### Basic Deep Learning Language Model

#### Language Model

$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$

**◆ Language model basics**
- A model that assigns probabilities to word sequences (arrangements of words).
- The probability of a whole sequence of words is equal to the chain of probabilities of the next word given the previous words. A language model is a model that assigns the probability of the next word appearing given the previous words

$$P(a1, a2, a3) = P(a1) \times P(a2|a1) \times P(a3|a1, a2)$$
$$P(\mbox{I, Sogang University, I am a student}) = P(\mbox{I}) \times P(\mbox{Sogang University| I}) \times P(\mbox{I am a student|I, Sogang University})$$

**◆ Language model evolution**
- N-gram model: Calculate the probability for each word by the frequency of the preceding word -> context is not considered
- Word2Vec-based model: Deep learning-based model trained with perceptron. Expressing word-to-word relationships -> no consideration of context
-  RNN-based model : It has a recursive structure that processes each element of the sequence sequentially and uses previous information for the current calculation. As the length of the input sequence increases, it is difficult to maintain long-term dependency on information (long-term dependency problem). Modified RNN structures such as LSTM and GRU have emerged, but performance deteriorates as sentences become longer.
- Attention-based model: Attention vectors are generated and applied
- Transformer-based model: Self-attention is used
    - Reason 
        - (1) Operation per layer 
        - (2) Parallel processing
        - (3) Dependency learning between distant words 
    - Ex) BERT, GPT

**◆ Difference between RNN-based model and attention-based model**

1. Sequence processing method:
    Attention: Processes all elements of the input sequence simultaneously and calculates the relevance between each element. Each element obtains the processing result through a weighted sum.
    RNN: Processes the sequence step by step, passing information from the previous step to the current step. Calculations are performed sequentially based on previous information, and the hidden state serves to maintain previous information.
2. Handling long-term dependencies:
    Attention: Calculates the relevance between each element within the input sequence to further highlight and focus important information.
    RNN: Difficulty handling long-term dependencies. Information loss and gradient vanishing problems occur in long sequences
3. Parallel processing:
    Attention: Advantageous for parallel processing because the relevance between each element can be calculated in parallel
    RNN: Difficult to parallelize because it is calculated sequentially, and processing speed is slow on large data
4. Sequence length handling:
    Attention: processed in the same way regardless of the length of the input sequence
    RNN: The disadvantage is that the longer the input sequence, the more difficult it is to maintain information.
5. Memory requirements:
    Attention: Memory requirements may be high because information within the input sequence is stored in the form of a weighted sum.
    RNN: Because it is computed sequentially, memory requirements can be relatively low.

#### Transformer

<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/8/8f/The-Transformer-model-architecture.png" alt="My Image"> </center>

**◆ Transformer Model Summary**
- Transformer has a sequence-to-sequence model structure proposed by Google in 2017 and is widely used in machine translation.
- It is also widely used for positive and negative classification of review articles.
- Reduced learning time by not using LSTM or RNN
- In the case of a positive or negative decision, the word with strong attention applied is visualized to show the basis for the decision (explainable artificial intelligence)
- Possible to express a word vector depending on the context: Each word has various meanings, but it can have different meanings depending on the context
- An encoder-decoder model. When an input sentence (original text) is input to the encoder, the encoder learns how to express the input sentence and sends the result to the decoder. The decoder receives the expression result learned from the encoder and generates the desired sentence.

**◆ Transformers model structure (e.g. machine translation)**
- After tokenizing with a tokenizer, each token is mapped to a number
- The tokenized input becomes the input of the transformer, and the transformer uses the input value to output a token and reverse the token to complete the sentence.
- Encoder and decoder are composed of 6 sub-layers each
- It is connected in a row to the encoder sublayer and input values are passed in order, but the decoder sublayer receives the output of the previous sublayer and the last output value of the encoder as input.
- The sub-layer structure of the encoder and decoder is shown in the right figure.

<center>

![cluster](./images/deeplearning.png) 

</center>

**◆ Sequence to Sequence (Seq2Seq) Structure**
- sequence : a sequence of words
- Sequence-to-Sequence : Converting a sequence with one property into a sequence with another property.
- ex ) 어제, 에버랜드, 갔었어, 거기, 사람, 많더라  $\Rightarrow$ I, went, to, the, ever-land, there, were, many, people, there

**◆ Encoder and decoder**
- A model that performs a sequence-to-sequence task consists of an encoder and a decoder.
- The encoder is responsible for compressing the source sequence information and sending it to the decoder.
- Encoding: The process of compressing source sequence information
- Decoding: The process of generating a target sequence by receiving information sent by the encoder