# Transformers 101
By Ankush Chander

# Outline
### 1. Context
### 2. Encoder decoder framework
### 3. Transformer architecture



## Language modelling
Language modeling is a task to predict the next word or character in a sequence of text given the context of the previous words.

P(w_n|w_1, w_2, ..., w_n-1) = ?
![title](img/transformer-talk/next_word_pred.jpeg)


## Evolution of Language models
| | Statistical Language models | Neural Language models | Large Language models |
| --- | --- | --- | --- |
|Pros: | Simple to implement | generalizes well to unseen sequences as it capture semantic relationships | generate coherent and contextually relevant text. |
|Cons: | 1.struggle with capturing long-range dependencies. <br> 2. didn"t capture semantic relationships. <br>Eg: cat sat on a table. vs cat sat on a desk  | Expensive to train | Expensive to train |
|Examples: | N-gram models, Hidden Markov Models (HMMs).   | RNNs, LSTMs, GRU | GPT-3, GPT-4, BERT |


## Word Embeddings
The word embedding is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning.

**Key characteristics:**  
*Continuous Vector Space:* Each word is represented by a vector in a continuous multi-dimensional space. The position of a word's vector in this space reflects its semantic meaning and relationships with other words.  
*Semantic Similarity:* Similar words in meaning are represented by vectors that are closer together in the vector space. For example, words like "king" and "queen" would have vectors that are closer in proximity.  
*Dimensionality:* The dimensionality of word embeddings is a hyperparameter that can be adjusted. Commonly used dimensions range from 50 to 300, but the optimal dimensionality may depend on the specific task and the size of the training data.

<img src="img/transformer-talk/word_vector.jpeg" width="650" height="650">

## Embedding layer
**Embedding layer** is the layer in neural NLP models that maps words to their corresponding embeddings. Words usually are represented as *one-hot vectors* of vocab_size. Embedding layer transforms one-hot vectors to corresponding embeddings.
Sometimes instead of training the embeddings we can reuse embeddings generated by someone else. In that case it"s called **pretrained-embeddings**.  
<img src="img/transformer-talk/embedding_layer.png" width="650" height="650"> 

## Sequence modelling
**Sequence models** are the machine learning models that input or output sequences of data. Sequential data includes text streams, audio clips, video clips, time-series data and etc.  
Eg:  
Sentiment analysis (text => positive/negative/neutral)  
Document summarization (document => summary)  
Machine translation (source_language_text => target_language_text)  
Text generation (text_so_far => next_word)

When the input and desired output are both sequences, it is known as **sequence to sequence models**.

## Encoder-decoder architecture
The encoder-decoder architecture is an ML architecture that is widely used in NLP tasks such as machine translation, text summarization, and language generation.  
The **encoder** takes a variable-length sequence as input and transforms it into a state with a fixed shape (**thought vector**) and the **decoder** maps the encoded state of a fixed shape to an output sequence.  
![title](img/transformer-talk/encoder_decoder_architecture.png)  
*Image Credits: [Illustrated Transformers](https://jalammar.github.io/illustrated-transformer/)*

## Variations of encoder-decoder architectures:
| | Encoder-Decoder models | Encoder only models | Decoder only models |
| --- | --- | --- | --- |
| Analogy: | ![title](img/transformer-talk/encoder_decoder_meme.png)   | ![title](img/transformer-talk/encoder_only_meme.jpg) | ![title](img/transformer-talk/decoder_only_meme.jpg)|
|Use cases: | sequence to sequence tasks like machine translation, document summarization | Embedding tasks, transfer learning for downstream tasks like classification | Language generation, completion, and other generative tasks. |
|Training objective: | Trained to minimize the difference between the predicted and target output sequences.  | Pre-trained on unsupervised tasks like language modeling, masked language modeling, etc.  | Pre-trained for generative tasks, often using autoregressive language modeling. |
|Examples: | T5(Text-to-Text Transfer Transformer), BART   | BERT(Bidirectional Encoder Representations from Transformers) | GPT family |


## Attention mechanism
The attention mechanism is a powerful technique in machine learning that allows a model to focus on specific parts of the input by assigning different weights to them based on their relevance to the task at hand.
### Scaled dot-product attention
![Attention mechanism](https://files.readme.io/ab64790-image.png)  
*Image Credits: [The Attention Mechanism- Cohere](https://docs.cohere.com/docs/the-attention-mechanism)*

## EncoderLayer = attention + feedforward
**vocab_size:** size of vocabulary (eg: 50K)  
**d_model** (int) – the number of expected features in the encoder/decoder inputs (default=512).  
**dim_feedforward** (int) – the dimension of the feedforward network model (default=2048).  
  
<img src="img/transformer-talk/encoder_with_tensors.png" width="450" height="450">  
<!-- Image Credits: Illustrated Transformers  -->

## Self attention
contains 3 representations Query, Key, Value  
**Query:** representation of current word that"s compared  
**Key:** representation of other word that"s being compared to  
**Value:** actual values that are scaled and added.  
<img src="img/transformer-talk/self-attention-output.png" width="450">  
*Image Credits: [Illustrated Transformers](https://jalammar.github.io/illustrated-transformer/)*

## Multiheaded attention
<img src="img/transformer-talk/transformer_multi-headed_self-attention-recap.png" width="700">
<!-- self-attention-output.png -->

## Positional Encoding
Positional encoding helps the model discern between same word occuring at different positions. It can be a learned parameter or can be fixed.  
For ex:  
I live in New Delhi.  
New Delhi is the capital of India.     
By using positional encoding, we assign slightly different embeddings to the "New Delhi".  
<img src="img/transformer-talk/transformer_positional_encoding_vectors.png" width="500">  
*Image Credits: [Illustrated Transformers](https://jalammar.github.io/illustrated-transformer/)*


## Residual connections and Layer normalization
**Layer Normalization:**  It involves normalizing the inputs of a layer by *subtracting the mean and dividing by the standard deviation*.  
It ensures that the inputs to each layer have a consistent distribution, which can aid in *faster and more stable training*.  
**Residual connections:**  Residual connections, also known as skip connections, involve adding the input of a layer to its output.
The depth of the Transformer model allows it to capture long-range dependencies, but deep networks can suffer from *difficulties in learning due to vanishing gradients*. Residual connections *enable the gradient to flow directly through the skip connection*, making it easier for the model to learn identity mappings and facilitating the training of deeper networks.  
<img src="img/transformer-talk/transformer_resideual_layer_norm.png" width="400">  
*Image Credits: [Illustrated Transformers](https://jalammar.github.io/illustrated-transformer/)*

## encoder stack
**d_model** (int) – the number of expected features in the encoder/decoder inputs (default=512).  
**nhead** (int) – the number of heads in the multiheadattention models (default=8).  
**num_encoder_layers** (int) – the number of sub-encoder-layers in the encoder (default=6).  
**dim_feedforward** (int) – the dimension of the feedforward network model (default=2048).  

<img src="img/transformer-talk/transformer_encoder_decoder_stack.png" width="500">  

## Decoder = self attention + encoder-decoder attention + feedforward layer
**Self attention**: Same as self attention in encoder except that it is masked and only allowed to look at previously generated tokens.  
**encoder-decoder attention**: This layer which helps the decoder focus on appropriate places in the input sequence. Q comes from decoder. K,V vectors are taken from encoder output.  
**feed forward network** : similar to encoder. 



## Final layer and softmax
 **Linear layer**: The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.  
**Softmax:** The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

<img src="img/transformer-talk/transformer_decoder_output_softmax.png" width="500">  
<!-- *Image Credits:* [Illustrated Transformers]("https://jalammar.github.io/illustrated-transformer/) -->

## Inference/training
<img src="img/transformer-talk/transformer_full_view.png" width="800">  

## What next?
1. [huggingface NLP course](https://huggingface.co/learn/nlp-course) - Models, tokenizers, datasets, huggingface ecosystem
2. Explore foundation models like Bert, Bart, T5, gpt
3. Explore newer architectures like [Mixture of Experts](https://huggingface.co/blog/moe#mixture-of-experts-explained), [Mamba](https://arxiv.org/abs/2312.00752) etc

## References:  
1. [Attention is all you need(paper)](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
2. [Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
3. [Annotated Transformer](https://nlp.seas.harvard.edu/annotated-transformer/)
4. [The Attention mechanism](https://docs.cohere.com/docs/the-attention-mechanism)
5. [Transformer Feed-Forward Layers Are Key-Value Memories(Paper)](https://aclanthology.org/2021.emnlp-main.446.pdf)
   