# Transformer Architecture

Original Work: [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

## Main Architecture

* The Encoder
   - Purpose: Processes input sequences and generates a contextual representation.
   - The encoder's output is a tensor that contains contextual information about the input sequence, such as word position, part of speech, and meaning.
* The Decoder
   - Purpose: Generates output sequences, one token at a time, using the contextual representation from the encoder and the tokens generated so far.

<!-- ![Transformer Model Architecture](./Transformer-Model-Architecture.jpg) -->

### Types of Inputs and Outputs

1. **Sequence to Vector**: 
   - **Example**: Image to description of the image.
2. **Sequence to Vector**: 
   - **Example**: Movie review with an output of a vector of probabilities.
3. **Sequence to Sequence**: 
   - **Example**: Translation from one language to another. Spanish to English as an example.

### Vocabulary

- **Input Embeddings**: Translates words to vector mappings, locating similar words in an embedding space.
- **Positional Encoders**: Provide context based on the position of words in a sentence using sine and cosine functions to help the model understand context.
- **Attention**: Focuses on relevant parts of the input, producing an attention vector that captures contextual relationships.
- **Tokens**: Each token (word, subword, character, etc.) is represented as a vector of real numbers. 

### Steps Through the Encoder Block

* For Large Language Models (LLMs): Encoders (Input Sequence)
   - **Input Embedding**: Maps the sequence of words to a vector space.
   - **Positional Encoding**: Encodes the contextual relationship between the current word and its neighboring words.
   - **Multi-Head Attention**: Captures the contextual relationship between words in the sentence, computing weighted averages to determine the importance of each word in the sequence.
   - **Feed Forward Network**: Processes the attention vectors.

* For Large Language Models (LLMs): Decoders (Desired Output Sequence)
   - **Input Embedding**: Maps the sequence of words to a vector space.
   - **Positional Encoding**: Encodes the contextual relationship between the current word and its neighboring words.
   - **Masked Multi-Head Attention**: Uses all words from the input sequence and only previous words from the output sequence to capture contextual relationships, creating vectors for each word.
   - **Encoder-Decoder Attention**: Integrates input-output interactions from the encoder and the decoder, encapsulating the relationship between each word.
   - **Feed Forward Network**: Processes the encoder-decoder attention vectors.
   - **Linear Layer**: Feedforward layer with the size of the target language vocabulary.
   - **Softmax Layer**: Produces a probability distribution, outputting the word with the highest probability.

### Popular Word Embedding Maps

- Word2Vec
- GloVe
- FastText

### Transfer Learning

* BERT (Bidirectional Encoder Representations from Transformers)

   - Utilizes multiple encoder layers.
   - Focuses on understanding the context of words bidirectionally, using an encoder-only architecture.
   - **Ideal for**: Tasks requiring deep understanding of input sequences, such as classification, sentiment analysis, and question answering.

* GPT (Generative Pretrained Transformer)

   - Utilizes multiple decoder layers.
   - Focuses on text generation using a unidirectional approach with a decoder-only architecture.
   - **Ideal for**: Tasks requiring text generation, such as dialogue systems, story creation, and summarization.

* Full Transformer

   - Combines both encoder and decoder layers.
   - Transforms input sequences into output sequences, leveraging both components of the architecture.
   - **Ideal for**: Sequence-to-sequence transformations, such as machine translation and complex text summarization.

* Use Cases

   - **BERT**:
   - Uses the encoder to understand and contextualize input text bidirectionally.
   - Suitable for tasks like classification, sentiment analysis, and question answering.

   - **GPT**:
   - Uses the decoder to generate text based on previous context unidirectionally.
   - Suitable for tasks like text generation, dialogue systems, and story creation.

   - **Full Transformer**:
   - Combines encoder and decoder for transforming input sequences to output sequences.
   - Suitable for tasks requiring sequence-to-sequence transformations, like translation.
   
### Transfer Learning Process

1. **Pretraining**: Generates a word given its context (e.g., "I love" outputs "you").
2. **Finetuning**: Trains the model on a specific task.