# Transformers
This notebook aims to help gain an understanding of transformers by defining the types of transformers and their functions. The notebook then goes in depth about the architecture of encoders and decoders and the mechanisms that makes up the encoders (examples include feed forward networks and multi-head self attention mechanims)

## Types of Transformers
### Encoder-Only Transformer 
<ins>Function</ins>: Encoder-only transformers take in an input sequence and contextualizes its meaning. It aims to understand the input sequence possibly for other tasks 

<ins>Use Cases</ins>: Classification of Input Data

### Decoder-Only Transformer 
<ins>Function</ins>: Decoder-only transformers take in an output sequence generated by the decoder and uses the output sequence to formulate a continuation of the output sequence.  

<ins>Use Cases</ins>: Text Completion 

### Encoder-Decoder Transformer 
<ins>Function</ins>: Encoder-decoder transformers contextualizes the meaning of an input sequence through an encoder, and uses the information to generate an separate output sequence. 

<ins>Use Cases</ins>: Language Translations

# Encoder
Encoders are generally used to obtain contextualized information about the words/phrases in an input sequence. 

The figure below represents the structure of an **encoder**:
<img src="EncoderImage.png" width="250" style="display:block; margin-left:auto; margin-right:auto;">


## Input Embedding 
<ins>Function</ins>: Input embedding aims to represent input tokens as vectors in a high-dimensional space. 

#### Tokens 
For a transformer, the input is split into **tokens** using a **tokenizer**. Tokens can be represented as complete words, phrases of words, or punctuations.  

#### Input ID 
Afterwards, each token is assigned its unique **input ID**, which is based on the transformer's defined vocabulary. 

#### Embedding 
The input IDs of the transformer are mapped into vectors in a high-dimensional space, where the position of the vector in the high-dimensional space corresponds to the 'meaning' of the token. This is called an **embedding**, where a column represents a dimension of the embedding vector. 

The values representing the vectors of each token is a parameter, which is learned and adjusted during training to better represent the 'meaning' of the token. The collection of the embeddings of each token are called the **embedding matrix**, where a column represents a dimension of all the tokens and a row represents a token ID.

## Positional Encoding 
<ins>Function</ins>: Positional encoding aims to represent a token's relationship to neighboring tokens into the vector of the token. 

### Positional Embedding 
Taking the input tokens and computing the following equations outputs the **positional embedding**, which represents information obtained from a token's relationship to neighboring tokens. The equation is as follows:
$$
PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{2i/d}} \right)   \text{ (even-embedding dimensions)}
$$

$$
PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{2i/d}} \right)   \text{ (odd-embedding dimensions)}
$$
where:
- $pos$ = token position (0-indexed, start with 0)  
- $i$ = embedding dimension index  
- $d$ = embedded vector dimension

Essentially, the positional embedding is calculated by inputting the position of the token in respect to the other tokens ($pos$) and a dimension of the token ($i$) into the equation.

## Finalized Embedding 
The positional embedding and input embedding are combined to form the **encoder input**, where a column represents the dimensions of the encoder input.

## Self-Attention
Self-attention methods are a crucial part of transformers because they controls how much each input vector contributes to the output. In order to use self-attention methods, the attention weights for each input vector must be computed. 

### Keys, Queries, Values
In a transformer model, the input vector is linearly transformed to create three separate values. These values are computed through the following equations:
$$
\mathbf{q}_n = \boldsymbol{\beta}_q + \mathbf{\Omega}_q \mathbf{x}_n
$$

$$
\mathbf{k}_m = \boldsymbol{\beta}_k + \mathbf{\Omega}_k \mathbf{x}_m
$$

$$
\mathbf{v}_m = \boldsymbol{\beta}_v + \mathbf{\Omega}_v \mathbf{x}_m
$$
Where: 
- $\mathbf{q}_n$ is defined as the **query** (dimension: token length, embedding dimension)
- $\mathbf{k}_m$ is defined as the **key** (dimension: token length, embedding dimension)
- $\mathbf{v}_m$ is defined as the **value**

In the equation, $\boldsymbol{\beta}$ and $\mathbf{\Omega}$ are learnable parameters. Intuitively, keys represent the categories of the token, values represent the token, and the queries represent the inputted category

### Attention Weights
The query ($\mathbf{q}_n$) and the key ($\mathbf{k}_m$) are both used to compute the attention weights for the input vector. The attention weights are computed through the following equation:
$$
a[\mathbf{x}_m, \mathbf{x}_n] 
= \text{softmax}_m\left[ \frac{\mathbf{k}_m^{T} \mathbf{q}_n }{\sqrt{d_k}} \right]
$$
Where: 
- $\sqrt{d_k}$ is the embedding dimension (stabilizes term)

The softmax function can also be presented as:
$$
= \frac{\exp\left( \mathbf{k}_m^{T} \mathbf{q}_n \right)}
{\sum_{m'=1}^{N} \exp\left( \mathbf{k}_{m'}^{T} \mathbf{q}_n \right)}
$$ 

In the equation, the output of the dot product between the query ($\mathbf{q}_n$) and the key ($\mathbf{k}_m$) are passed through the softmax function, creating the **attention weights** needed for the self-attention methods. (Note that the $\sqrt{d_k}$ is still present)

<ins>Definition of Attention Weight</ins>: The attention weights is a matrix with dimensions (token length, token length), where the intersecting values between two different tokens represents the relationship between the tokens. The intersection values between two of the same tokens is the highest because they are most closely related. 

### Self-Attention Outputs
The **self-attention output** is computed by multiplying each attention weight by its corresponding value ($\mathbf{v}_m$). The equation for the self-attention output is represented as:
$$
\mathrm{sa}_n[\mathbf{x}_1, \ldots, \mathbf{x}_N]
= \sum_{m=1}^{N} a[\mathbf{x}_m, \mathbf{x}_n] \, \mathbf{v}_m
$$
Where:
- $a[\mathbf{x}_m, \mathbf{x}_n]$ is the computed attention weight
- $\mathbf{v}_m$ is the corresponding value
- $\sum_{m=1}^{N}$ represents the summation of the weighted sums from 1 to N

This outputs the self-attention output, with dimensions (tokens, dimensions of embedding). In theory, the self-attention vector output is encoded with information from the input embedding, positional embedding, and attention weights.

## Multi-Head Self-Attention
In multi-head self-attention, the computed keys, queries, and values are split into smaller pieces on the embedding dimension. These pieces are defined as heads, and each will contain parts of embeddings from all the tokens

For each head, the attention weight is computed using the same formula defined above. The individual self-attention outputs are used to compute the multi-head self-attention weight through the following formula:

$$
\mathrm{MhSa}[\mathbf{X}] 
= 
\mathbf{\Omega}_c 
\begin{bmatrix}
\mathrm{Sa}_1[\mathbf{X}]^{T},\;
\mathrm{Sa}_2[\mathbf{X}]^{T},\;
\dots,\;
\mathrm{Sa}_H[\mathbf{X}]^{T}
\end{bmatrix}^{T}.
$$
Where: 
- $\mathrm{Sa}_1[\mathbf{X}]^{T}$ represents the first self-attention output (first "head")
- $\mathbf{\Omega}_c$ represents a weight value (parameter), which linearly transforms the multi-head self-attention output

## Layer Normalization
Layer normalization normalizes the embeddings of a token for numerical stability and a stable scale. The formula for Layer Normalization is:

$$
\mathrm{LN}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \boldsymbol{\beta}
$$

Where: 

- $\mathbf{x} = [x_1, x_2, \dots, x_d]$ is the input vector (for a single token)  
- $\mu$ is the mean of the components of $\mathbf{x}$, which is calculated by:  
$$
\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
$$
- $\sigma^2$ is the variance of the components of $\mathbf{x}$, which is calculated by:  
$$
\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
$$
- $\epsilon$ is a small constant added for numerical stability  
- $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ are learnable parameters (dimension $d$)  
- $\odot$ represents element-wise multiplication

## Encoder Layer
### Multi-Head Self Attention
Using the multi-head self-attention mechanism, we are now able to define an encoder layer in the encoder. 
The input embeddings are used to compute the **multi-head self attention weights**, which are computed through:
$$
\mathbf{X} \leftarrow \mathbf{X} + \mathrm{MhSa}[\mathbf{X}] \\
$$
Where: 
- $\mathbf{X}$ represents the input of the encoder layer
- $\mathrm{MhSa}[\mathbf{X}]$ represents the multi-head self-attention output

### LayerNorm
The computed multi-head self attention weights are then normalized through a **LayerNorm** using the following formula:
$$
\mathbf{X} \leftarrow \mathrm{LayerNorm}[\mathbf{X}] \\
$$
Where: 
- $\mathrm{LayerNorm}$ stabilizes the X value by normalizing the mean and variance

### Feed Foward Network
After stabilizing the multi-head self-attention weights, the values are fed into a **feed forward network**, which is a fully connected nueral network:
$$
\mathbf{x}_n \leftarrow \mathbf{x}_n + \mathrm{mlp}[\mathbf{x}_n] \\
$$
Where: 
- $\mathbf{x}_n$ represents a row from the vector $\mathbf{X}$
- $\mathrm{mlp}[\mathbf{x}_n]$ is a fully connected neural network that takes $\mathbf{x}_n$ as its input

Essentially, this equation is passing each computed $\mathbf{X}$ value through a fully connected neural network and returning its output. 

### LayerNorm
Finally, the output of the feed forward network is passed through a **LayerNorm** to normalize the output value's mean and variance. The equation is represented as:
$$
\mathbf{X} \leftarrow \mathrm{LayerNorm}[\mathbf{X}]
$$

**<ins>Note</ins>**: 
For equations 
$$
\mathbf{X} \leftarrow \mathbf{X} + \mathrm{MhSa}[\mathbf{X}] \\
$$
$$
\mathbf{x}_n \leftarrow \mathbf{x}_n + \mathrm{mlp}[\mathbf{x}_n] \\
$$

Notice how $\mathbf{X}$ is added back to the first equation and $\mathbf{x}_n$ is added back to the second equation. These represent residual networks because they add the original input into the final output of the equation, allowing it to preserve past information.

## Encoder Output
After the encoder processes the input embedding through multiple encoder layers, it outputs a matrix with dimensions of (tokens, embeddings per token). The outputted matrix consists of information from the input embedding, positional encoding, and the attention weights. Intuitively, the encoder output can be thought of as the contextualized version of the input tokens

# Decoder
Decoders are generally used to formulate the next part of an output sequence given an output sequence generated by the decoder in a previous timestep. The figure below represents the structure of an **decoder**:
<img src="DecoderImage.png" width="250" style="display:block; margin-left:auto; margin-right:auto;">
<ins>Note</ins>: In an Encoder-Decoder transformer, keys and values from the encoder is used as an input for the decoder's multi-head attention. In a Decoder-only transformer, the keys and values are generated from the input of the decoder.

## Masked Multi-Head Attention
The decoder uses the same mechanisms such as multi-head attention and layer normalization. However, the decoder uses the **masked multi-head attention** in the beginning in contrast to the encoder, which uses normal multi-head attention. 

Masked multi-head attention calculates the attention weights by computing the dot product between the query and the key and stabilizing the term. This process is represented through the following equation.
$$
a[\mathbf{x}_m, \mathbf{x}_n] 
= \text{softmax}_m\left[ \frac{\mathbf{k}_m^{T} \mathbf{q}_n }{\sqrt{d_k}} \right]
$$
Where: 
- $\sqrt{d_k}$ is the embedding dimension (stabilizes term)

However, masked multi-head attention ensures that the current token that it's calculating the attention weight for can only interact with previous tokens. Therefore for any computed attention weights for tokens past the current token, it replaces it with negative infinity. This ensures that when the attention weights are passed through the softmax function, the attention weights return 0 and prevent them from interacting.

## Decoder Layer
Using the masked multi-head self-attention mechanism, we are now able to define a decoder layer in the decoder.

### Masked Multi-Head Self Attention Weights
The input embeddings are used to compute the **masked multi-head self-attention weights**, which are computed through the following formula:
$$
\mathbf{X} \leftarrow \mathbf{X} + \mathrm{MaSa}[\mathbf{X}] \\
$$
Where: 
- $\mathbf{X}$ represents the input of the decoder layer
- $\mathrm{MaSa}[\mathbf{X}]$ represents the masked multi-head self-attention output

### LayerNorm
The computed weights are then normalized through **LayerNorm**, which is represented as:
$$
\mathbf{X} \leftarrow \mathrm{LayerNorm}[\mathbf{X}] \\
$$
Where: 
- $\mathrm{LayerNorm}$ stabilizes the X value by normalizing the mean and variance

### Multi-Head Self-Attention Weights
After stabilizing the weights, the **multi-head self-attention weights** are computed:

$$
\mathbf{X} \leftarrow \mathbf{X} + \mathrm{MhSa}[\mathbf{X}] \\
$$
Where: 
- $\mathbf{X}$ represents the input of the decoder layer
- $\mathrm{MhSa}[\mathbf{X}]$ represents the multi-head self-attention output

<ins>Note</ins>: The input of the mutli-head self attention depends on whether the transformer is an encoder-decoder or a decoder-only transformer. For an encoder-decoder transformer, the computed keys and values of the encoder is used as the input to the multi-head self attention of the decoder (Intuitvely, contextualized input embeddings are given to the decoder). For a decoder-only transformer, the keys and values comes from the decoder itself.

### LayerNorm
After computing the multi-head self attention weights, the weights are normalized again through **LayerNorm**, which is represented as follows: 

$$
\mathbf{X} \leftarrow \mathrm{LayerNorm}[\mathbf{X}] \\
$$

### Feed Forward Network
After stabilizing the multi-head self attention weights, these values are inputted into a **feed forward network**. This is represented as: 
$$
\mathbf{x}_n \leftarrow \mathbf{x}_n + \mathrm{mlp}[\mathbf{x}_n] \\
$$

Where: 
- $\mathbf{x}_n$ represents a row from the vector $\mathbf{X}$
- $\mathrm{mlp}[\mathbf{x}_n]$ is a fully connected neural network that takes $\mathbf{x}_n$ as its input

Essentially, this equation is passing each computed $\mathbf{X}$ value through a fully connected neural network and returning its output. 

### LayerNorm
Finally, the output of the fully connected neural network is passed through a **LayerNorm** to normalize the output value's mean and variance. The equation is represented as:
$$
\mathbf{X} \leftarrow \mathrm{LayerNorm}[\mathbf{X}]
$$

**<ins>Note</ins>**: 
For equations 
$$
\mathbf{X} \leftarrow \mathbf{X} + \mathrm{MhSa}[\mathbf{X}] \\
$$
$$
\mathbf{x}_n \leftarrow \mathbf{x}_n + \mathrm{mlp}[\mathbf{x}_n] \\
$$

Notice how $\mathbf{X}$ is added back to the first equation and $\mathbf{x}_n$ is added back to the second equation. These represent residual networks because they add the original input into the final output of the equation, allowing it to preserve past information.

## Decoder Output
The decoder layer outputs a contextualized matrix of the input embeddings of the decoder, which contains input embeddings, positional encoding, masked self-attention weights, and encoder-decoder attention. The matrix is passed through a linear layer, which returns a score for all the possible continuations of the output sequence. These scores are passed through the softmax function and the token is selected based on the selection algorithm.