# Transformers
- Initially targeted at NLP
- Goal is to design a network that can lear a text representation that is suitable for downstream tasks
- Way of connect to related words in the text (**attention**)
- Should extend across large text spans

## Dot-product self attention
- A self-attention block takes $N$ inputs each of dimension $D$ and outputs $N$ vectors of the same size. Each input represents a word or word fragment
- First: Set of values are computed for each input $$v_m = \beta_v + \Omega_v x_m$$ where $\Omega \in \mathbb{R}^{D \times D}$
- Second: 
  - The $n^{th}$ output of a self-attention block $sa_n[x_1,...,x_N] is a weighted sum of all values $v_1,...,v_N$ $$sa_n[x_1,...,x_N] = \sum_{m=1}^N a[x_m,x_n]v_m$$
  - The scalar weight $a[x_m,x_n]$ is the attention the $n^{th}$ output pays attention to input $x_m$
  - The $N$ weights $a[*,x_n]$ are non-negative and sum to one
## Attention weights
- To compute the attention we apply two more linear transformations to the input $$q_n = \beta_q + \Omega_q x_n \\ k_m = \beta_k + \Omega_k x_m$$ where $q_n$ is the query and $k_m$ is the key.
- Then, dot product between the queries and keys $$a[x_m,x_n] = softmax_m[k^T q_n] \\ = \frac{exp[k_m^T,q_n]}{\sum_{m'=1}^N exp[k_{m'}^T,q_n]}$$
- Dot product returns a measure of similarity between queries and keys
- Summary Self Attention
  - $n^{th}$ output is a weighted sum of the same linear transformation $v_* = \beta_v + \Omega_v x_*$ applied to all inputs
  - The weights depend on how similar $x_n$ is to the other inputs
  - There is no activation function, but the mechanism is non-linear due to dot-product
- Written in matrix form (X is the $D \times N$ matrix of inputs)  $$ V[X] = \beta_v \mathbb{1}^T + \Omega_v X \\ Q[X] = \beta_q \mathbb{1}^T + \Omega_q X \\ K[X] = \beta_k \mathbb{1}^T + \Omega_k X  \\ Sa[X] = V[X] \cdot Softmax[K[X]^T Q[X]]$$

## Extensions to self-attention
- Positional encoding 
  - Absolute positional encoding
    - Matrix $\Pi$ is added to the input
    - Each column of $\Pi$ is different
    - Can be chosen by hand or learned
- Scaled self attention
  $$Sa[X] = V \cdot Softmax\left[\frac{K^T Q }{\sqrt{D_q}}\right]$$
- Multiple heads
  - Multiple self attention mechanisms
  - For each query,key,value matrix for each head there is an parameter matrix
  - $h^{th}$ self attention is $$Sa_h[X] = V_h \cdot Softmax\left[ \frac{K_h^T Q_h}{\sqrt{D_{qh}}} \right]$$
  - Normally, if there are $H$ heads and the input dimension is $D$, parameters will be size $D/H$
  - Vertically concatenated and another linear transformation $\Omega_c$ is applied to them $$MhSa[X] = \Omega_c[Sa_1[X]^T, Sa_2[X]^T,...,Sa_H[X]^T]^T$$
  - Necessary for performance

## Transformers
- Transformer mechanism
  - Multi-head self attention
    - Allow word representations to interact with each other
  - Fully connected network that operates separately on each word
  - Both units are residual networks
  - Layer norm operation after both s.a and fcn
    - Similar to batch norm, but uses statistics across the tokens within a single input to compute the statistics
  - Can be described as $$X \leftarrow X + MhSa[X] \\ X \leftarrow LayerNorm[X] \\ x_n \leftarrow x_n + mlp[x_n], \forall n \in \{1,2...,N\} \\ X \leftarrow LayerNorm[X]$$

## Transformers for NLP
- Typical NLP pipeline starts with a tokenizer
  - Splits text into word or word fragments
  - Each of these is mapped to a word embedding
  - These are passed through a series of transformers
- Tokenization
  - Using byte-pair encoding for example
    - Greedily merges commonly occuring substrings based on their frequency
- Embeddings
  - Each token is mapped to an embedding
  - Store in a $\Omega_e \in \mathbb{R}^{D \times |V|}$ matrix
  - First, the $N$ input tokens are mapped to a $T^{|V| \times N}$ matrix, where the $n_th$ column correspond to the $n_th$ token and it's a one-hot vector
  - The input embeddings are computed as $X = \Omega_e T$, and $\Omega_e$ is learned as any other parameter.
- Transformer model
  - Embedding matrix $X$ is passed through a series of $K$ transformers
  - Encoder
    - Transform text embeddings into a representation that can support a variety of tasks
  - Decoder
    - Predicts the next token to continue the input text
- **Encoder-decoder** models are used in sequence-to-sequence tasks, like machine translation

## Encoder model: BERT
- Uses 24 transformers blocks
- 1024-word embedding dimensions
- 16 heads in self attention
- Exploit transfer learning
- During pre-training, parameters are learned using self-supervised learning
- Fine-tuning: Resulting network is adapted to solve a downstream task
- **Pre-training**
  - Self supervised task: Predicting missing words from the text
  - Max input length: 512 tokens
  - Predicting missing words allows to learn some syntax
    - Red is often found before house or car, but never before a verb like shout
    - Degree of understanding in this type of model can ever have is limited
  - Also uses a secondary task if two sentences are adjacents to each other
- **Fine tuning**
  - Text classification
    - Uses the $<cls>$ token to a classification layer
  - Word classification
    - NER(Named Entity Recognition): Classify each word as an entity type
      - Each input embedding is mapped to an $E \times 1$ vector where each of the $E$ entries correspond to an entity type
      - Passed through an softmax to create probability for each class
  
## Decoder model: GPT3
- Generate next token in the sequence
- Auto-regressive language model
- Maximize log probability of the input text under the autoregressive model
- Set attention to the answer and the right context to zero (masked self-attention)
- LLMs
- Can perform many tasks without fine-tuning
- Are few shot learners
  - Can learn to do novel tasks based on just a few examples

## Encoder-Decoder: Machine Translation
- Translation between languages is a sequence-to-sequence task
- Encoder:
  - Receives sentence in a language and creates an output representation for each token
- Decoder:
  - Receives the ground truth translation and passes through a sequence of transformers that use masked self-attention and predicts the following word at every position
  - Conditioned on the previous output words *and* the source language text
- Encoder-decoder cross attention

## Transformers for long sequences
- Computational complexity scales quadractically with the length of the sequence

## Transformers for images
- ImageGPT
  - Autoregressive model of image pixels that ingest a partial image and predicts the subsequent pixel value
  - Due to size, operates on 64x64 shaped images
  - Learns that each pixel has a close relationship with nearby pixels
- Vision Transformers (ViT)
  - Divides the image into 16x16 patches
  - Each patch is mapped by a linear transformation and feed to transformer blocks
  - Standard 1D positional encodings are learned
  - Encoder model with \<cls\> token
    - Mapped via a network layer that create activations that are feed to a softmax layer