## Transformers

Transformers are state-of-the-art for processing sequences. Their first successes were with text processing and machine translation in particular.

Unlike RNNs, LSTMs and GRUs there are no recurrent connections so there is no memory of the previous state. The entire sequence is input.

**Attention** overcomes the lack of memory by perceiving the entire sequence at once, enabling long-range semantic dependencies but at a cost of computer memory.

Since there is no sequential processing, transformers are suited to parallel hardware environments 

The transformer model follows the same general pattern as a standard sequence to sequence with attention model:

- The input sentence is passed through N encoder layers that generates an output for each word/token in the sequence.

- The decoder attends on the encoder's output and its own input (self-attention) to predict the next word.


![](Transformer.png)

$\text{Figure 1. Basic Transformer Architecture, Left: Encoder, Right: Decoder}$

### Transform Basic Components for Machine Translation

#### Input Encoder

Input Embedding: The embedding for the source sentence (from the source language).

Positional Encoding: Serial positions of the words in the source sentence.

Multi-Head Attention: Attention over the source sentence.

Feed-Forward Network: Computes non-linear hierarchical features. Output goes to Multi-head attention in the decoder.

#### Output Decoder

Embedding: The embedding for the target sentence (from the target language). 

Positional Encoding: Serial positions of the words generated so far.

Masked Multi-Head Attention: Attention over the words generated so far.

Multi-Head Attention from Input and Output: Query from the output, keys and values from the input.

Feed-Forward Network: Computes non-linear hierarchical features

Final Linear Layer with Softmax: Probabilities for next word

Note: $N_x$ is the number of encoder or decoder blocks to stack

### Positional Encoding

https://kazemnejad.com/blog/transformer_architecture_positional_encoding/

Attention is a parallel operation not sequential so word position is lost. Positional Encoding adds word position information.

Different sine and cosine frequencies indicate relative position of words.

Let:
t = position of word
d = embedding dimension
pe = positional encoding vector, its length = d

Then:

$$pe(t,2i) = sin\left(\frac{t}{10000^\frac{2i}{d}}\right) \\
pe(t,2i+1) = cos\left(\frac{t}{10000^\frac{2i}{d}}\right) $$

![](positional_encoding.png)

Above depicts the positional encoding for an encoding dimension d = 128. The rows are the positions.


### Attention

Ashish Vaswami,et.al.(2017)Attention Is All You Need. https://arxiv.org/abs/1706.03762

Attention is a way to capture dependencies among in sequences without sequential recurrent connections. The entire sequence processed in one step. This eliminates back-propagation through time.

Intuitively think of attention as a weighted average of past elements of a sequence.

#### Basic Attention

$$ Attention(Q,K,V) = Softmax_k\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

![](BasicAttention.png)

$\text{Figure2. Basic Attention}$

As a word is being processed we want information from all the other words in the sequence. Relationship of input word to all the other words in parallel.

Key-value store paradigm: Information Retreival Systems, Dictionary in Python
- Q Weight matrix representing a Query (i.e. What information in the rest of the sequence do I want for this word)
- K Weight matrix representing a Key (i.e. something like a label, e.g. color)
- V Weight matrix representing a Value, (i.e. the value of the label e.g. red)
    
Think of these matrices as rotating the input vector in the embedding vector space.

The key and the query must be the same dimension.

The output of attention is a weighted (i.e. probability) Value vector. (How should attention be allocated in the sequence).



#### Multi-Head Attention

![](Attention2.png)

$\text{Figure 3. Multi-Head Attention}$

Multiple attention (i.e. multiple hidden layers in Figure 3.) block in parallel to attend to different attributes. (e.g. color, gender, location).

### Add & Norm

ADD: adds residual connections to prevent vanishing gradients.   
Norm: Batch Normalization to stabilize the computations.

#### Linear Layer

A regular linear layer computing $y = WX^T +b$

### NLP Systems Based on Attention-Transformer Architecture

#### BERT - Bidirectional Encoder Representation from Transformers

Google AI Language: https://arxiv.org/pdf/1810.04805.pdf (2019)

BERT stacks Encoders to attend over both left and right context (i.e. Bidirectional)
BERT Pre-trained:

- Masked Language Model: fill in words that are masked out
- Next Sentence Prediction: does second sentence follow from first
- Fine Tuning: must fine tune for each new language task

BERT has 110 million parameters

#### GPT-2, GPT-3 : Generative Pre-Training

Open AI  
GPT-2:  https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf (2019)  
GPT-3: https://arxiv.org/pdf/2005.14165.pdf (2020)  

GPT-3 stacks Transformer Decoders to perform one-shot learning to do all language tasks. Don't need fine tuning for separate language tasks.

GPT-2 has 1.5 Billion parameters, GPT-3 has 175 Billion parameters.  GPT-3 is the same architecture as GPT-2 only much much bigger. GPT-3 trained with Common Crawl (i.e. the entire Internet)  

GPT-2 online demo https://gpt2.ai-demo.xyz/

OpenAI has recently opened up the GPT-3 API https://openai.com/api/

GPT-3 is much better than GPT-2 because the scale (training corpus and number of parameters) is so much bigger.

A lot of hype about GPT-3!!! It appears to generate some realistic text but also can generate silly things

https://machinelearningknowledge.ai/openai-gpt-3-demos-to-convince-you-that-ai-threat-is-real-or-is-it/

### The Problem with Language Models

**No Real Language Understanding**. **Language Understanding requires experience not just processing correlations**


### References

Ashish Vaswami, et.al. (2017) Attention Is All You Need https://arxiv.org/pdf/1706.03762.pdf

https://www.kdnuggets.com/2020/08/transformer-architecture-development-transformer-models.html

https://blog.exxactcorp.com/a-deep-dive-into-the-transformer-architecture-the-development-of-transformer-models/

https://theaisummer.com/transformer/