# <center> What is a [Transformer](https://arxiv.org/abs/1706.03762)? (HIGH LEVEL OVERVIEW) </center>

![Transformer](Transformer.png)

- The __Transformer__ model that predominatley replaced the [RNN](https://www.youtube.com/watch?v=FBlPZJrJt9g), [GRU](https://www.youtube.com/watch?v=rdz0UqQz5Sw), and [LSTM](https://www.youtube.com/watch?v=rmxogwIjOhE) models due to efficacy and speed. (Links to videos on each of those topics have been covered on my YouTube Channel!
- Has the ability to take a more context (getting all the words instead of sequential, iterative input)
- Has the ability to highly parallelize the processes
- [Paper use-case](https://arxiv.org/abs/1706.03762) was Machine Translation, however, the transformer models (or derivations of) have been used in other cases

### Fun fact:
- BERT related models uses the Encoder blocks
- GPT-3 related models uses the Decoder blocks

## What is a **Transformer** Encoder Blocks?

<img src="Encoder.png" width="340" height="596">

- In layman's terms, the Transformer Blocks' purpose is to associate different weightings in a sentence to identify the most important words that are most critical to the underlying meaning.

- __Inputs__: Assigns word embeddings to *each* word in parallel; This is much **faster** than RNN and LSTM architecture.
    - You can think of a word embedding as a vector ID in vector space. Each NLP model would typically have different vector values for the same words in english.
- __Positional Encoding__: Pass the embeddings and transform with the position encoder, a vector that gives context based on the current position of words
    - Provides context to a word by might mean differently in a different setting
    - __Embedding of sentence(s) or word(s) + Positional Encoding (Vector Encoding of position in sentence) = Embedding with Context__

<img src="pos.png">

- pos is the position
- i is the dimension
- d is the representation dimension

### Transformer Encoder Block
- Multi-head __Attention__ layer, in laymen's terms, is to identify which word in a sentence is most relevant to other words in the _same_ sentence.
    - Each word in a sentence is assigned a probability, where the sum of the probabilities in a sentence is equal to one.
        - Example: Make sure to Like and Subscribe!
        - Focus on the word __Like__ will have an output vector similar to [0.1, 0.06, 0.02, 0.4, 0.02, 0.4]<sup>T</sup>
            - where each probability lines with "Make sure to Like and Subscribe!"
            - *note* that each focus on a word will result in a different vector. The paper has it such that there are 8 attention vectors that are concatenated, each with length of 64. [original paper on section 3.2.2](https://arxiv.org/abs/1706.03762)


- __Feed Forward Net__: Transforms the Mult-head Attention output __v__ vectors into a digestable input that can then be used for the next iteration encoder or decoder block. (multi-attention head)

### Transformer Decoder Block
<img src="Decoder.png" width="325" height="786">


Instead of your inputs being used as input in this block, you plug in your output into your decoder block.
- You do the exact same process of word embeddings and positional vectors as seen in the __Inputs__ and __Positional Encoding__ 
- There is however a different process called the __Masked Multi-head Attention__ step where there are masked word attributes, meaning that not the entire sentence is taken into consideration but up to 1 to __N__ words are considered (the rest are "masked" i.e changed to value of 0.
    - Example: Output Vector Text: "I just liked the video and subscribed!"
    - Masked Vector values [0.05, 0, 0, 0, 0, 0, 0]
        - This iterates along ... [0.05, 0.1, 0, 0, 0, 0, 0] etc.
        - Note that in a NLP library, the masked values might look like this [MASK]
- An additional step after the __Masked Multi-head Attention__ vector step, is that use the output and pass it into another __Multi-head Attention__ block (as seen in the encoder block); however, this is where the "merging" of your indpendent and dependent variable vectors meet.
    - This determines how related the various input and output vectors are related to each other.
        - Returns a vector for each match that represents the relationship betweent the input and output vectors.
- We then pass the relationship vector(s) to a feed forward layer and do the typical matrix multiplication
    - After this, there could be another decoder block and you repeat the process, OR you calculate the final output for human interpretation

### Output Probabilities
- The __Linear layer__ has __N__ number of nodes where __N__ is the determined number of outcomes that are possible (This can be as nebulous as total number of words in English) 
- __Softmax Layer__ converts the output from the Linear Layer to a probability distribution, which is finally interpretable by us and offers a prediction of a word

# You keep doing this until the there are no additional words in a sentence.



<img src="a_lot.jpg">

## But wait... there's more!
# <center> Attention Vector Architecture </center>

## Query (Q), Key (K) and Value (V) vectors are trained via backpropagation and start at the same value(s)
- These are 3 separate values that are derived from the same vector; linear transformations are applied here
    - The proper weights for Q, K and V are learned via training 
- Think of Q as the actual word embedding
- Think of K and V as the memory in the Model -- they could be the same values.

The Transformer compares the value of Query (Q) and Key (K) to see where in Key (K) is Q most similar. (like cosine similarity). Then, whichever Key (K) is most similar, return the value (V). This [stackoverflow post](https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms) has a wonderful breakdown on the meaning behind attention values.
- Where do the values of K, Q and V come from?
    - It *depends*
    - If GPT-3, Q, K, and V typically come from the same source (i.e self-attention)
    - Machine Translation tasks: Self attention is first applied *then* K and V come from the source sequence and Q comes from the target sequence.

## Attention Vector calculation
You will have multiple attention vectors per word since V, K and Q represent different components of a word.
Finally, the resulting attention vectors are transformed, via weights, to be digestable for the fully connected network.

### Steps to calculate Attention
    1) Calculate the dot product between the query and key vector of each word. (this is known as the attention score)
    2) Divide each of the results by the Square root of the dimension of the key vector (scaled attention score)
    3) Pass output from 2) to a softmax function, squeezing the values between 0 and 1
    4) Calculate the dot product between the value vectors with the dot product of the softmax output from step 3)
    5) Add all weighted value vectors together

![multiheadattention](multi_head_attention.png)

# <center> Scaled Dot-Product Attention Calculation </center>
![Attention_formula](Attention_formula.png)

### The Multi-head Attention concatenates the multiple scaled dot-product attention calculations (since there are multiple vectors --- in the paper there are 8 layers)

# Phew!

# That was a lot.
# So, how would we go about implementing this architecture?

# [Here is an excellent notebook that details the paper step by step in python code](http://nlp.seas.harvard.edu/2018/04/03/attention.html)

# Here is a great [github repository](https://github.com/jadore801120/attention-is-all-you-need-pytorch) that more or less implemented the paper and is downloadable.
