# Week 4: Transformers

### (video 1) Transformer network

* Many of the most effective algorithms for NLP today are based on the transformer architecture
* As the complexity of your sequence task increases, so does the complexity of your model.
    * RNN: 
        * vanishing / exploding gradients =>
        * hard to capture long range dependencies and sequences
    * GRU + LSTM:
        * resolve many of those problems: gates used to control the flow of information.
        * bottleneck to the flow of information: to compute the output of this final unit, for example, you first have to compute the outputs of all of the units that come before
    * transformer architecture, 
        * allows you to run a lot more of these computations for an entire sequence in parallel.
        * ingest an entire sentence all at the same time, rather than just processing it one word at a time from left to right
        
<img src='./Images/W4_01.png' style="width: 60%"></img>

**Transformer network intuition**

* RNN
    * processes 1 output at a time

* The major innovation of the transformer architecture is combining the use of 
    * attention based representations = a way of computing very rich, very useful representations of words
    * and a CNN convolutional neural network style of processing.

<img src='./Images/W4_02.png' style="width: 60%"></img>

2 key ideas:
* Reminder: Attention mechannism
    * allows output to focus attention on input while producing output
* (1) Self-attention:
    * allows inputs to interact with each other
    * if you have a sentence of five words 
    * you will end up computing five representations for these five words (A1,A2,A3,A4,A5).
    * And this will be an attention based way of computing representations for all the words in your sentence in parallel. 
* (2) Multi-headed attention: 
    * is basically a "for" loop over the self attention process.
    * You end up with multiple versions of these representations (very rich)
    * these representations can be used for machine translation or other NLP tasks to create effectiveness

### (video 2) Self-attention

* You've seen how attention is used with sequential neural networks such as RNNs. 
* To use attention with a style more late CNNs, you need to calculate self-attention
    * where you create attention-based representations for each of the words in your input sentence. 
    
* Jane, visite, l'Afrique, en, septembre, 
    * our goal will be for each word to compute an attention-based representation (A1,A2,A3,A4,A5)   

* Take the word l'Afrique => A3 representation
    * One way to represent l'Afrique would be to just look up the word embedding for l'Afrique.
    * But depending on the context, are we thinking of l'Afrique as of a 
        * site of historical interests 
        * as a holiday destination
        * or as the world's second largest continent
    * Depending on context, you may choose to represent it differently
        * A3 will look into sourrounding words, gets the context and gives the best representation
    * It won't be too different from the attention mechanism you saw in the context of RNNs,
        * except we'll compute these representations in parallel for all five words in a sentence
        * the equasions are quite similar
            * also involves a softmax
            * exponent terms are akin to attention values
        * the main difference is that for every word you have 3 values
            * query = q3, 
            * key = k3,
            * value = v3
            * they are the vector inputs to computing attention value for each word 
        
<img src='./Images/W4_04.png' style="width: 60%"></img>

**Computations:** 

(1) We are going to associate each of the words with 3 values: q, k, v
* q3 = Wq * x3  
* k3 = Wk * x3
* v3 = Wv * x3

Wq, Wk, Wv = matrices of parameters, they allow you to pull off these query, key, and value vectors for each word

(2) Compute how likely each word is an answer to a certain question about a selected word
* q3 = the question you are going to ask about the word 3: eg: what's happening therу?
    * we compute the inner product between q^3 and k^1,
    * this will tell us how good `Word 1 = Jane` answers the question of what's happening in Africa
    * we compute the inner product between q^3 and k^2,
    * how good `Word 2 = visite` answers the question of what's happening in Africa

The goal of this operation is to pull up the most information that's needed to help us compute the most useful representation A^3

For intuition:
* if k^1 represents that this word is a person,
* and k^2 represents that the second word, visite, is an action, 
* then you may find that q^3 inter producted with k^2 has the largest value, 
* and this may be intuitive example, might suggest that visite, gives you the most relevant contexts for what's happening in Africa

(3) Then we take all dot products for selected word and compute a soft max over them (exp(q * k) / sum(exp(q * k)) 

(4) Then we're going to take these Softmax values and multiply them with v^1, which is the value for word 1, the value for word 2, and so on (v)

(5) Finally, we sum it all up

**The key advantage** of this representation is that
* the word of l'Afrique isn't some fixed word embedding. 
* Instead, it lets the self-attention mechanism realize that l'Afrique is the destination of a visite
* and thus compute a richer, more useful representation for this word

(6) YOu do this for all words in the sequence and summarize this in one single attention, where
* another name for this type of attention is the "scaled dot-product attention"
* Q, K, V are matrices of all these qi, ki, vi values for all words
* on the right under softmax = vectorized representation of the equation with exponents
    * denominator = just to scale the dot-product, so it doesn't explode.

<img src='./Images/W4_05.png' style="width: 80%"></img>

**Recap:**
* associated with each of the five words you end up with a query, a key, and a value. 
    * The query lets you ask a question about that word, such as what's happening in Africa.
        * Q = interesting questions about the words in a sentence,  
    * The key 
        * K = qualities of words given a Q, 
        * looks at all of the other words, 
        * and by the similarity to the query, helps you figure out which words gives the most relevant answer to that question. 
        * In this case, visite is what's happening in Africa, someone's visiting Africa. 
    * The value 
        * V = specific representations of words given a Q
        * allows the representation to plug in how visite should be represented within A^3, within the representation of Africa. 
* This allows you to come up with a representation for the word Africa that says this is Africa and someone is visiting Africa.
* This is a much richer representation for the world than 
    * if you just had to pull up the same fixed word embedding for every single word 
    * without being able to adapt it based on what words are to the left and to the right of that word. 

### (video 3) Multi-headed attention

* it is basically just a big four loop over the self attention mechanism 
* each time you calculate self attention for a sequence is called a head
* thus the name multi head attention refers to if you do what you saw in the last video, but a bunch of times 

* With simple self-attention: 
    * you got the vectors Q K and V for each of the input terms by 
    * multiplying them (terms) by a few matrices, Wq Wk and Wv weight matricies 

*  With multi head attention for **each word**
    * you take that same set of query key and value vectors as inputs (Q K and V)
    * and calculate multiple self attentionsby multiplying them with weight matrices,
        * w_1_q, w_1_k, w_1_v => 
        * resulting values give a new set of q, k and v values for the word
        * then do the same for all other words (Ws are the same)
        * w_1_q, w_1_k, w_1_v = are wights to be learnt to help asking and anwering the question
    * We do this with second head
        *  w_2_q, w_2_k, w_2_v => allows to answer the second question: when is smth happening
        * september = good answer to the question (for the word Afrique)
    * etc
       
* h = number of heads (parameter) = features
    * the concatination of these values are concatinated and to compute the output of the multi headed attention 
    
* Final value: concatination of these heads, multiplied by matrix W

* I described this as a "for loop", but you can actually compute these different heads' values in parallel because no one has value depends on the value of any other head
        

<img src='./Images/W4_06.png' style="width: 60%"></img>

### (video 4) Transformer network

* Same task: translation of the sequence from french to english 
* Up until this point, for the sake of simplicity, 
    * I've only been talking about the embeddings for the words in the sentence. 
    * But in many sequences sequence translation task, will be useful to also add
        * the start of sentence = SOS token
        * and the end of sentence = EOS tokens

(1) The first step these embeddings get fed into an encoder block
* (a) which has a multi head attention there. 
    * you feed in the values Q K and V computed from 
        * the embeddings 
        * and the weight matrices W. 
    * This layer then produces a matrix that can be 
* (b) passed into a feed forward neural network.
    * Which helps determine what interesting features there are in the sentence. 
In the transformer paper, this block, this encoding block is repeated N times and a typical value for N is six.

(2) After these 6 times in encoder, the result is fed to the decoder block
* First output is the SOS token
* and it takes as an input the encoders input + everything already produced previously by decoder 

* (a) So SOS token is passed the first and is used to compute Q K and V for this first multi-headed attention block
    * This first blocks, output is used to generate the Q matrix for the next multi head attention block = what you've been translated so far
    * And the output of the **encoder** is used to generate K and V = give context
* (b) second multi-head attention 
    * Calculates attention 
* (c) the output transfered to the FF neural network that predicts the next word
    * tries to decide which is the next word to generate
    * The SOS + next word = input to the decoder

The decoder is also repeated N times

<img src='./Images/W4_07.png' style="width: 60%"></img>

**Beyond these main ideas, there are a few extra bells and whistles:**
* (1) Positional encoding of the input:
    * the way you encode the position of elements in the input is that you use a combination of these sine and cosine equations
    *  let's say for example that your word embedding is a vector with four values. In this case the dimension D of the word embedding is 4.
    *  we're going to then create a positional embedded in vector of the same dimension
        * In this equation below, 
        * pos = denotes the numerical position of the word. 
        * So for the word Jane, pos is equal to 1 
        * and i = refers to the different dimensions of the encoding vector
        * The first element responds to I equals 0. 
        * This element i equals 1,i equals 2,i equals 3.
        * d = 4 = dimention of enoding vector
        * sine/cosine used to create unique position encoding vector for each word
        * equation => values in each cell of the position vector
        * you read values from d sine/cosine graphs for each consecutive word and fill vector with them
    * Positional encoding of P1 is added directly to X1
    
=> The output of the encoding block contains contextual semantic embedding and positional encoding information. 

The output of the embedding layer is of dimension:
* d = in this case 4 
* by the maximum length of sequence your model can take. 

The outputs of all other layers of the encoder are also of this shape.

<img src='./Images/W4_08.png' style="width: 30%"></img>

* (2) Residual connections: Add and Norm
    * In addition to adding these position encodings to the embeddings, you'd also pass them through the network with residual connections. 
    * The purposein this case: to pass along positional information through the entire architecture.
    * Similar to bash norm 
    * This helps to speed up learning (**question** by normalizing the length of sentences?)
    * repeated in several places
   
* (3) Output of the decoder block is linearized and a soft max applied


* (4) You may also to hear of Masked Multi-head attention for first layer of the decoder
    * It is important for the trainig set only
    * Let's say your data set has the correct french to English translation
    * When training you have access to the entire correct English translation, the correct output and they're correct input.
    * And because you have the full correct output you don't actually have to generate the words one at a time during training.
    * masking blocks out the last part of the sentence to mimic what the network will need to do at test time or during prediction
    
    
<img src='./Images/W4_09.png' style="width: 80%"></img>

# Quiz

<img src='./Images/Q4_1.png' style="width: 80%"></img>

I don't understand what are "word values" in the 3d exercise

<img src='./Images/Q4_2.png' style="width: 80%"></img>
<img src='./Images/Q4_3.png' style="width: 80%"></img>
<img src='./Images/Q4_4.png' style="width: 80%"></img>

**References**

**Week 1:**
* Minimal character-level language model with a Vanilla Recurrent Neural Network, in Python/numpy (GitHub: karpathy)
* The Unreasonable Effectiveness of Recurrent Neural Networks (Andrej Karpathy blog, 2015)
* deepjazz (GitHub: jisungk)
* Learning Jazz Grammars (Gillick, Tang & Keller, 2010)
* A Grammatical Approach to Automatic Improvisation (Keller & Morrison, 2007)
* Surprising Harmonies (Pachet, 1999)

**Week 2:**
* Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings (Bolukbasi, Chang, Zou, Saligrama​ & Kalai, 2016)
* GloVe: Global Vectors for Word Representation (Pennington, Socher & Manning, 2014)
* Woebot.

**Week 4:**
* Natural Language Processing Specialization (by DeepLearning.AI)
* Attention Is All You Need (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser​ & Polosukhin, 2017)