# ___Transformers – Attention is All You Need___

_The paper [‘Attention Is All You Need’](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf) introduces a novel architecture called __Transformer__. As the title indicates, it uses the attention-mechanism we saw earlier. Like LSTM, Transformer is an architecture for transforming one sequence into another one with the help of two parts (Encoder and Decoder), but it differs from the previously described/existing sequence-to-sequence models because it does not imply any Recurrent Networks (GRU, LSTM, etc.)._

_Recurrent Networks were, until now, one of the best ways to capture the timely dependencies in sequences. However, the team presenting the paper proved that an architecture with only attention-mechanisms without any RNN (Recurrent Neural Networks) can improve on the results in translation task and other tasks!_

## ___The Transformer Black Box___

<img src='https://miro.medium.com/max/512/1*jKjKXD0zqUTifHMb6iFf1A.png'/>

_Let us first understand the basic similarities and differences between the attention and the transformer models. __Both aim to achieve the same result using an encoder-decoder approach.__ The encoder converts the original input sequence into its latent representation in the form of hidden state vectors. The decoder tries to predict the output sequence using this latent representation. But the RNN based approach has an inherent flaw. Due to the fundamental constraint of sequential computation, it is not possible to parallelize the network, which makes it hard to train on long sequences. This, in turn, puts a constraint on the batch size that can be used while training._

_The transformer architecture continues with the Encoder-Decoder framework that was a part of the original Attention networks — given an input sequence, create an encoding of it based on the context and decode that context-based encoding to the output sequence._

<img src='https://miro.medium.com/max/512/1*A69q3bASD9ClfSddu_D5Kw.png'/>

_Except for the __issue of not being able to parallelize__, another important reason for working on improvement was that the __attention-based model would inadvertently give a higher weight-age to the elements in the sequence closer to a position.__ Though this might make sense in the sense of understanding the grammatical formation of various parts of the sentence, it is hard to find relations between words far apart in the sentence._

### ___Why do we need the Transformer?___

#### ___RNN___
* ___Advantages___ _: RNNs are popular and successful for variable-length representations such as sequences (e.g. languages), images, etc. RNNs are considered the core of seq2seq (with attention). The gating models such as LSTM or GRU are for long-range error propagation._
* ___Problems___ _: The sequentiality prohibits parallelization within instances. Long-range dependencies still tricky, despite gating. Sequence-aligned states in RNN are wasteful. Hard to model hierarchical-alike domains such as languages._

#### ___CNN___
* ___Advantages___ _: Trivial to parallelize (per layer) and fit intuition that most dependencies are local._
* ___Problems___ _: Path length between positions can be logarithmic when using dilated convolutions, left-padding for text. (autoregressive CNNs WaveNet, ByteNET)_

### ___Objective for the Architecture___
* ___Parallelization of Seq2Seq___ _: RNN/CNN handle sequences word-by-word sequentially which is an obstacle to parallelize. Transformer achieves parallelization by replacing recurrence with attention and encoding the symbol position in the sequence. This, in turn, leads to significantly shorter training time._
* ___Reduce sequential computation___ _: Constant O(1) number of operations to learn dependency between two symbols independently of their position distance in sequence._

## ___Transformer Architecture___

<img src='https://miro.medium.com/max/1000/1*o0pS0LbXVw7i41vSISrw1A.png'/>

_A Transformer is composed of an encoder and a decoder. The encoder’s role is to encode the inputs(i.e sentence) in a state, which often contains several tensors. Then the state is passed into the decoder to generate the outputs. In machine translation, the encoder transforms a source sentence, e.g., “Hello world.”, in a state, e.g., a vector, that captures its semantic information. The decoder then uses this state to generate the translated target sentence, e.g., “Bonjour le monde.”. Encoder and decoder have some submodules, but as you can see both of them use mainly Multi-Head Attention and Feed Forward Network._

_The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number._

<img src='https://blog.exxactcorp.com/wp-content/uploads/2020/05/Overview-of-a-Full-Transformer-Architecture.jpg'/>

_The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:_

<img src='http://jalammar.github.io/images/t/Transformer_encoder.png' width= 500/>

_The encoder’s inputs first flow through a __self-attention layer__ – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word._

_The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position._

_The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models)._

<img src='http://jalammar.github.io/images/t/Transformer_decoder.png' width= 600/>

### ___Encoder Side___

#### ___Input Embedding___

_Embedding aims at creating a vector representation of words. Words that have the same meaning will be close in terms of euclidian distance. For example, the word bathroom and shower are associated with the same concept, so we can see that the two words are close in Euclidean space, they express similar senses or concept._

_For the encoder, the authors decided to use an embedding of size 512 (i.e each word is modeled by a vector of size 512). The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset._

_After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder._

<img src='http://jalammar.github.io/images/t/encoder_with_tensors.png' width= 600/>

_Here we begin to see one key property of the Transformer, which is that __the word in each position flows through its own path in the encoder__. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer._

#### ___Positional Encoding___

_As there is no component in this new encoder-decoder architecture which explains the sequential nature of the data, we need to inject some information about the relative or absolute position of the tokens in the sequence. This is the task of the __positional encoding module__._

_The position of a word plays a determining role in understanding the sequence we try to model. Therefore, we __add positional information__ about the word within the sequence in the vector._

_To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors._

_The authors of the paper used the following functions (see figure 2) to model the position of a word within a sequence._

<img src='https://miro.medium.com/max/439/1*0dwJvPk9mTgyBkiyJAiexA.png'/>

_where __pos is the position and i is the dimension__. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000⋅2π. The authors chose this function because they hypothesized it would allow the model to easily learn to attend by relative positions since for any fixed offset k, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$._

_We will try to explain positional encoding in more detail. Let us take an example._

```
The big yellow cat
1    2    3     4
```

_We note the position of the word in the sequence p_t € [1, 4]._

_$dmodel$ is the dimension of the embedding, in our case d_model = 512,_

_$i$ is the dimension(i.e the dimension of vector)._

_In the case of this model the information of the absolute position of a word in a sequence is added directly to the initial vector. To do this the encoding position must have the same size as the initial vector d_model._

<img src='http://jalammar.github.io/images/t/transformer_positional_encoding_vectors.png' width=600/>

_Ignoring the mathematical formulation, given an embedding for token x at position i, a positional encoding for the i’th position is added to that embedding. This injection of position is done such that each positional encoding is distinct from any other. Every dimension of the positional enc corresponds to a sinusoid wavelength, with the final enc being the value of each of these sinusoid waves at the i’th point._

<img src='http://jalammar.github.io/images/t/transformer_positional_encoding_example.png' width = 600/>

#### ___Self-Attention___

_Say the following sentence is an input sentence we want to translate:_
```
"The animal didn't cross the street because it was too tired"
```
_What does __“it”__ in this sentence refer to? __Is it referring to the street or to the animal?__ It’s a simple question to a human, but not as simple to an algorithm._

_When the model is processing the word __“it”__, self-attention allows it to associate __“it”__ with __“animal”__._

_As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word._

_If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing._

_The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other (i.e calculate attention of all other inputs wrt one input._

_As described by the authors of “__Attention is All You Need__”,_

<img src='https://miro.medium.com/max/447/1*8n2-UGxviSJf4M04S5Q08w.png' width = 300/>

___Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.___

_This layer aims to encode a word based on all other words in the sequence. It measures the encoding of the word against the encoding of another word and gives a new encoding._

___The main purpose of attention is to estimate the relative importance of the keys term compared to the query term related to the same person or concept. To that end, the attention mechanism takes query Q that represents a vector word, the keys K which are all other words in the sentence, and value V represents the vector of the word.___

##### ___Self-Attention in Detail___

_The __first step__ in calculating self-attention is to create three vectors from each of the encoder’s input vectors. So for each word, __we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process__._

_Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant._

<img src='http://jalammar.github.io/images/t/transformer_self_attention_vectors.png' width = 700/>

<center style="font-size:10px"><i>Multiplying x1 by the WQ weight matrix produces q1, the "query" vector associated with that word. We end up creating a "query", a "key", and a "value" projection of each word in the input sentence.</i></center>
    
___What are the “query”, “key”, and “value” vectors?___

_They’re abstractions that are useful for calculating and thinking about attention. Once we understand how attention is calculated below, we’ll know the role each of these vectors plays._
    
_The __second step__ in self-attention is to __calculate a score__. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position._

_The score is calculated by taking the __dot product of the query vector with the key vector of the respective word__ we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2._

<img src='https://jalammar.github.io/images/t/transformer_self_attention_score.png' width = 500/>

_The __third and forth__ steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. __This leads to having more stable gradients.__ There could be other possible values here, but this is the default), then __pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1__._

<img src='http://jalammar.github.io/images/t/self-attention_softmax.png' width = 500/>

_This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word._

_The __fifth step__ is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example)._

_The __sixth step__ is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word)._

<img src='https://jalammar.github.io/images/t/self-attention-output.png' width = 500/>

_That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network. In the actual implementation, however, this calculation is done in matrix form for faster processing. So let’s look at that now that we’ve seen the intuition of the calculation on the word level._

___Matrix Calculation of Self-Attention___

_The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV)._

<img src='https://jalammar.github.io/images/t/self-attention-matrix-calculation.png' width = 300/>

<center style="font-size:10px"><i>Every row in the X matrix corresponds to a word in the input sentence. We again see the difference in size of the embedding vector (512, or 4 boxes in the figure), and the q/k/v vectors (64, or 3 boxes in the figure)</i></center>

_Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer._

<img src='http://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png' width = 500/>

<center style="font-size:10px"><i>The self-attention calculation in matrix form</i></center>

##### ___Summarized___

___The attention mechanism allows output to focus attention on input while producing output while the self-attention model allows inputs to interact with each other (i.e calculate attention of all other inputs wrt one input.___

* _The first step is multiplying each of the encoder input vectors with three weights matrices (W(Q), W(K), W(V)) that we trained during the training process. This matrix multiplication will give us three vectors for each of the input vector: the key vector, the query vector, and the value vector._


* _The second step in calculating self-attention is to multiply the Query vector of the current input with the key vectors from other inputs._


* _In the third step, we will divide the score by square root of dimensions of the key vector (dk). In the paper the dimension of the key vector is 64, so that will be 8. The reason behind that is if the dot products become large, this causes some self-attention scores to be very small after we apply softmax function in the future._


* _In the fourth step, we will apply the softmax function on all self-attention scores we calculated wrt the query word (here first word)._


* _In the fifth step, we multiply the value vector on the vector we calculated in the previous step._


* _In the final step, we sum up the weighted value vectors that we got in the previous step, this will give us the self-attention output for the given word._

_The above procedure is applied to all the input sequences. Mathematically, the self-attention matrix for input matrices (Q, K, V) is calculated as:_

<img src='https://www.geeksforgeeks.org/wp-content/ql-cache/quicklatex.com-d365ea4bd6d1549fe6fe8817f4e92815_l3.svg'/>

_where Q, K, V are the concatenation of query, key, and value vectors._

#### ___Multi Head Attention___

<img src='https://miro.medium.com/max/576/1*yNRczSxh9b9RTN__4IxVhA.png' width = 400/>

_The paper further refined the self-attention layer by adding a mechanism called __“multi-headed” attention__. This improves the performance of the attention layer in two ways:_

* ___It expands the model’s ability to focus on different positions.___ _Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to._


* ___It gives the attention layer multiple “representation subspaces”.___ _As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace._

<img src='https://jalammar.github.io/images/t/transformer_attention_heads_qkv.png' width = 600/>

<center style="font-size:10px"><i>With multi-headed attention, we maintain separate Q/K/V weight matrices for each head resulting in different Q/K/V matrices. As we did before, we multiply X by the WQ/WK/WV matrices to produce Q/K/V matrices.</i></center>

_If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices._

<img src='https://jalammar.github.io/images/t/transformer_attention_heads_z.png' width = 600/>

_The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix._

_How do we do that? __We concat the matrices then multiple them by an additional weights matrix WO__._

<img src='https://jalammar.github.io/images/t/transformer_attention_heads_weight_matrix_o.png' width = 600/>

<img src='https://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png' width = 600/>

_If we add all the attention heads to the picture, however, things can be harder to interpret:_

<img src='https://jalammar.github.io/images/t/transformer_self-attention_visualization_3.png' width = 300/>

##### ___Summarized___

_In the attention paper, the authors proposed another type of attention mechanism called multi-headed attention. Below is the step-by-step process to calculate multi-headed self-attention:_

* _Take each word of input sentence and generate the embedding from it._


* _In this mechanism, we created h (h = 8) different attention heads, each head has different weight matrices (W(Q), W(K), W(V))._


* _In this step, we multiply the input matrix with each of the weight matrices (WQ, WK, WV) to produce the key, value, and query matrices for each attention head._


* _Now, we apply the attention mechanism to these query, key, and value matrices, this gives us an output matrix from each attention head._


* _In this step, we concatenate the output matrix obtained from each attention heads and dot product with the weight WO to generate the output of the multi-headed attention layer._


* _Mathematically multi-head attention can be represented by:_

$$MultiHead\left ( Q, K, V \right ) = concat\left ( head_{1} head_{2} ... head_{n} \right )W_{O} 
where, head_{i} = Attention \left ( QW_{i}^{Q},KW_{i}^{K}, VW_{i}^{V} \right )$$

#### ___The Residuals Connections___

<img src='https://jalammar.github.io/images/t/transformer_resideual_layer_norm.png' width=300/>

_One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step._

_If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:_
<img src='https://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png' width=300/>

_This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:_
<img src='https://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png' width=500/>

#### ___Layer Normalization___

<img src='https://miro.medium.com/max/410/1*F8KDxyfGG63QbJB2SB2aJw.png'/>
<center style="font-size:10px"><i>Normalization Techniques</i></center>

_The key feature of layer normalization is that it __normalizes the inputs across the features__, unlike batch normalization which normalizes each feature across a batch. Batch norm has the flaw that it imposes a lower bound on the batch size. In layer norm, the statistics are computed across each feature and are independent of other examples. It has been seen to perform better experimentally._

<img src='https://miro.medium.com/max/232/1*HRX5QmV1viDj3DtjdbVlLQ.png'/>

___In the transformers, layer normalization is done with residuals, allowing it to retain some form of information from the previous layer.___

### ___Decoder Side___

* _Decoder Input is the Output Embedding + Positional Encoding, which is offset by 1 position to ensure the prediction for position i depends only on the positions before i_


* _N layers of Masked Multi-Head Attention, Multi-Head Attention and Position-Wise Feed Forward Network with Residual Connections around them followed by a Layer of Normalization_


* _Masked Multi-Head Attention to prevent future words to be part of the attention (at inference time, the decoder would not know about the future outputs)_


* _This is followed by Position-Wise Feed Forward NN_

_The decoder generates one word at a time from left to right. The first word is based on the final representation of the encoder (offset by 1 position)._

_Every word predicted subsequently attends to the previously generated words of the decoder at that layer and the final representation of the encoder (Multi-Head Attention) — similar to a typical encoder-decoder architecture._

_The decoder is autoregressive, it begins with a start token, and it takes in a list of previous outputs as inputs, as well as the encoder outputs that contain the attention information from the input. The decoder stops decoding when it generates a token as an output._

#### ___Masked Self-Attention___
_At any position, a word may depend on both the words before it as well as the ones after it. For eg. in “I saw the_ ______ _chasing a mouse.” we would intuitively fill cat as that is the most probable one._

_So while encoding a word, it needs to know everything that comes in the whole sentence. Which is why in the self-attention layer, the query was performed with all words against all words._

_But at the time of decoding, when trying to predict the next word in the sentence (which is why we have a shifted input sequence for decoder input), logically, it should not know what are the words which are present after the word we are trying to predict._

_Hence, the embeddings for all these are masked by multiplying with 0, rendering any value from them to become 0 and only predicting based on the embeddings created using the words which came before it._

___This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.___

<img src='https://miro.medium.com/max/436/0*0pqSkWgSPZYr_Sjx.png' width=300/>

#### ___Encoder-Decoder Attention___

_A encoder-decoder attention layer, which operates on the output of the decoder's self-attention layer and the output of the final encoder as input._

_In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation._

_The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack._

_This allows every position in the decoder to attend over all the positions in the input sequence (similar to the typical encoder-decoder architecture)._

<img src='http://jalammar.github.io/images/t/transformer_decoding_2.gif' width = 500/>

##### ___Understanding the Flow___
<img src='https://miro.medium.com/max/576/1*sbfNVjf3yERRD9Rg4OLIBw.gif'/>

### ___Final Linear and Softmax Layer___

_The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer._

_The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector._

_Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer._

_The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step._

<img src='http://jalammar.github.io/images/t/transformer_decoder_output_softmax.png' width = 500/>

## ___Applications of the Transformer___

_In October 2019, Google announced the use of BERT for 10% of its English language search. __Search will attempt to understand queries the way users tend to ask them in a natural way__. This is opposed to parsing the query as a bunch of keywords. Thus, phrases such as "to" or "for someone" are important for meaning and BERT picks up these._

_We can use transformers to __generate synthetic text__. Starting from a small prompt, GPT-2 model is able to __generate long sequences and paragraphs of text__ that are realistic and coherent. This text also adapts to the style of the input._

_For __correcting grammar__, transformers provide competitive baseline performance. For sequence generation, Insertion Transformer and Levenshtein Transformer have been proposed._

_Transformers have been used beyond NLP, such as for __image generation where self-attention is restricted to local neighbourhoods__. Music Transformer applied self-attention to __generate long pieces of music__. While the original transformer used absolute positions, the music transformer used relative attention, allowing the model to create music in a consistent style._

## ___Well-Known Transformer Networks___

___BERT is an encoder-only transformer___ _. It's the first deeply bidirectional model, meaning that it uses both left and right contexts in all layers. BERT showed that as a pretrained language model it can be fine-tuned easily to obtain state-of-the-art models for many specific tasks. BERT has inspired many variants: __RoBERTa, XLNet, MT-DNN, SpanBERT, VisualBERT, K-BERT, HUBERT, and more__. Some variants attempt to compress the model: __TinyBERT, ALERT, DistilBERT, and more__._

_The other competitive model is __GPT-2__. Unlike __BERT, GPT-2 is not bidirectional and is a decoder-only transformer__. However, the training includes both unsupervised pretraining and supervised fine-tuning. The training objective combines both of these to improve generalization and convergence. This approach of training on specific tasks is also seen in __MT-DNN__._

___GPT-2 is auto-regressive___ _. Each output token is generated one by one. Once a token is generated, it's added to the input sequence. BERT is not auto-regressive but instead uses context from both sides. __XLNet__ is auto-regressive while also using context from both sides._

## ___Variations of the Transformer Network___

_Compared to the original transformer of Vaswani et al., we note the following variations:_

* ___Transformer-XL___ _: Overcomes the limitation of fixed-length context. It makes use of segment-level recurrence and relative positional encoding._


* ___DS-Init & MAtt___ _: Stacking many layers is problematic due to vanishing gradients. Therefore, depth-scaled initialization and merged attention sublayer are proposed._


* ___Average Attention Network (AAN)___ _: With the original transformer, decoder's self-attention is slow due to its auto-regressive nature. Speed is improved by replacing self-attention with an averaging layer followed by a gating layer._


* ___Dialogue Transformer___ _: Conversation that has multiple overlapping topics can be picked out. Self-attention is over the dialogue sequence turns._


* ___Tensor-Product Transformer___ _: Uses novel TP-Attention to explicitly encode relations and applies it to math problem solving._


* ___Tree Transformer___ _: Puts a constraint on the encoder to follow tree structures that are more intuitive to humans. This also helps us learn grammatical structures from unlabelled data._


* ___Tensorized Transformer___ _: Multi-head attention is difficult to deploy in a resource-limited setting. Hence, multi-linear attention with Block-Term Tensor Decomposition (BTD) is proposed._

## ___Limitations of the Transformer___

_Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it comes with its own share of limitations:_

* _Attention can only deal with fixed-length text strings. The text has to be split into a certain number of segments or chunks before being fed into the system as input._


* _This chunking of text causes __context fragmentation__. For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the text is split without respecting the sentence or any other semantic boundary._

_So how do we deal with these pretty major issues? That’s the question folks who worked with Transformer asked. And out of this came __Transformer-XL__._

#### ___Understanding Transformer-XL___

_Transformer architectures can learn longer-term dependency. However, they can’t stretch beyond a certain level due to the use of fixed-length context (input text segments). A new architecture was proposed to overcome this shortcoming in the paper – __[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context](https://arxiv.org/pdf/1901.02860.pdf)__._

<img src='https://3.bp.blogspot.com/-THzNQ7yDkpE/XFCmkGvYeYI/AAAAAAAADuE/ycT09f23p8UzjfyBflsTvsdWoQ8KafUcgCLcBGAs/s640/GIF1.gif'/>
<center style='font-size:10px'><i>Vanilla Transformers</i></center>

___In this architecture, the hidden states obtained in previous segments are reused as a source of information for the current segment___ _. It enables modeling longer-term dependency as the information can flow from one segment to the next._

_During the evaluation phase, the representations from the previous segments can be reused instead of being computed from scratch (as is the case of the Transformer model). This, of course, increases the computation speed manifold._

<img src='https://2.bp.blogspot.com/--MRVzjIXx5I/XFCm-nmEDcI/AAAAAAAADuM/HoS7BQOmvrQyk833pMVHlEbdq_s_mXT2QCLcBGAs/s640/GIF2.gif'/>
<center style='font-size:10px'><i>Transformer-XL with segment-level recurrence at training time</i></center>

<img src='https://4.bp.blogspot.com/-Do42uKiMvKg/XFCns7oXi5I/AAAAAAAADuc/ZS-p1XHZUNo3K9wv6nRG5AmdEK7mJsrugCLcBGAs/s640/xl-eval.gif'/>
<center style='font-size:10px'><i>Transformer-XL with segment-level recurrence at evaluation time</i></center>

## ___Mathematical operations involved in a Self-Attention___

### ___Step By Step___

* _Prepare inputs_
* _Initialise weights_
* _Derive key, query and value_
* _Calculate attention scores for Input 1_
* _Calculate softmax_
* _Multiply scores with values_
* _Sum weighted values to get Output 1_
* _Repeat steps 4–7 for Input 2 & Input 3_

#### ___Prepare Inputs___

_In this example, we start with 3 inputs, each with dimension 4._

<img src='https://miro.medium.com/max/512/1*hmvdDXrxhJsGhOQClQdkBA.png'/>

    Input 1: [1, 0, 1, 0] 
    Input 2: [0, 2, 0, 2]
    Input 3: [1, 1, 1, 1]

#### ___Initialise Weights___

_Every input must have three representations (see diagram below). These representations are called key (orange), query (red), and value (purple). For this example, let’s take that we want these representations to have a dimension of 3. Because every input has a dimension of 4, this means each set of the weights must have a shape of 4×3._

_In order to obtain these representations, every input (green) is multiplied with a set of weights for keys, a set of weights for querys (I know that’s not the right spelling), and a set of weights for values._

<img src='https://miro.medium.com/max/512/1*VPvXYMGjv0kRuoYqgFvCag.gif'/>

_In our example, we initialise the three sets of weights as follows._

_Weights for key:_

    [[0, 0, 1],
     [1, 1, 0],
     [0, 1, 0],
     [1, 1, 0]]

_Weights for query:_

    [[1, 0, 1],
     [1, 0, 0],
     [0, 0, 1],
     [0, 1, 1]]

_Weights for value:_

    [[0, 2, 0],
     [0, 3, 0],
     [1, 0, 3],
     [1, 1, 0]]

#### ___Derive key, query and value___

_Now that we have the three sets of weights, let’s actually obtain the key, query and value representations for every input._

_Key representation for Input 1:_

                   [0, 0, 1]
    [1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
                   [0, 1, 0]
                   [1, 1, 0]

_Use the same set of weights to get the key representation for Input 2:_

                   [0, 0, 1]
    [0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
                   [0, 1, 0]
                   [1, 1, 0]

_Use the same set of weights to get the key representation for Input 3:_

                   [0, 0, 1]
    [1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
                   [0, 1, 0]
                   [1, 1, 0]

_A faster way is to vectorise the above operations:_

                   [0, 0, 1]
    [1, 0, 1, 0]   [1, 1, 0]   [0, 1, 1]
    [0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
    [1, 1, 1, 1]   [1, 1, 0]   [2, 3, 1]

<img src='https://miro.medium.com/max/512/1*dr6NIaTfTxEWzxB2rc0JWg.gif'/>

_Let’s do the same to obtain the value representations for every input:_

                   [0, 2, 0]
    [1, 0, 1, 0]   [0, 3, 0]   [1, 2, 3] 
    [0, 2, 0, 2] x [1, 0, 3] = [2, 8, 0]
    [1, 1, 1, 1]   [1, 1, 0]   [2, 6, 3]
    
<img src='https://miro.medium.com/max/512/1*5kqW7yEwvcC0tjDOW3Ia-A.gif'/>  
    
_and finally the query representations:_

                   [1, 0, 1]
    [1, 0, 1, 0]   [1, 0, 0]   [1, 0, 2]
    [0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
    [1, 1, 1, 1]   [0, 1, 1]   [2, 1, 3]
    
<img src='https://miro.medium.com/max/512/1*wO_UqfkWkv3WmGQVHvrMJw.gif'/>

#### ___Calculate attention scores for Input 1___

<img src='https://miro.medium.com/max/512/1*u27nhUppoWYIGkRDmYFN2A.gif'/>

_To obtain attention scores, we start off with taking a dot product between Input 1’s query (red) with all keys (orange), including itself. Since there are 3 key representations (because we have 3 inputs), we obtain 3 attention scores (blue)._

                [0, 4, 2]
    [1, 0, 2] x [1, 4, 3] = [2, 4, 4]
                [1, 0, 1]

_This above operation is known as **dot product attention**._

_Notice that we only use the query from Input 1. Later we’ll work on repeating this same step for the other querys._

#### ___Calculate softmax___

<img src='https://miro.medium.com/max/512/1*jf__2D8RNCzefwS0TP1Kyg.gif'/>

_Take the softmax across these attention scores (blue)._

    softmax([2, 4, 4]) = [0.0, 0.5, 0.5]
    
    
#### ___Multiply scores with values___

<img src='https://miro.medium.com/max/512/1*9cTaJGgXPbiJ4AOCc6QHyA.gif'/>

_The softmaxed attention scores for each input (blue) is multiplied with its corresponding value (purple). This results in 3 alignment vectors (yellow). In this tutorial, we’ll refer to them as weighted values._

    1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
    2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
    3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]


#### ___Sum weighted values to get Output 1___

<img src='https://miro.medium.com/max/512/1*1je5TwhVAwwnIeDFvww3ew.gif'/>

_Take all the weighted values (yellow) and sum them element-wise:_

  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
-----------------
= [2.0, 7.0, 1.5]

_The resulting vector [2.0, 7.0, 1.5] (dark green) is Output 1, which is based on the query representation from Input 1 interacting with all other keys, including itself._

#### ___Repeat for Input 2 & Input 3___

_Now that we’re done with Output 1, we repeat Steps 4 to 7 for Output 2 and Output 3. I trust that I can leave you to work out the operations yourself_

_Finally!!!_

<img src='https://miro.medium.com/max/512/1*G8thyDVqeD8WHim_QzjvFg.gif'/>


___Note___: _The dimension of query and key must always be the same because of the dot product score function. However, the dimension of value may be different from query and key. The resulting output will consequently follow the dimension of value._