# TRANSFORMER

## Attention mechanism explained

Attention is a technology that allows the model to focus on important information and fully learn and absorb it. It is not a complete model. It should be a technology that can be used in any sequence model.

## Seq2Seq

Before starting to explain Attention, let's briefly review the Seq2Seq model. Traditional machine translation is basically based on the Seq2Seq model. The model is divided into an encoder layer and a decoder layer, and they are all composed of RNN or RNN variants, as follows:

![alt](https://i.imgur.com/218g1E5.gif)


Pictured : In the encode phase, the first node enters a word, and the subsequent nodes enter the hidden state of the previous word and the previous node. Finally, the encoder will output a context, which is used as the input of the decoder. A decoder node outputs a translated word, and uses the hidden state of the decoder as the input of the next layer. This model works well for short text translation, but it also has certain shortcomings. If the text is longer, it is easy to lose some information of the text. In order to solve this problem, Attention came into being.

## Attention

Attention, as the name implies, in the decoding stage, the model will select the context that is most suitable for the current node as input. Attention differs from the traditional Seq2Seq model mainly in the following two points.

1) The encoder provides more data to the decoder. The encoder will provide the hidden state of all nodes to the decoder, not just the hidden state of the last node of the encoder.

![alt](https://i.imgur.com/mQspchZ.gif)

2) The decoder does not directly take the hidden state provided by all encoders as input, but adopts a selection mechanism to select the hidden state that best matches the current position. The specific steps are as follows

   - Determine which hidden state is most closely related to the current node

   - Calculate the score of each hidden state (how to calculate it we will explain below)

   - A softmax calculation is performed for each score value, which can make the score value of the highly correlated hidden state larger and the score value of the hidden state with lower correlation lower.

Here we take a specific example to see the detailed calculation steps:

![alt](https://i.imgur.com/KlwC95p.gif)

multiply the value of the hidden states of each encoder node with the hidden state of the previous node of the current decoder node, as shown above, h1, h2, h3 are respectively Multiply the hidden state of the previous node of the current node (if it is the first decoder node, you need to initialize a hidden state randomly), and finally get three values, which are the scores of the hidden state mentioned above Note that this value is different for each encoder node. The softmax calculation is performed on the score value. The calculated value is the weight of the hidden states of each encoder node for the current node, and the weight is the same as the original hidden states. Multiply and add, the result is the hidden state of the current node. It can be found that the key to Attention is to calculate this score.

After understanding how each node obtains the hidden state, the next step is the working principle of the decoder layer. The specific process is as follows:

The first decoder node initializes a vector, calculates the hidden state of the current node, uses the hidden state as the input of the first node, and obtains a new hidden state and output value after passing the RNN node. Note that there is a big difference between this and Seq2Seq. Seq2Seq directly uses the output value as the output of the current node, but Attention will make a connection between this value and the hidden state, and use the connected value as the context and send it to a Feedforward neural network. The output content of the current node is determined by the network. Repeat the above steps until all decoder nodes output the corresponding content.

![alt](https://i.imgur.com/jVomlwp.gif)

The Attention model does not just blindly align the first word of the output with the first word of the input. In fact, it learned how to align words in the language pair (French and English in the example) during the training phase. The nature of the attention function can be described as a mapping from a query to a series of key-value pairs.

![alt](https://i.imgur.com/XIKY7yS.png)

There are three main steps in calculating the attention. The first step is to calculate the similarity between query and each key to obtain the weight. The commonly used similarity function is a bit product, stitching, perceptron, etc .; then the second step is generally to use a softmax The function normalizes these weights; finally, the weights and corresponding key values are weighted and summed to obtain the final attention. Currently in NLP research, key and value are often the same, that is, key = value.

![alt](https://i.imgur.com/uWKY6Qt.png)

## Transformer model explanation

Next I will introduce the paper "Attention is all you need". This paper was put on arXiv by Google's machine translation team in June 2017 and finally published on the 2017 nips. So far, Google's academic display citations are 2203, which shows that it has also received widespread attention and application. The main highlights of this paper are


1) Different from the previous mainstream machine translation using the seq2seq model framework based on RNN, this paper uses the attention mechanism instead of RNN to build the entire model framework.
2) Multi-headed attention mechanism method is proposed, and a large number of multi-headed self-attention mechanisms are used in the encoder and decoder.
3) Achieve advanced results on English-German and English-French tasks in the WMT2014 corpus, and train faster than mainstream models.

"Attention Is All You Need" is a paper put forward by Google to bring Attention to the limit. This paper proposes a brand new model called Transformer, which abandons CNN and RNN used in previous deep learning tasks. Bert is based on Transformer. This model is widely used in the field of NLP, such as machine translation, question answering system, text Digest and speech recognition and more. For the understanding of the Transrofmer model, a foreign blogger article "The Illustrated Transformer" is particularly recommended.

## Transformer overall structure

Like the Attention model, the transformer-decoder architecture is also used in the Transformer model. But its structure is more complicated than Attention. In the paper, the encoder layer is stacked by 6 encoders, and the decoder layer is the same.

![alt](image/3.jpg)

The internal simplified structure of each encoder and decoder is shown in the following figure.

![alt](image/5.jpg)

For the encoder, it contains two layers, a self-attention layer and a feedforward neural network. Self-attention can help the current node not only focus on the current word, so that it can obtain The semantics of the context. The decoder also includes the two-layer network mentioned by the encoder, but there is also an attention layer in the middle of the two layers, which helps the current node to obtain the key content that needs attention.

Now that we know the main components of the model, let's look at the internal details of the model. First, the model needs to perform an embedding operation on the input data (which can also be understood as an operation similar to w2c). After the embedding is completed, it is input to the encoder layer. After the data is processed by self-attention, it is sent to the feedforward neural network. The computation of the feed-forward neural network can be performed in parallel, and the output obtained will be input to the next encoder.

![alt](image/8.jpg)

## Self-Attention

Let ’s take a closer look at self-attention. The idea is similar to attention, but self-attention is a train of thought used by Transformer to convert the “understanding” of other related words into words that we normally understand. Let ’s look at an example:
   
<span style="background-color: red">The dog didn't cross the road because it was too tired</span>

here Whether it represents animal or street, it can be easily judged for us, but for the machine, it is difficult to judge, self-attention can let the machine associate it with the animal.

![alt](image/9.jpg)

Next we Look at the detailed process.

1. First, self-attention calculates three new vectors. In the paper, the dimensions of the vectors are 512. We call these three vectors Query, Key, and Value, respectively. These three vectors use the embedding vector and The result of multiplying a matrix. This matrix is randomly initialized with a dimension of (64, 512). Note that the second dimension needs to be the same as the dimension of embedding. Its value will be updated during the BP process. The dimensions of the vectors are 64 below the embedding dimension.

![alt](image/10.jpg)

So what are these three vectors, Query, Key, and Value? These three vectors are very important for attention. When you understand the following, you will understand what role these three vectors play.

2. Calculate the score value of self-attention. This score value determines the degree of attention to other parts of the input sentence when we encode a word at a certain position. The calculation method of this score value is a dot multiplication of Query and Key. The following figure is an example. First, we need to calculate the score value of other words for this word for Thinking. Then for the second word, q1·k2

![alt](image/11.jpg)

3. Next, divide the result of the point by a constant. Here we divide by 8. This value is generally the first dimension of the matrix mentioned above. The square is the square of 8 and the square is 8; of course, other values ​​can also be selected, and then the obtained result is calculated as a softmax. The result is the relevance of each word to the word at the current position. Of course, the relevance of the word at the current position will definitely be great.

![alt](image/12.jpg)

4. The next step is to multiply the values ​​obtained by Value and softmax and add them. The result obtained is the value of self-attetion at the current node.

![alt](image/13.jpg)

In the actual application scenario, in order to improve the calculation speed, we use a matrix method, which directly calculates the matrix of Query, Key, and Value, and then directly multiplies the embedding value with the three matrices. Multiplying K, multiplying by a constant, doing a softmax operation, and finally multiplying by V matrix.

![alt](image/14.jpg)

![alt](image/15.jpg)

This method of determining the weight distribution of value through the similarity between query and key is called scaled dot-product attention. In fact, scaled dot-Product attention is the attention we use to calculate the similarity using the dot product, except that one more (the dimension of K) is used for adjustment, so that the inner product is not too large.

## Multi-Headed Attention

What's even more powerful in this paper is that it adds another mechanism to self-attention, called "multi-headed" attention. This mechanism is simple to understand, that is, not only initialize a set of Q, K, V matrices Instead of initializing multiple groups, the tranformer uses 8 groups, so the final result is 8 matrices.

![alt](image/16.jpg)

![alt](image/17.jpg)

This leaves us with a small challenge. The feedforward neural network cannot input 8 matrices. What should we do? So we need a way to reduce 8 matrices to 1. First, we connect the 8 matrices together, which will get a large matrix, then randomly initialize a matrix and multiply this combined matrix, and finally Get a final matrix.

![alt](image/18.jpg)

This is the entire process of multi-headed attention. In fact, there are already a lot of matrices. Let's put all the matrices in a picture to see the overall process.

![alt](image/19.jpg)

The entire process of multi-head attention (Multi-head attention) can be briefly described as: Query, Key, Value first enters a linear transformation, and then is input to the scaling point product attention (note that it has to be done h times, in fact, it is also called so-called Calculate a head each time, and each time Q, K, V performs a linear transformation of the parameters W), and then concatenate the attention result of the scaling product of h times, and then perform the linear transformation as the long The result of attention. It can be seen that the difference between the multi-head attention proposed by Google is that the calculation is performed h times instead of only once. The advantage mentioned in the paper is that it allows the model to learn relevant information in different representation subspaces. It will be verified later based on attention visualization.

Now that we have contacted the attention header, let's re-examine our previous example to see how the word "it" in the example sentence will have different attention points under different attention headers (here different colors represent different attention headers) As a result, the darker the color, the greater the attention value).

![alt](image/20.jpg)

When we encode the word "it", one focus is primarily on "animal" and the other focus is on "tired" (two heads).
But if we add all our attention to In the picture, it may be a bit difficult to understand:

![alt](image/21.jpg)

## Positional Encoding

So far, a method for interpreting the order of words in an input sequence is missing from the transformer model. In order to deal with this problem, the transformer adds an additional vector to the input of the encoder layer and decoder layer. Can determine the position of the current word, or the distance between different words in a sentence. There are many specific calculation methods for this position vector. The calculation method in the paper is as follows:

<center><h3>$PE(pos,2i) = sin(pos/10000^{2i}/d_{model})$</h3></center>
<center><h3>$PE(pos,2i + 1) = cos(pos/10000^{2i}/d_{model})$</h3></center>

*pos* refers to the position of the current word in the sentence, and i refers to the index of each value in the vector. It can be seen that in even positions, sine is used. Coding, using cosine coding at odd positions. Finally, add this Positional Encoding to the value of embedding and send it as input to the next layer.

![alt](image/22.jpg)

In order for the model to capture the order information of the words, we add position-encoding vector information (POSITIONAL ENCODING). The position-encoding vector does not need training, and it has a rule generation method (the formula in the figure above).

If our embedding dimension is 4, then the actual position encoding is as shown below:

![alt](image/23.jpg)

## Layer normalization

In the transformer, each sub-layer (self-attetion, ffnn) will be followed by a residual module, and there is a Layer normalization.

![alt](image/24.jpg)

to further explore its internal calculation method. We can visualize the above layer as the following figure:

![alt](image/25.jpg)

residual module I believe everyone is very clear, no longer explain here, mainly explain Layer normalization. There are many types of Normalization, but they all have a common purpose, which is to transform the input into data with a mean of 0 and a variance of 1. We normalize the data before sending it to the activation function, because we don't want the input data to fall in the saturation region of the activation function.

So far, it is the content of all encoders. If two encoders are superimposed together, this is the structure. The last point that needs to be emphasized in self-attention is that it uses the short- The cut structure is designed to solve the problem of degradation in deep learning.

![alt](image/26.jpg)

## Mask

Mask means a mask, which masks some values,so that they have no effect when the parameters are updated. There are two types of masks in the Transformer model, namely the padding mask and the sequence mask.

Among them, the padding mask is used in all scaled dot-product attention, and the sequence mask is only used in the decoder's self-attention.

### Padding Mask

What is a padding mask? Because the input sequence length of each batch is different, that is, we need to align the input sequences. Specifically, it is padded with zeros after the shorter sequence. But if the input sequence is too long, it will intercept the content on the left and discard the excess. Because these padding positions are actually meaningless, our attention mechanism should not focus on these positions, so we need to do some processing.

The specific approach is to add a very large negative number (negative infinity) to the values ​​of these positions, so that after softmax, the probability of these positions will be close to 0!

And our padding mask is actually a tensor, each value is a Boolean, and the place where the value is false is where we want to process it.

### Sequence mask

As mentioned earlier in the article, the sequence mask is to prevent the decoder from seeing future information. That is, for a sequence, when the time_step is t, our decoded output should only depend on the output before time t, and not on the output after t. So we need to think of a way to hide the information after t.

So how do you do that? It's also very simple: generate an upper triangle matrix, all the values ​​of the upper triangle are 0. By applying this matrix to each sequence, we can achieve our purpose.

   - For the decoder's self-attention, the scaled dot-product attention used in it requires both a padding mask and a sequence mask as the att_mask. The specific implementation is to add the two masks as the att_mask.
   
   - In other cases, attn_mask is always equal to the padding mask.
   
The encoder is started by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors k and v. Each decoder will use these attention vectors in its "encoder-decoder attention" layer, which helps the decoder focus its attention on the appropriate place in the input sequence:

![alt](image/27.jpg)

After completing the encoding phase, we start the decoding phase. Each step of the decoding phase outputs an element from the output sequence (in this case, an English translation sentence).
The following steps repeat this process until a symbol is reached indicating that the decoder has completed output. The output of each step is sent to the bottom decoder at the next time step. The decoder is just like what we do with the encoder input. We embed and add position codes to these decoder inputs to represent each word position.

![alt](image/28.jpg)

### Output layer

After the decoder layer is completely executed, how to map the obtained vector to the words we need. It is very simple. You only need to add a fully connected layer and a softmax layer at the end. If our dictionary is 1w words, then the final softmax The probability of 1w words will be entered, and the corresponding word with the highest probability value is our final result.

![alt](image/29.jpg)

