#**(15)Transformer(Deep Learning,Attention Mechanism Based)**

- A transformer is a deep learning model that adopts the mechanism of **self-attention**, differentially **weighting** the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).

- The Transformer model extract features for each word using a self-attention mechanism to figure out how important all the other words in the sentence are w.r.t. to the previous word.

- A transformer neural network can take an input sentence in the form of a sequence of vectors, and converts it into a vector called an encoding, and then decodes it back into another sequence.

- An important part of the transformer is the attention mechanism. The attention mechanism represents how important other tokens in an input are for the encoding of a given token. For example, in a machine translation model, the attention mechanism allows the transformer to translate words like ‘it’ into a word of the correct gender in French or Spanish by attending to all relevant words in the original sentence.
-  The attention mechanism allows the transformer to focus on particular words on both the left and right of the current word in order to decide how to translate it.

<figure align="left">
<img src="https://drive.google.com/uc?id=1XImgT5NMp92zgpfvWpVmWVt3aOInXeKj" height="500px", width="600px"> 
</figure>

<figure align="left">
<img src="https://drive.google.com/uc?id=1oxM56il9d9zRa7B_gi2TDE8Sg9s8jB6m" height="500px", width="600px"> 
</figure>


- The transformer neural network receives an input sentence and converts it into two sequences: a sequence of word vector embeddings, and a sequence of positional encodings.

- The word vector embeddings are a numeric representation of the text. It is necessary to convert the words to the embedding representation so that a neural network can process them. In the embedding representation, each word in the dictionary is represented as a vector. The positional encodings are a vector representation of the position of the word in the original sentence.

- The transformer adds the word vector embeddings and positional encodings together and passes the result through a series of encoders, followed by a series of decoders.

- The encoders each convert their input into another sequence of vectors called encodings. The decoders do the reverse: they convert the encodings back into a sequence of probabilities of different output words. The output probabilities can be converted into another natural language sentence using the softmax function.

- Each encoder and decoder contains a component called the attention mechanism, which allows the processing of one input word to include relevant data from certain other words, while masking the words which do not contain relevant information.

- Because this must be calculated many times, we implement multiple attention mechanisms in parallel, taking advantage of the parallel computing offered by GPUs. This is called the multi-head attention mechanism. The ability to pass multiple words through a neural network simultaneously is one advantage of transformers over LSTMs and RNNs.

- The most important part of a transformer neural network is the attention mechanism. The attention mechanism addresses the question of which parts of the input vector the network should focus on when generating the output vector.

- This is very important in translation. For example, the English “the red house” corresponds to “la casa roja” in Spanish: the two languages have different word orders.

- The attention mechanisms allow a decoder, while it is generating an output word, to focus more on relevant words or hidden states within the network, and focus less on irrelevant information.

- In practice attention is used in three different ways in a transformer neural network:

- (1) Encoder-decoder attention, as in the above example. An attention mechanism allowing a decoder to attend over the input sequence when generating the output sequence. 

- (2) Self-attention in the encoder. This allows an encoder to attend to all parts of the encoding output from the previous encoder.

- (3) Self-attention in the decoder. This allows a decoder to attend to all parts of the sequence inside the decoder.

- The attention mechanisms allow a model to draw information from input words and hidden states at any other point in the sentence.







##**Self-Attention**

- Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. Let’s call the input vectors x1, x2,…, xt and the corresponding output vectors y1, y2,…, yt. The vectors all have dimension k. To produce output vector yi, the self attention operation simply takes a weighted average over all the input vectors, the simplest option is the dot product.

- In the self-attention mechanism of our model we need to introduce three elements: Queries, Values and Keys

- Every input vector is used in three different ways in the self-attention mechanism: the Query, the Key and the Value. In every role, it is compared to the other vectors to get its own output yi(Query), to get the j-th output yj(Key) and to compute each output vector once the weights have been established (Value).

- To obtain this roles, we need three weight matrices of dimensions k x k and compute three linear transformation for each xi:

$$q_i=W_qx_i, k_i=W_kx_i, v_i=W_vx_i$$


- These three matrices are usually known as K, Q and V, three learnable weight layers that are applied to the same encoded input. Consequently, as each of these three matrices come from the same input, we can apply the attention mechanism of the input vector with itself, a “self-attention”.


###**The Scaled Dot-Product Attention**
- The input consists of queries and keys of dimension dk, and values of dimension dv. We compute the dot product of the query with all keys, divide each by the square root of dk, and apply a softmax function to obtain the weights on the values.
- Then we use the Q, K and V matrices to calculate the attention scores. The scores measure how much focus to place on other places or words of the input sequence w.r.t a word at a certain position. That is, the dot product of the query vector with the key vector of the respective word we’re scoring. So, for position 1 we calculate the dot product (.) of q1and k1, then q1. k2, q1. k3 and so on,…

- Next we apply the “scaled” factor to have more stable gradients. The softmax function can not work properly with large values, resulting in vanishing the gradients and slowing down the learning, . After “softmaxing” we multiply by the Value matrix to keep the values of the words we want to focus on and minimizing or removing the values for the irrelevant words (its value in V matrix should be very small).

$$Attention(Q,K,V)=SoftMax(\frac{QK^T}{\sqrt{d_k}})V$$

###**Multi-head Attention**

- Instead of performing a single attention function with $d_model$ dimensional keys, values and queries,we found it beneficial to linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
- On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional output values.

- Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions.

###**Feed-Forward Networks**

- In addition to attention sub-layers, each of the layers in our encoder and decoder contains a fully
connected feed-forward network, which is applied to each position separately and identically. This
consists of two linear transformations with a ReLU activation in between.

###**Decoder**
- Each decoder consists of three major components: a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders.

- Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow. The last decoder is followed by a final linear transformation and softmax layer, to produce the output probabilities over the vocabulary.

###**Embeddings and Softmax**

- Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_model$. We also use the usual learned linear transformation and softmax function to convert the decoder output to predicted next-token probabilities.