# Attention

### It is the ability to dynamically highlight and use the salient parts of the information at hand — in a similar manner as it does in the human brain — that makes attention such an attractive concept in machine learning.

 ### Think of an attention-based system consisting of three components:
1. A process that “reads” raw data(such as source words in a source sentence), and converts them into distributed representations, with one feature vector associated with each word position.
2. A list of feature vectors storing the output of the reader. This can be understood as a “memory” containing a sequence of facts, which can be retrieved later, not necessarily in the same order, without having to visit all of them.
3. A process that “exploits” the content of the memory to sequentially perform
a task, at each time step having the ability put attention on the content
of one memory element (or a few, with a different weight).

### Let’s take the encoder-decoder framework as an example since it is within such a framework that the attention mechanism was first introduced. If we are processing an input sequence of words, then this will first be fed into an encoder, which will output a vector for every element in the sequence. This corresponds to the first component of our attention-based system, as explained above. A list of these vectors (the second component of the attention-based system above), together with the decoder’s previous hidden states, will be exploited by the attention mechanism to dynamically highlight which of the input information will be used to generate the output.

### At each time step, the attention mechanism then takes the previous hidden state of the decoder and the list of encoded vectors, using them to generate unnormalized score values that indicate how well the elements of the input sequence align with the current output. Since the generated score values need to make relative sense in terms of their importance, they are normalized by passing them through a softmax function to generate the weights. Following the softmax normalization, all the weight values will lie in the interval [0, 1] and add up to 1, meaning they can be interpreted as probabilities. Finally, the encoded vectors are scaled by the computed weights to generate a context vector. This attention process forms the third component of the attention-based system above.

### The idea is to be able to work with an artificial neural network that can perform well on tasks where the input may be of variable length, size, or structure or even handle several different tasks. It is in this spirit that attention mechanisms in machine learning are said to inspire themselves from psychology rather than because they replicate the biology of the human brain

### The task of the encoder is to generate a vector representation of the input, whereas the task of the decoder is to transform this vector representation into an output. The attention mechanism connects the two.

### Attention helpd determine which of these vectors should be used to generate the output. Because the output sequence is dynamically generated one element at a time, atttention can dynamically highlight different encoded vectors at each time point. This allows the decoder to flexibly utilize the most relevant parts of the input sequence.

### More recently, Vaswani et al. (2017) proposed an entirely different architecture that has steered the field of machine translation in a new direction. Termed transformer, their architecture dispenses with any recurrence and convolutions altogether but implements a self- attention mechanism. Words in the source sequence are first encoded in parallel to generate key, query, and value representations. The keys and queries are combined to generate attention weightings that capture how each word relates to the others in the sequence. These attention weightings are then used to scale the values, in order to retain focus on the important words and drown out the irrelevant ones.

### The self-attention mechanism relies on the use of queries, keys, and values, which are generated by multiplying the encoder’s representation of the same input sequence with different weight matrices. The transformer uses dot product (or multiplicative) attention, where each query is matched against a database of keys by a dot product operation in the process of generating the attention weights. These weights are then multiplied by the values to generate a final attention vector.

## The Bahdanau Architecture

### For this purpose, Bahdanau et al. employ a bidirectional RNN, which reads the input sentence in the forward direction to produce a forward hidden state −!h i, and then reads the input sentence in the reverse direction to produce a backward hidden state h− . The annotation for some particular word xi concatenates the two states: i

### The idea behind generating each annotation in this manner was to capture a summary of both the preceding and succeeding words

### The generated annotations are then passed to the decoder to generate the context vector

## The Decoder

### The role of the decoder is to produce the target words by focusing on the most relevant information contained in the source sentence. For this purpose, it makes use of an attention mechanism

## Luong Attention Mechanism

### The global attentional model resembles the Bahdanau et al. (2015) model in attending to all source words but aims to simplify it architecturally. The local attentional model is inspired by the hard and soft attention models of Xu et al. (2015) and attends to only a few of the source positions. The two attentional models share many of the steps in their prediction of the current word but differ mainly in their computation of the context vector. Let’s first take a look at the overarching Luong attention algorithm and then delve into the differences between the global and local attentional models afterward.