# Attention!

Many of the seq2seq tasks behave in a "non holistic" way, meaning that during the solution generation it is not true that all of the prior input information is always equally important, it is well worth "attending to" certain elements of it at times, when at other occasions they can be thought of as completely unnecessary. Despite this the encoder-decoder model is constrained to only one summarized representation and can not access the relevant parts of prior hidden states. In early times some tricks were applied to mitigate this effect: entering the input twice or in reverse order, but the real solution proved to be the so called **"attention mechanism"** (coming from image processing).

<img src="https://cdn-images-1.medium.com/max/1600/0*SY3nv8-J6qX1GUxk.png">

The decoder receives in each step the prior hidden state and output, as well a _weighted sum_ of all prior states of the encoder as context. 

Context in the $i$ step of the decoder:

$$ c_i = \sum_{j=1}^{T}\alpha_{ij}h_j$$

where for all $h_k$ hidden states there is weight generated by a trained feedforward network $A$:

$$e_{ik} = A(h_k, s_{i-1})$$ 

(where input is $h_k$ encoder state and $s_{i-1}$, the prior hidden state of the decoder) and uses $\alpha_{ij}$ weights to generate a softmax:

$$\alpha_{ij} = \frac{\exp e_{ij}}{\sum_{k=1}^{T}\exp e_{ik}}$$


The classic paper about attention mechanisms is: [Bahdanau et al: "Neural machine translation by jointly learning to align and translate." (2014).](https://arxiv.org/pdf/1409.0473.pdf)

<a id="memnets"></a>
# Memory networks 

But the idea of attention mechanisms over a representation had far wider consequences than one could imagine at frst, since some researchers started to generalize this mechanism as a general storage-retrieval method method for differentiable computation.



## Gated RNN memory problems

### Comparatively small "working memory"

The size of the hidden state, that is the "working memory" of LSTM-like models is very limited

<img src="http://drive.google.com/uc?export=view&id=1-LbAhO8U_sfr5ipSaPTEm_0niPxS39u8" width="700px">

If we take a layer width of 2000 and 64 bit floating point numbers, we get approximately: 128kbit = 16kB, which was considered very limited even with the advent of personal computing

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/Commodore_16_002a.png/1920px-Commodore_16_002a.png" width="700px">


#### The storage requirement of weights

In spite of the limited capacity, the storage needed for the weights is quadratically related to the size of memory, since the "gates" are dense layers.

So if the size of the hidden layer is $h$, the number of weights for an LSTM model is more than $8 h^2$, that means, for the 2000 width example, we are talking about more than **32 million weights**, for which we again calculate with 64 bit numbers, we have to use at least 256MB of storage for the weights (16kB information, at least 256MB retrieval mechanism!!)


#### External memory

The question naturally arises: how can we increase the storage capacity without increasing the number of weights?

### Memory Networks

[Weston et al. (Facebook AI Research, 2014): Memory Networks](https://arxiv.org/pdf/1410.3916.pdf)

#### Abstract architecture: IGOR

- __Input__: Transforms the input into internal feature respresentations
- __Generalization__: Refreshes the memory based on the input, typically it compresses, represents
- __Output__: Generates new output (in the feature representation space)based on input and memory state
- __Response__: Transforms output to the required output format

#### Modules more in detail

##### Input
In NLP cases processing and embedding can happen here

##### Generalization
The most basic solution is to simply store the inner representation at a memory location depending on the $x$ input:

$$m_{H(x)} = I(x) $$

where $H(.)$ is definign the appropriate memory location.

##### Output and response
- Typically the oputput reads and composes data from the memory locations, which means a form of reasoning
- Based on output, response layer gives the final response, eg. an RNN decoder

### One example architecture: Neural Turing Machine 2.0 - Differentiable Neural Computer

[Graves et al (Google DeepMind, 2016): Hybrid computing using a neural
network with dynamic external memory](https://www.dropbox.com/s/0a40xi702grx3dq/2016-graves.pdf)

#### Architecture

The architecture is based on that of the classic Turing machine: there are separate reading and writing network components ("heads") to interact with the external memory, which is an array of writable/readable numeric vectors.
The main method of addressing the memory is similarity based. Given a key, a similarity based attention mechanism accesses the memory by focusing on the memory cells with content that is most similar to the key. Among uses, this mechanism enables using the memory as a key-value store. 

The memory related components:
- a memory address adjacency matrix that stores which memory addresses were written after each other.
- a vector storing the list of already used memory addresses.
- a single "write head", which uses either a key/content based addressing scheme or writes to newly allocated memory places.
- two reading heads, both of which can read the external memory in three ways:
  - based on the similarity of content to a given key
  - sequentially according to the earlier writing order
  - by selecting an unused or long ago used memory cell

The above external memory machinery is driven by a recurrent network "controller" which reads the input, interacts with the external memory and produces the output.

<img src="http://drive.google.com/uc?export=view&id=1HQMkHgWYUL348DT86ZQe371snij1Ui6X">
  




### Results

- __Large scale QA__ Search in a memory stored database of 14M triplets in (subject, relation, object) form (eg. milne authored winnie-the-pooh) based on natural language query, eg. "Who is pooh's creator?"

> "The results show that MemNNs are a viable approach for large scale QA in
terms of performance."

- __Simulated World QA__ 

Answering questions based on simple stories:

>we also built a simple simulation of 4 characters, 3 objects and 5 rooms – with characters moving around, picking up and dropping objects. The actions are transcribed into text using a simple automated grammar, and labeled questions are generated in a similar way.

<img src="http://drive.google.com/uc?export=view&id=1oxe_-Wm4s4K-Ax880NCu4Z_lK-PFnriH">

<img src="http://drive.google.com/uc?export=view&id=1BtY9jKwtt3xr4NS9huadwCTllP-GQCkp" width="700px">

(Difficulty: In which sentence did the asked object appear in the last time? Actor vs actor + object expariment: in the first only "go" was allowed as action, in the later "get" and "drop" also.)

### Mature benchmark - The bAbI dataset

bAbI tasks: synthetic "toy" QA dataset produced by simulation

([Weston et al (Facebook AI Research, 2016): Towards ai-complete question answering: A set of prerequisite toy tasks.](https://arxiv.org/pdf/1502.05698.pdf)):

>All of the tasks are noiseless and a human able to read that language can potentially achieve 100%
accuracy. We tried to choose tasks that are natural to a human: they are based on simple usual situations and no background in areas such as formal semantics, machine learning, logic or knowledge
representation is required for an adult to solve them.

>The data itself is produced using a simple simulation of characters and objects moving around and
interacting in locations, described in Section 4.  The simulation allows us to generate data in many
different scenarios where the true labels are known by grounding to the simulation.

Components:

- "entities"
  - places
  - objects
  - persons
- states
  - absolute/relative place
  - mental state
- attributes
  - size
  - colour
- actions:
  - go _location_, get _object_, get _object1_ from _object2_, put _object1_ in/on _object2_, give _object_ to
_actor_, drop _object_, set _entitity_ _state_, look, inventory and examine _object_.

"For each task, we describe it by giving a small sample of the dataset including statements, questions and the true
labels (in red) in Tables 1 and 2."

<img src="http://drive.google.com/uc?export=view&id=1zwM3eG-QuTkcyWWFzgius2r_xCMyTPOo" width="700px">

<img src="http://drive.google.com/uc?export=view&id=1GpyNrRy9B294D01SpspKX9mpxVFqp0D5" width="700px">




### Results for DNC

#### bAbI

Mean test error rate 7.5% $\rightarrow$ 3.8%

#### Randomly generated graph tasks

<img src="http://drive.google.com/uc?export=view&id=1ckenwxkD75EwfRJSED0O0xi20PtCEMJ3" width="700px">

<img src="http://drive.google.com/uc?export=view&id=1wr7CT728KYoVG44P89R5WnVmUVAr5WLC"  width="600px">

#### Moving objects (mini SHRDLU)

This is the domaion of Reinforcement Learning...


# Attention is all you need! - rise of the Transformers

Although we have seen that the usage of attention mechanisms enables the processing over elaborate external memory structures, later on with the advancement of research it turned out that attention mechanisms even without any external memory are extremely powerful in sequence modeling.


The __transformer__ is a powerful seq2seq encoder-decoder architecture which is built solely from "transformer modules" consisting of attention and feed-forward layers without using RNN-s. Nonetheless, in most NLP tasks (e.g., language modeling, translation, question answering etc.) transformer-based models have recently significantly outperformed the "more traditional" RNN-based encoder-decoders.

## Attention in general

The basic attention schema used in transformers can be described as follows: We want to "attend" to part(s) of a certain $\mathbf X=\langle \mathbf x_1,\dots,\mathbf x_n \rangle$ sequence of vectors (embeddings). In order to do that, we transform $\mathbf X$ into a sort of "key-value store" by calculating from $\mathbf X$ a

- $\mathcal K(\mathbf X) = \mathbf K = \langle \mathbf k_1,\dots, \mathbf k_n \rangle$ sequence of key vectors for each $\mathbf x_i$,
- a $\mathcal V(\mathbf X) = \mathbf V = \langle \mathbf v_1,\dots,\mathbf v_n \rangle$ sequence of value vectors for each $\mathbf x_i$,

plus generate (not necessarily from $\mathbf X $) a $\mathbf Q = \langle \mathbf q_1,\dots,\mathbf q_m\rangle$ sequence of query vectors. Using these values, the "answers" to each $\mathbf q$ query can be calculated by

- first calculating a "relevance score" for each $\mathbf k_i$ key, which is simply the $\mathbf q \cdot \mathbf k_i$ dot product (in certain cases scaled by a constant),
- taking the $\langle s_1,\dots,s_n\rangle$ softmax of the scores, which forms a probability distribution over the value vectors;
- finally, calculating the answer as the 
$$ \sum_{i} s_i \mathbf v_i$$ weighted sum of the values. 

##  Attention as a layer

How can the above attention mechanism be used as a _layer_ in a network with an input vector $\mathbf I = \langle \mathbf i_1,\dots, \mathbf i_n\rangle$, where the $\mathbf i_i$s are themselves vectors (embeddings)? The transformer solution is is to calculate a query from each input: 

$$
\mathbf Q = \mathcal Q(\mathbf I) = \langle \mathcal Q(\mathbf i_1),\dots,\mathcal Q(\mathbf i_n)\rangle 
$$
use these queries to attend to a sequence of vectors, and output simply the calculated answers.

The transformer uses two attention-layer variants, which differ only in what they attend to:

- __Self-attention__ layers attend (unsurprisingly) to themselves, while, in contrast 
- __Encoder-decoder attention__ layers, used in the decoder, attend to the output of the encoder.

## Self-attention

In a transformer self-attention layer, both the source of the queries and the target of the attention are the input embeddings. The mappings for queries, keys and values are learned projections:

<img src="http://jalammar.github.io/images/t/self-attention-matrix-calculation.png" width="400px">

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

## Multi-headed attention

In order to be able attend to different features on the basis of different queries, the transformer attention layers work with multiple learned query, key and value projections, which are collectively called "attention heads":

> "Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions." 

([Attention is all you need](https://arxiv.org/abs/1706.03762))

<img src="http://jalammar.github.io/images/t/transformer_attention_heads_qkv.png" width="800">
(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

The outputs are collected for each head separately:

<img src="http://jalammar.github.io/images/t/transformer_attention_heads_z.png" width="800">

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

concatenated, and, finally, projected back by another learned weight matrix into the basic model embedding dimension:

<img src="http://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png" width="800">

(In the original "Attention is all you need" paper the model embedding dimension is 512, there are 8 attention heads and the query key and value vectors are all 512/8 = 64 dimensional.)

## Transformator modules
Similarly to most CNN architectures,  transformators are built up from identical modules, that consist of two main components, one or two multiheaded attention layers and a positionwise feedforward network layer with one hidden layer whose dimensionality is larger than the model's basic embedding dimension (2048 in the original paper). The attention and FF layers are residuals with skip connections, and are normalized with layer norm. Two types of modules are used:
The modules in the encoder contain only self-attention:

<img src="http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png" width="550">

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

While the modules in the decoder also contain an "outward attention" layer attending to the output of the encoder:

<img src="https://lilianweng.github.io/lil-log/assets/images/transformer-decoder.png" width="400">

(image source: [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html))

## Encoder-decoder architecture

The full encoder-decoder architecture has the following structure:

<img src="http://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png" width="400">

(image source: [The annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html))

Similarly to other (e.g., RNN-based) seq2seq architectures, the decoder part takes the previous outputs as input. In order to prevent access to information from "future outputs", the self-attention layers in the decoder use "masked attention", i.e., for each position, positions to the right are forced to have $-\infty$ input relevance score in the self-attention softmax layer.

The following animations show the whole transformer seq2seq architecture in action in a translation task:

<img src="http://jalammar.github.io/images/t/transformer_decoding_1.gif">

<img src="http://jalammar.github.io/images/t/transformer_decoding_2.gif">

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

## Further reading

+ The original transformer paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)
+ A highly readable, illustrated dissection on which this discussion drew: [The illustrated transformer](http://jalammar.github.io/illustrated-transformer/)
+ An annotated version of the original paper with implementation in Pytorch: [The annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
+ Perhaps the most important application, a special kind of language model: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)