In [None]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

## In this presentation

- Sequence2sequence models
- Attention mechanism
- Transformers for NLP
- Transformers for images
- Set loss
- Transformer based object detection

## Sequence2sequence models

Sequence to sequence model was developed for machine translation with RNN (LSTM or GRU) encoder and decoder
<br>
The architecture was first RNN, the encoder to encode the sentence in the one language, for instance French into the fixed size vector: $300$, $400$, $1024$, $2048$, etc
<br>
and second RNN, the decoder, to decode this vector to the sentence in other language, for instance in English 

Example of sequence2sequence:
<img src="images/detr/seq2seq_1.png" height="1000" width="1000">

The RNN models are used to deal with sequences remembering the previous states information:
<img src="images/detr/rnn_1.gif" height="1000" width="1000">

Visualization of the recurrent neural network
<img src="images/detr/rnn_anim_1.gif" height="1000" width="1000">

Each cell is the neural network with recurrent connections, connection of the hidden layer:
<img src="images/detr/rnn_anim_2.gif" height="600" width="600">

Example of translation:
<img src="images/detr/seq2seq_2.gif" height="1000" width="1000">

Turns out, that this model works quite well if we choose "good" pair of sentences and train it for long time
<br>
As I remember in the original paper, they sad that, when we reverse the input sentences it significantly improved the accuracy.

This model works not only for machine translation, but for other tasks like image captioning:
<img src="images/detr/imcap_1.png" height="1000" width="1000">

The main idea here, instead of using RNN as encoder, use ConvNet and use RNN only as decoder
<br>
Use pre-trained model without last layer to encoder image to the vector and feed this vector to the decoder

<img src="images/detr/imcap_3.png" height="1000" width="1000">

## Attention

Turns out that when sentence becomes longer, RNN models are not good enough to forward the information about first words at the end or even in the middle and translation performance is suffering
<br>
To deal with this, we need somehow pay an attention on the parts of the sentence which are most influential for the part (for the particular word) in the output sentence
<br>
Naturally, when we try to translate long text, we first pay an attention on particular part of it, on particular words and part by part assemble the result 

Let's create an probabilistic mask for hidden layers $h^1, h^2, \dots, h^n$ which are vectors, with weights:
$$
\begin{pmatrix}
\alpha^{<1, 1>} & \alpha^{<1, 2>} & \cdots & \alpha^{<1, n>} \\
\alpha^{<2, 1>} & \alpha^{<2, 2>} & \cdots & \alpha^{<2, n>} \\
\vdots  & \vdots  & \ddots & \vdots  \\
\alpha^{<m, 1>} & \alpha^{<m, 2>} & \cdots & \alpha^{<m, n>} \\
\end{pmatrix}
$$
<br>
where
$\sum_{j=1}^n{\alpha^{<i, j>}} = 1$

And for each hidden state (which is vector by itself) of decoder $s^t$ we generate the vector 
$$
c^t = \sum_{i=1}^n{\alpha^{<t, i>} \cdot h^i}
$$
<br>
and concatenate $c^t$ with the original hidden state for output and input in the next recurrent unit $<c^t, s^t>$

Visualization of the attention:
<img src="images/detr/attnt_1.gif" height="1000" width="1000">

The question is how we generate this $\alpha$, we use softmax:
$$
\alpha^{<i, t>} = \frac{exp(z^{<t, i>})}{\sum_{j=1}^{n}exp(z^{<t, j>})}
$$
but how we calculate $z^{<t, i>}$ values

We need to generate function $f$ which consumes $h^i$ and $s^{t-1}$ and outputs the $z^{<t, i>}$ (because, $s^t$ is $s^{t-1}$ influenced) one function for all data
<br>
$$
z^{<t, i>} = f(<h^i, s^{t-1}>)
$$
<br>
Here $<.,.>$ means concatenation
<br>
So we can use a "small" neural network with maybe $1$ or $2$ hidden layer as our function $f$
<br>
The encoder RNN, decoder RNN and $f$ are all trained together

Attention is trained probabilities for each decoder's hidden state how it should influence encoders hidden state.
<br>
The same might be applied to any tensor to any other tensor transition like images or graphs

Attention improved performance of translation of long sequences and even image captioning with longer and more precise descriptions

Attention heatmaps:
<img src="images/detr/attnt_2.png" height="600" width="600">

## Transformers for NLP (<a href="http://jalammar.github.io/illustrated-transformer/">source</a>)

<a href="https://arxiv.org/abs/1706.03762"> Attention is all you need</a>
<br>

<a href="http://jalammar.github.io/illustrated-transformer/">The Illustrated Transformer</a>
<br>

<a href="https://kazemnejad.com/blog/transformer_architecture_positional_encoding/">Transformer Architecture: The Positional Encoding</a>
<br>

<a href="https://medium.com/lsc-psd/introduction-of-self-attention-layer-in-transformer-fc7bff63f3bc">Introduction of Self-Attention Layer in Transformer</a>

#### Transformer architecture

Transformer consists with encoder and decoder and first was introduced as an alternative of sequence2sequence models for machine translation.
<br>

The key concept is that input sentence is fed as matrix with fixed dimensional word embedding at each row:

$$
\begin{pmatrix}
x_{1, 1} & x_{1, 2} & \cdots & x_{1, d} \\
x_{2 1} & x_{2, 2} & \cdots & x_{2, d} \\
\vdots  & \vdots  & \ddots & \vdots  \\
x_{m, 1} & x_{m, 2} & \cdots & x_{m, d} \\
\end{pmatrix}
$$
<br>

and is multiplied on weights matrix with fixed dimensional columns (the same as word embeddings) and rows (hyper-parameter, $64$ in original paper):
<br>

$$
\begin{pmatrix}
w_{1, 1} & w_{1, 2} & \cdots & w_{1, s} \\
w_{2 1} & w_{2, 2} & \cdots & w_{2, s} \\
\vdots  & \vdots  & \ddots & \vdots  \\
w_{d, 1} & w_{d, 2} & \cdots & w_{d, s} \\
\end{pmatrix}
$$
<br>

So output is always fixed size $\mathbb{R}^{d \times s}$ and theoretically model can consume any length sentence

#### Model architecture:

The encoder and decoder networks:
<img src="images/detr/transf_1.png" height="600" width="600">

Encoder consists with several different networks stacked together as well as decoder ($6$ in paper):
<img src="images/detr/transf_2.png" height="600" width="600">

The architecture of each layer are similar: (multi-head) self-attention and then feed-forward layers for encoder and decoder, plus decoder has encoder-decoder attention layer as sequence2sequence models:
<img src="images/detr/transf_3.png" height="600" width="600">

#### Self-attention

First step: create three different vectors from each input vector (word embedding) using three different weight matrices: query, key and value

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant
<img src="images/detr/selfatt_1.png" height="600" width="600">

The second step: calculate the score for each query with different keys:

<br>
The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2
<img src="images/detr/selfatt_2.png" height="600" width="600">

Third step: Divide the scores by 8 (the square root of the dimension of the key vectors (in the paper 64 and divide on 8 respectively)
<br>
Fourth step: SoftMax the the results:
<img src="images/detr/selfatt_3.png" height="600" width="600">

Fifth step: Multiply each value vector on this scores, this will generate probability masked values as before
<br>
Sixth step: Sum all value vectors as output for first embedding:
<img src="images/detr/selfatt_4.png" height="600" width="600">

The resulting vector is one we can send along to the feed-forward neural network

All above steps might be done in matrix calculation:
<img src="images/detr/selfatt_5.png" height="600" width="600">

Then calculate attention outputs
<img src="images/detr/selfatt_6.png" height="600" width="600">

Instead of single self attention, multi-head attention is applied with different weights ($8$ in original paper):
<img src="images/detr/selfatt_7.png" height="600" width="600">

Per embedding, different outputs are generated:
<img src="images/detr/selfatt_8.png" height="600" width="600">

Then outputs are concatenated horizontally and additional weights matrix is used to produce single matrix:
<img src="images/detr/selfatt_9.png" height="600" width="600">

Here is the big picture, performance is improved with multi-head attention (compare to features map):
<img src="images/detr/selfatt_10.png" height="600" width="600">

#### Positional encoding

There is no notion of word order (1st word, 2nd word, ..) in the transformer architecture, thus positional encoding is applied, for $d$ dimensional embeddings:
$$
\text{PE}(pos,2i)=sin\left(\frac{pos}{10000^{2i/d_{model}}}\right),
$$
<br>
and
$$
\text{PE}(pos,2i+1)=cos\left(\frac{pos}{10000^{2i/d_{model}}}\right).
$$
generate the $d$ dimensional $\mathbb{R}^d$ vectors with encoded positional information
<br>
$d_{model}=512$ model $i \in [0, 255]$ in paper

<br>
This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
<br>

The main thing here is, positional encoders should be distinguishable and periodic in order to encode sentences with length which was not seen during the training.
<br>

From the paper: "We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos."

These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention:
<img src="images/detr/selfatt_11.png" height="600" width="600">

For example:
<img src="images/detr/selfatt_12.png" height="600" width="600">

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible:
<img src="images/detr/selfatt_13.png" height="600" width="600">

Residual connections:
<img src="images/detr/selfatt_14.png" height="600" width="600">

#### Different type of normalizations (not only batch normalization exists)

Instead of batch normalization, lets learn mean and standard deviation for instance, layer, group, etc:
<img src="images/detr/norm_1.png" height="1000" width="1000">
<br>
For images anything beside batch normalization does not give any improvement and sometimes deteriorates performance, because of channel structure, but for transformer architecture, according to the inter-text context and (multi-head) attention it significantly improves performance

More detailed illustration of layer normalization:
<img src="images/detr/norm_2.png" height="1000" width="1000">

Layer normalization in transformer associated with self attention, here input embeddings $X$ and output of the layer are summed to preserve the original information:
<img src="images/detr/selfatt_15.png" height="600" width="600">

This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
<img src="images/detr/selfatt_16.png" height="1000" width="1000">

#### Decoder

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:
<img src="images/detr/selfatt_17.gif" height="1000" width="1000">

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.

<img src="images/detr/selfatt_18.gif" height="1000" width="1000">

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

## Output of transformer

Transformers has fully connected last layers with SoftMax activation of the vocabulary length vector, encoding by the probability, one-hot encoded word:
<img src="images/detr/transf_4.png" height="1000" width="1000">

## Transformers for images

<a href="https://arxiv.org/abs/1802.05751">Image Transformer</a>
<br>

<a href="https://arxiv.org/abs/1904.09925">Attention Augmented Convolutional Networks</a>

Multi-head self-attention technique was used for image decoding as well. The experiments show that mixed approach along with ConvNet layers increases accuracy and performs better than fully attention based model:
<br>
"We test our method on the CIFAR-100 and ImageNet classification [22, 9] and the COCO object detection [27] tasks, across a wide range of architectures at different com- putational budgets, including a state-of-the art resource constrained architecture [42]. Attention Augmentation yields systematic improvements with minimal additional computational burden and notably outperforms the popu- lar Squeeze-and-Excitation [17] channelwise attention ap- proach in all experiments. In particular, Attention Augmen- tation achieves a 1.3% top-1 accuracy ImageNet on top of a ResNet50 baseline and 1.4 mAP increase in COCO ob- ject detection on top of a RetinaNet baseline. Suprisingly, experiments also reveal that fully self-attentional models, a special case of Attention Augmentation, only perform slightly worse than their fully convolutional counterparts on ImageNet, indicating that self-attention is a powerful stand- alone computational primitive for image classification."
<br>

In my opinion ConvNet layer have the property to forget inactive features, remove noise and extract word level features from continuous data where multi-head self attention shines

Given the input tensor of shape
$$
(H, W, F_{in})
$$
multi-head self-attention is defined as:
<br>

$$
O_h=\text{SoftMax}(\frac{(X \cdot W_q)(X \cdot W_k)}{\sqrt{d_k^h}}) \cdot (X \cdot W_v)
$$
<br>

Where $W_q, W_k \in \mathbb{R}^{F_{in} \times d_k^h}$ and $W_v \in \mathbb{R}^{F_{in} \times d_v^h}$ are learned liner transformations which map $X$ to queries $Q = X \cdot W_q$, keys $K = X \cdot W_k$ and values $X \cdot W_v$
<br>

Output of multi-head self-attention layer is \:
$$
\text{MHA}(X) = \text{concat}(Q_1, O_2, \dots, O_{N_{h}}) \cdot W^O
$$
<br>

$$
W^O = \mathbb{R}^{d_u \times d_u}
$$
<br>

Then the output is reshaped in $(H, W, d_u)$ to match the original input dimension

At the end attention augmented convolution is:
$$
\text{AAConv}(X) = \text{concat}( \text{Conv}(X),\text{MHA}(X))
$$

Performance of attention augmented models:
<img src="images/detr/aaconv_1.png" height="1000" width="1000">

## Set prediction

Set prediction is applied to match two sets

<img src="images/detr/dert_3.png" height="1000" width="1000">

## Transformer based object detection

<a href="https://ai.facebook.com/research/publications/end-to-end-object-detection-with-transformers">End-to-end Object Detection with Transformers</a>
<br>

<a href="https://medium.com/lsc-psd/detr-object-detection-with-transformer-a97104ea1723">DETR, Object detection with Transformer</a>
<br>

<a href="https://www.youtube.com/watch?v=T35ba_VXkMY">Paper explained in video</a>
<br>

<a href="https://www.youtube.com/watch?v=LfUsGv-ESbc">[Code] How to use Facebook's DETR object detection algorithm in Python (Full Tutorial)</a>

The attention layers, the transformer architecture will encode spatial information:
<img src="images/detr/detr_1.png" height="1000" width="1000">

<img src="images/detr/detr_2.png" height="1000" width="1000">

<img src="images/detr/detr_3.png" height="1000" width="1000">

<img src="images/detr/detr_4.png" height="1000" width="1000">

<img src="images/detr/detr_5.png" height="1000" width="1000">

<img src="images/detr/detr_6.png" height="1000" width="1000">

<img src="images/detr/detr_7.png" height="1000" width="1000">

<img src="images/detr/detr_8.png" height="1000" width="1000">

<img src="images/detr/detr_9.png" height="1000" width="1000">

<img src="images/detr/detr_10.png" height="1000" width="1000">

<img src="images/detr/detr_11.png" height="1000" width="1000">

<img src="images/detr/detr_12.png" height="1000" width="1000">

## Questions

<img src="images/detr/questions_1.png" height="600" width="600">

## Thank you