Step By Step Transformer

Attention is all you need

Convolutional Neural Networks (CNNs) Kim 2014 represent documents as a collection of weighted n-grams. Convolutional layers are fast to compute as they fully leverage GPUs, but have no grammatical understanding of a document. On the other hand, Recurrent Neural Networks (RNNs) Cho and al. 2014 are built to read text as a human would do and to take into account long term dependencies in text. However they are, by design, not adapted to GPU computing and are difficult to train consistently.

In the paper Attention Is All You Need, the authors propose to use only dense layers and self-attention layers to learn representations.

Scaled Dot-Product Attention

Attention was originally proposed by bahdanau2014neural as a mean to align text in the source and target languages and to relieve the encoder of the burden to encode the full sentence into a unique vector.

The Scaled Dot-Product attention computes attention over Values ( $V$ ) based on the relation between Keys ( $K$ ) and Queries ( $Q$ ). Formally, $K \in \mathbb{R}^{T_1, d}$ , $Q \in \mathbb{R}^{T_2, d}$ and $V \in \mathbb{R}^{T_2, d}$ are three sequences represented on $d$ dimensions. The attention layer computes:

$\textrm{Attention}(Q, K, V) = \textrm{softmax}\Big(\frac{Q K^\top}{\sqrt{d}}\Big)V$

Where $T_1$ and $T_2$ are the lengths of the sequences. Dot product scales with the dimension of $Q$ and $K$ thus the division by $\sqrt{d}$ that keeps the result away from low gradients region of the softmax function.

To better understand this formulation, let's consider two examples with two sequences $x$ and $y$ :

	$Q$	$K$	$V$
Attention of $y$ over $x$ , as in the classical encoder-decoder framework	$y$	$x$	$x$
Self-attention over $x$	$x$	$x$	$x$

Self-attention

The author compare self-attention against CNNs and RNNs on three criteria: Complexity per Layer, Sequential Operations, the number of operations that need to be done in sequential order (the lower the more parallelisable the layer), Maximum Path Length, and the number of intermediary neurons between the first and last words of a sentence (i.e. the capacity to take into account long term dependencies).

Layer Type	Complexity per Layer	Sequential Operations	Maximum Path Length
Self-Attention	$O(n^2 \cdot d)$	$O(1)$	$O(1)$
Recurrent	$O(n \cdot d^2)$	$O(n)$	$O(n)$
Convolutional	$O(k \cdot n \cdot d^2)$	$O(1)$	$O(n/k)$

Based on those observations, the author argue that self-attention is extremely efficient to encode and decode text. It is extremely cheap to compute, and highly parallelisable. Moreover, as we said earlier, stacking self-attention layers allows to builds complex relationships between words. Finally long term dependencies are taken into account just as much as short term ones.

Multi-Head Attention

Instead of using one instance of the previous module at every layer, ">ulti-Head Attention uses in parallel multiple heads of smaller dimensions and then concatenates their outputs.

$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head_1}, ..., \mathrm{head_h})W^O\\$ Where $\mathrm{head_i} = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i)\\$

The projection matrices $W^Q_1,\dots,W^Q_h,W^K_1,\dots,W^K_h,W^V_1,\dots,W^V_h,W^O$ are the parameters of the network. For all $i$ $W^Q_i \in \mathbb{R}^{d_{model},d_k} W^K_i \in \mathbb{R}^{d_{model},d_k}$ and $W^V_i \in \mathbb{R}^{d_{model},d_v}$ . The authors choose $d_v=d_k=d_{model}/h$ . $d_{model}$ is the dimensionality of text representation throughout the network.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
data		data
README.md		README.md
transformer.py		transformer.py
transformer_moodle.py		transformer_moodle.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Step By Step Transformer

Attention is all you need

Scaled Dot-Product Attention

Self-attention

Multi-Head Attention

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Step By Step Transformer

Attention is all you need

Scaled Dot-Product Attention

Self-attention

Multi-Head Attention

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages