Convolutional Neural Networks (CNNs) Kim 2014 represent documents as a collection of weighted n-grams. Convolutional layers are fast to compute as they fully leverage GPUs, but have no grammatical understanding of a document. On the other hand, Recurrent Neural Networks (RNNs) Cho and al. 2014 are built to read text as a human would do and to take into account long term dependencies in text. However they are, by design, not adapted to GPU computing and are difficult to train consistently.
In the paper Attention Is All You Need, the authors propose to use only dense layers and self-attention layers to learn representations.
Attention was originally proposed by bahdanau2014neural as a mean to align text in the source and target languages and to relieve the encoder of the burden to encode the full sentence into a unique vector.
The Scaled Dot-Product attention computes attention over Values () based on the relation between Keys (
) and Queries (
).
Formally,
,
and
are three sequences represented on
dimensions. The attention layer computes:
Where and
are the lengths of the sequences.
Dot product scales with the dimension of
and
thus the division by
that keeps the result away from low gradients region of the softmax function.
To better understand this formulation, let's consider two examples with two sequences and
:
| Attention of |
|||
| Self-attention over |
The author compare self-attention against CNNs and RNNs on three criteria: Complexity per Layer, Sequential Operations, the number of operations that need to be done in sequential order (the lower the more parallelisable the layer), Maximum Path Length, and the number of intermediary neurons between the first and last words of a sentence (i.e. the capacity to take into account long term dependencies).
| Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Length |
|---|---|---|---|
| Self-Attention | |||
| Recurrent | |||
| Convolutional |
Based on those observations, the author argue that self-attention is extremely efficient to encode and decode text. It is extremely cheap to compute, and highly parallelisable. Moreover, as we said earlier, stacking self-attention layers allows to builds complex relationships between words. Finally long term dependencies are taken into account just as much as short term ones.
Instead of using one instance of the previous module at every layer, ">ulti-Head Attention uses in parallel multiple heads of smaller dimensions and then concatenates their outputs.
The projection matrices are the parameters of the network.
For all
and
.
The authors choose
.
is the dimensionality of text representation throughout the network.