Skip to content

JbRemy/Transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Step By Step Transformer

Attention is all you need

Convolutional Neural Networks (CNNs) Kim 2014 represent documents as a collection of weighted n-grams. Convolutional layers are fast to compute as they fully leverage GPUs, but have no grammatical understanding of a document. On the other hand, Recurrent Neural Networks (RNNs) Cho and al. 2014 are built to read text as a human would do and to take into account long term dependencies in text. However they are, by design, not adapted to GPU computing and are difficult to train consistently.

In the paper Attention Is All You Need, the authors propose to use only dense layers and self-attention layers to learn representations.

Scaled Dot-Product Attention

Attention was originally proposed by bahdanau2014neural as a mean to align text in the source and target languages and to relieve the encoder of the burden to encode the full sentence into a unique vector.

The Scaled Dot-Product attention computes attention over Values () based on the relation between Keys () and Queries (). Formally, , and are three sequences represented on dimensions. The attention layer computes:

Where and are the lengths of the sequences. Dot product scales with the dimension of and thus the division by that keeps the result away from low gradients region of the softmax function.

To better understand this formulation, let's consider two examples with two sequences and :

Attention of over , as in the classical encoder-decoder framework
Self-attention over

Self-attention

The author compare self-attention against CNNs and RNNs on three criteria: Complexity per Layer, Sequential Operations, the number of operations that need to be done in sequential order (the lower the more parallelisable the layer), Maximum Path Length, and the number of intermediary neurons between the first and last words of a sentence (i.e. the capacity to take into account long term dependencies).

Layer Type Complexity per Layer Sequential Operations Maximum Path Length
Self-Attention
Recurrent
Convolutional

Based on those observations, the author argue that self-attention is extremely efficient to encode and decode text. It is extremely cheap to compute, and highly parallelisable. Moreover, as we said earlier, stacking self-attention layers allows to builds complex relationships between words. Finally long term dependencies are taken into account just as much as short term ones.

Multi-Head Attention

Instead of using one instance of the previous module at every layer, ">ulti-Head Attention uses in parallel multiple heads of smaller dimensions and then concatenates their outputs.

Where

The projection matrices are the parameters of the network. For all and . The authors choose . is the dimensionality of text representation throughout the network.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages