# Recent advancements: The evolution of word embeddings
## Problems and limitations of the first generation of word embeddings

Although the first generation of successful word embedding methods like Word2vec and GloVe produced vector representations for text tokens that proved to be hugely beneficial for most NLP tasks, they have some important limitations:

+ __Context independence:__ A word has only one representation: The word "bank" will have the same embedding in the sentences "I went to my bank to withdraw some money." and  "We explored the river bank.", even though the two meanings are clearly different.
+ __Words are black boxes:__ In reality, words have internal structure: They consist of characters, and from the linguistic point of view, they can be composed of several morphemes (stem, inflections, etc.). Nevertheless, the representation learning methods of Word2vec, GloVe etc. ignore this internal structure and treat words as black boxes. The only information they use to produce the representations is word-level distribution, regardless of internal connections, like having the same stem etc.
+ __No useful representation for unseen or rare words:__ Because words are treated as black boxes, these models cannot produce useful representations for words that do not occur or are very rare in the training corpus, as there is no (or not enough) contextual information about them.
+ __Good vocabulary coverage requires huge model dimensions:__ Again because of the black box word handling, a word can get a meaningful representation only if it's explicitly included in the vocabulary of the model. But the memory consumption is typically a linear function of the covered vocabulary (one of the dimensions of the embedding matrices is the vocabulary size).
+ __Language dependence:__ Since embeddings are intended to express the *semantics* of words, one would expect that the embeddings associated with words that express the same meaning in different languages should be identical or at least highly similar. Unfortunately, for embeddings trained -- as usual -- on monolingual corpora this not the case.

# Modeling word structure: Subword embeddings

The "black box word" problem has an obvious solution: we should use features that reflect the tokens' internal structure. In practice, this means that words are represented in terms of all or some of the constituents (characters, morphemes, character *n*-grams etc.) which they are composed of. The most important approaches are the following:

## Morphological modeling

These methods rely on the morphological decomposition of words. Fortunately, semantically or syntactically similar words often share some common morphemes such as roots, affixes, and syllables. For example, probably and probability share the same root, i.e., probab, as well as the same syllables, i.e., pro and ba. herefore, morphological information can provide valuable knowledge to bridge the gap between rare or unknown words and well-known words in learning word representations.

__Morpheme embeddings.__ The most typical approach is to segment words into morphemes, add the morphemes to the vocabulary, and use the morphemes together with words for the language modeling task. As a concrete example, Quiu et al (2014, [Co-learning of Word Representations and Morpheme Representations](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/5BCOLING20145D20Morphological20Word20Embedding.pdf)) extends the standard word-level CBOW word2vec architecture with morphemes in the following way (figure from the paper):



<a href="http://drive.google.com/uc?export=view&id=1DMbOLsoWamcrBUwfHEDKYgPKRAvKUlwe"><img src="https://drive.google.com/uc?export=view&id=1Lru58DtrjYmqZyd73iOOdokOmvlGBq6F"></a>

The model learns embeddings for all words and morphemes in the vocabulary and the context is modeled with the weighted sum of the embeddings of words and morphemes in the context. The prediction target is also modified: in addition to the missing word, the model predicts the morphemes of the missing word as well.

__Morphological tagging.__ An alternative approach, which requires morphological tagging as opposed to morphological segmentation, is to add finegrained morphological tags to the input and to the predictive task. See e.g. [Cotterell and Schütze: Morphological Word-Embeddings (2015)](http://www.aclweb.org/anthology/N15-1140).

__Advantages and problems.__ One of the main advantages of these methods is that they can be used to generate useful embeddings for unknown/out-of-vocabulary words: E.g., with morpheme embeddings one can add the embeddings of the constituent morphemes of unknown words to generate a useful representation on the-fly.

The main problem of the approach that it requires a morphological segmenter/analyzer both for training and for the handling of unknown/oov words, which seriously limits their usability, since most of these tools are vocabulary-based as well, if they exist for the target language at all.

## Character-level modeling
Character-level modeling of words was successfully used for many typical NLP tasks, among others for
- language modeling,
- part of speech tagging
- machine translation

One of its advantages is that for some tasks, e.g. for language modeling and machine translation it makes word segmentation unnecessary.

The typical architecture for generating word embeddings with character-level word modeling uses learnt  character embeddings together with the usual sequence embedding methods, i.e., CNNs or LSTM variants. The produced sequence embeddings in turn can be used as word embeddings for the upstream task. For language modeling this results in an architecture along the lines of:

<a href="http://drive.google.com/uc?export=view&id=1pfeatk5u7ZGVwhnMjOEfuTqjJ6DqXLT7"><img src="https://drive.google.com/uc?export=view&id=1RuC_DXqHZ-FVjgUNJ-yY2eLzmSfowtIO" width=500px></a>

(figure from [Kim et al.: Character-Aware Neural Language Models](https://nlp.seas.harvard.edu/slides/aaai16.pdf))

### Character-word hybrid in MT

An important variant, successfully used for machine translation, is the character-word hybrid developed by Luong and Manning in 2016 ([Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models](https://arxiv.org/pdf/1604.00788.pdf)), which uses traditional word embeddings for in-vocabulary words, but a character-level model for unknown ones:
<a href="http://drive.google.com/uc?export=view&id=19KY2XwDARZISs2CkHKO6a2HkolTP2tby"><img src="https://drive.google.com/uc?export=view&id=1ViIA1O_JbywJ4cyHaunkss239jhfJCYv" width=300px></a>
(figure from the [Luong and Manning 2016](https://arxiv.org/pdf/1604.00788.pdf))

Their experiments showed that wholly character-level models performed better than exclusively word-level models, but took prohibitively long to train (4 times longer). The hybrid model's training, in contrast, was only 10-20% slower and it performed even better than the entirely character-based one.

## Bag of *n*-grams (FastText)


[FastText](https://fasttext.cc/) is a word embedding method developed at Facebook's AI research lab (see [Bojanowski et al: Enriching Word Vectors with Subword Information, 2016](http://aclweb.org/anthology/Q17-1010)) which tries to characterize words in terms of their constituent parts, but -- unlike the morpheme based solutions -- these parts are generated in an unsupervised manner from the corpus.

More concretely, in FastText words are modeled as bags of character *n*-grams. For instance, the word "apple" with *n*=3 would be associated with the bag of trigrams "$<$ap", "app", "ppl", "ple", "le$>$" (where the special characters "$<$" and "$>$" are added to indicate the start and end of the word). In practice, FastText does not restrict the *n*-grams to a single particular *n* value, but associates each word with a bag containing
* all *n*-grams in the word where $3\leq$*n*$\leq 6$
* plus the whole word as an *n*-gram, e.g., "$<$apple$>$".

Using this representation and assuming that each occurring *n*-gram has an associated embedding, words can be represented as the sum of the embeddings of the $n$-grams in their associated bag of *n*-grams. These word representations, in turn, can be plugged into the standard word2vec models (either CBOW or skipgram) to build a model that learns embeddings for the whole vocabulary consisting of the occurring *n*-grams.

Apart from embeddings of higher quality ([Bojanowski et al](http://aclweb.org/anthology/Q17-1010) report that distances between FastText embeddings correlate more with intuitive similarity judgments than that between standard word2vec embeddings), trained FastText models can easily provide embeddings for OOV words by simply adding together the embeddings of their constituent $n$-grams.

__The "hashing trick".__ An important practical problem with FastText-like approaches is the increase of the vocabulary size resulting from the addition of *n*-grams. FastText adopts the frequently used, so called "hashing trick" to keep the effective vocabulary size at a manageable level: $n$-grams are hashed to integers in 1 to *K*, where *K* is chosen on the basis of the available memory. E.g., in the  [Bojanowski et al](http://aclweb.org/anthology/Q17-1010) experiments *K* was set to 2 million.

## Byte Pair Encoding (BPE) and co.

### What is BPE?

BPE was originally developed as a simple compression technique for byte sequences. The compression procedure consists of a series of 'merge' operations, in which the most frequent consecutive byte-pair in the sequence is replaced everywhere with one byte (which didn't occur in the sequence).

The technique can be generalized to any sequence consisting of finite number of symbols (an 'alphabet'). To generate an encoded version of a text over an alphabet
1. initialize the symbol list with the characters in the alphabet plus a special word-start  symbol  ('_'), 
2. repeatedly count all symbol pairs and replace each occurrence of the most frequent pair (‘A’, ‘B’) with a  new   'AB' element, and add 'AB' to the list of symbols. (The merged pairs cannot contain '_' anywhere else than the first position.)

> "The number of merge operations *o* determines if the resulting encoding mostly creates short character sequences (e.g. *o* =  1000) or if it includes symbols for many frequently occurring words, e.g. *o* = 30,000.  Since the BPE algorithm works with any sequence of symbols, it requires no preprocessing and can be applied to untokenized text."

BPE encodings of a sentence depending on the number of merge operations:

<a href="http://drive.google.com/uc?export=view&id=1KF3BHRCmI30B5Gu4elGWtIAaeSKOfKiC"><img src="https://drive.google.com/uc?export=view&id=1R9XtYoMQohYpOI9Xg_SvqsEUsFY8ijZH" width="700px"></a>

(Quote and table both from [Heinzerling and Strube: BPEmb: Tokenization-free Pre-trained Subword Embeddings (2018)](http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf))

Having created the BPE vocabulary (alphabet) by merging, the segmentation/encoding of a given input text is based on simply performing all possible merges in the order they were performed during training on the input. (This means that the compression is solely based on the training distribution, not on the input.)

BPE-based subword decomposition was first used, and became the _de facto_ standard, in machine translation. 

### BPE and tokenization

As the above quote said, one of the big advantages of BPE is that it can be used without pretokenization: BPE does not segment across white space and punctuation but there is no additional, language-dependent logic, and for writings without space (e.g., Chinese), it simply merges character-pairs within punctuations without relying on any notion of word-boundary.


### BPE and morphological segmentation

Despite being fully unsupervised, for unknown or rare words BPE segmentation can be suprisingly close to morphological segmentation, e.g.



In [None]:
%%capture
! pip install bpemb
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", dim=50)
bpemb_de = BPEmb(lang="en", dim=50)

In [5]:
bpemb_en.encode("supercalifragilisticexpialidocious")

['▁super',
 'cal',
 'if',
 'ra',
 'g',
 'il',
 'ist',
 'ice',
 'x',
 'p',
 'ial',
 'id',
 'oc',
 'ious']

In [7]:
bpemb_en.encode("These neologisms were quite widespread.")

['▁these', '▁ne', 'olog', 'isms', '▁were', '▁quite', '▁widespread', '.']

In [18]:
bpemb_de.encode("unglaublich")

['▁un', 'g', 'la', 'ubl', 'ich']

###  Unigram language-model based subword segmentation

An interesting recent alternative to BPE is the so-called unigram language-model based subword segmentation, which is a probabilistic approach to choosing the best subword vocabulary of a given size: It tries to find the subword vocabulary which maximises the probability of the corpus assuming a unigram language model. See the paper: [Kudo (2018): Subword Regularization: Improving Neural Network Translation Models
with Multiple Subword Candidates](https://arxiv.org/pdf/1804.10959.pdf) for details.

### Subword regularization

An important aspect (and perhaps the main motivation) of the unigram language-model approach is that it makes it possible to tackle the problem that most words (sentences) have several alternative segmentations according to a subword vocabulary. BPE encodes deterministically in a greedy fashion, which has the disadvantage that DL-models using BPE do not explore and learn from these alternatives.

Since the unigram language model-based subword segmentation assigns probabilities to the alternatives, it is easy to sample them during training as a form of regularization, see the paper for details.

Although BPE is not a probabilistic algorithm, a BPE-based subword-regularization variant, the so-called BPE-dropout has been introduced very recently in the paper [Provilkov et al (2019): BPE-Dropout: Simple and Effective Subword Regularization](https://arxiv.org/pdf/1910.13267.pdf).


<a href="http://drive.google.com/uc?export=view&id=1eXGoMrDcDEsPL3OU_5eNpLQPUoocdVf7"><img src="https://drive.google.com/uc?export=view&id=1eXGoMrDcDEsPL3OU_5eNpLQPUoocdVf7" width="700px"></a>



(Figure from [Provilkov et al (2019): BPE-Dropout: Simple and Effective Subword Regularization](https://arxiv.org/pdf/1910.13267.pdf))



### BPEmb: BPE-based subword embeddings (2018)

BPEmb  ([GitHub project page](https://github.com/bheinzerling/bpemb)) is a BPE-based subword embedding project which publishes embeddings for 275 languages, trained on Wikipedia. The embeddings are produced with GloVe.

In the experimental application (fine-grained entity typing) BPEmb embeddings were competitive (in fact, slightly better) with FastText regarding performance, but required radically less memory -- "6 GB for FastText’s 3 million embed-dings with dimension 300 vs 11 MB for 100k BPE embeddings with dimension 25." ([Heinzerling and Strube: BPEmb](http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf)).

An important lesson is that the best performing architectures did not simply calculate the sum or average of the subword embedding vectors of a token's constituent subwords to produce its fixed-length representation, but used more sophisticated, RNN and CNN-based sequence embedding methods.

### See also
+ The most widely used subword-segmentation implementation, supporting both BPE and unigram language-model based segmentation (with a nice Python interface) is Google's [SentencePiece](https://github.com/google/sentencepiece) library.

# Contextual embeddings, Part 1

<img src="https://miro.medium.com/max/907/1*p8QR4tJivDxDNQUObkJfJA.jpeg" width="500px">

The embedding approaches discussed so far share an important characteristic: they are all _static_, i.e., map tokens having the same form to the same embedding vector, regardless their context. One of the most important recent developments in NLP has been the appearance of __contextual embeddings__, which, in contrast, can vary the embedding from context to context, and, therefore, produce embeddings that are more useful for downstream tasks than static embeddings. The first important contextual embedding model was ELMo:

## ELMo (Embeddings from Language Models, 2018, Allen Institute for Artificial Intelligence)

<img src="https://upload.wikimedia.org/wikipedia/en/7/74/Elmo_from_Sesame_Street.gif">

Similarly to Word2Vec, ELMo ([Peters et al: Deep contextual word representations](https://arxiv.org/pdf/1802.05365.pdf)) learns the word representations via a language modeling task: The architecture starts with producing context-independent embeddings using character-level convolutions, and then uses forward and backward bidirectional LSTM layers to predict the next/previous token via weight-shared softmax layers:

<a href="https://cdn-images-1.medium.com/max/1600/1*ko2Ut74J_oMxF4jSo1VnCg.png"><img src="https://drive.google.com/uc?export=view&id=1-C6qGphj-5K89ipDeeWNv8EpQCKWO0p1" width="700"></a>

At first approximation, the context dependent embeddings produced by the model are all the intermediate representations produced by the model, i.e., the context-independent character-level embeddings and those produced by the LSTM layers: if there are $n$ LSTM layers for each direction then altogether $2n + 1$ vectors are produced.

Although these vectors together can be considered the "full" ELMo representation, for actual downstream NLP tasks ELMo's creators actually suggest not to use this very high-dimensional representation, but a lower dimensional combination of these vectors. Suggested solutions are 

+ simply taking the concatenation of the output of the top LSTM layers (forward and backward), and
+ learning a task-specific linear combination of the ELMO representations on the supervised task.

The main novelty is that ELMo, and all later contextual word embeddings, are to be used as __full-fledged context-dependent feature extractor modules__, as opposed to the simple lookup tables provided by the traditional static embeddings (although subword embeddings already challenged this formula to a certain extent).  

### Flair (Zalando, 2018)
An interesting, character-based variant of contextual embeddings are the ["flair embeddings"](https://research.zalando.com/welcome/mission/research-projects/flair-nlp/) by Zalando (2018), which are basically recurrent character level language models (a forward and a backward language model), but not respecting word boundaries, and taking contextual information into account in the form of LSTM hidden states at the first and last character of he word (first character from the backward and last from the forward LM). They proved to be pretty useful in sequence tagging tasks, and shallow models using them are currently 2nd best (!) in NER and POS-tagging. 

<a href="https://s3-eu-central-1.amazonaws.com/zalando-wp-zalando-research-production/2018/09/image1.png"><img src="https://drive.google.com/uc?export=view&id=1sdpgiJtCok8yPcXJovvIlMNmh7I_-hmY"></a>

([Image source](https://research.zalando.com/welcome/mission/research-projects/flair-nlp/))

# Architectural Interlude:  The rise of  Transformers

Although we have seen that the usage of attention mechanisms enables the processing over elaborate external memory structures, later on with the advancement of research it turned out that attention mechanisms even without any external memory are extremely powerful in sequence modeling.


The __transformer__ is a powerful seq2seq encoder-decoder architecture which is built solely from "transformer modules" consisting of attention and feed-forward layers without using RNN-s. Nonetheless, in most NLP tasks (e.g., language modeling, translation, question answering etc.) transformer-based models have recently significantly outperformed the "more traditional" RNN-based encoder-decoders.

## Attention in general

The basic attention schema used in transformers can be described as follows: We want to "attend" to part(s) of a certain $\mathbf X=\langle \mathbf x_1,\dots,\mathbf x_n \rangle$ sequence of vectors (embeddings). In order to do that, we transform $\mathbf X$ into a sort of "key-value store" by calculating from $\mathbf X$ a

- $\mathcal K(\mathbf X) = \mathbf K = \langle \mathbf k_1,\dots, \mathbf k_n \rangle$ sequence of key vectors for each $\mathbf x_i$,
- a $\mathcal V(\mathbf X) = \mathbf V = \langle \mathbf v_1,\dots,\mathbf v_n \rangle$ sequence of value vectors for each $\mathbf x_i$,

plus generate (not necessarily from $\mathbf X $) a $\mathbf Q = \langle \mathbf q_1,\dots,\mathbf q_m\rangle$ sequence of query vectors. Using these values, the "answers" to each $\mathbf q$ query can be calculated by

- first calculating a "relevance score" for each $\mathbf k_i$ key, which is simply the $\mathbf q \cdot \mathbf k_i$ dot product (in certain cases scaled by a constant),
- taking the $\langle s_1,\dots,s_n\rangle$ softmax of the scores, which forms a probability distribution over the value vectors;
- finally, calculating the answer as the 
$$ \sum_{i} s_i \mathbf v_i$$ weighted sum of the values. 

##  Attention as a layer

How can the above attention mechanism be used as a _layer_ in a network with an input vector $\mathbf I = \langle \mathbf i_1,\dots, \mathbf i_n\rangle$, where the $\mathbf i_i$s are themselves vectors (embeddings)? The transformer solution is is to calculate a query from each input: 

$$
\mathbf Q = \mathcal Q(\mathbf I) = \langle \mathcal Q(\mathbf i_1),\dots,\mathcal Q(\mathbf i_n)\rangle 
$$
use these queries to attend to a sequence of vectors, and output simply the calculated answers.

The transformer uses two attention-layer variants, which differ only in what they attend to:

- __Self-attention__ layers attend (unsurprisingly) to themselves, while, in contrast 
- __Encoder-decoder attention__ layers, used in the decoder, attend to the output of the encoder.

## Self-attention

In a transformer self-attention layer, both the source of the queries and the target of the attention are the input embeddings. The mappings for queries, keys and values are learned projections:

<a href="http://jalammar.github.io/images/t/self-attention-matrix-calculation.png"><img src="https://drive.google.com/uc?export=view&id=1fYqDpEpgejnUavIBanfhHgbdVp_WNVNY" width="400px"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

## Multi-headed attention

In order to be able attend to different features on the basis of different queries, the transformer attention layers work with multiple learned query, key and value projections, which are collectively called "attention heads":

> "Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions." 

([Attention is all you need](https://arxiv.org/abs/1706.03762))

<a href="http://jalammar.github.io/images/t/transformer_attention_heads_qkv.png"><img src="https://drive.google.com/uc?export=view&id=1zB0CP-GlenMj346g1UlryVe5SOB2cFYy" width="800"></a>
(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

The outputs are collected for each head separately:

<a href="http://jalammar.github.io/images/t/transformer_attention_heads_z.png"><img src="https://drive.google.com/uc?export=view&id=15Eg3HEgeiUiW8YsaeExqD6Ra5IFo-7IA" width="800"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

concatenated, and, finally, projected back by another learned weight matrix into the basic model embedding dimension:

<a href="http://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png"><img src="https://drive.google.com/uc?export=view&id=190H1uu8SySbF0Yr3VNGjisWxagygjcxm" width="800"></a>

(In the original "Attention is all you need" paper the model embedding dimension is 512, there are 8 attention heads and the query key and value vectors are all 512/8 = 64 dimensional.)

## Transformator modules
Similarly to most CNN architectures,  transformators are built up from identical modules, that consist of two main components, one or two multiheaded attention layers and a positionwise feedforward network layer with one hidden layer whose dimensionality is larger than the model's basic embedding dimension (2048 in the original paper). The attention and FF layers are residuals with skip connections, and are normalized with layer norm. Two types of modules are used:
The modules in the encoder contain only self-attention:

<a href="http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png"><img src="https://drive.google.com/uc?export=view&id=14sg-hZGrA6SF1ZadZPLKCd0yOIKu2Tcb" width="550"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

While the modules in the decoder also contain an "outward attention" layer attending to the output of the encoder:

<a href="https://lilianweng.github.io/lil-log/assets/images/transformer-decoder.png"><img src="https://drive.google.com/uc?export=view&id=1e8sNfkgD_qvTKZy_fzBx6r06cpPgmat2" width="400"></a>

(image source: [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html))

## Encoder-decoder architecture

The full encoder-decoder architecture has the following structure:

<a href="http://nlp.seas.harvard.edu/images/the-annotated-transformer_14_0.png"><img src="https://drive.google.com/uc?export=view&id=1ysJosUbOAOmYBzE6XRLrNWlFSU7-o1fh" width="400"></a>

(image source: [The annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html))

Similarly to other (e.g., RNN-based) seq2seq architectures, the decoder part takes the previous outputs as input. In order to prevent access to information from "future outputs", the self-attention layers in the decoder use "masked attention", i.e., for each position, positions to the right are forced to have $-\infty$ input relevance score in the self-attention softmax layer.

The following animations show the whole transformer seq2seq architecture in action in a translation task:

<a href="http://jalammar.github.io/images/t/transformer_decoding_1.gif"><img src="https://drive.google.com/uc?export=view&id=1oT-y2wQS8MLxzEj7umw1MZo5rDTD4uui"></a>

<a href="http://jalammar.github.io/images/t/transformer_decoding_2.gif"><img src="https://drive.google.com/uc?export=view&id=15XWP6IfFFUK7V9B_1D_QrSVeDOv8JmSP"></a>

(image source: [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/))

## Further reading

+ The original transformer paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)
+ A highly readable, illustrated dissection on which this discussion drew: [The illustrated transformer](http://jalammar.github.io/illustrated-transformer/)
+ An annotated version of the original paper with implementation in Pytorch: [The annotated transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
+ For those interested in the bleeding/leading edge developments in transformer architecture research: Google used evolutionary neural architecture search (NAS) to find improved transformer variants. See their [blog post](https://ai.googleblog.com/2019/06/applying-automl-to-transformer.html) and [paper](https://arxiv.org/abs/1901.11117v2) for details.

# Contextual embeddings, Part 2: Transformer-based architectures

The first task in which the transformer architecture was used was translation (2017), but it was obvious that there is a large range of other potential application  areas. Recently (starting from 2018), a series of transformer-based architectures producing contextual embeddings were developed.

## GPT (Generative Pre-Training, OpenAI, 2018)
Paper: [Radford et al: "Improving language understanding by generative pre-training."](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)

GPT is a BPE-based, _decoder only_ transformator model trained with a traditional "predict the next token" language modeling objective. The contextual embeddings are simply the outputs of the top transformer module.

Similarly to ELMo, the main goal of GPT is to provide a useful pretrained "feature extractor" module, which can be fine-tuned for supervised NLP tasks. In most cases it's enough to add single linear layer(s) to get models with state-of-the-art performance (structured inputs are transformed into token sequences with special delimiter tokens ["delim"]):

<a href="https://www.topbots.com/wp-content/uploads/2019/04/cover_GPT_web-1280x640.jpg"><img src="https://drive.google.com/uc?export=view&id=1KLxLHvYM75UbtJSMMIjaNyDrnf2Xk2Tx" width="800"></a>

(The figure is from the original paper.) 

## BERT (Bidirectional Encoder Representations  from Transformers, Google, 2018)

<img src="https://a1cf74336522e87f135f-2f21ace9a6cf0052456644b80fa06d4f.ssl.cf2.rackcdn.com/images/characters_opt/p-ses-bert.jpg" width="200px">

Paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

Like GPT, BERT is a transformer-based model (pre)trained with a "broadly language modeling" objective in order to produce useful contextual embeddings from texts. However, in contrast to GPT, BERT's task is not one- but bidirectional, and, therefore, it uses a transformer encoder architecture instead of a decoder.

BERT's main novelty is the modeling task: BERT is pretrained actually on two tasks:

### Task 1: Masked language model

The input is not a left context, but a "full" chunk of text (typically, several sentences), in which approximately 15% of the tokens (remember, these can be subwords!) are masked. The model's task is to predict the missing tokens:

<a href="https://www.lyrn.ai/wp-content/uploads/2018/11/MLM.png"><img src="https://drive.google.com/uc?export=view&id=1RMiGozsSlYbfHUu9QhZbv_8y6w_3jwPv" width="700"></a>

Initially, the masked tokens were totally randomly selected, but it turned out that it is "too easy" to predict missing subwords of a word if the rest of the subwords is known, so (in May 2019) "Whole word masking" was introduced, where the masking is guaranteed to  mask all tokens belonging to a word (see the README.md at https://github.com/google-research/bert).

(Image source: [Horev: BERT – State of the Art Language Model for NLP](https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for-nlp/))

### Task 2: Next sentence prediction (NSP)

The model is also trained on a binary classification task, in which the model decides whether two text segments follow each other immediately or not. (In half of the training data the input segment pairs are consecutive, in the other half they second segment is randomly sampled from the rest of the corpus.) The input consists of the two segments separated by a special separator token, and the fact that an input token belongs to the first or second segment is also encoded by two "segment embeddings":

<a href="https://miro.medium.com/max/1400/1*AcHwFPeBhMABqmUfxLAEUg.png"><img src="https://drive.google.com/uc?export=view&id=1v2Lf8G86-ejgem5oqsC2MEUwrvIpvurE" width="850"></a> 

(Figure from the original paper.)

It is worth noting that the input always contains a special [CLS] (from "classification") token, whose embedding is intended to represent/embed the entire input segment. To achieve this, the NSP classifier network's output layer is connected only to the embedding of this token. 

### Fine-tuning for supervised tasks

Similarly to GPT, the addition of a single linear layer is sufficient for most supervised downstream tasks, and BERT developers simply used end-to-end training of all weights.

# Latest developments

For better or worse, the development of transformer-based, pretrained language models producing contextual embeddings did not stop, in fact, it is apparently becoming an industry. Since BERT, the following notable models and results were published, among others:


## RoBERTa (Facebook, 2019)

<img src="https://miro.medium.com/max/280/0*cce_546B_ES53WKs">

Paper: [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)

This is, in essence, a strongly improved/tuned BERT: BERT's results were improved by changes such as

- Removing the next sentence objective (!)
- Radically increasing the data set size
- Increasing batch size and learning rate

## ERNIE (Baidu, 2019)

<img src="https://static.wikia.nocookie.net/muppet/images/2/2e/ErnieFullFigure-NEW.jpg/revision/latest/smart/width/200/height/200?cb=20100901224144">

Paper: [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/pdf/1904.09223.pdf)

The main innovation here is the introduction of two "knowledge-based" pretraining tasks to the original BERT objective (model is the same as BERT):

- Guessing random masked phrases (as opposed to random masked tokens),
- Guessing random masked named entities.

on certain tasks (e.g. dialog LM) the added tasks achieved 1% improvement compared to the BERT objective.

## (the other) ERNIE (Tsinghua University, Huawei, 2019)

Paper: [ERNIE: Enhanced Language Representation with Informative Entities](https://arxiv.org/pdf/1905.07129.pdf)

A more radical take on the idea of using entity-knowledge to produce contextual token embeddings: this model also uses entity-based masked tasks for pretraining, but also explicitly fuses token embeddings and entity embeddings (figure from the paper):

<img src="https://storage.googleapis.com/groundai-web-prod/media/users/user_5084/project_365282/images/x2.png" width="800px">

As a consequence, it can also be considered a knowledge base/ontology--word vector aligner, but done in a context-dependent way. The results show 1-2% improvement on BERT in several tasks.

##  Transformer-XL (CMU--Google Brain, 2019)

Paper: [Dai, Zihang, et al. "Transformer-xl: Attentive language models beyond a fixed-length context."](https://arxiv.org/abs/1901.02860)

In order to address the relatively limited context size of transformer-based LMs, Transformer-XL introduces a kind of recurrence mechanism, in which when processing a sequence

> the hidden state sequence computed for  the  previous  segment  is fixed and cached to be reused as an extended context when the model processes the next new segment. (From the paper)

<a href="https://2.bp.blogspot.com/--MRVzjIXx5I/XFCm-nmEDcI/AAAAAAAADuM/HoS7BQOmvrQyk833pMVHlEbdq_s_mXT2QCLcBGAs/s1600/GIF2.gif"><img src="https://drive.google.com/uc?export=view&id=1yUnTJLoXg2I7tw_SyninWRP19kpjVbz_"></a>

(Figure source: [Transformer-XL: Unleashing the Potential of Attention Models (Google AI Blog)](https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html))

Crucially, gradients are calculated only for the current segment! 

Adding the recurrence mechanism required a change in the positional encodings as well (the model couldn't distinguish tokens in identical positions of different segments), so in Transformer-XL the "traditional" absolute positional encodings are replaced with relative ones, which represent the relative distance between query and key positions during attention lookups.

## GPT-2 (OpenAI, 2019)

Paper: [Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI Blog 1.8 (2019).](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)

GPT-2 is a larger variant of GPT. The developers made a few architectural improvements:
+ layer normalization is moved to the input of each block, and
+ "an additional layer normalization was added after the final self-attention block" (from the paper)

but the most important change was in the size of the model: for the largest model, the embedding dimensionality was increased to 1600 (from 768), the context size to 1024 (from 512), the vocabulary size to ~50K (from 40K) and the number of used blocks to 48 (from 12).

(In)famously, the largest trained GPT-2 model was claimed to be so powerful (especially in text generation) that OpenAI decided not to release it to the public -- see https://openai.com/blog/better-language-models/ for some details.

## T5 (Google, 2019): Formulating all tasks as text-to-text problems

Paper: [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)

An interesting trend in the theoretically-motivated studies of the performance effects of architecture parameters and pretraining are the attempts to unify the possible downstream tasks as much as possible. The T5 approach is to use a complete encoder-decoder transformer architecture, and formulate all tasks as text-to-text problems (figure from the paper):

<img src="https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_11850%2Fproject_395412%2Fimages%2Fx1.png">

Even the masked pretraining objective is formulated this way (figure from the paper):

<img src="https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_11850%2Fproject_395412%2Fimages%2Fx2.png">

## Completeness results for transformers (2019)

It has been proven that transformers are

- universal approximators for sequence-to-sequence functions: [Yun et al. 2019](https://arxiv.org/pdf/1912.10077)
- Turing complete (assuming arbitrary precision computations): [Pérez et al. 2019](https://arxiv.org/abs/1901.03429)

## Big Bird (Google, 2020) and the quest for sparse attention

<img src="https://cdn.vox-cdn.com/thumbor/5ySM42afkM_G1fk2sL-a5aHxQi0=/0x94:1494x1090/1400x1400/filters:focal(0x94:1494x1090):format(jpeg)/cdn.vox-cdn.com/uploads/chorus_image/image/45686844/big_bird_half.0.0.jpg" width="200px">

Paper: [Big Bird:  Transformers for Longer Sequences](https://arxiv.org/pdf/2007.14062.pdf)

One of the most important features of the original transformer modules is the use of __full__ (self) attention: each query attends to the full range of embeddings, which implies that the number of dot products to be calculated is quadratic in the number of input tokens, which has a huge impact on the memory requirements of training/inference.

A lot of research effort went into developing sparse attention variants that don't decrease the performance but decrease the memory requirements. Currently, the attention variant used for Google Research's Big Bird architecture is the most sophisticated and theoretically best motivated. It combines three types of attention patterns to a sparse attention whose memory requirements are __linear__ with respect to the number of input tokens:

<img src="https://miro.medium.com/max/816/1*4iAjyRtn65NAP-Oxm_PwLg.png">

(Figure from the paper)

- "A set of $g$ global tokens that attend on all parts of the sequence.
- For each query , a set of $r$ random keys that each query will attend to.
- A block of local neighbors $w$ so that each node attends on their local structure." (from the paper)

This sparse attention structure allows to increase the number of input tokens significantly without significant change in memory requirements.

The paper also shows that the class of transformers with these sparse attention mechanisms is still universal approximator and Turing complete, similarly to the original transformers with full attention.

## GPT-3 (OpenAI, 2020)

<img src="https://static.wikia.nocookie.net/muppet/images/5/51/Sweetumstms.jpg/revision/latest/scale-to-width-down/300?cb=20110426195725">

Paper: [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165.pdf)

### Model architecture

On the architectural and pretraining front there is no large change: besides "using alternating dense and locally banded sparse attention patterns in the layers of the transformer" (from the paper), GPT-3 (the biggest) is "just" a scaled up version of GPT-2:

- the models basic embedding dimensionality $d_{model}$, is increased to 12288 (!!)
- context size ("input width") to 2048
- number of heads to 96
- number of blocks to 96
- the model has 175 billion parameters (!)
- and was trained on a 300 billion token corpus, which is actually smaller then some of the available corpora (~ 1000 bill.) on which smaller models were trained, so training is not necessarily longer

### 0-, 1- and few-shot learning


In contrast to previous studies, GPT-3 experiments focused on avoiding fine-tuning on downstream tasks, and solve them, in contrast, by 0-, 1- or few-shot learning. The reason is the following problems with fine-tuning:

- it still requires a sizable supervised data set -- few hundreds or even thousands of data points
- because of the size of the pretrained models, it's hard to avoid overfitting: "poor generalization out-of-distribution, and the potential to exploit spurious features of thetraining data, potentially resulting in an unfair comparison with human performance" (from the paper).

Consequently, the GPT-3 work is focused on task-agnostic performance __without updating the weights at all__, and the authors also want to examine the claim that simply increasing the model size doesn't bring significant performance improvements.

A few examples of the formulation of the learning tasks (figure from the paper):

<img src="https://anantja.in/static/61c08ded6425a6082cf158eb56710833/09e48/learning-modes.png" width="750px">

### Results
The main finding is that increasing model size made it possible to produce competitive results in several tasks (various types of language modeling, question answering, simple calculations etc.) without fine-tuning, e.g., on the long-distance dependency LM-modelling task LAMBADA GPT-3 improved the SOTA by 18% percent using few-shot learning.
 
Title and subtitle-driven news article-generation also improved drastically, humans couldn't really distinguish human and machine produced texts.

Finally, GPT-3 achieved close to fine-tuning SOTA results using few-shot learning in a number of broadly inference-based task, such as question-answering, pronoun resolution etc.

# Transformers in a traditional pipeline: spaCy 3.0

How can pretrained transformer models be integrated in a library providing the elements of the traditional NLP pipeline? The main challenges are 

- to make the large context-dependent embedder modules sharable between individual pipeline components; 
- provide the required "glue" between subword embeddings and token level embeddings required for the downstream NLP tasks.

The forthcoming spaCy 3.0 will have solutions for both problems:

- the context-dependent embedder models (e.g., transformers) are encapsulated in standardized token-to-vector components, and
- these components can be shared by pipeline elements using listeners: 

<img src="https://nightly.spacy.io/tok2vec-listener-8c4d53807708b270c07a085f4a2da75f.svg" width="600px">

(Image from the [spaCy 3.0/nightly documentation](https://nightly.spacy.io/usage/embeddings-transformers))

# An overview of embedding techniques in NLP

<a href="http://drive.google.com/uc?export=view&id=1cRkU88Mr4LfE3aTqzQShZslBFJIhK12s"><img src="https://drive.google.com/uc?export=view&id=1JPYkkwRJjAkcilA8h1K9XrEBpGTDvgdI"></a>