In [1]:
# Reveal.js
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'theme': 'white',
        'transition': 'none',
        'controls': 'false',
        'progress': 'true',
})

{'theme': 'white',
 'transition': 'none',
 'controls': 'false',
 'progress': 'true'}

In [2]:
%%capture
%load_ext autoreload
%autoreload 2
# %cd ..
import sys
sys.path.append("..")
import statnlpbook.util as util
util.execute_notebook('language_models.ipynb')

In [3]:
%%html
<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [4]:
from IPython.display import Image
import random

# Transformer Language Models

In [5]:
Image(url='../img/transformer-encoder-decoder.png'+'?'+str(random.random()), width=1400)

## BERT

**B**idirectional **E**ncoder **R**epresentations from **T**ransformers ([Devlin et al., 2019](https://www.aclweb.org/anthology/N19-1423.pdf)).

<center>
    <img src="https://miro.medium.com/max/300/0*2XpE-VjhhLGkFDYg.jpg" width=40%/>
</center>

<center>
<a href="slides/mlm.pdf"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Sesame_Street_logo.svg/500px-Sesame_Street_logo.svg.png"></a>
</center>

### BERT training objective (1): **masked** language model

Predict masked words given context on both sides:

<center>
    <img src="http://jalammar.github.io/images/BERT-language-modeling-masked-lm.png" width=50%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### BERT Training objective (2): next sentence prediction

**Conditional encoding** of both sentences:

<center>
    <img src="http://jalammar.github.io/images/bert-next-sentence-prediction.png" width=60%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### BERT architecture

Transformer with $L$ layers of dimension $H$, and $A$ self-attention heads.

* BERT$_\mathrm{BASE}$: $L=12, H=768, A=12$
* BERT$_\mathrm{LARGE}$: $L=24, H=1024, A=16$

(Many other variations available through [HuggingFace Transformers](https://huggingface.co/docs/transformers/index))

Trained on 16GB of text from Wikipedia + BookCorpus.

* BERT$_\mathrm{BASE}$: 4 TPUs for 4 days
* BERT$_\mathrm{LARGE}$: 16 TPUs for 4 days

### How is that different from ELMo and GPT-$n$?

<center>
    <img src="mt_figures/bert_gpt_elmo.png" width=100%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

### BERT tokenization: not words, but WordPieces

<center>
    <img src="https://vamvas.ch/assets/bert-for-ner/tokenizer.png" width=80%/>
</center>

<div style="text-align: right;">
    (from <a href="https://vamvas.ch/bert-for-ner">BERT for NER</a>)
</div>

* 30,000 WordPiece vocabulary
* No unknown words!

### Using BERT

<center>
    <img src="http://jalammar.github.io/images/bert-tasks.png" width=60%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/N19-1423.pdf">Devlin et al., 2019</a>)
</div>

Feature extraction (❄️) vs. fine-tuning (🔥)

<center>
    <img src="https://d3i71xaburhd42.cloudfront.net/8659bf379ca8756755125a487c43cfe8611ce842/1-Table1-1.png" width=80%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/W19-4302.pdf">Peters et al. 2019</a>)
</div>

Don't stop pretraining!

<center>
    <img src="https://d3i71xaburhd42.cloudfront.net/e816f788767eec6a8ef0ea9eddd0e902435d4271/1-Figure1-1.png" width=80%/>
</center>

<div style="text-align: right;">
    (from <a href="https://www.aclweb.org/anthology/2020.acl-main.740.pdf">Gururangan et al. 2020</a>)
</div>

### Which layer to use?

<center>
    <img src="http://jalammar.github.io/images/bert-feature-extraction-contextualized-embeddings.png" width=80%/>
</center>

<div style="text-align: right;">
    (from <a href="http://jalammar.github.io/illustrated-bert/">The Illustrated BERT</a>)
</div>

### RoBERTa

[Liu et al., 2019](https://arxiv.org/pdf/1907.11692.pdf): bigger is better.

BERT with additionally

- CC-News (76GB)
- OpenWebText (38GB)
- Stories (31GB)

and **no** next-sentence-prediction task (only masked LM).

Training: 1024 GPUs for one day.


## Multilingual BERT

* One model pre-trained on 104 languages with the largest Wikipedias
* 110k *shared* WordPiece vocabulary
* Same architecture as BERT$_\mathrm{BASE}$: $L=12, H=768, A=12$
* Same training objectives, **no cross-lingual signal**

https://github.com/google-research/bert/blob/master/multilingual.md

<center>
    <img src="https://d3i71xaburhd42.cloudfront.net/5d8beeca1a2e3263b2796e74e2f57ffb579737ee/3-Figure1-1.png" width=80%/>
</center>

<div style="text-align: right;">
    (from <a href="https://arxiv.org/pdf/1911.03310.pdf">Libovický et al., 2019</a>)
</div>

### Other multilingual transformers

+ XLM ([Lample and Conneau, 2019](https://arxiv.org/pdf/1901.07291.pdf)) additionally uses an MT objective
+ DistilmBERT ([Sanh et al., 2020](https://arxiv.org/pdf/1910.01108.pdf)) is a lighter version of mBERT
+ Many monolingual BERTs for languages other than English
([CamemBERT](https://arxiv.org/pdf/1911.03894.pdf),
[BERTje](https://arxiv.org/pdf/1912.09582),
[Nordic BERT](https://github.com/botxo/nordic_bert)...)

# Summary #

* Static word embeddings do not differ depending on context
* Contextualised representations are dynamic
* Popular pre-trained contextual representations:
    * ELMo: bidirectional language model with LSTMs
    * GPT: transformer language models
    * BERT: transformer masked language model

# Outlook #

* Transformer models keep coming out: larger, trained on more data, languages and domains, etc.
* In the machine translation lecture, you will learn how to use them for cross-lingual tasks

# Additional Reading #

+ [Jurafsky & Martin Chapter 11](https://web.stanford.edu/~jurafsky/slp3/11.pdf)
+ Jay Alammar's blog posts:
    + [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/)
    + [The Illustrated BERT](http://jalammar.github.io/illustrated-bert/)