# How to use BERT and RoBERTa

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

This notebook explains how you can use a transformer model. Transformer models are published regularly on the huggingface platform: https://huggingface.co/models

These models are very big (Gigabytes) and require a computer with sufficient memory to load. Furthermore, loading these models takes some time as well. It is also possible to copy such a model to your disk and to load the local copy. Still a substantial memory is needed to load it.

There is whole family of transformer models developed by different research groups and published on the Huggingface platform. We will look at two popular models BERT and its sequel RoBERTa, specifically its crosslingual variant XML-RoBERTa.

BERT (Bidirectional Encoder Representations from Transformers) is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, without human labelling. BERT was pretrained with two objectives:

* Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input. It next runs the entire masked sentence through the model to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which only mask the future/next tokens. BERT allows the model to learn a bidirectional representation of the sentence.
* Next sentence prediction (NSP): the model concatenates two masked sentences as input during pretraining. Sometimes they correspond to sentences that were following each other in the original text (positive example), sometimes it is a random sentence (negative example). The model then has to predict if the two sentences were following each other or not.

The core trick of the model is **Attention**, hence the seminal paper "Attention is all you need" (Vaswani et al 2017). Attention refers to the principle that the embedding representation of a word or token in a sequence is determined for the purpose of e.g. predicting the masked tokens but with the help of the other tokens. Likewise, the model learns which words in the context pay most attention to the role of predicting masked words. This is applied to all words in a sequence and over many (11 to more than 20) layers with different attention heads.

Eventually running many times over large amounts of data, the models learn an inner representation of the English language (a language model) that can then be used to represent sentences in texts as contextual vectors in downstream tasks, instead of feature engineered vectors. If you have a dataset of labeled sentences for instance, you can train a standard classifier using the representations produced by the BERT model as inputs. 

RoBERTa is a more efficient sequel to BERT in which the next sentence prediction was dropped, more data, more layers and wider contexts were used (Liu et al 2019). It often gives better performance than BERT in downstream tasks but it is bigger and more heavy to use.

XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages (Lample and Conneau 2019). XLM-RoBERTa can represent text in any of these languages and it can be finetuned in one language (or more) and applied to all these languages. This makes it possible to profit from the English training data that is available for all languages captured by the model even if there is no labeled data in that specific language.


## References
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzma ́n, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 8440– 8451, Online. Association for Computational Lin- guistics.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).


## Loading a model in a prefabbed NLP pipeline

We will start with the English case-sensitive BERT model that is provided by the Huggingface platform. It is possible to load the model itself in combination with its tokenizer to get a representation of a text for all words/tokens across all layers. It is however rather complex to exploit these representations for specific tasks. Huggingface therefore provides an option to create a pipeline to perform an NLP task with a pretrained model:

"The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering."

More information can be found here: https://huggingface.co/transformers/v3.0.2/main_classes/pipelines.html

You can find information on the different tasks and how to call these. We will use these pipelines in this class for a gentle introduction. In notebook 4.3, will use the pipeline module to load fine-tuned models to perform sentiment analysis and emotion detection. In notebook 4.4, we try out finetuned crosslingual models.

This notebook requires installing the deep learning package **transformers**. For this lab you should use the transformers version 4.16.0. Once installed, you can comment out the next cell.

In [1]:
#!pip install transformers==4.16.0

Below, we use the two pretrained Language Models BERT and XLM-RoBERTa to perform the basic task of predicting a masked word in the context of a sentence. For this, we use the **fill-mask** pipeline as an interface for the model. So these models were not fine-tuned for another purpose.

We create an instance of the pipeline class, where we specify the task **fill-mask** and the name of the model. When creating the instance, the constructor scans the Huggingface platform for the model (or your local cache) and its configuration for the task. If you have loaded the model before, it will find it cached on your local disk. If not, it will download it from Huggingface, which may take some time.

In [2]:
from transformers import pipeline

pipe = pipeline('fill-mask', model='bert-base-cased')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [3]:
for res in pipe('A keyboard and a [MASK] are connected to the computer.'):
    print(res['sequence'])

A keyboard and a mouse are connected to the computer.
A keyboard and a keyboard are connected to the computer.
A keyboard and a screen are connected to the computer.
A keyboard and a monitor are connected to the computer.
A keyboard and a microphone are connected to the computer.


In [5]:
for res in pipe('The cat chased the [MASK] for minutes.'):
    print(res['sequence'])

The cat chased the dog for minutes.
The cat chased the mouse for minutes.
The cat chased the cat for minutes.
The cat chased the bird for minutes.
The cat chased the rabbit for minutes.


We can see from the different predictions for the MASKED position that the context triggered different words.

Instead of words, we can also prompt for names:

In [6]:
for res in pipe('Mr [MASK] was charged with murder.'):
    print(res['sequence'])

Mr Smith was charged with murder.
Mr Brown was charged with murder.
Mr Jones was charged with murder.
Mr Johnson was charged with murder.
Mr Williams was charged with murder.


And the other way around, we can prompt for what people are charged for...

In [7]:
for res in pipe('Mr Williams was charged with [MASK].'):
    print(res['sequence'])

Mr Williams was charged with murder.
Mr Williams was charged with assault.
Mr Williams was charged with theft.
Mr Williams was charged with fraud.
Mr Williams was charged with corruption.


Think about what these models do when they generate these answers! Did they index knowledge on individuals as facts or do they only have a hunch about what vocabulary items can be expected in this position?

Next, we will load XLM-RoBERTa from huggingface in a fill-mask pipeline as we have done before for English BERT to perform word predictions in different languages covered by the model.

In [4]:
from transformers import pipeline

xlmpipe = pipeline('fill-mask', model='xlm-roberta-base')

In [9]:
for res in xlmpipe('I wish I was a <mask>.'):
    print(res['sequence'])

I wish I was a girl.
I wish I was a teenager.
I wish I was a poet.
I wish I was a child.
I wish I was a virgin.


Note that BERT and RoBERTa use a different representation for the masked token. This is an incidental difference. The can check the vocabulary file of the model to see what tokens are used.

You can find the actual files on huggingface.co. Go to a model of your choice and use the option "Files and versions" to get a listing of the files for a model. The tokenizer.json contain the vocabulary (in some cases there is also a "vocab.txt" file:

* https://huggingface.co/bert-base-cased/raw/main/tokenizer.json
* https://huggingface.co/xlm-roberta-base/blob/main/tokenizer.json

These files a big but can be opened in a browser as raw data or downloaded individually.

So let us now try some texts in other languages with the same XLM model:

In [10]:
for res in xlmpipe('Ik wou dat ik een <mask> was.'):
    print(res['sequence'])

Ik wou dat ik een vrouw was.
Ik wou dat ik een meisje was.
Ik wou dat ik een engel was.
Ik wou dat ik een homo was.
Ik wou dat ik een mens was.


In [11]:
for res in xlmpipe('Ich mochte gerne ein <mask> sein.'):
    print(res['sequence'])

Ich mochte gerne ein Mädchen sein.
Ich mochte gerne ein Moderator sein.
Ich mochte gerne ein Mensch sein.
Ich mochte gerne ein Paar sein.
Ich mochte gerne ein Engel sein.


In [12]:
for res in xlmpipe('The <mask> was charged with murder.'):
    print(res['sequence'])

The man was charged with murder.
The suspect was charged with murder.
The woman was charged with murder.
The victim was charged with murder.
The officer was charged with murder.


In [13]:
for res in xlmpipe('De <mask> is aangeklaagd voor moord.'):
    print(res['sequence'])

De verdachte is aangeklaagd voor moord.
De man is aangeklaagd voor moord.
De vrouw is aangeklaagd voor moord.
De politie is aangeklaagd voor moord.
De bestuurder is aangeklaagd voor moord.


You can check which languages are covered in XLM-RoBERTa and try out other word predictions in other languages yourself.

## End of notebook