[https://colab.research.google.com/github/JadeMaveric/CoinShift-Imaging-Box/blob/master/NLP/filling-in-masked-words-with-roberta.ipynb](Open in Colab)

# Introduction

RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. 


The model we'll look at in this notebook were trained using a masked language modeling (MLM) objective. It was introduced in this [paper](https://arxiv.org/abs/1907.11692) and first released in this [repository](https://github.com/pytorch/fairseq/tree/master/examples/roberta).

# Installation
We'll be using the [Transformers](https://huggingface.co/transformers/) library by HuggingTorch throughout this notebook.
It provides a simple interface to use NLP models with both, PyTorch and Tensorflow

To get started, use ```pip``` to install the package ```transformers```

In [1]:
!pip install -q transformers

# Quick Start

The easiest way to use a pretrained model with HuggingFace is to use ```pipeline()```. This gives us access to a wide variety of NLP [tasks](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.pipeline). The one we're interested in is ```fill-mask```

In [2]:
from transformers import pipeline
unmask = pipeline('fill-mask')

Downloading:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We can now passed a string with a masked word to ```unmask()``` and it'll return an array of predictions. The mask for this tokenizer is described by `unmask.tokenizer.mask_token`. We can type it in manually, or use an `f"string"`

In [3]:
unmask.tokenizer.mask_token

'<mask>'

In [4]:
predictions = unmask('Elon Musk is the founder of <mask>')
for prediction in predictions:
    print(prediction['sequence'].strip('<s>').strip('</s>'), end='\t--- ')
    print(f"{round(100*prediction['score'],2)}% confidence")

Elon Musk is the founder of SpaceX	--- 68.7% confidence
Elon Musk is the founder of Tesla	--- 28.07% confidence
Elon Musk is the founder of PayPal	--- 2.82% confidence
Elon Musk is the founder of Facebook	--- 0.09% confidence
Elon Musk is the founder of Alphabet	--- 0.06% confidence


Out of the box, the model fairs pretty well. From the top 5 results returend, we see that we've got relatively high confidence for the correct answers, and significantly lower scores for the incorrect ones. The default model used by this pipeline task is `distilroberta-base`. From its info [page](https://huggingface.co/distilroberta-base) we see that it's a distilled version of the `roberta-base` model. We can use the parent model directly, by passsing it as an argument when creating the pipeline. HuggingFace offers muliple [models](https://huggingface.co/models), each finetuned for a different task

In [5]:
roberta_unmask = pipeline('fill-mask', model='roberta-base')
predictions = roberta_unmask('Elon Musk is the founder of <mask>')
for prediction in predictions:
    print(prediction['sequence'].strip('<s>').strip('</s>'), end='\t--- ')
    print(prediction['score'])

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Elon Musk is the founder of Tesla	--- 0.6961442232131958
Elon Musk is the founder of SpaceX	--- 0.29525160789489746
Elon Musk is the founder of PayPal	--- 0.007312057539820671
Elon Musk is the founder of Twitter	--- 0.000530343793798238
Elon Musk is the founder of Facebook	--- 0.0002385072730248794


# Setting up our own workflow
The `pipeline()` method works well if you don't need a lot of customisation. But there willl be times when you want more control of over the process, we can instantiate, train and use our own model and tokenzier. The HuggingFace [docs](https://huggingface.co/transformers/task_summary.html#masked-language-modeling) give us a concise way of doing this.

1. Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a DistilBERT model and loads it with the weights stored in the checkpoint.
2. Define a sequence with a masked token, placing the `tokenizer.mask_token` instead of a word.
3. Encode that sequence into a list of IDs and find the position of the masked token in that list.
4. Retrieve the predictions at the index of the mask token: this tensor has the same size as the vocabulary, and the values are the scores attributed to each token. The model gives higher score to tokens it deems probable in that context.
5. Retrieve the top 5 tokens using the PyTorch `topk` or TensorFlow `top_k` methods.
6. Replace the mask token by the tokens and print the results


### PyTorch

In [6]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForMaskedLM.from_pretrained('roberta-base')

sequence = f"The world will end in {tokenizer.mask_token}" # "The world will end in <mask>"

input_seq = tokenizer.encode(sequence, return_tensors='pt') # tensor([[0, 133, 232, 40, 253, 11, 50264, 2]])
mask_token_index = torch.where(input_seq == tokenizer.mask_token_id)[1] # (tensor([0]), tensor([6])) - we only want the the 2nd dimension

token_logits = model(input_seq).logits
masked_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(masked_token_logits, 5, dim=1).indices[0].tolist()

# print('sequence:', sequence)
# print('input_seq:', input_seq)
# print('mask_token_index:', mask_token_index)
# print('token_logits:', token_logits)
# print('masked_token_logits:', masked_token_logits)
# print('top_5_tokens:', top_5_tokens)

Some weights of RobertaForMaskedLM were not initialized from the model checkpoint at roberta-base and are newly initialized: ['lm_head.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

The world will end in  2100
The world will end in  destruction
The world will end in  2019
The world will end in  2018
The world will end in  peace


### Tensorflow

In [8]:
from transformers import TFAutoModelForMaskedLM, AutoTokenizer
import tensorflow as tf

tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = TFAutoModelForMaskedLM.from_pretrained('roberta-base')

sequence = f"The world will end in {tokenizer.mask_token}" # "The world will end in <mask>"

input_seq = tokenizer.encode(sequence, return_tensors='tf') # tensor([[0, 133, 232, 40, 253, 11, 50264, 2]])
mask_token_index = tf.where(input_seq == tokenizer.mask_token_id)[0, 1] # (tensor([0]), tensor([6])) - we only want the the 2nd dimension

token_logits = model(input_seq)[0]
masked_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = tf.math.top_k(masked_token_logits, 5).indices.numpy()

# print('sequence:', sequence)
# print('input_seq:', input_seq)
# print('mask_token_index:', mask_token_index)
# print('token_logits:', token_logits)
# print('masked_token_logits:', masked_token_logits)
# print('top_5_tokens:', top_5_tokens)

Downloading:   0%|          | 0.00/657M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


In [9]:
for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

The world will end in  2100
The world will end in  destruction
The world will end in  2019
The world will end in  2018
The world will end in  peace


# Fine tuning the model
Most of the models on HuggingFace are meant to be fine-tuned for specific tasks. To save valuable time, HuggingFace offers a `Trainer` that'll fine tune our model to a dataset. All we need to do is provide it with a config. You can still [train directly](https://huggingface.co/transformers/training.html#fine-tuning-in-native-pytorch) through PyTorch or Tensorflow if you want, but there's very little benifit to doing that.

To make things even more easier, HuggingFace offers [scripts](https://github.com/huggingface/transformers/tree/master/examples/language-modeling) that can be run to generate the model. For fine tuning our model, we'll use `run_mlm.py`. This script requires [version 4.5.0](https://github.com/huggingface/transformers/blob/3f48b2bc3e5b555a06492f1e7b999ff29bb6058a/examples/language-modeling/run_mlm.py#L51) which, at the time of writing, hasn't been released. So we'll need to install it from the master. It also requires the `datasets` package to import datasets using only their name

In [10]:
!wget --quiet https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_mlm.py
!pip install --quiet datasets
!pip install --quiet git+https://github.com/huggingface/transformers

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 2.0.1 requires transformers<4.3,>=4.1, but you have transformers 4.5.0.dev0 which is incompatible.[0m


*Be sure to use an accelerator when training, this can take a long time*

In the cell below we're fine tuning `roberta-base` on the `wikitext` dataset. Since this isn't an interactive shell, and we don't want to upload the resulting weights and biases anywhere, we pass in `none` for the `report_to` flag. Even with an accelerator, this can still take a couple of minutes, so we limit training/validation samples.
Take note of the `outpu_dir` we'll need it later.

In [11]:
# Clear GPU memory (sometimes needed on Kaggle/Colab)
from numba import cuda
cuda.select_device(0)
cuda.close()
cuda.select_device(0)

<weakproxy at 0x7f14794ed6b0 to Device at 0x7f1479707690>

In [12]:
!python './run_mlm.py' \
--model_name_or_path 'roberta-base' \
--dataset_name 'wikitext' \
--dataset_config_name 'wikitext-2-raw-v1' \
--do_train \
--do_eval \
--report_to none \
--max_train_samples 500 \
--max_val_samples 500 \
--output_dir './test-mlm'

2021-03-24 09:41:46.838330: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
Downloading: 8.39kB [00:00, 6.08MB/s]                                           
Downloading: 5.83kB [00:00, 4.90MB/s]                                           
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 (download: 4.50 MiB, generated: 12.91 MiB, post-processed: Unknown size, total: 17.41 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91...
Downloading: 100%|█████████████████████████| 4.72M/4.72M [00:00<00:00, 6.66MB/s]
Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/47c57a6745aa5ce8e16a5355aaa4039e3aa90d1adad87cef1ad4e0f29e74ac91. Subsequent calls will reuse this data.
[INFO|configuration_utils.py:472] 2021-03-24 09:41:52,833 >> loading configuration file https://hug

We can now load in this model by passing in the `output_dir` to `from_pretrained()`

In [13]:
model = AutoModelForMaskedLM.from_pretrained('./test-mlm/')
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')

unmask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

predictions = unmask('Elon Musk is the founder of <mask>')
for prediction in predictions:
    print(prediction['sequence'].strip('<s>').strip('</s>'), end='\t--- ')
    print(f"{round(100*prediction['score'],2)}% confidence")

Elon Musk is the founder of Tesla	--- 72.62% confidence
Elon Musk is the founder of SpaceX	--- 25.97% confidence
Elon Musk is the founder of PayPal	--- 0.68% confidence
Elon Musk is the founder of Twitter	--- 0.45% confidence
Elon Musk is the founder of Facebook	--- 0.15% confidence
