# Day 3 - Masked Language Modeling

Welcome to day 3! We hope you're ready for some more modeling! 

We divided the notebook into three main parts, as described below. The first one will be focused on the tokenization and how different models preprocess the text you feed as inputs. In the second part, we will be experimenting with two widely used **masked language model (MLM)** among the NLP community. These models were pretrained on large English corpora using a *fill-in the blank* task. Finally, we will apply the same *fill-in the blank* to a different model and assess its advantages and limitations over the BERT-based models.


## Agenda:

- Part 1. Tokenization
- Part 2. Masked Language Modelling using BERT-based models
- Part 3. Masked Language Modelling using T5



As usual, the main task in this notebook is to play with the models, find examples that surprise you or are intriguing. If any example challenges your ideas, let us know! 🤯

Have fun!

## Setup

Standard installation. We'll use HuggingFace Transformers library. We'll then load some classes that facilitate interactions with the models.

In [None]:
%%time
%%capture
!pip install sentencepiece
!pip install transformers

CPU times: user 151 ms, sys: 40.6 ms, total: 192 ms
Wall time: 16.6 s


In [None]:
from transformers import AutoTokenizer, BertForMaskedLM, pipeline

In [None]:
def get_prediction(model, text: str, n=2, wordpieces=None):
  """Gets predictions for BERT-based models."""
  orig_text = text
  if isinstance(model.model, BertForMaskedLM):
    text = text.replace("<mask>", "[MASK]")
    prefix = "[bert]"
  else:
    prefix = "[roberta]"
  
  results = model(text, targets=wordpieces)

  for i, r in enumerate(results):
    if i >= n: return

    message = f"{prefix} {orig_text} -----> {r['token_str'].strip()}"
    message += f"\t {round(r['score'], 3)}"
    print(message)

## Part 1. Tokenization

Before we even start using Language Models, we'd like to point out how different models process words. 

In the next few cells, we will load three different models that you've learned about in class! We will see how they transform texts into different word pieces! 

In [None]:
%%time
%%capture
# Load 3 different tokenization models
wordp = AutoTokenizer.from_pretrained("bert-base-uncased")
sentp = AutoTokenizer.from_pretrained("xlnet-base-cased")
bytep = AutoTokenizer.from_pretrained("gpt2")

CPU times: user 2.04 s, sys: 197 ms, total: 2.24 s
Wall time: 9.02 s


### Exercise

We invite you to play around and try out your own (: 


Which one works best for your use case? 


In [None]:
text = "My name is annoyingly fantastic and egregious."
print("BERT tokenizer:", wordp.tokenize(text))  # word piece
print("XLNet tokenizer:", sentp.tokenize(text)) # sentence piece = word piece + " "
print("GPT2 tokenizer:", bytep.tokenize(text))  # byte pairs

BERT tokenizer: ['my', 'name', 'is', 'annoying', '##ly', 'fantastic', 'and', 'e', '##gre', '##gio', '##us', '.']
XLNet tokenizer: ['▁My', '▁name', '▁is', '▁annoying', 'ly', '▁fantastic', '▁and', '▁', 'eg', 'reg', 'ious', '.']
GPT2 tokenizer: ['My', 'Ġname', 'Ġis', 'Ġannoy', 'ingly', 'Ġfantastic', 'Ġand', 'Ġegregious', '.']


In [None]:
text = "My name is समीर"
print("BERT tokenizer:", wordp.tokenize(text))
print("XLNet tokenizer:", sentp.tokenize(text))
print("GPT2 tokenizer:", bytep.tokenize(text))

BERT tokenizer: ['my', 'name', 'is', 'स', '##म', '##ी', '##र']
XLNet tokenizer: ['▁My', '▁name', '▁is', '▁', 'समर']
GPT2 tokenizer: ['My', 'Ġname', 'Ġis', 'Ġà¤', '¸', 'à¤', '®', 'à¥', 'Ģ', 'à¤', '°']


### [Extra] List of BERT Word Pieces

If you are curious whether a token is in the BERT vocabulary, you can find the complete list here: https://huggingface.co/bert-base-cased/raw/main/vocab.txt

**Warning**: The file is pretty big! 😅

## Part 2. Masked Language Model using BERT-based models

In this section, we introduce two different language models - BERT and RoBERTa. Although they have both been trained using a masked language modeling objective, they have some differences in the way they were trained, including the datasets and different hyperparameters.


In [None]:
%%time
%%capture
# Loads the BERT language model
bert = pipeline('fill-mask', model='bert-base-uncased')
# Loads the RoBERTa language model
roberta = pipeline('fill-mask', model='roberta-base')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


CPU times: user 26.4 s, sys: 5.54 s, total: 31.9 s
Wall time: 36.7 s


To better understand what masked language modeling is, let us walk you through some examples. When training, these models were given *fill-in-the-blank* sentences with randomly masked out words. 
One such example could be predicting the blank in the following sentence: 

> my _____ is cute.

We will check the guesses of BERT and RoBERTA but before that, what word do you think it should be? 😀


In [None]:
text = "2 + 2 = <mask>."
get_prediction(bert, text, n=5)
print()
get_prediction(roberta, text, n=5)

[bert] 2 + 2 = <mask>. -----> 0	 0.11
[bert] 2 + 2 = <mask>. -----> 2	 0.104
[bert] 2 + 2 = <mask>. -----> 3	 0.09
[bert] 2 + 2 = <mask>. -----> 1	 0.088
[bert] 2 + 2 = <mask>. -----> 4	 0.052

[roberta] 2 + 2 = <mask>. -----> 2	 0.212
[roberta] 2 + 2 = <mask>. -----> 1	 0.138
[roberta] 2 + 2 = <mask>. -----> 3	 0.13
[roberta] 2 + 2 = <mask>. -----> 4	 0.126
[roberta] 2 + 2 = <mask>. -----> 0	 0.086


It is weird that BERT has assigned **mom** and **dad** the highest. Let us see what would be the value for other tokens like **cat** and **dog**.

In [None]:
get_prediction(bert, text, n=2, wordpieces=["dog", "cat"])

[bert] My <mask> is cute. -----> dog	 0.003
[bert] My <mask> is cute. -----> cat	 0.002


Uff! That's low!

#### A Sentiment Classification example

Let us now try something more challenging. Sentiment classification is the task where we provide a review (of movie, book, food, hotel, etc) and ask the model to predict whether it is a positive or a bad review.

In [None]:
text = "The movie was a lot of fun. I this this movie was <mask>."

get_prediction(bert, text, n=5)
print()
get_prediction(roberta, text, n=5)

print("""\n
# -------------------------------------------------------------
# Check scores of some specific words by setting wordpieces
# -------------------------------------------------------------\n
"""
)
word_subset = ["good", "love", "hate", "bad"]

get_prediction(bert, text, n=5, wordpieces=word_subset)
print()
# get_prediction(roberta, text, n=5, wordpieces=word_subset)

[bert] The movie was a lot of fun. I this this movie was <mask>. -----> fun	 0.248
[bert] The movie was a lot of fun. I this this movie was <mask>. -----> good	 0.105
[bert] The movie was a lot of fun. I this this movie was <mask>. -----> great	 0.096
[bert] The movie was a lot of fun. I this this movie was <mask>. -----> awesome	 0.076
[bert] The movie was a lot of fun. I this this movie was <mask>. -----> amazing	 0.048

[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> good	 0.296
[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> great	 0.154
[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> fun	 0.114
[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> awesome	 0.109
[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> funny	 0.029


# -------------------------------------------------------------
# Check scores of some specific words by setting wordpieces
# ------

In [None]:
text = "The movie was a lot of fun. I think this movie was <mask>."
word_subset = ["good", "love", "hate", "bad"]

get_prediction(bert, text, n=5, wordpieces=word_subset)
print()
get_prediction(roberta, text, n=5, wordpieces=word_subset)

[bert] The movie was a lot of fun. I this this movie was <mask>. -----> good	 0.105
[bert] The movie was a lot of fun. I this this movie was <mask>. -----> bad	 0.007
[bert] The movie was a lot of fun. I this this movie was <mask>. -----> love	 0.001
[bert] The movie was a lot of fun. I this this movie was <mask>. -----> hate	 0.0

[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> good	 0.0
[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> bad	 0.0
[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> love	 0.0
[roberta] The movie was a lot of fun. I this this movie was <mask>. -----> hate	 0.0


In [None]:
text = "The acting and directing was terrible. I <mask> this movie."
get_prediction(bert, text, n=5)
print()
get_prediction(roberta, text, n=5)

[bert] The acting and directing was terrible. I <mask> this movie. -----> loved	 0.513
[bert] The acting and directing was terrible. I <mask> this movie. -----> love	 0.14
[bert] The acting and directing was terrible. I <mask> this movie. -----> hated	 0.11
[bert] The acting and directing was terrible. I <mask> this movie. -----> liked	 0.054
[bert] The acting and directing was terrible. I <mask> this movie. -----> enjoyed	 0.045

[roberta] The acting and directing was terrible. I <mask> this movie. -----> hated	 0.738
[roberta] The acting and directing was terrible. I <mask> this movie. -----> hate	 0.092
[roberta] The acting and directing was terrible. I <mask> this movie. -----> loved	 0.066
[roberta] The acting and directing was terrible. I <mask> this movie. -----> disliked	 0.029
[roberta] The acting and directing was terrible. I <mask> this movie. -----> liked	 0.017


**Note**: 
Each blank is replaced with the token `<mask>` as shown above!

### Exercise

Play around with the models and try to find cases where they fail but humans wouldn't. For example, would you expect the models to return the same output for the following prompts? Why? Why not?

- `"my <mask> is cute."`
- `"my <mask> is cute"`



### Breaking Masked Language Models

We have seen in class that language modeling pretraining constitutes a laborous process that goes through large amount of textual data and tries to predict the right word for randomly *fill-in-the-blank* sentences. As a consequence, the model ends up learning word co-occurrences in English. 
As such, we might observe big changes in the model's outputs when small changes to the input. 


#### Warmup 

In this part, of the notebook we ask you to find some prompts where the model's outputs differ significantly. You can choose a single model or compare the outputs of the different models. Are both models equally brittle to these variations or is there one that is more robust?

In [None]:
model = ... # TODO: decide whether you want to use bert or roberta
text = ... # TODO: write the text

get_prediction(model, text, n=5)

#### Exercise 1. Finding evidence of Bias and Discrimination

A common concern nowadays is whether models are perpetuating biases and unwittingly causing harm to certain population groups. In this exercise, we ask you to come up with different *fill-in-the-blank* sentences that you would expect the model to be (or not to be) biased.


**Hint**: Few such examples could be `"the doctor said <mask> was busy tonight."` or `"<mask> is a mechanic."`. What do you expect it to be? Does the model meet your expectations? Why (why not)? 

In [None]:
model = ... # TODO: decide whether you want to use bert or roberta
text = ... # TODO: write the text

get_prediction(model, text, n=5)

Playing around with job occupations might be a fun starting point but try to come up with other examples. Can you find examples of discrimination of other socio-demographic groups?

In [None]:
# TODO
# Try writing down other fill in the blank sentences
# Examples: "<mask> is a mechanic."

#### Exercise 2. Know your facts

Despite the biases, these models are quite powerful! In fact, they can serve as knowledge bases. For instance, if you wish to know where Barack Obama was born, we'll have a RoBERTA or a BERT model correctly providing you that information.



In [None]:
get_prediction(roberta, "Obama was born in <mask>.", n=5)

[roberta] Obama was born in <mask>. -----> Hawaii	 0.135
[roberta] Obama was born in <mask>. -----> Chicago	 0.127
[roberta] Obama was born in <mask>. -----> Kenya	 0.099
[roberta] Obama was born in <mask>. -----> 1963	 0.035
[roberta] Obama was born in <mask>. -----> 1961	 0.033


Your job is to **find impressive examples** or some examples where the models fail. Do both models fail or just one? 


In [None]:
model = ... # TODO: decide whether you want to use bert or roberta
text = ... # TODO: write the text

get_prediction(model, text) # TODO: 

#### Exercise 3. The sky is the \<mask>

In this section, we invite you to try different things, from common sense to temporal or numerical reasoning, we want to know what you find about the models.


As a few examples to get you running, consider templates like: 

- `two plus two is <mask>`
- `Elephants are much <mask> than mice.`




In [None]:
model = ... # TODO: decide whether you want to use bert or roberta
text = ... # TODO: write the text

get_prediction(model, text) # TODO: 

## Part 3. What if we had more flexibility in Masked Language Modeling? 

In this section, we will interact with a model that although it was not trained using a masked language modeling objective, it can also interact via *fill-in-the-blank* prompts.  

This model is known as T5 and is trained using a causal language modeling. Therefore, it does not take benefit from the future tokens.

### Setup and define helper methods

In [None]:
%%time
%%capture
!pip install sentencepiece
!pip install transformers

from transformers import T5Tokenizer, T5ForConditionalGeneration

CPU times: user 92.5 ms, sys: 111 ms, total: 204 ms
Wall time: 8.28 s


In [None]:
import re

MASK_PATTERN = r"<extra_id_[0-9]{1,2}>"

def get_masks(text: str) -> dict:
  """Given a text return the sentinel masks ``<extra_id_[0-9]{1,2}>``."""
  mask_values = re.split(MASK_PATTERN, text)
  mask_values = [m for m in mask_values if m]
  results = {}
  for i, val in enumerate(mask_values):
    results[f"<extra_id_{i}>"] = val
  
  return results  


def replace_masks(text_orig: str, masked_outputs: list):
  """Replace the masks in the original text with the masked outputs."""
  final_texts = []

  for out in masked_outputs:
    text = text_orig
    # Get the masks in the model output 
    # (essentially scan for <extra_id_N> tags)
    masks = get_masks(out)

    # Replace each mask in the original text
    for mask_name, mask_value in masks.items():
      # Some tokens have some whitespace surrounding it
      mask_value = mask_value.strip()
      text = text.replace(mask_name, mask_value)

    final_texts.append(text)
  return final_texts

def to_t5_mask_format(text: str) -> str:
  fragments = text.split("<mask>")
  t5_masked = []

  for i, frag in enumerate(fragments):
    t5_masked.append(frag)

    if i < len(fragments)-1:
      t5_masked.append(f"<extra_id_{i}>")

  text = " ".join(t5_masked)
  return text

def generate_input(text, model, tokenizer, n=5, max_length=5, num_beams=200, debug=False):
  # Get T5 mask format using sentinels <extra_id_n>
  masked_text = to_t5_mask_format(text)
  if debug:
    print(masked_text)

  # Encode the masked text
  encoded = tokenizer.encode_plus(masked_text, add_special_tokens=True, return_tensors='pt')

  # Generating `n` sequences with maximum length set to `max_length`
  outputs = model.generate(input_ids=encoded['input_ids'],
                           num_beams=num_beams,
                           num_return_sequences=n,
                           max_length=max_length)
  
  # Note: skip_special_tokens=False, so that we preserve the sentinels
  # and can do the proper parsing of the string.
  outputs = tokenizer.batch_decode(outputs, skip_special_tokens=False, clean_up_tokenization_spaces=False)
  # Remove the pad token from the output (if present)
  outputs = [out.replace("<pad> ", "") for out in outputs]

  outputs = replace_masks(masked_text, outputs)
  for i, out in enumerate(outputs):
    print(out)



### Generate text with T5 


In this section, we'll try to evoke some strings from T5, an autoregressive model. While it does not benefit from the same *bidirectional* view as the other **masked language models**, it can still output some texts when provided a fill-in the blank kind of prompt. 

Let's try some, shall we? (: 


First, let us download a pre-trained T5 model! We're starting small to make sure everything is up and running appropriately. Feel free to use the base version instead by specifyin `t5-base`. 

**Note that this will make your generations more interesting but also much slower!**

In [None]:
%%time
%%capture
model_name = 't5-base' # "t5-small", "t5-base"

# Load tokenizer and model
t5_tokenizer = T5Tokenizer.from_pretrained(model_name)
t5_model = T5ForConditionalGeneration.from_pretrained(model_name)

CPU times: user 24.2 s, sys: 4.53 s, total: 28.8 s
Wall time: 37.2 s


Now we just provide the fill-in the blank prompts, we'd like! One key aspect, is that T5 model is no longer limited to a single word piece but instead can generate multiple word pieces for the same mask! This is due to its auto-regressive nature!

In [None]:
text = "I <mask> NLP!"
generate_input(text, model=t5_model, tokenizer=t5_tokenizer)

I  love  NLP!
I  LOVE  NLP!
I  love  NLP!
I  NEED  NLP!
I  LOVE  NLP!


### Exercise 

Like before, just play around with some patterns and try to see what you can get. Can you break the model? :)

We can frame it more similarly to other NLP tasks, like Sentiment Classification. If you're a foodie, you can give it a try 😜 Give us your wildest review and let us see what T5 has to say about that!

In [None]:
text = "The movie was a lot of fun. I <mask> this movie."
generate_input(text, model=t5_model, tokenizer=t5_tokenizer, max_length=10)

print()

text = "The acting and directing was terrible. I <mask> this movie."
generate_input(text, model=t5_model, tokenizer=t5_tokenizer, max_length=10)

The movie was a lot of fun. I  really enjoyed  this movie.
The movie was a lot of fun. I  loved  this movie.
The movie was a lot of fun. I  highly recommend  this movie.
The movie was a lot of fun. I  enjoyed  this movie.
The movie was a lot of fun. I  really enjoyed  this movie.

The acting and directing was terrible. I  didn't like  this movie.
The acting and directing was terrible. I  didn't like  this movie.
The acting and directing was terrible. I  did not like  this movie.
The acting and directing was terrible. I  didn't like  this movie.
The acting and directing was terrible. I  didn't like  this movie.


## [Optional] Teaser for tomorrow's class

If you'd like to explore a little bit a different class of Language Models, head over to [HuggingFace t5-small playground](https://huggingface.co/t5-small).

## Resources

- [HuggingFace Documentation on T5: Training section](https://huggingface.co/docs/transformers/model_doc/t5#training)
- [HuggingFace Github Issue](https://github.com/huggingface/transformers/issues/3985)