<a href="https://colab.research.google.com/github/ucbnlp24/hws4nlp24/blob/main/HW3/HW3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 3: Language Models, Contextual Embedding and BERT

In this homework, we will explore implementations of various language models we saw in lecture. We will explore BERT and measure perplexity.

##Set Up

If you're opening this Notebook on colab, you will probably need to install Transformers. Make sure your version of Transformers is at least 4.11.0

In [None]:
! pip install transformers



In [None]:
import transformers
print(transformers.__version__)

4.37.2


IMPORTANT: For this assignment, GPU is not necessary. The following code block should show "Running on cpu".
Go to Runtime > Change runtime type > Hardware accelerator > None if otherwise.

In [None]:
import torch
torch.manual_seed(159259)
torch.cuda.manual_seed(159259)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Running on {}".format(device))

Running on cpu


# Masking

One of the core ideas to wrap your head around with transformer-based language models (and PyTorch) is the concept of *masking*---preventing a model from seeing specific tokens in the input during training.

* BERT training relies on the concept of *masked language modeling*: masking a random set of input tokens in a sequence and attempting to predict them.  Remember that BERT is *bidirectional*, so that it can use all of the other non-masked tokens in a sentence to make that prediction.

* The GPT class of models acts as a traditional left-to-right language model (sometimes called a "causal" LM) .  This family also uses self-attention based transformers---but, when making a prediction for the word $w_i$ at position $i$, it can only use information about words $w_1, \ldots, w_{i-1}$ to do so.  All of the other tokens following position $i-1$ must be *masked* (hidden from view).


Think about a mask as a matrix that's applied to every input $w$ when generating an output $o$ that determines whether an given $o_i$ is allowed to access each token in $w$.  For example, when passing a three-word input sequence through a transformer (to yield a three-word output sequence), a mask is a $3 \times 3$ matrix where the cells are essentially  answering the following questions:

\begin{bmatrix}
o_1 \; \textrm{hide} \; w_1\textrm{?} & o_1 \; \textrm{hide} \; w_2\textrm{?} & o_1 \; \textrm{hide} \; w_3\textrm{?} \\
o_2 \; \textrm{hide} \; w_1\textrm{?} & o_2 \; \textrm{hide} \; w_2\textrm{?} & o_2 \; \textrm{hide} \; w_3\textrm{?} \\
o_3 \; \textrm{hide} \; w_1\textrm{?} & o_3 \; \textrm{hide} \; w_2\textrm{?} & o_3 \; \textrm{hide} \; w_3\textrm{?} \\
\end{bmatrix}

In the masks we will consider below, 1 denotes that a position should be hidden; 0 denotes that it should be visible. Consider this mask:

\begin{bmatrix}
0 & 1 & 1 \\
1 & 0 & 1 \\
1 & 1 & 0
\end{bmatrix}

And consider this sequence:

\begin{bmatrix}
\textrm{John} & \textrm{likes}  & \textrm{dogs}  \\
\end{bmatrix}

When applying this mask to that sequence, we're saying that when we're generating the output for $o_1$ (*John*), we can only consider $w_1$ as an input (*John*).  Likewise, when we generate the output for $o_2$ (*likes*), we can only consider $w_2$ as an input (*likes*), and so on.  (This is a terrible mask!  But illustrates what function a mask performs.)

The following code illustrates how this works for that particular mask.


In [None]:
import numpy as np
np.random.seed(159259)

def visualize_masking(sequences, mask):
  print(mask)
  for sequence in sequences:
    for i in range(len(sequence)):
      visible=[]
      for j in range(len(sequence)):
        if mask[i][j]==0:
          visible.append(sequence[j])
      print("for word %s, the following tokens are visible: %s" % (sequence[i], visible))
    print()

In [None]:
sequences=[["This", "is", "a", "sentence", "that", "has", "exactly", "ten", "tokens", "."], ["Here's", "another", "sequence", "with", "10", "words", "like", "the", "last", "."]]

seq_length=len(sequences[0])

test_mask=np.ones((seq_length,seq_length))
for i in range(seq_length):
  test_mask[i,i]=0

visualize_masking(sequences, test_mask)



[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 0. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 0. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 0. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 0. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]]
for word This, the following tokens are visible: ['This']
for word is, the following tokens are visible: ['is']
for word a, the following tokens are visible: ['a']
for word sentence, the following tokens are visible: ['sentence']
for word that, the following tokens are visible: ['that']
for word has, the following tokens are visible: ['has']
for word exactly, the following tokens are visible: ['exactly']
for word ten, the following tokens are visible: ['ten']
for word tokens, the following tokens are visible: ['tokens']
for word ., the following tokens are visible: ['.']

for word Here's, the following tokens are visible: ["Here's"]
for word another, the follow

##Q1.  
As we discussed in class, BERT masks a random set of words in the input and attempts to reconstruct those words as output.  **Create a mask that always masks token positions 2 and 7 for a size 10 sequence input.** You should generate output representations for all 10 tokens (i.e., $[o_1, \ldots, o_{10}]$ in the notation above. Each representation must ignore input tokens at position 2 and 7.

In [None]:
def create_bert_mask(seq_length):
  mask=np.ones((seq_length,seq_length))
  # implement BERT mask here

  # BEGIN SOLUTION
  for i in range(seq_length):
    for j in range(seq_length):
        mask[i][j] = 0
        if j == 2 or j == 7:
          mask[i][j] = 1

  # END SOLUTION

  return mask

In [None]:
visualize_masking(sequences, create_bert_mask(len(sequences[0])))

[[0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]]
for word This, the following tokens are visible: ['This', 'is', 'sentence', 'that', 'has', 'exactly', 'tokens', '.']
for word is, the following tokens are visible: ['This', 'is', 'sentence', 'that', 'has', 'exactly', 'tokens', '.']
for word a, the following tokens are visible: ['This', 'is', 'sentence', 'that', 'has', 'exactly', 'tokens', '.']
for word sentence, the following tokens are visible: ['This', 'is', 'sentence', 'that', 'has', 'exactly', 'tokens', '.']
for word that, the following tokens are visible: ['This', 'is', 'sentence', 'that', 'has', 'exactly', 'tokens', '.']
for word has, the following tokens are visible: ['This', 'is', 'sentence', 'that', 'h

##Q2
A left-to-right language model (such as GPT) can only use information from input words $[w_1, \ldots, w_{i}]$ when generating the representation for output $o_i$.  **Encode a representation that masks input from $[w_{i+1}, \ldots, w_{n}]$.**

*Example:*

For input of length 5, the beginning of the mask should be:

```
[[0, 1, 1, 1, 1],
 [0, 0, 1, 1, 1],
 ...             ]
```

In [None]:
def create_causal_mask(seq_length):
  mask=np.ones((seq_length,seq_length))
  # implement causal mask here

  # BEGIN SOLUTION
  for i in range(seq_length):
    for j in range(seq_length):
      if i == j or i > j:
        mask[i][j] = 0
  # END SOLUTION

  return mask

In [None]:
visualize_masking(sequences, create_causal_mask(len(sequences[0])))

[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
for word This, the following tokens are visible: ['This']
for word is, the following tokens are visible: ['This', 'is']
for word a, the following tokens are visible: ['This', 'is', 'a']
for word sentence, the following tokens are visible: ['This', 'is', 'a', 'sentence']
for word that, the following tokens are visible: ['This', 'is', 'a', 'sentence', 'that']
for word has, the following tokens are visible: ['This', 'is', 'a', 'sentence', 'that', 'has']
for word exactly, the following tokens are visible: ['This', 'is', 'a', 'sentence', 'that', 'has', 'exactly']
for word ten, the following tokens are visible: ['This', 'is', 'a', 'sentence', 'that', 'has', 'exactly'

Now let's go ahead and embed these masks within a model.  First, we'll load some textual data (from Austen's *Pride and Prejudice*).

In [None]:
!wget https://www.gutenberg.org/files/1342/1342-0.txt

--2024-02-23 22:20:43--  https://www.gutenberg.org/files/1342/1342-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 772186 (754K) [text/plain]
Saving to: ‘1342-0.txt.3’


2024-02-23 22:20:44 (6.57 MB/s) - ‘1342-0.txt.3’ saved [772186/772186]



In [None]:
import nltk
from nltk import word_tokenize
from collections import Counter

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Let's read in the data and tokenize it; for this homework, we'll only work with the first 10,000 tokens of that book; we'll keep only the most frequent 1,000 word types (all other tokens will be mapped to an [UNK] token).

In [None]:
def read_data(filename):
  with open(filename) as file:
    data=file.read().lower()
    first10K=' '.join(data.split(" ")[:10000])
    toks=nltk.word_tokenize(first10K)[:10000]
    vocab={"[PAD]":0, "[UNK]":1}
    counts=Counter()
    for tok in toks:
      counts[tok]+=1
    for v, _ in counts.most_common(1000):
      vocab[v]=len(vocab)
    tokids=[]
    for tok in toks:
      tokid=1
      if tok in vocab:
        tokid=vocab[tok]

      tokids.append(tokid)

    return tokids, vocab

Now let's specify our model in PyTorch.

In [None]:
from torch import nn
import torch

class MaskedLM(nn.Module):
    def __init__(self, vocab, mask, d_model=512):
        super().__init__()
        self.vocab=vocab
        self.mask=mask
        vocab_size=len(vocab)
        self.embeddings=nn.Embedding(1002,512)
        encoder_layer = nn.TransformerEncoderLayer(d_model, nhead=8, batch_first=True)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
        self.linear=torch.nn.Linear(d_model, vocab_size)
        self.rev_vocab={vocab[k]:k for k in vocab}

    def forward(self, input):
        # first we pass the input word IDS through an embedding layer to get embeddings for them
        input=self.embeddings(input)
        # then we pass those embeddings through a transformer to get contextual representations, masking the input where appropriate
        out = self.transformer_encoder.forward(input, mask=self.mask)
        # finally we pass those embeddings through a linear layer to transform it into the output space (the size of our vocabulary)
        h=self.linear(out)
        return h

In [None]:
def get_batches(xs, ys, batch_size=32):
    batch_x=[]
    batch_y=[]
    for i in range(0, len(xs), batch_size):
        batch_x.append(torch.LongTensor(xs[i:i+batch_size]).to(device))
        batch_y.append(torch.LongTensor(ys[i:i+batch_size]).to(device))
    return batch_x, batch_y

In [None]:
tokids, vocab=read_data("1342-0.txt")

In [None]:
def train(mask, data_function, tokids, vocab):

    mask=torch.BoolTensor(mask).to(device)

    num_labels=len(vocab)
    model=MaskedLM(vocab, mask).to(device)
    optimizer=torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
    cross_entropy=nn.CrossEntropyLoss()
    losses=[]

    xs, ys=data_function(tokids)

    batch_x, batch_y=get_batches(xs, ys)

    for epoch in range(1):
        model.train()

        for x, y in list(zip(batch_x, batch_y)):
            x, y = x.to(device), y.to(device)
            y_pred=model.forward(x)
            loss=cross_entropy(y_pred.view(-1, num_labels), y.view(-1))
            losses.append(loss.item())
            print(loss)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Our model and training process are now all defined; all that remains is to pass our inputs and outputs through it to train.  Your job here is to create the correct inputs (x) and outputs (y) to train a left-to-right (causal) language model.

##Q3
**Write a function that takes in a sequence of token ids $[w_1, \ldots, w_n]$ and segments it into 8-token chunks and creates a corresponding $y_i$ for every $x_i$.** The x segmentation should operate like this: $x_1=[w_1, \ldots, w_8]$, $x_2=[w_9, \ldots, w_{16}]$, etc. Given this language modeling specification, each $y_i$ should also contain 8 values (for each token in $x_i$).  Keep in mind this is a left-to-right causal language model; your job is to figure out the values of y that respects this design. At token position $i$, when a model has access to $[w_1, \ldots, w_i]$, which is the true $y_i$ for that position? Each element in $y$ should be a word ID (i.e., an integer).

Edge Case: For the case when the length of `data` is not divisilble by `max_len`, we will discard the remainder part of the input data. Here is an example for clarity:

 >  Given this sentence input:  
 > "This is a sentence for natural language processing homework that goes on and on"   
 > The function should produce this segmentation (with `max_len = 8`):  
 > `xs = [[This, is, a, sentence, for, natural, language, processing]]`  
 > The similar applies to `ys`

 *Note*: `ys` will not be a copy of `xs`

In [None]:
def get_causal_xy(data, max_len=8):
    xs=[]
    ys=[]

    # BEGIN SOLUTION
    data_len = len(data) - (len(data) % 8)
    for i in range(0, data_len, max_len):
      xi = data[i : i + max_len]
      yi = data[i+1 : i + max_len+1]

      xs.append(xi)
      ys.append(yi)
    # END SOLUTION

    return xs, ys

In [None]:
seq_length=8

train(create_causal_mask(seq_length=seq_length), get_causal_xy, tokids, vocab)

tensor(7.0286, grad_fn=<NllLossBackward0>)
tensor(6.4421, grad_fn=<NllLossBackward0>)
tensor(6.4656, grad_fn=<NllLossBackward0>)
tensor(6.0913, grad_fn=<NllLossBackward0>)
tensor(6.1931, grad_fn=<NllLossBackward0>)
tensor(6.1629, grad_fn=<NllLossBackward0>)
tensor(6.0787, grad_fn=<NllLossBackward0>)
tensor(6.1768, grad_fn=<NllLossBackward0>)
tensor(6.1566, grad_fn=<NllLossBackward0>)
tensor(6.5156, grad_fn=<NllLossBackward0>)
tensor(6.2127, grad_fn=<NllLossBackward0>)
tensor(5.6328, grad_fn=<NllLossBackward0>)
tensor(5.3481, grad_fn=<NllLossBackward0>)
tensor(5.1073, grad_fn=<NllLossBackward0>)
tensor(5.0457, grad_fn=<NllLossBackward0>)
tensor(5.1834, grad_fn=<NllLossBackward0>)
tensor(5.0366, grad_fn=<NllLossBackward0>)
tensor(4.7546, grad_fn=<NllLossBackward0>)
tensor(4.7268, grad_fn=<NllLossBackward0>)
tensor(5.1561, grad_fn=<NllLossBackward0>)
tensor(5.4131, grad_fn=<NllLossBackward0>)
tensor(5.2509, grad_fn=<NllLossBackward0>)
tensor(4.6203, grad_fn=<NllLossBackward0>)
tensor(4.89

*You should be seeing the cross entropy loss gradually decrease to be less than 5.*


# Perplexity
To evaluate how good our language model is, we use a metric called perplexity. The perplexity of a language model (PP) on a test set is the inverse probability of the test set, normalized by the number of words. Let $W = w_{1}w_{2}\dots w_{N}$. Then,

$$PP(W) = \sqrt[N]{\prod_{i = 1}^{N}\frac{1}{P(w_{i}|w_{1}\dots w_{i - 1})}}$$

However, since these probabilities are often small, taking the inverse and multiplying can be numerically unstable, so we often first compute these values in the log domain and then convert back. So this equation looks like:

$$\ln PP(W) = \frac{1}{N} \sum_{i = 1}^{N} -\ln P(w_{i}|w_{1}\dots w_{i - 1})$$

$$\implies PP(W) = e^{\frac{1}{N} \sum_{i = 1}^{N} -\ln P(w_{i}|w_{1}\dots w_{i - 1})}$$

Here we want to calculate the perplexity of [pretrained BERT model](https://huggingface.co/bert-base-uncased) on text from different sources. When calculating perplexity with BERT, we'll use a related measure of pseudo-perplexity, which allow us to condition on the bidirectional context (and not just the left context, as in standard perplexity):

$$PP(W) = e^{\frac{1}{N} \sum_{i = 1}^{N} -\ln P(w_{i} \mid w_{1}\dots w_{i - 1}, w_{i+1}, \ldots, w_n)}$$


First, let's instantiate a BERT model, along with its WordPiece tokenizer.

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
import numpy as np

base_model_name = 'bert-base-uncased'
base_model = AutoModelForMaskedLM.from_pretrained(base_model_name)
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model=base_model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identica

Let's see how the BERT tokenizer tokenizes a sentence into a sequence of WordPiece ids.  Note how BERT tokenization automatically wraps an input sentences with [CLS] and [SEP] tags.

In [None]:
sentence = "A dog landed on Mars"
tensor_input = base_tokenizer(sentence, return_tensors="pt")
print(tensor_input)
tensor_input_ids = tensor_input["input_ids"]
print(tensor_input_ids)
print(base_tokenizer.convert_ids_to_tokens(tensor_input_ids[0]))

{'input_ids': tensor([[ 101, 1037, 3899, 5565, 2006, 7733,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}
tensor([[ 101, 1037, 3899, 5565, 2006, 7733,  102]])
['[CLS]', 'a', 'dog', 'landed', 'on', 'mars', '[SEP]']


Now let's see how we can calculate output probabilities using this model.  The output of each token position $i$ gives us $P(w_i \mid w_1, \ldots, w_n)$---the probability of the word at that position over our vocabulary, given *all* of the words in the sentence.

In [None]:
with torch.no_grad():
  output = base_model(tensor_input_ids)
  logits = output.logits
  # logits here are the unnormalized scores, so let's pass them through the softmax
  # to get a probability distribution
  softmax = torch.nn.functional.softmax(logits, dim = -1)
  # for one input sequence, the shape of the resulting distribution is:
  # 1 x [length of input, in WordPiece tokens] x (the size of the BERT vocabulary)
  print(softmax.shape) # [1, 7, 30522]
  input_ints=tensor_input_ids.numpy()[0]
  # Let's print the probability of the true inputs
  wp_tokens=base_tokenizer.convert_ids_to_tokens(input_ints)
  for i in range(len(input_ints)):
    prob=softmax[0][i][input_ints[i]].numpy()
    print("%s\t%s\t%.5f" % (wp_tokens[i], input_ints[i], prob))

torch.Size([1, 7, 30522])
[CLS]	101	0.00000
a	1037	0.99281
dog	3899	0.99052
landed	5565	0.99809
on	2006	0.99874
mars	7733	0.00133
[SEP]	102	0.00000


Note that $w_i$ is in the range $[w_1, \ldots, w_n]$ -- clearly the probability of a word is going to be high when we can observe it in the input! Let's do some masking to calculate $P(w_i \mid w_1, \ldots w_{i-1}, w_{i+1}, w_n)$.  Now annoyingly, BERT's `attention_mask` function only works for padding tokens; to mask input tokens, we need to intervene in the input and replace a WordPiece token that we're predicting with a special [MASK] token (BERT tokenizer word id `103`).

In [None]:
import copy

with torch.no_grad():
  # let's make a copy of the original word ids so we can mask one of the tokens
  masked_input_ids=copy.deepcopy(tensor_input_ids)
  # we'll mask the second word
  masked_input_ids[0][1]=base_tokenizer.convert_tokens_to_ids("[MASK]")

  print("The second word here now is [MASK] token ID '103': ", masked_input_ids)

  # now let's run that through BERT in the same way we did before
  output = base_model(masked_input_ids)
  logits = output.logits

  softmax = torch.nn.functional.softmax(logits, dim = -1)
  input_ints=tensor_input_ids.numpy()[0]

  wp_tokens=base_tokenizer.convert_ids_to_tokens(input_ints)
  i=1
  prob=softmax[0][i][input_ints[i]].numpy()
  print("%s\t%s\t%.5f" % (wp_tokens[i], input_ints[i], prob))

The second word here now is [MASK] token ID '103':  tensor([[ 101,  103, 3899, 5565, 2006, 7733,  102]])
a	1037	0.13965


You can see the probability of "a" as the second token has gone down to 0.13965 when we mask it.  This is the $P(w_1 =\textrm{a} \mid w_0, w_2, \ldots, w_n)$.  At this point you should have everything you need to calculate the BERT pseudo-perplexity of an input sentence.

##Q4
**Implement the pseudo-perplexity measure described above, calculating the perplexity for a given model, tokenizer, and sentence. You MUST comment your code to clarify the steps you had taken.**

The function calculates the average probability of each token in the sentence given all the other tokens. We need to predict the probability of each word in a sentence by masking the one word to predict. Note that you should not include the probabilities of the [CLS] and [SEP] tokens in your perplexity equation -- those tokens are not part of the original test sentence.



In [None]:
# This function calculates the perplexity of a language model, given a sentence and its corresponding tokenizer

# Inputs:
# model: language model being used to calculate the perplexity
# tokenizer: tokenizer that is used to preprocess the input sentence
# sentence: input sentence string for which perplexity is to be calculated

# Outputs:
# returns perplexity of the input sentence

def perplexity(model, tokenizer, sentence):

    # hints: you'll need to:
    # encode the input sentence using the tokenizer
    # for each WordPiece token in the sentence (except [CLS] and [SEP]), mask that single token and
    # calculate the probability of that true word at the masked position
    # don't calculate perplexity for the [CLS] and [SEP] tokens (which are not part of the original test sentence).

    perplexity=None
    # BEGIN SOLUTION

    #tokenize sentence into sequence of WordPiece ids. note: input sentence is wrapped with [CLS] and [SEP] tags
    tensor_input = tokenizer(sentence, return_tensors = 'pt')
    tensor_input_ids = tensor_input['input_ids']

    #calculate each word's probability when masked and store probabilities in probs
    probs = []
      #mask each word and find its probability.
    for i in range(1, len(tensor_input_ids[0])-1): #skipping [CLS] and [SEP] tags
      with torch.no_grad():
        #make copy of original word ids so we can mask one of the tokens. one at a time, collecting each of their probabilities
        masked_input_ids = copy.deepcopy(tensor_input_ids)
        masked_input_ids[0][i] = tokenizer.convert_tokens_to_ids('[MASK]')

        #output of token position i = the probability of the word at that position over our vocab, given all words in a sentence
        output = model(masked_input_ids)
        logits = output.logits
        #pass logit through softmax to get probability distribution
        softmax = torch.nn.functional.softmax(logits, dim = -1)
        input_ints = tensor_input_ids.numpy()[0]
        #store probability of masked true word i in probs list
        wp_tokens = tokenizer.convert_ids_to_tokens(input_ints)
        probs.append(softmax[0][i][input_ints[i]].numpy())

    #now we work with these probabilities to find the perplexity.
    N = len(probs)
    summation = sum(-np.log(prob) for prob in probs)
    avg = summation / N
    perplexity = np.exp(avg)

    # END SOLUTION

    return perplexity

In [None]:
print(perplexity(sentence='London is the capital of the United Kingdom.', model=base_model, tokenizer=base_tokenizer))

1.0596669499203932


*Sanity Check: You should be seeing perplexity close to 1*

# Comparing Different Pretrained BERT using Perplexity

Note: the following section will be using your implementation of `perplexity()`.

In a previous question, we explored the perplexity of `bert-base-uncased` using the sentence "London is the capital of the United Kingdom." According to its [HuggingFace page](https://huggingface.co/bert-base-uncased), this model underwent pretraining with two significant datasets: **BookCorpus**—comprising 11,038 unpublished books—and **English Wikipedia**, specifically its text portions (lists, tables, and headers excluded). It's important to note that both datasets are in English. Given this background, it would be insightful to assess the model's performance with a Simplified Chinese translation of the sentence "London is the capital of the United Kingdom," considering its training data.


In [None]:
# Simplified Chinese translation of "London is the capital of the United Kingdom."
print(perplexity(sentence='伦敦是英国的首都。', model=base_model, tokenizer=base_tokenizer))

273.36646577452393


Luckily, we have another flavor of BERT just for that. [**BERT Multilingual**](https://huggingface.co/bert-base-multilingual-uncased) is pretrained on the 102 languages with the largest Wikipedias. This way, the model learn inner representation of multiple languages. Now, let's see how it will perform on the previous example in Chinese.

In [None]:
multilingual_model_name = 'bert-base-multilingual-cased' # TODO: Click on the Hugging Face link and fill in the model name
multilingual_model = AutoModelForMaskedLM.from_pretrained(multilingual_model_name)
multilingual_tokenizer = AutoTokenizer.from_pretrained(multilingual_model_name)
multilingual_model=multilingual_model.to(device)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Simplified Chinese translation of "London is the capital of the United Kingdom."
print(perplexity(sentence='伦敦是英国的首都。', model=multilingual_model, tokenizer=multilingual_tokenizer))

2.7009387957943582


The third model in our discussion, [**BERT Large**](https://huggingface.co/bert-large-uncased), significantly expands on its predecessors in terms of size and capacity. While [**BERT Base**](https://huggingface.co/bert-base-uncased) contains 110 million parameters, and [**BERT Multilingual**](https://huggingface.co/bert-base-multilingual-uncased) slightly increases this figure to 168 million parameters, **BERT Large** boasts an impressive 336 million parameters. This substantial increase in parameters enables the **BERT Large** model to handle more complex language tasks with greater efficacy. However **BERT Large** is pretrained on the same dataset as **BERT Base**. In the following question, you will compare the three BERT models side by side using multilingual inputs.


In [None]:
large_model_name = 'bert-large-uncased' # TODO: Click on the Hugging Face link and fill in the model name
large_model = AutoModelForMaskedLM.from_pretrained(large_model_name)
large_tokenizer = AutoTokenizer.from_pretrained(large_model_name)
large_model=large_model.to(device)

Some weights of the model checkpoint at bert-large-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Q5 (Write up)

We will evaluate the performance of three BERT models by calculating their perplexity across various inputs. These inputs will differ along two dimensions:

1. Logical coherence (Rational vs. Irrational)
2. Language (English vs. Other languages)

Below are the four sample inputs provided for this analysis:

- For logical coherence in English:
  - Rational: "London is the capital of the United Kingdom"...
  - Irrational: "London is the capital of the United States"...
  
- For logical coherence in Simplified Chinese:
  - Rational: "伦敦是英国的首都。"... (London is the capital of the United Kingdom.)
  - Irrational: "伦敦是美国的首都。"... (London is the capital of the United States.)

The goal is to compute and compare the perplexity of our BERT models on these inputs. This comparison will help us understand how well each model can handle logical coherence and language variability. After running the two cells below, you will write a short answer in response to the results.

*Note: we are computing the average perplexity over 10 example sentences for each style of input. This is because, practically, perplexity evaluation of LMs is done through averaging it on a corpus, as the individual sentence scores can have random variations.*

In [None]:
def compare_models(inputs: list[list[str]], model_dictionary: dict):
  for input in inputs:
    print(input[0] + "...")
    for model_name, model_meta in model_dictionary.items():
      ps = []
      for sentence in input:
        p = perplexity(
          sentence=sentence,
          model=model_meta["model"],
          tokenizer=model_meta["tokenizer"]
        )
        ps.append(p)
      avg_p = np.mean(ps)
      print(f" - {model_name}\t{avg_p}")
    print("==============")

english_rational = [
    "London is the capital of the United Kingdom.",
    "The Great Wall of China is visible from space.",
    "Shakespeare wrote the play 'Romeo and Juliet'.",
    "The human body has 206 bones.",
    "The Earth orbits the Sun once every 365.25 days.",
    "The chemical symbol for gold is Au.",
    "The Pacific Ocean is the largest ocean on Earth."
]
english_irrational = [
    "London the is United States capital the of.",
    "longest The Great slideof China Wall in the. world is the",
    "screenplay Shakespeare the for 'wrote The Terminator'.",
    "body has 206 The human pebbles.",
    "The Moon once 365.25 Earth orbits every the days.",
    "The symbol chemical for  Gl gold is  .",
    "The Pacific Ocean is the largest swimming pool on Earth."
]
chinese_rational = [
    "伦敦是英国的首都。",
    "长城可以从太空中看见。",
    "莎士比亚写了《罗密欧与朱丽叶》这部戏剧。",
    "人体有206块骨头。",
    "地球绕太阳转一圈大约需要365.25天。",
    "金的化学符号是Au。",
    "太平洋是地球上最大的海洋。"
]
chinese_irrational = [
    "伦敦的首是美国都。",
    "长滑梯城上最长是世界的。",
    "《终结者》了的剧本莎士比亚写。",
    "人体颗鹅卵有206石。",
    "地球365.25一圈大约需要天绕月球转。",
    "化学符号金Gl的是。",
    "最大太平洋上的是地球游泳池。"
]

In [None]:
model_dictionary = {
    "Base BERT" : {
        "model" : base_model,
        "tokenizer" : base_tokenizer
    },
    "Multilingual" : {
        "model" : multilingual_model,
        "tokenizer" : multilingual_tokenizer
    },
    "Large BERT" : {
        "model" : large_model,
        "tokenizer" : large_tokenizer
    }
}

inputs = [
    english_rational,
    english_irrational,
    chinese_rational, # Translation of `english_rational`
    chinese_irrational  # Translation of `english_irrational`
]

compare_models(inputs, model_dictionary)

# THIS CELL WILL TAKE ~5 MIN TO RUN

London is the capital of the United Kingdom....
 - Base BERT	9.64857491197
 - Multilingual	16.372904987229365
 - Large BERT	5.427852264905104
London the is United States capital the of....
 - Base BERT	1786.1066923203557
 - Multilingual	1383.8890177932167
 - Large BERT	3489.688384097369
伦敦是英国的首都。...
 - Base BERT	389.22307566249293
 - Multilingual	10.598953520046122
 - Large BERT	157.6115739212149
伦敦的首是美国都。...
 - Base BERT	318.44269447988165
 - Multilingual	127.09980372470706
 - Large BERT	130.72529683720492


**Observe the results and highlight patterns that reflect characteristics of the BERT models. For each observation, use the observed pattern to justify characteristics of one specific model or a group of models. You should provide 3 observations with 50 words each**. One example is provided for you. Please double click the cell below to enter your response.


1. (Example) Perplexity of Base BERT and Large BERT both increase on irrational English sentences, compared to rational Englsih sentences. This justifies that both models are better at prediciting rational sentences due to knowledge learned from their pretrained text corpus.
2. The perplexity of all Bert models decrease on irrational Chinese sentences compared to rational Chinese sentences. This shows that all three BERT models still need more work with how they predict multilingual sentences.
3. The Multilingual BERT model performs better than Base BERT and Large BERT when predicting both rational and irrational Chinese sentences. Also given the previous example of processing Simple Chinese sentences through Base BERT vs. Multilingual BERT, we can see that Multilingual BERT is more than likely a better-suited model for non-English languages.
4. Multilingual BERT also performs the best when predicting both rational and irrational English sentences. Having the context of 102 different languages may possibly provide BERT with more context behind the English language that it can use in their learning.