# Lab 3: Calculating perplexity

A language model can be used to calculate the probability of **a sequence** of text. This is very useful for a number of applications, e.g. judging whether a word was likely mistyped and the corrected word makes more sense in the context.

These probability calculations can also be used to broadly evaluate a language model. A language model should give a higher probability to a real block of text compared to one that contains random words. The measure used in this context is known as **perplexity**. Let's figure out how to calculate it.
.

We'll define a span of text that we want to calculate the probability that the language model generated it.

In [2]:
real_text = "It was the best of times, it was the worst of times"


Let's load up the `distilgpt2` tokenizer and model that we've used before

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

We want to calculate the probability of the sequence which would start with the word "It". So what probability does `distilgpt2` give for "It" starting a sequence? To find out that, we need to add the beginning of sequence (or bos) token at the beginning of the string. `distilgpt2` only had one special token (`<|endoftext|>`) that it uses to separate bits of text, whereas other models (e.g. BERT) do have specific tokens for the beginning of sequences (`[CLS]` for BERT). We'll add GPT's special token at the beginning of this sequence.

In [4]:
bos_token = tokenizer.special_tokens_map['bos_token']
real_text_with_bos = f"{bos_token} {real_text}"
real_text_with_bos

'<|endoftext|> It was the best of times, it was the worst of times'

Now we tokenize it and get Pytorch tensors using `return_tensors='pt` so that it's ready to be inputted into the Transformer.

In [6]:
real_text_with_bos

'<|endoftext|> It was the best of times, it was the worst of times'

In [5]:
tokenized = tokenizer(real_text_with_bos, return_tensors='pt')
tokenized

{'input_ids': tensor([[50256,   632,   373,   262,  1266,   286,  1661,    11,   340,   373,
           262,  5290,   286,  1661]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

That gave us 14 tokens including the special token `<|endoftext|>` which has ID 50256.

We can now feed that tokenized text into the `distilgpt2` transformer model.

In [7]:
output = model(**tokenized)

In [8]:
output

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[-32.8782, -30.4127, -32.6386,  ..., -43.8683, -43.8586, -32.0607],
         [-51.8211, -52.7703, -54.7660,  ..., -61.3993, -55.5683, -54.8043],
         [-70.8942, -72.2311, -76.1172,  ..., -78.8249, -75.2432, -74.6483],
         ...,
         [-42.1591, -47.5681, -52.7264,  ..., -56.9564, -55.0613, -47.0133],
         [-44.3839, -46.1254, -50.4883,  ..., -51.0306, -52.4772, -47.1517],
         [-35.3398, -40.0403, -43.8940,  ..., -51.6798, -49.4999, -38.2232]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-0.9310,  2.1511,  1.3583,  ..., -1.6581, -0.5595,  1.6256],
          [-2.3221,  2.2498,  2.3471,  ..., -1.0194, -1.5275,  1.8522],
          [-1.9049,  2.2993,  1.9366,  ..., -1.2037, -3.2026,  2.2419],
          ...,
          [-1.9888,  1.3952,  1.1575,  ..., -1.4404, -1.2264,  0.4281],
          [-2.5564,  2.6051,  1.5983,  ..., -0.5955, -1.9933,  2.6414],
          [-3.0031,  2.3393,  2.5966, 

And what size of output do we get?

In [9]:
logits = output.logits.detach().numpy() # We use .detach().numpy() to convert the Pytorch tensor to a numpy array
logits.shape

(1, 14, 50257)

Scores for every possible token in our model (all 50257 of them) following each of the 14 input tokens. Recall that `len(tokenizer.vocab) = 50257`.

We also need to apply the softmax function as before to make the scores look like probabilities.

In [10]:
from scipy.special import softmax

softmaxxed = softmax(logits, axis=2) # We use axis=2 so that the softmax is applied to each token's scores in turns and not all the values together

Now the scores should look more like probabilities and be between 0 and 1.

In [33]:
softmaxxed.shape

(1, 14, 50257)

In [11]:
softmaxxed

array([[[5.2592589e-04, 6.1899549e-03, 6.6829839e-04, ...,
         8.8715462e-09, 8.9581622e-09, 1.1912026e-03],
        [2.2344657e-05, 8.6488117e-06, 1.1755636e-06, ...,
         1.5467685e-09, 5.2699352e-07, 1.1313784e-06],
        [3.8375958e-05, 1.0079329e-05, 2.0688992e-07, ...,
         1.3797350e-08, 4.9581871e-07, 8.9876431e-07],
        ...,
        [7.3675411e-03, 3.2976055e-05, 1.8967479e-07, ...,
         2.7602123e-09, 1.8363645e-08, 5.7434248e-05],
        [1.4915577e-06, 2.6141319e-07, 3.3308936e-09, ...,
         1.9366180e-09, 4.5578807e-10, 9.3672405e-08],
        [1.4167164e-02, 1.2878785e-04, 2.7304848e-06, ...,
         1.1348130e-09, 1.0038102e-08, 7.9255301e-04]]], dtype=float32)

So what is the probability that `distilgpt2` gives to a text sequence beginning with "It"? Well, we need the token IDs from our sequence to figure this out. We can remove the first one as it is the special token `<|endoftext|>`.

In [12]:
token_ids = tokenized['input_ids'][0]
token_ids = token_ids[1:]
token_ids

tensor([ 632,  373,  262, 1266,  286, 1661,   11,  340,  373,  262, 5290,  286,
        1661])

In [34]:
real_text

'It was the best of times, it was the worst of times'

In [35]:
real_text_with_bos

'<|endoftext|> It was the best of times, it was the worst of times'

**"It"** has token ID **632**. To get the probability of that after the first token (which is the special beginning of sequence token), we can index into `softmaxxed`. 

We would set the first coordinate as 0 (as it's the first sequence of only one), the second coordinate as 0 (as it's after the first token, i.e. after `<|endoftext|>`) and then the third coordinate as 632 as that is the token ID for "It".

In [14]:
softmaxxed[0, 0, 632]

0.004289047

A probability of 0.004, which is actually pretty high. What's the probability **starting with** a rare word such as 'mismatch'? Note that we use the `Ġ` character to denote that the token is at the beginning of a word.

In [23]:
tokenizer.vocab['Ġmismatch']

46318

In [15]:
softmaxxed[0, 0, tokenizer.vocab['Ġmismatch']]

4.8842557e-09

As expected, that is much lower.

Practically, we often work with **log probabilities** as there are numerical problems when working with such tiny numbers. To calculate the probability of a sequence of tokens, we'd need to multiply together many probabilities which creates a **tiny** number. If we log-transform the probabilities, multiplication becomes addition and we can add the log probabilities together.

We'll use log **base 2** for this. Let's get the log probability of "It" starting the sequence:

In [16]:
import math

math.log(softmaxxed[0, 0, 632], 2)  # 2 is due to base 2.

-7.865127206017433

Now we can get the probabilities out, can you calculate the log probability of the whole sequence? This involves **adding up all the log probabilities**, i.e. 
1. "It" (token ID=632) being after the first token
2. "was" (token ID=373) being after the second token, etc. 

The token IDs are available in `token_ids` and the [enumerate](https://docs.python.org/3/library/functions.html#enumerate) function may be helpful.



In [17]:
token_ids

tensor([ 632,  373,  262, 1266,  286, 1661,   11,  340,  373,  262, 5290,  286,
        1661])

In [41]:
total_log_prob = 0
for i, token_id in enumerate(token_ids.tolist()):
    total_log_prob += math.log(softmaxxed[0, i, token_id], 2)
    #. 'It was the best of times, it was the worst of times', so 
    #   'It': 0, 'was': 1, 'the': 2, 'best': 3 .... and only one sentence, so first index = 0, softmaxxed[0, i, x]
    
total_log_prob

-44.53607419410286

You should have added up 13 log probabilities to get a sum of approximately `-44.5360`. We could use this to calculate the probability of the whole sequence, but we typically think about the probability averaged over the tokens. This makes it easier to compare different sequences that may have different lengths.

Divide the sum of log probabilities by the number of predicted tokens (13) to get the average log probability per token

In [38]:
avg_log_prob = total_log_prob/len(token_ids)
avg_log_prob

-3.425851861084835

That should be approximately `-3.426`. Take that out of log space by calculating two to the power of that number.

In [39]:
2**avg_log_prob

0.09304988251206257

This should be roughly `0.0931`. That's the average token probability of `distilgpt2` generating our input sequence. It's actually pretty high for a probability.

Dealing with these tiny numbers is tricky, so it is normal to take the recipricol (1 over the value) to get the perplexity. Let's do that as the final step.

In [40]:
perplexity = 1.0/(2**avg_log_prob)
perplexity

10.74692383271268

You should have a perplexity of roughly `10.7469`. **Lower** perplexity means that the language model was more likely to generate that text and was **less "surprised"** by the text. If there were any **out-of-place tokens**, the probability would be much lower, and hence the perplexity would be much higher.

Congratulations, you've worked through calculating one of the main metrics for examining language models. Typically this is done of very large samples of text to get a more accurate probability that can be used to compare language models.

## Optional Extra:
- Try recalculating perplexity for a span of text composed of random words (e.g. `highway dentist coffee surf table`). Does `distilgpt2` give it a higher or lower perplexity than our original input text? And what about perplexity?

In [43]:
real_text2 = 'highway dentist coffee surf table'
real_text_with_bos2 = f"{bos_token} {real_text2}"
tokenized2 = tokenizer(real_text_with_bos2, return_tensors='pt')
tokenized2

{'input_ids': tensor([[50256, 12763, 38408,  6891,  9053,  3084]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}

In [44]:
token_ids2 = tokenized2['input_ids'][0]
token_ids2 = token_ids2[1:]
token_ids2

tensor([12763, 38408,  6891,  9053,  3084])

In [45]:
output2 = model(**tokenized2)
logits2 = output2.logits.detach().numpy()

softmaxxed2 = softmax(logits2, axis=2) # We use axis=2 so that the softmax is applied to each token's scores in turns and not all the values together

total_log_prob2 = 0
for i, token_id2 in enumerate(token_ids2.tolist()):
    total_log_prob2 += math.log(softmaxxed[0, i, token_id], 2)
        
total_log_prob2

-74.13451694416626

In [46]:
avg_log_prob2 = total_log_prob2/len(token_ids2)
perplexity2 = 1.0/(2**avg_log_prob2)
perplexity2

29063.148626285612