# Lab 2: Language Modelling with Transformers

We're going to feed some text into a Transformer and examine how it outputs the probabilities for the next word/token.

First let's load up the `distilgpt2` tokenizer as we did before.

In [6]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')

We're going to be interested in predicting the next subword token. How many possible subword tokens are there?

In [2]:
len(tokenizer.vocab) # or we could just use len(tokenizer)

50257

When tokenizing, we'll use the tokenizer with the `return_tensors='pt'` parameter. This puts the data into the format of a [PyTorch](https://pytorch.org) tensor which is used as the input for a Transformer model. PyTorch is a commonly used library for deep learning and HuggingFace builds upon it. We won't use PyTorch directly.

Let's tokenize: `"A horse! a horse! my kingdom for a"`

In [3]:
tokenized = tokenizer('A horse! a horse! my kingdom for a', return_tensors='pt')
tokenized

{'input_ids': tensor([[   32,  8223,     0,   257,  8223,     0,   616, 13239,   329,   257]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Note that it has been tokenized into 10 tokens.

In [4]:
len(tokenized['input_ids'][0])

10

Now we need to load up the full Transformer model. We need to use the same one that matches our tokenizer (`distilgpt2`). Tokenizers and models must match.

We'll load it using `AutoModelForCausalLM`. CausalLM is causal language modelling, or predicting the next token. You can also load models for other purposes like document classification.

In [7]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained('distilgpt2')

Now let's pass the tokenized text into the Transformer model. We could do this with `model(input_ids=tokenized['input_ids'], attention_mask=tokenized['attention_mask'])` but a tidied shorthand is:

In [8]:
output = model(**tokenized)

In [9]:
output

CausalLMOutputWithCrossAttentions(loss=None, logits=tensor([[[-31.1439, -29.1282, -30.8418,  ..., -42.3130, -42.1440, -31.0009],
         [-59.5865, -60.5802, -64.7680,  ..., -70.8865, -65.8933, -63.0499],
         [-62.7691, -63.7442, -64.5699,  ..., -75.1833, -72.3489, -60.4002],
         ...,
         [-51.0393, -59.1055, -63.8448,  ..., -68.9364, -65.0198, -59.6002],
         [-56.1765, -60.0481, -63.8827,  ..., -66.6802, -65.5936, -61.3876],
         [-63.7612, -64.7149, -67.7764,  ..., -75.3739, -69.5853, -65.8060]]],
       grad_fn=<UnsafeViewBackward0>), past_key_values=((tensor([[[[-0.9069,  2.5481,  1.5372,  ..., -1.3799, -0.7807,  1.4542],
          [-1.6978,  3.0068,  1.2534,  ..., -1.8757, -1.7081,  0.4693],
          [-1.7586,  2.7333,  1.7106,  ..., -1.3476, -1.9361,  1.9352],
          ...,
          [-2.4914,  2.7562,  1.6152,  ..., -1.4161, -2.7310, -0.1493],
          [-2.7463,  2.6533,  1.5964,  ..., -0.7182, -1.6476,  1.7722],
          [-2.5939,  2.5006,  1.6788, 

For causal language modelling, what we care about is the predictions of the next token. This is captured by the `logits` which are the **scores** for each of the possible tokens.

In [10]:
output.logits

tensor([[[-31.1439, -29.1282, -30.8418,  ..., -42.3130, -42.1440, -31.0009],
         [-59.5865, -60.5802, -64.7680,  ..., -70.8865, -65.8933, -63.0499],
         [-62.7691, -63.7442, -64.5699,  ..., -75.1833, -72.3489, -60.4002],
         ...,
         [-51.0393, -59.1055, -63.8448,  ..., -68.9364, -65.0198, -59.6002],
         [-56.1765, -60.0481, -63.8827,  ..., -66.6802, -65.5936, -61.3876],
         [-63.7612, -64.7149, -67.7764,  ..., -75.3739, -69.5853, -65.8060]]],
       grad_fn=<UnsafeViewBackward0>)

This is a PyTorch tensor which is a grid of numbers. In this case, it's a 3D grid. You can see the dimensions of it using `.shape` as below:

In [11]:
output.logits.shape

torch.Size([1, 10, 50257])

Where do the different numbers come from?

Well we only put in one sequence of ten words, so that explains the `[1, 10,...]`. The `50257` is the size of the vocabulary of the tokenizer:

In [12]:
len(tokenizer)

50257

That means we can get the score that the Transformer has given to token `horse` after the final token in the sequence with. First, what is the token index for horse? Recall that as it is starting a new word, there is the special character of `Ġ`.

In [13]:
tokenizer.vocab['Ġhorse']

8223

In [15]:
tokenizer.vocab['Ġkingdom']

13239

In [16]:
tokenizer.vocab['Ġmy']

616

Then to get the score from the first sequence (0), after the final token (-1) and for the token `horse` (8223), we would access it with:

In [17]:
output.logits[0,-1,8223]

tensor(-59.6237, grad_fn=<SelectBackward0>)

Hmm, the logits are not nicely probabilities so are difficult to interpret. We'll have to do a little work to make them interpretable.

Let's get all the scores out for predictions of tokens after our input (so using the index of -1 to get the final logits).

In [18]:
next_token_scores = output.logits[0,-1,:].tolist()
len(next_token_scores)

50257

As we already saw, they are not easy to interpret.

In [19]:
next_token_scores[:5]

[-63.76122283935547,
 -64.71493530273438,
 -67.77637481689453,
 -67.36962890625,
 -67.97136688232422]

So we shall use a softmax function. It takes a list of numbers, applies the equation below to them (using lots of exponentials) and returns a vector where all the values are between 0 and 1 and they all add up to 1.

$ \textrm{softmax}(z) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ \textrm{for}\ i=1,2,\dots,K $

There is a [function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.softmax.html) in the useful [scipy package](https://scipy.org/) that does this for us.

In [20]:
from scipy.special import softmax

Apply the `softmax` function to `next_token_scores` and output the first five values. You should see that they are between 0 and 1 and rather small.

In [21]:
softmax(next_token_scores)[:5]

array([9.37539204e-05, 3.61241284e-05, 1.69134066e-06, 2.54026248e-06,
       1.39170476e-06])

The probabilities should also add up to 1 (or very close due to little numerical differences). Check if this the case using the `sum` function.

In [23]:
sum(softmax(next_token_scores))

0.9999999999999977

Let's see what the probability of horse is now (token id = 8223)

In [26]:
softmax(next_token_scores)[8223]

0.005873694458246944

You should find that it has a probability of approximately `0.006`.

If we didn't already know that 8223 is horse, we could decode it with the tokenizer.

In [28]:
max(softmax(next_token_scores))

0.342668730590573

In [27]:
tokenizer.decode(8223)

' horse'

Now, the final task is going through the `next_token_probs` and finding which one has the highest probability and figuring out the corresponding token using `tokenizer.decode`.

In [31]:
import numpy as np
index = np.argmax(softmax(next_token_scores))
index

890

In [32]:
tokenizer.decode(index)

' long'

You should find that `' long'` has the highest probability (`≈ 0.3427`)

That's the end of this mini-lab.

## Optional Extra:
- Try a different input sentence

In [34]:
tokenized2 = tokenizer("Let's go to San Jose to eat ", return_tensors='pt')

In [35]:
output2 = model(**tokenized2)

In [36]:
next_token_scores2 = output2.logits[0,-1,:].tolist()

In [37]:
index2 = np.argmax(softmax(next_token_scores2))

In [38]:
tokenizer.decode(index2)

'iced'