# Lab 4: Language Generation with Transformers

When predicting the next token, a GPT model can give us a score for all possible next tokens. We can use those probabilities to generate new text, potentially by selecting the most likely next token or by sampling using the probabilities. Let's see how that works.

Let's say that we want to generate more text after the sequence below:

In [1]:
text = 'The quick brown fox jumped over'

We'll need to load the tokenizer and model for `distilgpt2`.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

As before, we use the tokenizer to tokenize the text and convert each token to its token ID. We will use the `.encode` function to get the token IDs back as a Python list as they are easier to manipulate. We'll want to add extra token IDs that we've generated!

In [3]:
input_ids = tokenizer.encode(text)
input_ids

[464, 2068, 7586, 21831, 11687, 625]

We can use the `tokenizer.decode` function to turn the token IDs back into text. This will be useful after we've generated further token IDs to add on the end

In [4]:
tokenizer.decode(input_ids)

'The quick brown fox jumped over'

Now let's run the token IDs through the `distilgpt2` model and get the probabilities of the next token

In [5]:
import torch # We'll load Pytorch so we can convert a list to a tensor
from scipy.special import softmax

as_tensor = torch.tensor(input_ids).reshape(1,-1) # This converts the token ID list to a tensor
output = model(input_ids=as_tensor) # We pass it into the model
next_token_scores = output.logits[0,-1,:].detach().numpy() # We get the scores for next token and the end of the sequence (token index=-1)
next_token_probs = softmax(next_token_scores) # And we apply a softmax function

next_token_probs.shape

(50257,)

Now we've got the probabilities for all possible 50257 tokens to be after our input text sequence.

Let's get the one with the highest probability. For that we can use the `argmax` function.

In [6]:
next_token_id = next_token_probs.argmax()
next_token_id

262

Hmm, the token with ID=262 has the highest probability. But what token is that? `tokenizer.decode` can tell us:

In [7]:
tokenizer.decode(next_token_id)

' the'

Now, we've all the parts we need. Your task is to calculate the next eight tokens after `input_ids` (including the one we calculated above). You'll be adding `1353` to the input token IDs, running it through the model again and deciding the next token. Try writing it as a loop that iterates eight times.

Now picking the token with highest probability every time can often create quite boring text. Sampling from the tokens can generate more interesting text. Sampling uses the probabilities as weights so that words with higher probabilities are more likely to be chosen. Let's see how that works:

Let's imagine we've got a probabilities for four possible tokens (a very tiny vocabulary).

In [8]:
import numpy as np # We're using numpy to use its argmax function

next_token_probs = np.array([0.1, 0.2, 0.5, 0.3])

As we saw above, we can use `argmax` that tells us the index of the highest value. In this case, it's index=2

In [9]:
next_token_probs.argmax()

2

However, let's say we want to sample randomly from the possible token indices (`[0, 1, 2, 3]`). First, let's create that list to sample from:

In [10]:
indices = list(range(len(next_token_probs)))
indices

[0, 1, 2, 3]

In [11]:
import random

next_token_id = random.choices(indices, k=1)[0]
next_token_id

0

Or we could provide weights, such that some of the tokens are more likely to be chosen than others. In this case, we provide `next_token_probs` as weights.

In [12]:
next_token_id = random.choices(indices, k=1, weights=next_token_probs)[0]
next_token_id

2

That would allow us to sample from the token probability distribution.

Your task is to generate some new text (starting from "The quick brown fox jumped over" as before) using sampling and the `random.choices` function to pick your next token. Try it with weighting and without weighting to see what happens.

Try running your code again and you should get a different output due to the random nature of the sampling. There's a lot of tweaks that can be made to the random sampling strategy.

Fortunately, we don't have to implement all the different text generation functions ourselves. The HuggingFace library provides a `text-generation` pipeline to generate text.

For example, here is how to run it and request 30 extra tokens and 5 different generations.

In [13]:
from transformers import pipeline
generator = pipeline('text-generation', model="distilgpt2")
generator("Hello, I'm a language model,", max_new_tokens=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Hello, I'm a language model, but I just wanted to see how it works."},
 {'generated_text': "Hello, I'm a language model, one that is not going to be used by everyone on the web or in any other part of the web, but should be the default language. If"},
 {'generated_text': "Hello, I'm a language model, and I'd like to thank everyone who's mentioned and helped me create it, including you, the author, everyone at the local and corporate level."},
 {'generated_text': "Hello, I'm a language model,” I'd rather just say this: I hope you can learn more!Thanks~Kara\n\nAdvertisements"},
 {'generated_text': "Hello, I'm a language model, and I've been doing this for over twenty years, and I think that's really important in keeping my software stable on demand.\n\n\nSo"}]

There are a lot of different options, including controlling how sampling is done. If we wanted to not do sampling, we could turn it off with `do_sample=False`.

In [None]:
generator("Hello, I'm a language model,", max_new_tokens=30, do_sample=False)

Or turn it on but tell it to only sample from the 10 most likely tokens, we can use `do_sample=True` and `top_k=10`

In [14]:
generator("Hello, I'm a language model,", max_new_tokens=30, do_sample=True, top_k=10)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


[{'generated_text': "Hello, I'm a language model, so I'll probably do a bit more in the future to help you understand how you're going to understand the concept of the Language.\n\n\n"}]