# Lab 4: Language Generation with Transformers

When predicting the next token, a GPT model can give us a score for all possible next tokens. We can use those probabilities to generate new text, potentially by selecting the most likely next token or by sampling using the probabilities. Let's see how that works.

Let's say that we want to generate more text after the sequence below:

In [2]:
text = 'The quick brown fox jumped over'

We'll need to load the tokenizer and model for `distilgpt2`.

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

As before, we use the tokenizer to tokenize the text and then convert each token to its token ID. 

We will use the `.encode` function to get the token IDs back as a Python list as they are easier to manipulate. We'll want to add extra token IDs that we've generated!

### Tokenize text to ids

In [4]:
text

'The quick brown fox jumped over'

In [5]:
input_ids = tokenizer.encode(text)
input_ids

[464, 2068, 7586, 21831, 11687, 625]

We can use the `tokenizer.decode` function to turn the token IDs back into text. This will be useful after we've generated further token IDs to add on the end

In [6]:
tokenizer.decode(input_ids)

'The quick brown fox jumped over'

Now let's run the token IDs through the `distilgpt2` model and get the probabilities of the next token

In [7]:
import torch # We'll load Pytorch so we can convert a list to a tensor
from scipy.special import softmax

as_tensor = torch.tensor(input_ids).reshape(1,-1) # This converts the token ID list to a tensor

print(as_tensor)

output = model(input_ids=as_tensor) # We pass it into the model

# print(type(output))

next_token_scores = output.logits[0,-1,:].detach().numpy() # We get the scores for next token and the end of the sequence (token index=-1)

print(len(next_token_scores))
print("next token score:", next_token_scores)

next_token_probs = softmax(next_token_scores) # And we apply a softmax function

next_token_probs.shape

tensor([[  464,  2068,  7586, 21831, 11687,   625]])
50257
next token score: [-59.50909  -63.043785 -65.99477  ... -71.91912  -69.816216 -62.994907]


(50257,)

Now we've got the probabilities for **all possible 50257 tokens** to be after our input text sequence, so the `next_token_scores` list is 50257 long.

In [8]:
len(tokenizer.vocab) 

50257

Let's get the one with the highest probability. For that we can use the `argmax` function.

In [9]:
next_token_id = next_token_probs.argmax()
next_token_id

262

Hmm, **the token with ID=262 has the highest probability**. But what token is that? `tokenizer.decode` can tell us:

In [10]:
tokenizer.decode(next_token_id)

' the'

### Greedy decoding to generate a sentence

Now, we've all the parts we need. Your task is to calculate the next eight tokens after `input_ids` (including the one we calculated above). You'll be adding `1353` to the input token IDs, running it through the model again and deciding the next token. Try writing it as a loop that iterates eight times.

In [14]:
def get_next_token(input_ids):
    '''
    using greedy decoding strategy to generate next token
    '''
    as_tensor = torch.tensor(input_ids).reshape(1,-1)
    output = model(input_ids=as_tensor)
    next_token_scores = output.logits[0,-1,:].detach().numpy()
    next_token_probs = softmax(next_token_scores)
    return next_token_probs.argmax()

input_ids = tokenizer.encode(text)
for _ in range(8):
    next_token_id = get_next_token(input_ids)
    input_ids.append(next_token_id)

print(input_ids)

[464, 2068, 7586, 21831, 11687, 625, 262, 13990, 290, 4966, 625, 262, 13990, 13]


In [15]:
tokenizer.decode(input_ids)

'The quick brown fox jumped over the fence and ran over the fence.'

With eight extra tokens, you should get a list with IDs = `[464, 2068, 7586, 21831, 11687, 625, 262, 13990, 290, 4966, 625, 262, 13990, 13]` which decodes to give the text: "The quick brown fox jumped over the fence and ran over the fence.".

Now picking the token with highest probability every time can often create quite boring text. Sampling from the tokens can generate more interesting text. Sampling uses the probabilities as weights so that words with higher probabilities are more likely to be chosen. Let's see how that works:

### Randomly generate a sentence

Let's imagine we've got a probabilities for four possible tokens (a very tiny vocabulary).

In [17]:
import numpy as np # We're using numpy to use its argmax function

next_token_probs = np.array([0.1, 0.2, 0.5, 0.3])

next_token_probs

array([0.1, 0.2, 0.5, 0.3])

As we saw above, we can use `argmax` that tells us the index of the highest value. In this case, it's index=2

In [18]:
next_token_probs.argmax()

2

However, let's say we want to sample randomly from the possible token indices (`[0, 1, 2, 3]`). First, let's create that list to sample from:

In [19]:
indices = list(range(len(next_token_probs)))
indices

[0, 1, 2, 3]

We could use the [choices](https://docs.python.org/3/library/random.html#random.choices) function to pick a single token ID with all four being equally likely to be chosen

In [29]:
random.choices(indices, k=3)

[3, 0, 2]

In [24]:
import random

next_token_id = random.choices(indices, k=1)[0]
next_token_id

2

Or we could provide weights, such that some of the tokens are more likely to be chosen than others. In this case, we provide `next_token_probs` as weights.

In [22]:
next_token_id = random.choices(indices, k=1, weights=next_token_probs)[0]
next_token_id

1

That would allow us to sample from the token probability distribution.

Your task is to generate some new text (starting from "The quick brown fox jumped over" as before) using sampling and the `random.choices` function to pick your next token. Try it with weighting and without weighting to see what happens.

In [31]:
def get_next_token_probs(input_ids):
    '''
    using greedy decoding strategy to generate next token
    '''
    as_tensor = torch.tensor(input_ids).reshape(1,-1)
    output = model(input_ids=as_tensor)
    next_token_scores = output.logits[0,-1,:].detach().numpy()
    next_token_probs = softmax(next_token_scores)
    return next_token_probs

input_ids = tokenizer.encode(text)
indices = list(range(len(tokenizer.vocab)))
for _ in range(8):
    next_token_probs = get_next_token_probs(input_ids)
    next_token_id = random.choices(indices, k=1, weights=next_token_probs)[0]
    input_ids.append(next_token_id)

print(input_ids)

[464, 2068, 7586, 21831, 11687, 625, 262, 1353, 290, 17788, 2241, 284, 262, 2323]


In [32]:
tokenizer.decode(input_ids)

'The quick brown fox jumped over the top and lowered himself to the ground'

In [33]:
input_ids = tokenizer.encode(text)
indices = list(range(len(tokenizer.vocab)))
for _ in range(8):
    next_token_probs = get_next_token_probs(input_ids)
    next_token_id = random.choices(indices, k=1)[0]
    input_ids.append(next_token_id)

print(input_ids)
tokenizer.decode(input_ids)

[464, 2068, 7586, 21831, 11687, 625, 31266, 38654, 42038, 16215, 6275, 7817, 14628, 23419]


'The quick brown fox jumped overwikipediaitateseson defining excellent taught Happy Recon'

Try running your code again and you should get a different output due to the random nature of the sampling. There's a lot of tweaks that can be made to the random sampling strategy.

### HuggingFace `text-generation` pipeline

Fortunately, we don't have to implement all the different text generation functions ourselves. The HuggingFace library provides a `text-generation` pipeline to generate text.

For example, here is how to run it and request 30 extra tokens and 5 different generations.

In [34]:
from transformers import pipeline
generator = pipeline('text-generation', model="distilgpt2")
generator("Hello, I'm a language model,", max_new_tokens=30, num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, but as an entrepreneur, I believe in a robust framework within what I do. I have a full knowledge of functional programming, and this is just one"},
 {'generated_text': "Hello, I'm a language model, I use a very similar language. I can now see that it comes along nicely – when interacting with your language, when it is interacting with other languages"},
 {'generated_text': "Hello, I'm a language model,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHere's a version of this from the old wiki,\n\n\n"},
 {'generated_text': 'Hello, I\'m a language model, to which I am a member (or a "my") (my) language. For many other programming languages, that\'s how I\'m usually using'},
 {'generated_text': "Hello, I'm a language model, trying to find people who speak English. But I don't have that. What I want is to be respectful of how we teach people to learn English"}]

There are a lot of different options, including controlling how sampling is done. If we wanted to not do sampling, we could turn it off with `do_sample=False`.

In [35]:
generator("Hello, I'm a language model,", max_new_tokens=30, do_sample=False)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, and I'm a programmer. I'm a programmer. I'm a programmer. I'm a programmer. I'm a programmer. I'm a programmer"}]

Or turn it on but tell it to only sample from the 10 most likely tokens, we can use `do_sample=True` and `top_k=10`

In [36]:
generator("Hello, I'm a language model,", max_new_tokens=30, do_sample=True, top_k=10)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, and you can see why I want to do so:\n\n\n\nThe main point, the first thing to look to is how to use a"}]

That's the end of this mini-lab.

## Optional Extra
- Read about the generation methods implemented by HuggingFace in [this post](https://huggingface.co/blog/how-to-generate) and try some other parameters such as `top_p` and `temperature`.