# Goal: Generating one token at a time

We will get to understand how an LLM generates text--one token at a time, using the previous tokens to predict the following ones.


## Step 1. Load a tokenizer and a model

First we load a tokenizer and a model from HuggingFace's transformers library. A tokenizer is a function that splits a string into a list of numbers that the model can understand.


In [42]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# To load a pretrained model and a tokenizer using HuggingFace, we only need two lines of code!
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# We create a partial sentence and tokenize it.
text = "Levain is the best bakery for chocolate"
inputs = tokenizer(text, return_tensors="pt")

# Show the tokens as numbers, i.e. "input_ids"
inputs["input_ids"]

tensor([[32163,   391,   318,   262,  1266, 38164,   329, 11311]])

## Step 2. Examine the tokenization

Let's explore what these tokens mean!

In [43]:
# Show how the sentence is tokenized
import pandas as pd


def show_tokenization(inputs):
    return pd.DataFrame(
        [(id, tokenizer.decode(id)) for id in inputs["input_ids"][0]],
        columns=["ID", "Token"],
    )


show_tokenization(inputs)

Unnamed: 0,ID,Token
0,tensor(32163),Lev
1,tensor(391),ain
2,tensor(318),is
3,tensor(262),the
4,tensor(1266),best
5,tensor(38164),bakery
6,tensor(329),for
7,tensor(11311),chocolate


### Subword tokenization

The interesting thing is that tokens in this case are neither just letters nor just words. Sometimes shorter words are represented by a single token, but other times a single token represents a part of a word, or even a single letter. This is called subword tokenization.

## Step 2. Calculate the probability of the next token

Now let's use PyTorch to calculate the probability of the next token given the previous ones.

In [44]:
# Calculate the probabilities for the next token for all possible choices. We show the
# top 5 choices and the corresponding words or subwords for these tokens.

import torch

with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)


def show_next_token_choices(probabilities, top_n=5):
    return pd.DataFrame(
        [
            (id, tokenizer.decode(id), p.item())
            for id, p in enumerate(probabilities)
            if p.item()
        ],
        columns=["ID", "Token", "Probability"],
    ).sort_values("Probability", ascending=False)[:top_n]


show_next_token_choices(probabilities)

Unnamed: 0,ID,Token,Probability
20175,20175,lovers,0.131356
13,13,.,0.111288
290,290,and,0.091183
11,11,",",0.074657
11594,11594,chip,0.052126


The model thinks that the most likely next word is "lovers", followed up closely by ".".

In [45]:
# Obtain the token id for the most probable next token
next_token_id = torch.argmax(probabilities).item()

print(f"Next Token ID: {next_token_id}")
print(f"Next Token: {tokenizer.decode(next_token_id)}")

Next Token ID: 20175
Next Token:  lovers


In [46]:
# We append the most likely token to the text.
text = text + tokenizer.decode(20175)
text

'Levain is the best bakery for chocolate lovers'

## Step 3. Generate some more tokens

The following cell will take `text`, show the most probable tokens to follow, and append the most likely token to text. Run the cell over and over to see it in action!

In [47]:


from IPython.display import Markdown, display

# Show the text
print(text)

# Convert to tokens
inputs = tokenizer(text, return_tensors="pt")

# Calculate the probabilities for the next token and show the top 5 choices
with torch.no_grad():
    logits = model(**inputs).logits[:, -1, :]
    probabilities = torch.nn.functional.softmax(logits[0], dim=-1)

display(Markdown("**Next token probabilities:**"))
display(show_next_token_choices(probabilities))

# Choose the most likely token id and add it to the text
next_token_id = torch.argmax(probabilities).item()
text = text + tokenizer.decode(next_token_id)

Levain is the best bakery for chocolate lovers


**Next token probabilities:**

Unnamed: 0,ID,Token,Probability
13,13,.,0.318858
287,287,in,0.146327
11,11,",",0.131115
290,290,and,0.070244
8347,8347,everywhere,0.032601


## Step 4. Use the `generate` method

In [48]:
from IPython.display import Markdown, display

# Start with some text and tokenize it
text = "Once upon a time, chocolate makers"
inputs = tokenizer(text, return_tensors="pt")

# Use the `generate` method to generate lots of text
output = model.generate(**inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)

# Show the generated text
display(Markdown(tokenizer.decode(output[0])))

Once upon a time, chocolate makers would have been able to make chocolate with only a few ingredients. But now, with the advent of the Internet, it's possible to make chocolate with only a few ingredients.

The first step is to make a chocolate bar. The bar is made from a mixture of chocolate, sugar, and water. The water is added to the chocolate and then added to the chocolate. The chocolate is then mixed with the sugar and the water. The chocolate is then mixed