<a href="https://colab.research.google.com/github/DanielHolzwart/basic-text-generation-with-GPT2/blob/main/Text_Generation_with_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Loading a pretrained text generation model like GPT2 we can quickly generate text via the generate function. Moreover, this function allows us to specify on top k & top p sampling, greedy search,.... Here we want to implement our own custom functions for a couple of such options.

Our test model is going to be GPT2. Start loading it from GPT2 hugging face.

In [1]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [2]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

My girlfriend's dog is called Joey and he will provide our example text 'The dog Joey is jumping over the'

In [3]:
text = 'The dog Joey is jumping over the'

Firstly, we are going to play around with the tokenization of this text.

In [4]:
text_input_ids = tokenizer(text).input_ids #extract input_ids from tokenized text
print(text_input_ids)

[464, 3290, 26154, 318, 14284, 625, 262]


In [5]:
text_tokens = tokenizer.convert_ids_to_tokens(text_input_ids) #this shows how GPT2 tokenizes words. Note the dot above the words which correspond to spacebars between words
print(text_tokens)

['The', 'Ġdog', 'ĠJoey', 'Ġis', 'Ġjumping', 'Ġover', 'Ġthe']


In [6]:
text_id_to_text = tokenizer.decode(text_input_ids) #convert input_ids back to the text
print(text_id_to_text)

The dog Joey is jumping over the


In [7]:
text_tokens_to_text = tokenizer.convert_tokens_to_string(text_tokens) #in the same way we can convert tokens back to the text
print(text_tokens_to_text)

The dog Joey is jumping over the


--------------------------------------------------------------------------------
# **1.1 Construct a a simple generate function**

Our aim now is to replicate the following sentence(s)

In [8]:
tokenizer.pad_token_id = tokenizer.eos_token_id # set padding token id to end of sentence token id for open-end generation
text_pt = tokenizer(text, return_tensors = 'pt').to(device)
output = model.generate(**text_pt,max_new_tokens = 50, pad_token_id = tokenizer.eos_token_id )
print(tokenizer.decode(output.squeeze(0)))

The dog Joey is jumping over the fence and into the water. He is about to jump over the fence when he is hit by a car. Joey is taken to the hospital where he is treated for his injuries.

The dog Joey is jumping over the fence and into the water


We see that the omdel already repeats the sentences as it lacks creativity since it is only doing greedy search. We could change that by using a few options (like beam search or sampling) which we will talk about later.

In [9]:
def generate_words(max_new_tokens, text=text):
    input = text
    for _ in range(max_new_tokens + 1):
        output = model(**tokenizer(input, return_tensors = 'pt').to(device)) #output shape is (batch_size, tokens, embedding_dim)
        logits = output.logits[0,-1,:] #return logits for the last word the text (so for the initial text it is 'the')
        max_prob = torch.argmax(logits)
        input = input + tokenizer.decode(max_prob)
    return input

In [10]:
print(generate_words(50))

The dog Joey is jumping over the fence and into the water. He is about to jump over the fence when he is hit by a car. Joey is taken to the hospital where he is treated for his injuries.


The dog Joey is jumping over the fence and into the water


We can also store the five words with the highest probablity in a dataframe. For this, we are going to slightly adjust the function above. This will also be useful for sampling methods.

In [11]:
import pandas as pd

In [12]:
output = model(**tokenizer(text, return_tensors = 'pt').to(device)) #output shape is (batch_size, tokens, embedding_dim)
logits = output.logits[0,-1,:] #return logits for the last word the text (e.g. for the initial text it is 'the')

probs_sorted = torch.sort(logits, dim = - 1, descending = True)
probs_sorted_indices = probs_sorted.indices

In [13]:
tokenizer.decode(probs_sorted_indices[0])

' fence'

In [14]:
def probability_df(max_new_tokens, text=text, top_probs = 5):
    dt_frame = {}
    input = text
    for i in range(max_new_tokens + 1):
        output = model(**tokenizer(input, return_tensors = 'pt').to(device)) #output shape is (batch_size, tokens, embedding_dim)
        logits = output.logits[0,-1,:] #return logits for the last word the text (e.g. for the initial text it is 'the')

        probs_sorted = torch.sort(logits, dim = - 1, descending = True)
        probs_sorted_indices = probs_sorted.indices
        probs_sorted_values = probs_sorted.values
        probs_sorted_values = torch.softmax(probs_sorted_values, dim = -1)

        dt_frame[f'Text step {i+1}'] = [input]

        for j in range(top_probs):
            word = tokenizer.decode(probs_sorted_indices[j])
            dt_frame[f'Text step {i+1}'].append(f'{word} ({probs_sorted_values[j]:.2%})')

        input = input + tokenizer.decode(probs_sorted_indices[0])

    return pd.DataFrame(dt_frame).T

The function generate_words_with_df a single input generates 5 additional words to the initial sentence and output a dataframe with the 5 highest word probabilities after every step. The percentage output is the probability of that token being taken if we would have randomly sampled.

In [15]:
probability_df(4)

Unnamed: 0,0,1,2,3,4,5
Text step 1,The dog Joey is jumping over the,fence (23.04%),railing (8.62%),edge (3.04%),wall (2.96%),bridge (2.17%)
Text step 2,The dog Joey is jumping over the fence,and (13.09%),", (11.11%)",. (10.19%),to (8.67%),is (4.08%)
Text step 3,The dog Joey is jumping over the fence and,into (5.23%),is (4.44%),onto (3.25%),running (3.18%),the (2.04%)
Text step 4,The dog Joey is jumping over the fence and into,the (62.16%),a (16.68%),my (2.36%),his (2.01%),an (1.45%)
Text step 5,The dog Joey is jumping over the fence and int...,water (13.76%),bushes (4.06%),river (3.33%),yard (2.22%),street (1.78%)


# **1.2 Create a generate function with top_k sampling**

Next we are going to tackle the topic of top-k sampling. In all the words created above we simple used greedy search which means that we added the word with the highest output probability. In top-k sampling we randomly sample from, let's say 50, words and the output is much cleaner, non-repetetive and something more similar to human generated text.

As before, we first used the convient generate functon to which is giving us the output we want to create.

In [16]:
torch.manual_seed(42)
output = model.generate(**text_pt,max_new_tokens = 20, pad_token_id = tokenizer.eos_token_id, top_k =3, do_sample = True)
print(tokenizer.decode(output.squeeze(0)))

The dog Joey is jumping over the edge of the wall and is trying to get to the bottom of it, but it's too late


Via the categorical class from torch.distributions we can sample data based on the relative probabilities.

In [17]:
from torch.distributions.categorical import Categorical
logits = torch.tensor([[.1, .1, .8]])
m = Categorical(logits=logits)
samples = m.sample((10,))
print(samples.numpy())

[[0]
 [0]
 [0]
 [0]
 [2]
 [1]
 [2]
 [2]
 [0]
 [0]]


In [18]:
def generate_text_topk(max_new_tokens, top_k, text=text):
    torch.manual_seed(42)
    input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
    for _ in range(max_new_tokens + 1):
        output = model(input_ids) #output shape is (batch_size, tokens, embedding_dim)
        logits = output.logits[0,-1,:] #return logits for the last word the text (e.g. for the initial text it is 'the')

        top_k_logits, top_k_indices = torch.topk(logits, k=top_k) # we can also use the topk function instead of ordering and taking the top k values
        top_k_probabilities = torch.softmax(top_k_logits, dim = -1)

        m = Categorical(probs=top_k_probabilities).sample()
        next_token_id = top_k_indices[m].unsqueeze(0)

        input_ids = torch.cat([input_ids, next_token_id.unsqueeze(0)], dim=-1)

    generated_text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
    return generated_text

In [19]:
generate_text_topk(20,3)

'The dog Joey is jumping over the fence and into the bushes.\n\n"I\'m not sure how he got there, but it looks'

We see that the output of both sentences are not the same. My first assumption is that sampling in the generate function and in the custom function via Categorical is different. Possibly the top_k values are the same input_ids, but the probabilities are just different. Let us store additional data in a
dataframe similar to the probability_df function above.

In [20]:
def df_topk(max_new_tokens, top_k, text=text):
    dt_frame = {}
    torch.manual_seed(42)
    input_ids = tokenizer(text, return_tensors="pt").input_ids.to(device)
    for i in range(max_new_tokens + 1):
        output = model(input_ids) #output shape is (batch_size, tokens, embedding_dim)
        logits = output.logits[0,-1,:] #return logits for the last word the text (e.g. for the initial text it is 'the')

        top_k_logits, top_k_indices = torch.topk(logits, k=top_k) # we can also use the topk function instead of ordering and taking the top k values
        top_k_probabilities = torch.softmax(top_k_logits, dim = -1)

        dt_frame[f'Text step {i+1}'] = [tokenizer.decode(input_ids[0], skip_special_tokens=True)]

        for j in range(top_k):
            word = tokenizer.decode(top_k_indices[j])
            dt_frame[f'Text step {i+1}'].append(f'{word} ({top_k_probabilities[j]:.2%})')

        m = Categorical(logits=top_k_logits).sample()
        next_token_id = top_k_indices[m].unsqueeze(0)

        input_ids = torch.cat([input_ids, next_token_id.unsqueeze(0)], dim=-1)

    return pd.DataFrame(dt_frame).T

In [21]:
print(df_topk(5,3)[0].iloc[5]) #output of the full sentence
df_topk(5,3) #output probability dataframe

The dog Joey is jumping over the fence and into the bushes


Unnamed: 0,0,1,2,3
Text step 1,The dog Joey is jumping over the,fence (66.40%),railing (24.84%),edge (8.76%)
Text step 2,The dog Joey is jumping over the fence,and (38.06%),", (32.31%)",. (29.64%)
Text step 3,The dog Joey is jumping over the fence and,into (40.50%),is (34.36%),onto (25.14%)
Text step 4,The dog Joey is jumping over the fence and into,the (76.55%),a (20.54%),my (2.91%)
Text step 5,The dog Joey is jumping over the fence and int...,water (65.04%),bushes (19.21%),river (15.76%)
Text step 6,The dog Joey is jumping over the fence and int...,. (66.35%),", (27.01%)",to (6.64%)


From the datatable we can nicely see that in the 5th row our algorithm did not take the highest probability but rather the 2nd highest word 'bushes'. In particular, our algorithm takes the word 'fence' in the first step, while it takes the word 'edge' in the generate function.

In [22]:
torch.manual_seed(42)
print(tokenizer.decode(model.generate(**text_pt,max_new_tokens = 1, top_k=3, do_sample = True, pad_token_id = tokenizer.eos_token_id)[0]))
print(tokenizer.decode(model.generate(**text_pt,max_new_tokens = 1, do_sample = True, pad_token_id = tokenizer.eos_token_id)[0]))

The dog Joey is jumping over the edge
The dog Joey is jumping over the river


We will leave it like this for the moment, but keep it in our mind for the following tasks.

# **1.3 Create a generate function with top_p sampling**

Top_p sampling, also known as nucleus sampling, is another sampling approach to generate better texts. Here we don't specify on the number of tokens to sample from, but rather the cummulative sum o the tokens. For example, if top_p = 0.95 then we would sample from all tokens, starting with the highest probability, such that their cummulative sum is less or equal 95%. Hence, at different time steps we pool of tokens to chose from is not the same, only the upper bound remains constant.

In [23]:
torch.manual_seed(42)
tokenizer.batch_decode(model.generate(**text_pt,max_new_tokens = 20, pad_token_id = tokenizer.eos_token_id, top_p =.9, do_sample = True))

['The dog Joey is jumping over the edge to take the picture as his mother and father look on in horror at his situation. The footage']

In [24]:
def generate_text_topp(max_new_tokens, top_p, text = text):
    text_ids = tokenizer(text, return_tensors = 'pt').input_ids.to(device)
    for i in range(max_new_tokens):
        torch.manual_seed(42)
        output = model(text_ids).logits[0,-1,:]
        probs = torch.softmax(output, -1)
        probs_sorted_values, probs_sorted_indices = torch.sort(probs, dim = - 1, descending = True)
        j = 0
        while torch.sum(probs_sorted_values[:j], dim = -1) < top_p: #check for top_p threshold
            j += 1
        probs_selected = probs_sorted_values[:j] /torch.sum(probs_sorted_values[:j]) # normalize selected probs

        m = Categorical(probs = probs_selected[:j]) # sample from allowed words
        sampled_id = probs_sorted_indices[m.sample()]
        text_ids = torch.cat([text_ids[0],sampled_id.unsqueeze(0)]).unsqueeze(0).to(device)
    return text_ids

In [25]:
print(tokenizer.batch_decode(generate_text_topp(20,0.9)))

['The dog Joey is jumping over the carpet, always challenging the gardener to calm down.\n\n"He\'s going to end up']


It looks like the generate functions output a different sentence then our solution. Similar, to top_k, I assume that sampling in the generate function is just different than our approach.