<a href="https://colab.research.google.com/github/Andrian0s/ML4NLP1-2023-Tutorial-Notebooks/blob/main/tutorial_notebooks/11_tutorial_notebook_text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to generate text from a language model

Generating Text from Language Models (Rycolab) Tutorial at ACL2023

Taken and adapted by Andrianos Michail

https://rycolab.io/classes/acl-2023-tutorial/
https://drive.google.com/file/d/1UHbGcjzBURG1n2DufC7iDTmGNjIz5Dp_/view

In [None]:
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q torch

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import locale
import transformers
import torch
import matplotlib.pyplot as plt
import numpy as np
import os
import random

device = "cuda" if torch.cuda.is_available() else "cpu"
locale.getpreferredencoding = lambda: "UTF-8"

model1 = "gpt2"
model2 = "gpt2-large"

tokenizer = GPT2Tokenizer.from_pretrained(model1)
model = GPT2LMHeadModel.from_pretrained(model1, pad_token_id=tokenizer.eos_token_id).to(device)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
def get_next_word_probs(prefix):
    # Tokenize the input prefix string. The `encode` method converts the string into a sequence of tokens.
    # `return_tensors='pt'` specifies that the output should be PyTorch tensors.
    input_ids = tokenizer.encode(prefix, return_tensors='pt').to(device)

    # Disable gradient calculations. This is used during inference to reduce memory usage and improve speed.
    with torch.no_grad():
        # Pass the tokenized input through the model to get raw logits for the last token in the sequence.
        logits = model(input_ids).logits.squeeze()[-1]

    # Apply the softmax function to the logits to get probability distribution.
    # The softmax function converts logits into probabilities which sum to 1.
    probabilities = torch.nn.functional.softmax(logits, dim=0)

    # Return the calculated probabilities
    return probabilities

In [None]:
prefix = "My name"

In [None]:
probabilities = get_next_word_probs(prefix_with_context)
top_token_probs, top_token_vals = torch.topk(probabilities, 10)

for token, prob in zip(top_token_vals, top_token_probs):
  print("%.3f" % prob.item(), tokenizer.decode(token))

0.474  bathroom
0.129  hospital
0.118  doctor
0.059  toilet
0.023  gym
0.015  restroom
0.015  clinic
0.010  emergency
0.007  doctors
0.007  dentist


 In the context of language models like GPT, entropy can be used to assess how certain the model is about its next-word predictions. Models that are too uncertain (high entropy) might be considered underconfident, while those with very low entropy might be overfitting or too deterministic.

In [None]:
entropy = torch.distributions.Categorical(probs = probabilities).entropy()
entropy.item()

0.6693758964538574

In [None]:
# Calculate the entropy of the probability distribution.
# Entropy quantifies the amount of uncertainty or surprise associated with a probability distribution.
# Higher entropy means the distribution is more spread out (i.e., the model is less certain about its prediction).
entropy = torch.distributions.Categorical(probs=probabilities).entropy()

# Convert the entropy tensor to a Python float for easier interpretation.
entropy_value = entropy.item()

# Example usage
print(f"Entropy of the distribution: {entropy_value}")


In [None]:
prefix_no_context = 'They need to go to the'
prefix_with_context = "They drank a lot of water. As a result, they need to go to the"

In [None]:
from transformers import TopKLogitsWarper

# Define different contexts
prefix_no_context = 'They need to go to the'
prefix_with_context = "They drank a lot of water. As a result, they need to go to the"

# Instance of TopKLogitsWarper to consider top 100 probable tokens
topk_selecter = TopKLogitsWarper(100)

# Set random seed for reproducibility
torch.manual_seed(0)

# Choose the prefix to use
prefix = prefix_no_context

# Generate text for 30 iterations
for i in range(30):
    probabilities = get_next_word_probs(prefix)

    # Different strategies for word selection
    most_probable_token = torch.argmax(probabilities)
    sampled_token = torch.multinomial(probabilities, 1)
    topk_token_logits = topk_selecter(None, torch.log(probabilities))
    topk_sampled_token = torch.multinomial(torch.exp(topk_token_logits), 1)

    # Update the prefix with the new word and print the sequence
    prefix += tokenizer.decode(topk_sampled_token)
    print(prefix)

# Key Points:
# 1. The script demonstrates text generation using a pretrained model like GPT-2.
# 2. It shows different strategies for selecting the next word: most probable, random sampling, and top-K sampling.
# 3. Top-K sampling is used in the main loop, balancing creativity with controlled randomness.
# 4. The approach allows for varied text generation while maintaining coherence.

They need to go to the top
They need to go to the top of
They need to go to the top of the
They need to go to the top of the country
They need to go to the top of the country to
They need to go to the top of the country to be
They need to go to the top of the country to be treated
They need to go to the top of the country to be treated from
They need to go to the top of the country to be treated from the
They need to go to the top of the country to be treated from the bottom
They need to go to the top of the country to be treated from the bottom up
They need to go to the top of the country to be treated from the bottom up rather
They need to go to the top of the country to be treated from the bottom up rather than
They need to go to the top of the country to be treated from the bottom up rather than that
They need to go to the top of the country to be treated from the bottom up rather than that from
They need to go to the top of the country to be treated from the bottom up rather than 

In [None]:
# Encode the initial prefix text into token IDs
input_ids = tokenizer.encode(prefix_no_context, return_tensors='pt').to('cuda')

# Generate text using the model
sample_output = model.generate(
    input_ids,
    do_sample=True,     # Enable random sampling of words based on probability distribution
    max_length=30,      # Set the maximum length of the generated sequence
    top_k=100           # Use Top-K sampling, considering only the top 100 tokens at each step
)

# Decode the generated token IDs back into text and print
print(tokenizer.decode(sample_output[0]))

They need to go to the public," said state Rep. Peter Jacoby (D), who introduced the bill. "It's something that needs to


##Taken parts of sampling notebook

In [None]:
def set_all_seeds(seed=317):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

set_all_seeds()

## Generation configs

For GPT2, the default generation configuration is greedy sampling (taking the argmax)

In [None]:
(
  default_generation_config.do_sample,
  default_generation_config.num_beams,
  default_generation_config.max_length
)

(False, 1, 20)

This leads to highly-repetitive generations

In [None]:
default_generation_config.max_new_tokens = 50

out = model.generate(
    input_ids,
    generation_config=default_generation_config,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=20
)
tokenizer.decode(out[0])

'I want to go to the next level. I want to be a better player. I want to be a better player.'

Let's access the GenerationConfig object of our model directly instead, and see how we can use multinomial sampling as our generation algorithm

In [None]:
# Retrieve the current generation configuration settings for the model
model_generation_config = model.generation_config

# Set the maximum number of new tokens (words or parts of words) to generate to 50
model_generation_config.max_new_tokens = 50

# Set the minimum number of new tokens to generate to 30
model_generation_config.min_new_tokens = 30

# Enable probabilistic sampling for token generation for varied outputs
model_generation_config.do_sample = True

# Set the number of beams in beam search to 1, effectively disabling it
model_generation_config.num_beams = 1

# Ensure probabilities are renormalized before sampling
model_generation_config.renormalize_logits = True

# Configure the model to return a detailed dictionary output, not just text
model_generation_config.return_dict_in_generate = True

# Output the scores (probabilities) for each token generated
model_generation_config.output_scores = True

# Set the padding token ID to the end-of-sequence (EOS) token ID (specific to GPT-2)
model_generation_config.pad_token_id = tokenizer.eos_token_id

# Set 'top_k' sampling parameter to the tokenizer's vocabulary size (Default is 50)
model_generation_config.top_k = tokenizer.vocab_size

### Helper functions

In [None]:
def token_entropy(scores, eps=1e-10):
    # Apply softmax to convert the scores into probabilities
    probs = torch.nn.functional.softmax(scores, dim=-1)
    # Compute the entropy for each token. Entropy is calculated as -sum(p * log(p))
    # 'eps' is added to prevent log(0) which is undefined
    # 'nansum()' ensures that NaNs (if any) are ignored in the summation
    return -(torch.log(probs + eps) * probs).nansum().item()

def avg_token_entropy(scores, eps=1e-10):
    # Calculate the average token entropy over a sequence of scores
    # 'scores' is expected to be a list of tensors, each representing scores for a token
    # The function 'token_entropy' is applied to each score tensor
    # Finally, compute the mean of these entropy values using numpy's mean function
    return np.mean([token_entropy(score.squeeze()) for score in scores])

### Changing scoring function through the GenerationConfig object

During the generation process, temperature modifies how the language model selects the next word. Specifically:

Low Temperature (<1.0): The model becomes more confident in its choices, favoring words with higher predicted probabilities. This leads to more repetitive and predictable text, as it sticks closely to common patterns and frequent word pairings.

High Temperature (>1.0): The model's selection becomes more evenly distributed among the available choices. It's more likely to pick less probable words, leading to more diverse and creative text but with a higher chance of producing nonsensical or irrelevant content.

In [None]:
model_generation_config.temperature = 0.1

out = model.generate(
    input_ids,
    generation_config=model_generation_config,
)
tokenizer.decode(out.sequences[0])

'I want to go to the next level. I want to be a better player. I want to be a better player. I want to be a better player. I want to be a better player. I want to be a better player. I want to be a better'

In [None]:
avg_token_entropy(out.scores)

0.04372068114789067

In [None]:
model_generation_config.temperature = 3.0

out = model.generate(
    input_ids,
    generation_config=model_generation_config,
)
tokenizer.decode(out.sequences[0])

'I want to go to connection glasses Cousins footpodcast depends occasional Cincinnati absorbing 777 DEA JUSTBug interview Frames Fiber VPNа\x1dsportsSport asliaite Founder olanglesRANT 2048 clansbuf burger Evans ® prominent apparent Useful Sponsor boobs CHRIST Pal Apprentice Kristnder Chespast Forums Lethalbrain 2500'

In [None]:
avg_token_entropy(out.scores)

10.554449100494384

In [None]:
# back to default temperature value
model_generation_config.temperature = 1.0

model_generation_config.top_k = 10

out = model.generate(
    input_ids,
    generation_config=model_generation_config,
)
tokenizer.decode(out.sequences[0])

"I want to go to a movie in a month and I'm going to go to a movie that will be a big hit for me.\n\nWhat are your goals when you start out as an actor?\n\nMy goal in the beginning was to be an actor that"

In [None]:
avg_token_entropy(out.scores)

1.4649289319803938

In [None]:
model_generation_config.top_k = tokenizer.vocab_size

model_generation_config.top_p = 0.8

out = model.generate(
    input_ids,
    generation_config=model_generation_config,
)
tokenizer.decode(out.sequences[0])

'I want to go to Damascus to try and fight the regime. But you don\'t really want to go there, do you? And you don\'t really want to stay there because you\'re afraid of the regime or are you afraid of the regime?\n\n"And then'

In [None]:
avg_token_entropy(out.scores)

2.0000393056869505

In [None]:
model_generation_config.top_p = 1.0
# model_generation_config.typical_p = 0.0
# model_generation_config.epsilon_cutoff = 0.0
# model_generation_config.eta_cutoff = 0.0

What else? Why greedy search, perhaps Dive Into Deep Learning has something to tell us about Beam Search? https://colab.research.google.com/github/d2l-ai/d2l-pytorch-colab/blob/master/chapter_recurrent-modern/beam-search.ipynb

##Controlled Generation

In [None]:
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-1.3b")
model = OPTForCausalLM.from_pretrained("facebook/opt-1.3b", pad_token_id=tokenizer.eos_token_id).to("cuda")

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [None]:
stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=30)])
logits_warper = LogitsProcessorList(
    [
        TopPLogitsWarper(0.92),
    ]
)

In [None]:
input_ids = tokenizer.encode('two guys in the bar start a', return_tensors='pt').to("cuda")

torch.manual_seed(3)
print("Output:\n" + 100 * '-')
for i in range(5):
  output = model.sample(
      input_ids,
      logits_warper=logits_warper,
      stopping_criteria=stopping_criteria,
  )
  print("{}: {}".format(i, tokenizer.decode(output[0], skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: two guys in the bar start a fight. the second guy punches him in the face and walks out of the bar. The guy punches him in
1: two guys in the bar start a fight and then one of them gets punched.   after the fight the guy being punched has no idea who
2: two guys in the bar start a fight. 2 guys get beat up
Username checks out.
3: two guys in the bar start a fight, the other guy whips out a giant multitool and stomps on one guy's head for
4: two guys in the bar start a fight about who will get the last number.
While drunk - one tries to help and the other is really


# Prompting

In [None]:
nontoxic_prompt = 'The following text does not contain violent, rude, disrespectful or unreasonable language:\ntext: \"'

In [None]:
nontoxic_prompt_len = len(nontoxic_prompt.split())
max_len = nontoxic_prompt_len + 30
input_ids = tokenizer.encode(nontoxic_prompt+'two guys in the bar start a', return_tensors='pt').to("cuda")
stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=max_len)])

torch.manual_seed(3)
print("Output:\n" + 100 * '-')
for i in range(5):
  output = model.sample(
      input_ids,
      logits_warper=logits_warper,
      stopping_criteria=stopping_criteria,
  )
  out_all = tokenizer.decode(output[0], skip_special_tokens=True)
  print("{}: {}".format(i, out_all[out_all.find('two guys'):]))

Output:
----------------------------------------------------------------------------------------------------
0: two guys in the bar start a fight"

1) THE ORIGIN OF THE MOMENT

My
1: two guys in the bar start a fight"
http://www.turn1news.com/stevenberg/
2: two guys in the bar start a fight in front of me"
More than thirty people came out on Sunday evening after
3: two guys in the bar start a discussion...dick fag"
It really isn't all that disrespectful, but
4: two guys in the bar start a fight.."
Because this is not a true story, we have replaced the code


# Self-Debiasing

In [None]:
toxic_prompts = 'The following text contains violent, rude, disrespectful or unreasonable language:\ntext: \"'

In [None]:
class ToxicityLogitsProcessor(LogitsProcessor):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
        input_ids_toxic = tokenizer.encode(toxic_prompts, return_tensors='pt').to("cuda")
        input_ids_toxic = torch.concat([input_ids_toxic, input_ids], dim=1)
        with torch.no_grad():
          toxic_scores = model(input_ids_toxic).logits[0, -1, :].reshape(1, -1)

        plm = F.log_softmax(scores, dim=-1)
        plmt = F.log_softmax(toxic_scores, dim=-1)

        delta_toxic = plm - plmt
        delta_mask = ((delta_toxic) < 0).type(torch.int)

        return plm + 50.0*delta_mask*delta_toxic

logits_processor = LogitsProcessorList(
    [
        ToxicityLogitsProcessor(),
    ]
)

In [None]:
input_ids = tokenizer.encode('two guys in the bar start a', return_tensors='pt').to("cuda")
stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=30)])


torch.manual_seed(3)
print("Output:\n" + 100 * '-')
for i in range(5):
  output = model.sample(
      input_ids,
      logits_warper=logits_warper,
      logits_processor=logits_processor,
      stopping_criteria=stopping_criteria,
  )
  print("{}: {}".format(i, tokenizer.decode(output[0], skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: two guys in the bar start a conversation about eminem.  "can i buy your services?"  "not for anything under $1000"
1: two guys in the bar start a fight and then one guy pulls a gun.    the other guy pulls a scott drive  
2: two guys in the bar start a business providing free haircuts to the local kids  they make so much money, that they get into some shit
3: two guys in the bar start a conversation about alchemy and one is obsessed with the fire flower.   one says, "Alchemy is
4: two guys in the bar start a table djs show - one player controls the music and the other tunes the music. so they each have a


# What can go wrong?

In [None]:
print(tokenizer.encode('start', return_tensors='pt'))
print(tokenizer.decode([4901], return_tensors='pt'))
start_id = 13124

tensor([[    2, 13124]])
bar


In [None]:
input_ids = tokenizer.encode('two guys in the bar', return_tensors='pt').to("cuda")
scores = model(input_ids).logits[0, -1, :].reshape(1, -1)
scores = F.softmax(scores, dim=-1)
# Extract the probability of a specific word (e.g., 'start') in the non-toxic context
# 'start_id' should be defined as the token ID of the word 'start'
start_id = tokenizer.convert_tokens_to_ids('start')  # Example: Getting the token ID for 'start'
bar_normal = scores[0, start_id].item()

# Repeat the process for the toxic prompt
input_ids = tokenizer.encode(toxic_prompts + 'two guys in the', return_tensors='pt').to("cuda")
toxic_scores = model(input_ids).logits[0, -1, :].reshape(1, -1)
toxic_scores = F.softmax(toxic_scores, dim=-1)
bar_toxic = toxic_scores[0, start_id].item()

print(f"Probability of generation \'start\' for the prefix: \'two guys in the bar\'")
print(f"LM: {bar_normal}")
print(f"Toxic LM: {bar_toxic}")
print(f"Delta is: {bar_normal - bar_toxic} > 0")

Probability of generation 'start' for the prefix: 'two guys in the bar'
LM: 6.407179171219468e-06
Toxic LM: 1.3567743373243957e-08
Delta is: 6.393611427846224e-06 > 0


Overall, this output demonstrates the sensitivity of language models to context. The model's predictions vary significantly based on the preceding text, reflecting how context influences the perceived appropriateness or likelihood of certain words.