# Text Generation

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# load 1.5bn parameter version of GPT-2 with a language modelling head
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "gpt2-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

In [None]:
# iteratively feed inputs and run decoder for 8 timesteps
# then at each, pick out the model's logits for last token in prompt
# and wrap in softmax to get probability distribution
# then pick token with highest probability, add to input sequence and run again

import pandas as pd
input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps):
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids = input_ids)
        
        # select logics of first batch and the last token and apply softmax
        next_token_logits = output.logits[0, -1, :]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)
        
        # store token with highest probabilities
        for choice_idx in rance(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f})%")
            iteration[f"Choice {choice_idx+1}"] = token_choice
            
        # append predicted next token to input
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)
        
pd.DataFrame(iterations)

- Interestingly GPT-2 shows that it has internalised knowledge that Transformers is a media franchise
- Shows also the iterative nature of text generation; need to decode output tokens and feed back in, one at a time

In [None]:
# use built-in generate() function to explore more sophisticated decoding methods
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

In [None]:
# prompt to reproduce famous unicorn story
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length=max_length,
 do_sample=False)
print(tokenizer.decode(output_greedy[0]))

Results are different from OpenAI famous example; revealing a major drawback of greedy search decoding, that it tends to produce repetitive output sequences, undesirable in longer texts. So can miss higher overall probability for short-term probability gains at each word.

Fortunately, can do better with a popular method called *beam search decoding*

## Beam Search Decoding

Keeps track of *top-b* most probable next tokens, where *b* is the number of beams or partial hypotheses. The next set of beams are chosen by considering all possible next-token extensions of existing set and selecting *b* most likely extensions. This is repeated until we reach a maximum length or EOS token; finally the most likely sequence is selected by ranking *b* beams according to their log probabilities.

We take log probability as the calculation is the product of conditional probabilities each between [0,1] which can lead to underflow (cannot precisely represent the calculation); as the number becomes untenably small and numerical instability.

So then we have the sum of log probabilities

In [None]:
import numpy as np

sum([np.log(0.5)] * 1024)

And we have a number we can easily deal with. As we only want to compare relative probabilities; we can do this directly with log probabilities.

Next step is to calculate and compare log probabilities of greedy and beam search to see if beam search can increase overall probability. As Transformers return the unnormalized logits for next token given input tokens, we first need to normalize the logits to create a probability distribution over whole vocabulary for each token in the sequence; then select only the token probabilities present in the sequence.

In [None]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels):
    # log probbility of a single token
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label 

def sequence_logprob(model, labels, input_len=0):
    # sum log probabilities for each token
    with torch.no_grad():
        output = model(labels)
        # as model predicts next token; no need to get logit for the first
        # label; and no need for last logit because we have no ground truth token
        # for it. Ignores log p of input sequence as these are not generated by 
        # model
        log_probs = log_probs_from_logits(
            output.logits[:, :-1, :], labels[:, 1:]
        )
        seq_log_prob = torch.sum(log_probs[:, input_len:])
    return seq_log_prob.cpu().numpy()


In [None]:
# greedy decoder output (default option)
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In [None]:
# compare with beam search; use generate() and specify num_beams param
# more beams means better potential results; but slower generation process
# due to more parallel computations
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

Much better log prob (higher score) with beam search than greedy decoding. Though we see that beam search also suffers from repetitive text. We can address this with n-gram penalty with `no_repeat_ngram_size` parameter that tracks which n-grams have been seen and sets the next token probability to zero if it would produce a previously seen n-gram.

In [None]:
output_beam = model.generate(
    input_ids, max_length=max_length, num_beams=5, do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

Lower score but less repetition; so we can balance num_beams and n_gram penalty to find trade-off between high-probability tokens while reducing repetitions. n_gram penalty is commonly used in applications such as summarisation or machine translation, where factual correctness is important. When factual correctness is less important than the diversity of generated output, eg. in story-telling or chatbot, another alternative to reducing repetitions while improving diversity is to use sampling. We can examine some of the most common sampling methods.

## Sampling Methods

Simplest: Randomly sample from probability distribution of model's outputs over full vocabulary at each timestep; and can easily control the diversityof output by adding a temperature parameter T that rescales logits before taking softmax. When T<<1; the distribution becomes peaked around origin and rare tokens are suppressed. On the flipside, when T >> 1, the distribution flattens and each token becomes equally likely. 

In [None]:
# see how temperature influences text generation
output_temp = model.generate(
    input_ids, max_length=max_length, do_sample=True, temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

High temperature has produced mostly gibberish; by accentuating rare tokens. Even have strange grammar and made up words!

In [None]:
# much cooler temperature
output_temp = model.generate(
    input_ids, max_length=max_length, do_sample=True, temperature=0.5, top_k=0
)
print(tokenizer.decode(output_temp[0]))

Temperature allows us to control the quality of samples; but there's a trade-off between coherence (low-temperature) and diversity (high temperature) that one can tune for the case at hand.

Another way is to truncate the distribution of the vocabulary, allowing us to adjust the diversity freely with temperature, but in a more limited range that excludes words that would be too strange in the context (low-probability words); the two ways are *top-k* and nucleus (top-p) sampling.

## Top-k and Nucleus Sampling

*Top-k* and nucleus (top-p) sampling are two popular alternatives or extensions to using temperature. The basic idea is to restrict the number of possible tokens we can sample from at each timestep.

From plotting the probability distribution, we get that there is a very high chance (96%) of picking the top 1,000 tokens; so the probability rises quickly to above 90% and saturates close to 100%. To illustrate further, 1% chance of not picking tokens not in top 2,000 (99% chance of picking in top 2,000).

Though if we sample hundreds of times there is a significant chance of picking an unlikely token at some point. Picking such tokens when sampling can badly influence the quality of generated text; so usually we want to avoid these unlikely tokens.

The idea behind top-k is to avoid low-probability choices by sampling only from *k* tokens with highest probability. So has a fixed cut on long-tail of distribution and ensures we only sample from likely choices.

In [None]:
# generate provides a very easy method to achieve this with top_k argument
output_topk = model.generate(input_ids, max_length=max_length, do_sample=True, top_k=50)
print(tokenizer.decode(output_topk[0]))

*k* is chosen manually and is independent of the actual distribution; we can find a good *k* value by looking at text quality metrics, though a fixed cutoff may not be very satisfactory.

Alternative is a *dynamic cutoff*, where we set a condition to cut off. This is when a certain probability mass in the selection has been reached, eg. 95%. Arrange in descending order and add one token after another from top of list until sum of probabilities is 95%; so have a smaller range of tokens to select from.

In [None]:
# again; generate() function has an argument to activate top-p sampling
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True, top_p=0.90)
print(tokenizer.decode(output_topp[0]))

Produced a more coherent story. We can combine this with previous approaches ie. top_k and top_p to choose tokens with probability mass 90% from a pool of at most 50 tokens. Can also apply beam search when sampling to build up the beam instead of greedy search.

### Selecting the "Best" Decoding Method

No universally best, it very much is a trade-off depending on the nature of the task you are generating for. 

For a precise arithmetic task, one should lower the temperature for more deterministic methods like greedy search in combination with beam to guarantee getting the most likely answer. 

Or if longer and more creative texts, then sampling methods and increase temperature or use mix of top-k and nucleus sampling.

## Conclusion

Looked at text-generation, very different from NLU previously. Generation of text requires at least one forward pass per generated token and more if we use beam search; therefore makes text generation computationally demanding, and one needs the right infrastructure to run at scale. 

Additionally, a good decoding method can make a big difference in transforming the model output probabilities to discrete tokens and can improve the text quality; this is something that requires some experimentation and a subjective evaluation of generated texts.

Though this doesn't have to be subjective! We can select a performance metric that reflects the problem we want to solve, and luckily, there is a wide range of such choices.