# Natural Language Generation with Transformer-based Language Models

# Summary

Natural Language Generation (NLG) is one of the key areas of Natural Language Processing with a range of applications such as dialogue generation, question-answering, machine translation, summarisation, etc. Current state-of-the-art in language generation predominantly uses pre-trained Transformer-based language models. In addition to the Transformer design and unsupervised training, more efficient decoding techniques are also very important.

# Quick Reminder

Language modelling concerns predicting the next word given the words generated so far.

$P(y_t | y_1, \ldots ,y_{t-1})$

To compute the probability of the sequence of words we use the chain rule of probability:


$P(y_1, y_2, \ldots ,y_n) = \prod_{t=1}^{n} P(y_t | y_1, \ldots ,y_{t-1})$


**Transformer Model**

Transformers are based on the attention mechanism and are capable of handling long range dependencies between words in sentences.

See https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0 for more details.


**What is the difference between word embeddings and language modelling?**

*The main difference that word embeddings do not take word order into account. Language models take word order into account. The word order is very important. If you do not take the word order into account the representation of the following sentences will be the same: "It was really not good, on the opposite quite bad." and "It was really not bad, on the opposite quite good." However, the meaning of those two sentences is very different.*

**How is the output of the network is computed ?**

Typically the output layer outputs a vector with the same dimensionality as the vocabulary size of our language model. Those values are inputted into the softmax function that normalises them and outputs a probability distribution over our vocabulary:

$y = g(z_i) =  \frac{e^{z_i}}{\sum^k_{j=1}e^{z_j}}$

Let us have a vocabulary of three words and the output vector $\mathsf{z}= [2.0, \; 1.0, \; 0.1]$. The result of the application of softmax will look like:

$y = [\frac{e^{2.0}}{e^{2.0} + e^{1.0} + e^{0.1}}, \; \frac{e^{1.0}}{e^{2.0} + e^{1.0} + e^{0.1}}, \; \frac{e^{0.1}}{e^{2.0} + e^{1.0} + e^{0.1}}] = [0.66,\; 0.24, \; 0.1]$

**What is the training loss for LM tasks ?**

Language models are typically trained to minimise cross-entropy. The entropy of a distribution P measures how many bits you need on average to encode data from P with the code that is optimal for P:

 $H(P)\ = - \sum_ iP(y_i)\log P(y_i)$

To write a number in bits, we need to take a log base 2 of N.

The cross-entropy of a distribution P with respect to a distribution Q measures how many bits you need on average to encode data from P with the code that is optimal for Q:

$H(P,Q)\ = \ - \sum_ iP(y_i)\log Q(y_i)$

This is useful to measure how well the predicted distribution Q models P.

For prediction problems, like predicting words in language modeling, we often have just probability of 1 for the true word (for the rest of the words it is 0, remember the one-hot encoding).

So we usually minimise the negative log likehood (NLL) of correct words:

$ NLL = -\sum_n\log Q(y_t)$

where $y$ is the correct word (word from the reference sentence created by the human). This is equavalent to maximising the probability of correct words.

**How are language models trained ?**

*Very often the so-called "teacher forcing" is used. It works by using the actual ground true word from the training dataset at the current time step t as input to the next time step t+1, rather than the output generated by the network. This makes learning faster and the model more stable. The model is not going to get punished for every subsequent word it generates.*

Let's quickly install transformers and load the model. We will use GPT2 in Tensorflow 2.1 as an example of a Large Language Model (LLM).

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q 'tensorflow<2.16'

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


In [2]:
import tensorflow as tf
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer


tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


### **Greedy Search**

Greedy search simply selects the word with the highest probability as its next word: $y_t = argmax_{y}P(y_t | w_{<t})$ at each timestep $t$.

Starting from the word $\text{"The"}$, the algorithm sequentially selects the subsequent word with the highest probability, such as $\text{"cat"}$, and continues in this manner. Consequently, the resulting word sequence is $\text{"The", "cat", "slept"}$, with an overall probability calculated as $0.5 \times 0.4 = 0.2$.

In the following we will generate word sequences using GPT2 given the context (prompt) $\text{"The", "cat", "slept", "on", "the"}$. Greedy search can be used in `transformers` as follows:

In [3]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('The cat slept on the', return_tensors='tf')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
The cat slept on the floor, and the cat was still asleep.

"I was just trying to get her to sleep," she said. "I was trying to get her to sleep, but she was still asleep."

The cat


**Task 1: Write the greedy search method.**

Complete the below function implementing the greedy search algorithm and answer the following questions:


1.   Generate the text following the prompt "The cat slept on the"
2.   Return the log-probability of generating this sequence (conditioned on the prompt).

Your output should correspond to the result returned by the in-built Huggingface greedy search 🤗 method above.


In [4]:
import tensorflow as tf
import numpy as np

def greedy_search(model, tokenizer, input_ids, max_len=50):

    batch_size = input_ids.shape[0]
    i, cumulative_logprobs = 0, 0
    new_input_sequence = tf.convert_to_tensor(input_ids, dtype=tf.int32)

    while True:
        # Model forward pass
        output = model(new_input_sequence)


        # GPT-2 generates a prediction per token; we want the last prediction (logits)
        current_predictions = output.logits[:, -1, :]

        # Compute log probabilities using softmax, then take log
        log_probs = tf.math.log(tf.nn.softmax(current_predictions, axis=-1))

        # Greedy search: select token with max log probability
        next_token_id = tf.argmax(log_probs, axis=-1, output_type=tf.int32)

        # Get log probability of the selected token
        next_token_logprob = tf.reduce_max(log_probs, axis=-1)

        # Update cumulative log probabilities
        cumulative_logprobs += next_token_logprob

        # Append selected token to the sequence
        new_input_sequence = tf.concat([new_input_sequence, tf.expand_dims(next_token_id, -1)], axis=-1)

        # We stop at max length
        if new_input_sequence.shape[1] >= max_len:
            break

        i+= 1

    return new_input_sequence.numpy(), cumulative_logprobs[0].numpy()



In [5]:
out_seq, proba = greedy_search(model, tokenizer, input_ids, max_len=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(out_seq[0], skip_special_tokens=True))
print("\n Its log-probability:\n" + 100 * '-')
print(proba)

Output:
----------------------------------------------------------------------------------------------------
The cat slept on the floor, and the cat was still asleep.

"I was just trying to get her to sleep," she said. "I was trying to get her to sleep, but she was still asleep."

The cat

 Its log-probability:
----------------------------------------------------------------------------------------------------
-62.585667


### **Beam search**

Beam search reduces the risk of overlooking high-probability word sequences by maintaining the top num_beams hypotheses at each time step and ultimately selecting the hypothesis with the highest overall probability.

To illustrate with num_beams=2, at time step $1$, in addition to the most probable hypothesis, e.g., $\text{"The", "cat"}$, beam search also retains the second most likely one, e.g., $\text{"The", "feline"}$. At time step $2$, beam search identifies that the word sequence $\text{"The", "cat", "slept"}$ has a higher probability compared to $\text{"The", "cat", "rested"}$, which in turn has higher probability than $\text{"The", "feline", "rested"}$. So that only the former two are considered for further generation.

Let's see how beam search can be implemented in transformers. We specify `num_beams=3 `and `early_stopping=True` to stop generation process when all beam hypotheses reach the end-of-sequence (EOS) token.

In [6]:
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=3,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
The cat slept on the floor of the house.

"It was like a nightmare," she said.

"It was like a nightmare. It was like a nightmare. It was like a nightmare. It was like a nightmare. It


**Task 2: Write the beam search method.**

Complete the function below implementing the beam search algorithm from scratch and answer the following questions below:


1.   Generate the text following the prompt "The cat slept on the" using `beam_size=3` and `max_len=50`. Is the output different from the output provided by the official implementation when run with the same parameters?
2.   Return the log-probability of generating this sequence (conditioned on the prompt). Is this probability different to the probability you obtained for greedy search in Task 1? If so, why?
3.   Explain what are the differences of the beam search hypothesis you generated to the greedy search hypothesis you generated in Task 1?

In [7]:
def beam_search(model, tokenizer, input_ids, beam_size=3, max_len=41):

    sequences = [[input_ids, 0]]  # List to store candidate sequences and their cumulative log probabilities

    while True:
        all_candidates = []

        # we loop over existing sequences
        for seq, culm_logprob in sequences:

             # Model forward pass

            output = model(seq)

            # TO DO
            current_predictions = output.logits

            # Compute log probability
            log_probabilities = tf.nn.log_softmax(current_predictions[:, -1, :], axis=-1)
            # Beam search: select top-k candidates (k = beam size)
            top_k_probs, top_k_indices = tf.math.top_k(log_probabilities, k=beam_size)


            # for each existing hypothesis we add beam_size more hypotheses
            for k in range(beam_size):
                next_seq = tf.concat([seq, [[top_k_indices[0][k].numpy()]]], axis=-1)
                next_score = culm_logprob + top_k_probs[0][k].numpy()
                all_candidates.append([next_seq, next_score])

            # TO DO

        # Sort candidates by cumulative log probability and pick the n-best hypotheses
        ordered = sorted(all_candidates, key=lambda tup:tup[1], reverse=True)
        sequences = ordered[:beam_size]

        # Print the new best output for the top sequence
        print(tokenizer.decode(sequences[0][0][0], skip_special_tokens=True))

        # Stopping criteria (EOS token)
        if sequences[0][0][0][-1] == tokenizer.eos_token_id or len(sequences[0][0][0]) >= max_len:
            break

    return sequences[0][0].numpy(), sequences[0][1]

In [8]:
beam_search_result, beam_search_logprobs = beam_search(model, tokenizer, input_ids, beam_size=3, max_len=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_search_result[0], skip_special_tokens=True))
print("\n Its log-probability:\n" + 100 * '-')
print(beam_search_logprobs)

The cat slept on the floor
The cat slept on the floor,
The cat slept on the floor of the
The cat slept on the floor of the house
The cat slept on the floor of the house,
The cat slept on the floor of the house, and
The cat slept on the floor of the house.


The cat slept on the floor of the house.

"
The cat slept on the floor of the house.

"I
The cat slept on the floor of the house.

"It was
The cat slept on the floor of the house.

"It was a
The cat slept on the floor of the house.

"It was like a
The cat slept on the floor of the house.

"It was like a nightmare
The cat slept on the floor of the house.

"It was like a nightmare,"
The cat slept on the floor of the house.

"It was like a nightmare," she
The cat slept on the floor of the house.

"It was like a nightmare," she said
The cat slept on the floor of the house.

"It was like a nightmare," she said.
The cat slept on the floor of the house.

"It was like a nightmare," she said. "
The cat slept on the floor of the house.

"It w

### **Repetition penalty**


While the result may look more fluent, the output still includes repetitions of the same word sequences.

One straightforward solution involves incorporating penalties for n-grams (also known as word sequences of $n$ words), a concept introduced by [Paulus et al. (2017)](https://arxiv.org/abs/1705.04304) and [Klein et al. (2017)](https://arxiv.org/abs/1701.02810). The simpliest n-grams penalty ensures the exclusion of repeated n-grams by explicitly assigning a probability of zero to next words that could form a previously encountered n-gram.

Let's try it out by precising `no_repeat_ngram_size=2` so that no *2-gram* appears twice.

Note that there is also such parameter as `repetition_penalty` which only discounts the scores of previously generated tokens.


In [9]:
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=3,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
The cat slept on the floor of the house.

"I'm not going to tell you what happened," she said. "I don't want to talk about it. I'm just trying to make it clear that I didn't do anything


**Task 3: Write the beam search method with repetition penalty.**

Complete the `beam_search_no_repeat` function below by following the steps below:

1.   Take the last 20 tokens of each new hypothesis
2.   In the newly generated distribution assign a low probability of -1e9 to the tokens that were already within those last 20 tokens. Hint: you can use the `tf.tensor_scatter_nd_update` for this.
3.   Proceed with beam search as before.

On completing the function answer the following questions:

1.   Generate the text following the prompt "The cat slept on the" using `beam_size=3` and `max_len=50`. Is the output different from the output provided by your beam search implementation (Task 2)? If so, why?
2.  Is the output different from the output of the official Hugging face *n*-gram repetition penalty implementation above (`no_repeat_ngram_size=2`)? If so, why?
3.   Return the log-probability of generating this sequence (conditioned on the prompt).

In [10]:
def beam_search_no_repeat(model, tokenizer, input_ids, beam_size=3, max_len=50, last_tokens_count=20):

    sequences = [[input_ids, 0]]  # List to store candidate sequences and their cumulative log probabilities

    while True:

        all_candidates = []

        for seq, culm_logprob in sequences:

            output = model(seq)

            # TO DO

            # Getting the last prediction
            current_predictions = output.logits[:, -1, :]

            # Computing the log probability
            log_probabilities = tf.math.log(tf.nn.softmax(current_predictions, axis=-1))


            # Getting the last m tokens in a hypothesis
            if len(seq[-1]) > last_tokens_count:
                # Extracting the last last_tokens_count tokens from the sequence
                last_tokens = seq[-1][-last_tokens_count:]
                # Creating a mask for penalizing repeated tokens
                mask_indices = tf.reshape(tf.cast(last_tokens, dtype=tf.int64), [-1, 1])
                updates = tf.fill([len(last_tokens)], -1e9)
                # Updating log probabilities with the mask
                shape = tf.constant([log_probabilities.shape[-1]], dtype=tf.int64)
                mask_tensor = tf.scatter_nd(mask_indices, updates, shape)
                log_probabilities = log_probabilities + mask_tensor

            # Selecting top-k candidates
            top_k_probs, top_k_indices = tf.nn.top_k(log_probabilities, k=beam_size)


            for k in range(beam_size):
                next_token = top_k_indices[0, k]

                new_seq = tf.concat([seq, next_token[tf.newaxis, tf.newaxis]], axis=-1)
                new_culm_logprob = culm_logprob + top_k_probs[0, k]

                all_candidates.append([new_seq, new_culm_logprob])

            # TO DO

        # Sorting candidates by cumulative log probability
        sequences = sorted(all_candidates, key=lambda x: x[1], reverse=True)[:beam_size]

        # Printing new output for the top sequence
        print(tokenizer.decode(sequences[0][0][0], skip_special_tokens=True))

        # Stopping criteria
        if sequences[0][0][0][-1] == tokenizer.eos_token_id or len(sequences[0][0][0]) >= max_len:
            break

    return sequences[0][0], sequences[0][1]

In [11]:
beam_search_result, beam_search_logprobs = beam_search_no_repeat(model, tokenizer, input_ids, beam_size=3, max_len=50, last_tokens_count=20)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_search_result[0], skip_special_tokens=True))
print("Its log-probability:\n" + 100 * '-')
print(beam_search_logprobs)

The cat slept on the floor
The cat slept on the floor,
The cat slept on the floor of the
The cat slept on the floor of the house
The cat slept on the floor of the house,
The cat slept on the floor of the house, and
The cat slept on the floor of the house.


The cat slept on the floor of the house.

"
The cat slept on the floor of the house.

"I
The cat slept on the floor of the house.

"It was
The cat slept on the floor of the house.

"It was a
The cat slept on the floor of the house.

"It was like a
The cat slept on the floor of the house.

"It was like a nightmare
The cat slept on the floor of the house.

"It was like a nightmare,"
The cat slept on the floor of the house.

"It was like a nightmare," she
The cat slept on the floor of the house.

"It was like a nightmare," she said
The cat slept on the floor of the house.

"It was like a nightmare," she said,
The cat slept on the floor of the house.

"It was like a dream," she said, "
The cat slept on the floor of the house.

"It was l

# Other sampling strategies


We can alter the distributions provided by a model to create texts with specific characteristics. Those required text characteristics may vary depending on the task at hand, the typical objective is to ensure that the generated text exhibits:

*   Coherence - text should be logically connected and make sense.
*   Diversity - generated samples are distinct.

It's evident that coherence is crucial. In this section, you will explore some popular generation strategies and their impact on the coherence and diversity of the generated samples:

* **Random sampling** will choose the next token at random from the distribution

* **Temperature sampling** divides logits by the temperature before feeding them into softmax. The lower the temperature the closer the sampling procedure is to the greedy decoding.

* **Top k sampling** takes only k most probable tokens into account and removes the distribution tail

* **Nucleus (Top p) sampling** method chooses a set of tokens that has an accumulated probability equal or higher than a pre-defined value ([Holtzman et al. 2019)](https://arxiv.org/abs/1904.09751).



**Task 4: Implementation of different sampling methods.**

1.   Run the code below for the four strategies mentioned above and generate text for the prompt "The cat slept on the".
2.   Discuss the differences between the obtained outputs in terms of coherence and diversity.




In [12]:
# random sampling
tf.random.set_seed(0)

sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50
)

print("Random Sampling output:\n" + 100 * '-')
print(tokenizer.decode(sample_outputs[0], skip_special_tokens=True))


# High temperature

sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    temperature=2.0
)

print("\n High temperature output:\n" + 100 * '-')
print(tokenizer.decode(sample_outputs[0], skip_special_tokens=True))


# Low temperature
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    temperature=0.2
)

print("\n Low temperature output:\n" + 100 * '-')
print(tokenizer.decode(sample_outputs[0], skip_special_tokens=True))

# Top p sampling
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_p=0.92,
    top_k=0
)

print("\n Top p output:\n" + 100 * '-')
print(tokenizer.decode(sample_outputs[0], skip_special_tokens=True))


Random Sampling output:
----------------------------------------------------------------------------------------------------
The cat slept on the floor of the tent, she noticed that no one saw her step back in the hallway, when she walked in. It didn't help that this is when I realized that not one of that room's inhabitants has a cat.

 High temperature output:
----------------------------------------------------------------------------------------------------
The cat slept on the roof (a house cat being carried up from on down); that her belly crawled. When first taken, the crescent head is found to lie quite still inside the frame. It did at first belong more as it seemed as

 Low temperature output:
----------------------------------------------------------------------------------------------------
The cat slept on the floor, and the dog slept on the floor.

The cat was a little bit older than the dog, but she was still very young. She was a little bit older than the dog, but she w