# Overview

In this notebook, we will go through implement the different decoding strategies by using the Huggingface Transformers library.


# Auto-regressive Language generation

It is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions:

$$P(w_{1:T}|W_0)=\prod_{t=1}^{T}P(w_{t}|w_{1:t-1},W_{0}), w_{1:0}=\phi$$

and $W_{0}$ being the intial context word sequence. The length T of the word sequnce is usually determined on-the-fly and corresponds to the timestep t=T the EOS token is generated from $P(w_{t}|w_{1}:t-1, W_{0})$. Let's see the currently most prominent decoding methods:

* Greedy search
* Beam search
* Sampling

In [1]:
%%capture
!pip install transformers==4.38.2

Collecting transformers==4.38.2
  Downloading transformers-4.38.2-py3-none-any.whl.metadata (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Downloading transformers-4.38.2-py3-none-any.whl (8.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m0:01[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.1
    Uninstalling transformers-4.38.1:
      Successfully uninstalled transformers-4.38.1
Successfully installed transformers-4.38.2


In [2]:
import torch
import warnings

if torch.cuda.is_available():
    torch.backends.cudnn.deterministic=True
    # https://github.com/huggingface/transformers/issues/28731
    torch.backends.cuda.enable_mem_efficient_sdp(False)
    device='cuda'
else:
    device='cpu'

warnings.filterwarnings('ignore')

print(device)

cuda


In [4]:
import torch
from transformers import AutoTokenizer

tokenizer=AutoTokenizer.from_pretrained("gpt2")

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [6]:
from transformers import AutoModelForCausalLM

model=AutoModelForCausalLM.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id).to(device)
model.config

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "pad_token_id": 50256,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.38.2",
  "use_cache": true,
  "vocab_size": 50257
}

# Greedy Search

Greedy search is the simplest decoding method. It selectes the word with the highest probability as its next word: $w_{t}=argmax_{w}*P(w|w_{1:t-1})$ at each timestep t. The following sketch shows greedy search.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/093/437/219/901/721/original/a65abc1406cb4c72.png)

Starting from the word "The", the algorithm greedily chooses the next word of highest probability "nice" and so on, so that the final generated word sequence is ("The", "nice", "woman") having an overall probability of 0.5x0.4=0.2. In the following we will generate sequences using GPT2 on the context ("I", "enjoy", "walking", "with", "my", "cute", "dog"). Let's see how gready search can be used in `transformers`:

In [7]:
# encode context the generation is conditioned on
model_inputs=tokenizer("I enjoy walking with my cute dog", return_tensors="pt").to(device)

# generate 40 new tokens
greedy_output=model.generate(**model_inputs, max_new_tokens=40)

print("Output:\n"+100*'-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure


The generated words following the context are reasonable, but the model quickly starts repeating itself! This is very common problem in language generation and seems to be even more so in greedy and beam search. More detail see [Vijayakumar., 2016](https://arxiv.org/abs/1610.02424) and [Shao et al. 2017](https://arxiv.org/abs/1610.02424).

The major drawback of greedy search though is that it misses high probability words hidden behind a low probability words as can be seen in out sketch above: The word "has" with its high conditional probability of 0.9 is hidden behind the word "dog", which has only the second-highest conditional probability, so that greedy search misses the word sequence "The", "dog", "has".

# Beam Search

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely `num_beams` of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability. For example, num_beams=2:


![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/093/566/654/879/861/original/e9b86a2f188c635c.png)

At time step 1, besides the most likely hypothesis ("The", "nice"), beam search also keeps track of the second most likely one("The", "Dog"). At time step 2, beam search finds that the word sequnece ("The", "dog","has"), has with 0.36 a higher probability than ("The", "nice","woman"), which has 0.2. Great, it has found the most likely word sequence in our toy example! Beam search will always find an output sequence with higher probability than greedy search, but is not guranted to find the most likely output.


## Beam Search in Transformers

Here we set `num_beams>1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token.

In [9]:
beam_output=model.generate(**model_inputs, max_new_tokens=40, num_beams=5, early_stopping=True)

print("Output:\n"+100*'-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure


While the result is arguably more fluent, the output still includes repetitions of the same word sequences. One of the available remedies is to introduce n-grams(a.k.a word sequences of n words) penalties as introduced by [Paulus et al, 2017](https://arxiv.org/abs/1705.04304) and [Klein et al, 2017](https://arxiv.org/abs/1701.02810). The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0.

In [10]:
# set no-repeat_ngram_size to 2
beam_output=model.generate(**model_inputs,max_new_tokens=40,num_beams=5,no_repeat_ngram_size=2,early_stopping=True)

print("Output:\n"+100*'-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to


We can see that the repetition does not appear anymore. Nevertheless, n-gram penalties have to be used with care. An article generated about the city Melbourne should not use a 2-gram penalty or otherwise, the name of the city would only appear once in the whole text.

Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best. We can simply set the parameter `num_return_sequences` to the number of highest scoring beams that should be returned. Make sure though that `num_return_sequences <= num_beams`.

In [12]:
# set return_num_sequences>1
beam_outputs=model.generate(**model_inputs, max_new_tokens=40, num_beams=5, no_repeat_ngram_size=2, num_return_sequences=5, early_stopping=True)

print("Output:\n"+100*"-")
for i, beam_output in enumerate(beam_outputs):
    print("{}:{}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0:I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to
1:I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to
2:I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea to
3:I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time to take a
4:I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's a good idea.


As we can see, the five beam hypotheses are only marginally different to each other-which should not be too surprising when using only 5 beams.


# Beam Search is not Best Possible Option

Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization - see [Murray et al. 2018](https://arxiv.org/abs/1808.10006) and [Yang et al, 2018](https://arxiv.org/abs/1808.09582). But this is not the case for open-ended generation where the desired output length can vary greatly, e.g. diglog and story generation. We have seen that beam search heavily suffers from repetitive generation. This is especially hard to control with n-gram or other penalties in story generation since finding a good trade-off between inhibiting repetition and repeating cycles of identical n-grams requires a lot of finetuning. As argued in [Ari Holtzman et al(2019)](https://arxiv.org/abs/1904.09751), high quality human language does not follow a distribution of high probability next words. In other words, as human, we want generated text to surprise us and not to be boring/predictable. The authors show this nicely by plotting the probability, a model would give to human text vs. what beam search does.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/096/804/301/777/915/small/a2ba7b70e0fa191c.png)


# Sampling

In its most basic form, sampling means randomly picking the next word $w_{t}$  according to its conditional probability distribution:

$$w_{t}~P(w|w_{1:t-1})$$

Taking the example from above, the following graphic visualizes language generation when sampling.

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/096/815/799/840/566/original/0cf20071a21e631c.png)

It becomes obvious that languaeg generation using sampling is not deterministic anymore. The word "car" is sampled from teh conditioned probability distribution P(w|"The"), followed by sampling ("drives") from P(w|"The","car").

In Huggingface Transformers, we set `do_sample=True` and deactivate Top-K sampling via `top_k=0`. In the following, we will fix the random seed for illustration purpose. Feel free to chaneg the `set_seed` argument to obtain different results, or to remove it for non-determinism.

In [None]:
# set seeed to reproduce results, Feel free to change the seed though to get different results
from transformers import set_seed

set_seed(42)

# activate sampling and deactivate top_k by settting top_k sampling to 0
sample_output=model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0
)

print("Output:\n"+100*'-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Let's take a closer look, it is not very coherent and doesn't sound like it was written by a human. That is the big problem when sampling word sequences: **The model often generate incoherent gibberish, cf Ari Holtzman et al(2019).**


# Sampling with Temperature

A trick is to make the distribution P(w|w_{1:t-1}) sharper(increasing the likelihood of high probability words and decreasing the likelihood of low probabiliry words) by lowering the so-called `temperature` of the softmax.

An illustration of applying temperature to our example from above could look as follows

![](https://cdn.masto.host/sigmoidsocial/media_attachments/files/112/096/871/548/228/841/small/695491daf7135f33.png)

The conditional next word distribution of step t=1 becomes much sharper leaving almost no chance for word "car" to be selected. We can cool down the distribution in the library by setting `temperature=0.6`:

In [None]:
set_seed(42)

sample_output=model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=0,
    temperature=0.6
)

print("Output:\n"+100*'-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

There are less weried n-grams and the output is a bit more coherent now! while applying temperature can make a distribution less random, in its limit, when setting `temperature` ->0, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before.


# Top-k Sampling

[Fan et.al 2018](https://arxiv.org/pdf/1805.04833.pdf) introduced a simple, but very powerful sampling scheme, called **Top-K** sampling. In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words. GPT-2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

# Acknowledge

* https://huggingface.co/blog/how-to-generate
* https://huggingface.co/blog/ray-rag