## Text Generation Decoding Intro  

### **Introduction**

In recent years, there has been an increased interest in language generation due to the rise of transformer-based language models trained on millions of webpages, such as OpenAI's famous [GPT2 model](https://openai.com/blog/better-language-models/).

The results on conditioned open-ended language generation are quite good, e.g. [GPT2 on unicorns](https://openai.com/blog/better-language-models/#samples), [XLNet](https://medium.com/@amanrusia/xlnet-speaks-comparison-to-gpt-2-ea1a4e9ba39e), [XLNet @ GitHub](https://github.com/zihangdai/xlnet/)


Besides the improved transformer architecture and massive unsupervised training data, **better decoding methods** have also played an important role.

This notebook is an intro to some of the decoding methods used for text generation. The notebook gives a brief overview of different decoding strategies and shows how to implement them with very little effort using the popular [Huggin Face Transformers Library](https://github.com/huggingface/transformers)

The notebook briefly mentions the more general topic known as [auto-regressive language generation](http://jalammar.github.io/illustrated-gpt2/).

In short, **auto-regressive** language generation is based on the assumption that the probability distribution of a word sequence can be decomposed into the product of conditional next word distributions as in the formula below. Note that the formula is remarkably similar to the **autoregressive** [formulation](https://docs.google.com/presentation/d/1SB1fYiOAzDqZS_NSjb7zONKqdGS-vkXB5g_5eE8dt_0/edit#slide=id.g1ec32d6190d_0_137) we used for **RNNs** :

$$ P(w_{1:T} | W_0 ) = \prod_{t=1}^T P(w_{t} | w_{1: t-1}, W_0) \text{ ,with }  w_{1: 0} = \emptyset, $$

where $W_0$ is the initial *context* word sequence. The length $T$ of the word sequence is usually determined *on-the-fly* and corresponds to the timestep $t=T$ the EOS token is generated from $P(w_{t} | w_{1: t-1}, W_{0})$.

Auto-regressive language generation is available for `GPT2`, `XLNet`, `OpenAi-GPT`, `CTRL`, `TransfoXL`, `XLM`, `Bart`, `T5` in both PyTorch 2.5+ and TF.

Here we visit the following decoding methods: 

1. *Greedy search*
2. *Beam search*
3. *Sampling*
4. *Top-K sampling*
5. *Top-p sampling*.

### **Greedy Search**

Greedy search selects the word with the highest probability as its next word: $w_t = argmax_{w}P(w | w_{1:t-1})$ at each timestep $t$. 

The figure below shows how greedy search navigates through sentence trellis

![Greedy Search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/greedy_search.png)

Starting from the word **"The"**, greedy search chooses the next word with the highest probability **"nice"** and so on, until it reaches the last  word. 

Once the search is complete, the sequence is **"The", "nice", "woman"** with an overall probability of $0.5 \times 0.4 = 0.2$.

In the next sections we will generate word sequences using **HF GPT2**  with the input text as context **( "I", "enjoy", "walking", "with", "my", "cute", "dog")**.

We will also see how different encoding and serach methods can be used with **HF Transformers**


[Huggin Face](https://github.com/huggingface)

[Huggin Face Transformers](https://github.com/huggingface/transformers)

In [1]:
from IPython.display import Image
from IPython.display import HTML

# The code here is using Pytorch 2.6

import torch
import torch.nn as nn
import torch.nn.functional as F



In [2]:
def remove_after_char(text, char):
  """Removes all text from a string after the first occurrence of a given character.

  Args:
    text: The input string.
    char: The character to search for.

  Returns:
    The string with all text after the character removed, or the original string if the character is not found.
  """
  try:
    index = text.index(char)
    return text[:index]  # Slice the string up to the character's index
  except ValueError:
    return text  # Return the original string if the character is not found

In [3]:
lofas = "this is a text that contains two lines separated by \n\n  just like the lines that the decoders produce"

print( lofas ) 
lufas = remove_after_char(lofas, "\n")
print("this is lufas:",  lufas ) 

this is a text that contains two lines separated by 

  just like the lines that the decoders produce
this is lufas: this is a text that contains two lines separated by 


In [4]:
# The APIs here come from Huging Face. The HF people provide versions for TF and Torch. 
# The TF versions have names that start with TF. 
# See below 

# from transformers import TFGPT2LMHeadModel, GPT2Tokenizer # TF for TensorFlow

from transformers import GPT2LMHeadModel, GPT2Tokenizer # For Torch 
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
# KEEP CELL TO SHOW ERROR  

#from transformers import AutoTokenizer, AutoModelForCausalLM #or other model class

#tokenizer = AutoTokenizer.from_pretrained('gpt2') # Or your model
#model = AutoModelForCausalLM.from_pretrained('gpt2') # Or your model

#input_ids_list = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='pt')
#input_ids = torch.tensor([input_ids_list])  # Convert to PyTorch tensor

# generate text until the output length (which includes the context length) reaches 50
#greedy_output = model.generate(input_ids, max_length=50)

#print("Output:\n" + 100 * '-')
#print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

### HF AutoModelForCausalLM

The HF Transformers is an API that has several pre-trained GPT-2 language models.

** AutoModelForCausalLM** is a class from the HF Transformers library. "Auto" means that the library will automatically detect the correct model class based on the provided pre-trained checkpoint name ("gpt2" in this case).

**ModelForCausalLM** specifies that we will use a model designed for causal language modeling. Causal language modeling is the task of predicting the next token in a sequence, given the previous tokens. This is the core functionality of models like GPT-2.

**from_pretrained('gpt2')** is a method of the AutoModelForCausalLM class. It loads a pre-trained model and its configuration from a specified checkpoint. 

**'gpt2'** is the identifier for the GPT-2 model. It corresponds to a model that has been pretrained and is stored on the HF hub. The library will download the model's weights and configuration files.


In essence


        model = AutoModelForCausalLM.from_pretrained('gpt2')

does the following:

1. Identifies the model, i.e., it recognizes that we are using the GPT-2 model.

2. Downloads the model, its pre-trained weights and configuration files for GPT-2 from the HF Model Hub (if they are not already cached locally).

3. Instantiates the model, i.e., creates an instance of the appropriate model class (in this case, a causal language model) and loads the downloaded weights into it.


In [6]:

# code in this cell is using greedy search under PyTorch (note the use of 'pt')
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')

# convert to list to fix error 
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']


# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, attention_mask=attention_mask, max_length=25)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog.


Congratulations. We have generated a short text with GPT2 😊.

The generated words following the context are reasonable, but the model quickly starts repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search - check out [Vijayakumar et al., 2016](https://arxiv.org/abs/1610.02424) and [Shao et al., 2017](https://arxiv.org/abs/1701.03185).

The major problem of greedy search is that it misses high probability words hidden behind a low probability word as can be seen in the figure above.

If we go back to that figure, we can observe that the probability for the word **"has"** is a high $0.9$ but the word is hidden behind the word **"dog"**, which has only the second-highest conditional probability. Thus greedy search misses the word sequence **"The", "dog", "has"**.

Beam Search is next

### **Beam search**

Beam search reduces the risk of missing hidden word sequences of high probability by keeping the most likely `num_beams` of hypotheses at each step and eventually choosing the hypothesis that has the overall highest probability.

Let's use  `num_beams=2`:

![Beam search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

At step $1$, besides the most likely hypothesis **"The", "nice"**, beam search also keeps track of the second most likely one **"The", "dog"**.

At step $2$, beam search finds that the word sequence **"The", "dog", "has"** with $0.4*0.9=0.36$ is a higher probability than **"The", "nice", "woman"**, which has $0.5*0.4=0.2$. 

Beam search found the most likely word sequence in this toy example, and in general, beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.

Let's see how beam search can be used in `transformers`. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token.

In [7]:

# code in cell uses Beam Search under PyTorch.
# Q: Where may we find more information about this code and what it does???? 
# A: At the link below:

# https://huggingface.co/docs/transformers/main//generation_strategies

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# activate beam search and early_stopping.
# How do we know that we are using beam search? Because the param num_beams, != 0 
# Add attention_mask 
beam_output = model.generate(
    input_ids,
    attention_mask=attention_mask, # Add attention_mask here
    max_length=60, # Note change from 50 -> 60 
    num_beams=5,
    early_stopping=True
)


lofas = tokenizer.decode(beam_output[0])

lufas = remove_after_char(lofas, "\n")
print( lufas ) 

#print("Output:\n" + 100 * '-')
#print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.


While the result is arguably more fluent, the output still includes repetitions of the same word sequences.  

A simple remedy is to introduce *n-grams* (*a.k.a* word sequences of $n$ words) penalties as introduced by [Paulus et al. (2017)](https://arxiv.org/abs/1705.04304) and [Klein et al. (2017)](https://arxiv.org/abs/1701.02810). 

The most common *n-grams* penalty makes sure that no *n-gram* appears twice by manually setting the probability of next words that could create an already seen *n-gram* to $0$.

Let's try it out by setting `no_repeat_ngram_size=2` so that no *2-gram* appears twice:

In [8]:

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids,
    attention_mask=attention_mask, # Add attention_mask here
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

lofas = tokenizer.decode(beam_output[0]) 
print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break


That looks better. We can see that the repetition does not appear anymore. Nevertheless, *n-gram* penalties have to be used carefully. 

For example, articles generated about the city *New York* should not use a *2-gram* penalty, otherwise the name of the city would only appear once in the whole text!

Another important feature about beam search is that we can compare the top beams after generation and choose the best beam gnerated for our purpose.

This is done in HF `transformers`, by setting the parameter `num_return_sequences` in the call to model.generate( args ) 
to the number of highest scoring beams that the model should return.

Make sure though that `num_return_sequences <= num_beams`!

In [9]:

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,
    attention_mask=attention_mask,  # <--- Add attention_mask here
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 5 output sequences
print("Output:\n" + 100 * '-')

for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
    print("\n")


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break


1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to


2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to take a break


3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to get back to


4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinkin

The beam hypotheses shown above are only marginally different to each other. This should not be surprising because we are using small numbers of beams.

When moving forward from simple examples to production open-ended generation, there are several reasons why beam search might not be the best option. For example:

- Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization.
- See for example [Murray et al. (2018)](https://arxiv.org/abs/1808.10006) and [Yang et al. (2018)](https://arxiv.org/abs/1808.09582).
- But this is not the case for open-ended generation where the desired output length can vary greatly, e.g. dialog and story generation.

- We have seen that beam search suffers from repetitive word generation. 

- This is hard to control with *n-gram*- or other penalties in story generation.
- Finding a good trade-off between forced "no-repetition" and repeating cycles of identical *n-grams* requires significant  finetuning.

- As argued in the paper named "The Curious Case of Neural Test Degeneration" by [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751)

![The Curious Case of Neural Text Degeneration]( CuriousCaseOfNeuralTextDeGeneration.png )


high quality human language does not strictly follow a distribution of high probability next words.

In other words, as humans, we often prefer text with some elements of surprise, so that the dialog is less boring and less predictable.

The authors show in the figure as below, the probability that a model would give to human text vs. what beam search does.

From that figure we can clearly conclude that the element of surprise is absent in beam search.

![alt text](https://blog.fastforwardlabs.com/images/2019/05/Screen_Shot_2019_05_08_at_3_06_36_PM-1557342561886.png)


So let's stop being boring and introduce some randomness 🤪.

In [10]:

# HTML('<img src="CuriousCaseOfNeuralTextDeGeneration.png" alt=" My PNG", width="600">')
# Image(filename='/drv3/hm3/code/python/torch2.6/local/Transformer/CuriousCaseOfNeuralTextDeGeneration.png', width=300, height=200)


### **Sampling**

In most contexts, sampling means randomly picking the next word $w_t$ according to a conditional probability distribution that takes into consideration recent history:

$$w_t \sim P(w|w_{1:t-1})$$

Taking the example from above, the following figure shows language generation when sampling.

![vanilla_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/sampling_search.png)

It looks like language generation using sampling is not *deterministic* anymore. 

The word **"car"** is sampled from the conditioned probability distribution P(w | "The"), followed by sampling "drives" from the distribution P(w | "The", "car").

In `transformers`, we set `do_sample=True` and deactivate *Top-K* sampling (explained in the next cell) via `top_k=0`. 

In the following cell, we fix `random_seed=0` for illustration purposes. To get a better feel of what is happening, we could change the `random_seed` to play around with the model.


In [18]:
# set seed to reproduce results. Feel free to change the seed though to get different results

import random
import numpy as np

def set_seed(seed_value=42):
    """Sets the seed for reproducibility."""
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value) # gpu vars
    torch.backends.cudnn.deterministic = True  #needed if using cuda
    torch.backends.cudnn.benchmark = False

## Change this several times to show the effect 
set_seed(1) # Equivalent to tf.random.set_seed(0)

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    attention_mask=attention_mask, # Add attention mask here
    do_sample=True,
    max_length=50,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog when there are other people around, though.

No, ladies, enjoying your dog and publicly embracing her is not my thing. It doesn't even bother me, woman-like. I'm happy you think


Hopefully there are less weird n-grams and the output is a bit more coherent now!

While applying temperature can make a distribution less random, in its limit, when setting `temperature` $ \to 0$, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before.


### **Top-K Sampling**

[Fan et. al (2018)](https://arxiv.org/pdf/1805.04833.pdf) introduced a simple, but powerful sampling scheme, called ***Top-K*** sampling. 

In *Top-K* sampling, the *K* most likely next words are filtered and the probability mass is redistributed among only those *K* next words.

GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

In the example fro the cell above, we extended the range of words used for both sampling steps from 3 words to 10 words to better illustrate *Top-K* sampling.

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

Having set $K = 6$, in both sampling steps we limit our sampling pool to 6 words. 

Note tha in step 1 the 6 most likely words, defined as $V_{\text{top-K}}$ encompass around two-thirds of the whole probability mass.

In the second step the method includes almost all of the probability mass.

Observe that the method successfully eliminates the other candidates *"not", "the", "small", "told"* in the second sampling step.

Let's see how *Top-K* can be used in the library by setting `top_k=50`:

In [20]:
# set seed to reproduce results. Change the seed though to get different results
# tf.random.set_seed(0)

set_seed(0) # Equivalent to tf.random.set_seed(0)


tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# set top_k to 50
sample_output = model.generate(
    input_ids,
    attention_mask=attention_mask, # Add attention mask here
    do_sample=True,
    max_length=50,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog," she says. "You get a lot of love and support out of it. It has helped me to be open and see why and what I have to do to be successful."

I'd say the


Now the text is arguably the most *human-sounding* text so far.

One concern with *Top-K* sampling is that it does not dynamically adapt the number of words that are filtered from the next word probability distribution $P(w|w_{1:t-1})$.

This can be problematic as some words might be sampled from a very sharp distribution (distribution on the right in the graph above), whereas others from a much more flat distribution (distribution on the left in the graph above).

In step $t=1$, *Top-K* eliminates the possibility to sample *"people", "big", "house", "cat"*,  which seem like reasonable candidates. 

On the other hand, in step $t=2$ the method includes the arguably ill-fitted words *"down", "a"* in the sample pool of words.

Thus, limiting the sample pool to a fixed size *K* could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution.

This intuition led [Ari Holtzman et al. (2019)](https://arxiv.org/abs/1904.09751) to create ***Top-p***- or ***nucleus***-sampling.


### **Top-p (nucleus) sampling**

Instead of sampling only from the most likely *K* words, in *Top-p* sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability *p*. 

The probability mass is then redistributed among this set of words. In this way, the size of the set of words (*a.k.a* the number of words in the set) can dynamically increase or decrease according to the next word's probability distribution.

The figure illustrates the concept

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/top_p_sampling.png?raw=true)

Having set $p=0.92$, *Top-p* sampling picks the *minimum* number of words to exceed a cumulative value of $p=92\%$ of the probability mass, defined as $V_{\text{top-p}}$. 

In the first example, this included the 9 most likely words, whereas it only has to pick the top 3 words in the second example to exceed 92%. It can be seen that this method keeps a wide range of words where the next word is arguably less predictable, *e.g.* P(w | "The"), and only a few words when the next word seems more predictable, *e.g.* P(w | "The", "car").

We activate *Top-p* sampling by setting `0 < top_p < 1`:

In [28]:
# set seed to reproduce results. Change the seed though to get different results
# tf.random.set_seed(0)
# Note the  params top_p = 0.92 and top_k=0
# set them to dofferent values and compare results

set_seed(0)

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    attention_mask=attention_mask, # Add attention mask here
    do_sample=True,
    max_length=50,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog," she says. "You get a lot of love and eventually a great guy comes in with your national credentials. He gives you a virtual identity as a dog owner. You get second chances. It's a fascinating


The text now better. 

While in theory, *Top-p* seems more elegant than *Top-K*, both methods work well in practice.

*Top-p* can also be used in combination with *Top-K*, which can avoid very low ranked words while allowing for some dynamic selection.

Finally, to get multiple independently sampled outputs, we can *again* set the parameter `num_return_sequences > 1`:

In [None]:
# set seed to reproduce results. Change the seed though to get different results
# tf.random.set_seed(0)
# change params to compare results 

set_seed(0) # Equivalent to tf.random.set_seed(0)

tokenizer = AutoTokenizer.from_pretrained('gpt2')
model = AutoModelForCausalLM.from_pretrained('gpt2')

inputs = tokenizer('I enjoy walking with my cute dog', return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    attention_mask=attention_mask,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog," she says. "You get a lot of love and support out of it. It has helped me to be open and see what's really cool. I'm happy to see people are supporting my cause and just
1: I enjoy walking with my cute dog. I would also like to see a new feature for our cats, the cute bear, that is called 'Spend Your Sunday, Beating Dogs, by Feeding Dogs'.

Please see our page for
2: I enjoy walking with my cute dog, but I would definitely encourage anyone that will play around with your dog's ears to use a bit of patience and patience.

The dog's ears should be removed right away. After they are gone from the


Now we have some tools we could use to write stories with `transformers`!

### **Conclusions, Part 1**

*top-p* and *top-K* sampling are *ad-hoc* methods that seem to produce more fluent text than traditional *greedy* - and *beam* search on open-ended language generation.

Recent evidence has revealed that the problems of *greedy* and *beam* search (mainly generating repetitive word sequences) are caused by the model (especially the way the model is trained), rather than the decoding methods, *cf.* [Welleck et al. (2019)](https://arxiv.org/pdf/1908.04319.pdf). 

Another author [Welleck et al. (2020)](https://arxiv.org/abs/2002.02492), mentions that *top-K* and *top-p* sampling also suffer from generating repetitive word sequences.

In [Welleck et al. (2019)](https://arxiv.org/pdf/1908.04319.pdf), the authors show that according to human evaluations, *beam* search can generate more fluent text than *Top-p* sampling, when adapting the model's training objective.

Open-ended language generation is a rapidly evolving field of research. As it is often the case, in rapid evolving fields of research there is no single solution to fit all.

Therefore the best method is often the one that is discovered after experimenting with specific use cases. 

Experimentation is a good idea. Fortunately it is now possible (and easy) to try out several decoding methods using `transfomers` 🤗.

That was a short introduction on how to use different decoding methods in `transformers` and recent trends in open-ended language generation.

### **Conclusions, Part 2**

There are a couple of additional parameters for the `generate` method that were not mentioned above. This last piece explain them here briefly!

- `min_length` can be used to force the model to not produce an EOS token (= not finish the sentence) before `min_length` is reached. This is used quite frequently in summarization, but can be useful in general if the user wants to have longer outputs.

- `repetition_penalty` can be used to penalize words that were already generated or belong to the context. It was first introduced by [Kesker et al. (2019)](https://arxiv.org/abs/1909.05858) and is also used in the training objective in [Welleck et al. (2019)](https://arxiv.org/pdf/1908.04319.pdf). It can be quite effective at preventing repetitions, but seems to be very sensitive to different models and use cases, *e.g.* see this [discussion](https://github.com/huggingface/transformers/pull/2303) on Github.

- `attention_mask` can be used to mask padded tokens

- `pad_token_id`, `bos_token_id`, `eos_token_id`: If the model does not have those tokens by default, the user can manually choose other token ids to represent them.
