In [15]:
!pip install -U pip
!pip install -q tensorflow
!pip install -q transformers
!pip install -q torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
%cd /content/drive/MyDrive/Code
#!wget https://transformers-models.obs.cn-north-4.myhuaweicloud.com/gpt/cn/pretrain/gpt2_L-12_H-768_A-12_CN.zip
!unzip gpt2_L-12_H-768_A-12_CN.zip

/content/drive/MyDrive/Code
Archive:  gpt2_L-12_H-768_A-12_CN.zip
   creating: gpt2_L-12_H-768_A-12_CN/
  inflating: gpt2_L-12_H-768_A-12_CN/config.json  
  inflating: gpt2_L-12_H-768_A-12_CN/vocab.txt  
  inflating: gpt2_L-12_H-768_A-12_CN/pytorch_model.bin  


In [6]:
import tensorflow as tf
import torch
from transformers import GPT2LMHeadModel, BertTokenizer


tokenizer = BertTokenizer.from_pretrained("gpt2_L-12_H-768_A-12_CN")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2_L-12_H-768_A-12_CN", pad_token_id=0)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'GPT2Tokenizer'. 
The class this function is called from is 'BertTokenizer'.


#Greedy Search
Starting from the word
"The", the algorithm greedily chooses the next word of highest probability
"nice" and so on, so that the final generated word sequence is

("The","nice","woman") having an overall probability of

0.5×0.4=0.2 .

In [10]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('远上寒山石径斜，', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
远 上 寒 山 石 径 斜 ， 人 咫 尺 起 天 涯 。 云 间 若 有 招 巢 士 ， 好 看 嵇 生 削 鹄 华 。 说 天 边 产 宝 峰 ， 云 霞 彷 佛 洞 天 空 。 莫


The generated words following the context are reasonable, but the model quickly starts repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search - check out Vijayakumar et al., 2016 and Shao et al., 2017.

The word
"has" with its high conditional probability of
0.9 is hidden behind the word
"dog", which has only the second-highest conditional probability, so that greedy search misses the word sequence
"The","dog","has" .

#Beam Search

In [11]:
# activate beam search and early_stopping
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
远 上 寒 山 石 径 斜 ， 平 今 日 已 无 家 。 十 年 一 觉 河 山 梦 ， 独 羡 随 州 几 树 花 。 想 天 台 与 广 寒 ， 琼 楼 玉 宇 路 漫 漫 。 青


Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beams of hypotheses at each time step and eventually choosing the hypothesis that has the overall highest probability.

Beam search will always find an output sequence with higher probability than greedy search, but is not guaranteed to find the most likely output.

Let's see how beam search can be used in transformers. We set num_beams > 1 and early_stopping=True so that generation is finished when all beam hypotheses reached the EOS token.

#Fluency

While the result is arguably more fluent, the output still includes repetitions of the same word sequences.
A simple remedy is to introduce n-grams (a.k.a word sequences of n words) penalties as introduced by Paulus et al. (2017) and Klein et al. (2017). The most common n-grams penalty makes sure that no n-gram appears twice by manually setting the probability of next words that could create an already seen n-gram to 0.

In [12]:
# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
远 上 寒 山 石 径 斜 ， 平 今 日 已 无 家 。 十 年 一 觉 河 山 梦 ， 独 羡 随 州 几 树 花 。 想 天 台 与 广 寒 ， 琼 楼 玉 宇 路 漫 漫 。 青


Another important feature about beam search is that we can compare the top beams after generation and choose the generated beam that fits our purpose best.

In transformers, we simply set the parameter num_return_sequences to the number of highest scoring beams that should be returned. Make sure though that num_return_sequences <= num_beams!

In [13]:
# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

# now we have 3 output sequences
print("Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: 远 上 寒 山 石 径 斜 ， 平 今 日 已 无 家 。 十 年 一 觉 河 山 梦 ， 独 羡 随 州 几 树 花 。 想 天 台 与 广 寒 ， 琼 楼 玉 宇 路 漫 漫 。 青
1: 远 上 寒 山 石 径 斜 ， 平 今 日 已 无 家 。 十 年 一 觉 河 山 梦 ， 独 羡 随 州 几 树 花 。 想 天 台 与 广 寒 ， 琼 楼 玉 宇 路 漫 漫 。 真
2: 远 上 寒 山 石 径 斜 ， 平 今 日 已 无 家 。 十 年 一 觉 河 山 梦 ， 独 羡 随 州 几 树 花 。 想 天 台 与 广 寒 ， 琼 楼 玉 宇 路 漫 漫 。 相
3: 远 上 寒 山 石 径 斜 ， 平 今 日 已 无 家 。 十 年 一 觉 河 山 梦 ， 独 羡 随 州 几 树 花 。 想 天 台 与 广 寒 ， 琼 楼 玉 宇 路 漫 漫 。 秋
4: 远 上 寒 山 石 径 斜 ， 平 今 日 已 无 家 。 十 年 一 觉 河 山 梦 ， 独 羡 随 州 几 树 花 。 想 天 台 与 广 寒 ， 琼 楼 玉 宇 路 漫 漫 。 多


In open-ended generation, a couple of reasons have recently been brought forward why beam search might not be the best possible option:

Beam search can work very well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization - see Murray et al. (2018) and Yang et al. (2018). But this is not the case for open-ended generation where the desired output length can vary greatly, e.g. dialog and story generation.

We have seen that beam search heavily suffers from repetitive generation. This is especially hard to control with n-gram- or other penalties in story generation since finding a good trade-off between forced "no-repetition" and repeating cycles of identical n-grams requires a lot of finetuning.

As argued in Ari Holtzman et al. (2019), high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable. The authors show this nicely by plotting the probability, a model would give to human text vs. what beam search does.

#Sampling

In its most basic form, sampling means randomly picking the next word
  according to its conditional probability distribution:

In transformers, we set do_sample=True and deactivate Top-K sampling (more on this later) via top_k=0. In the following, we will fix random_seed=0 for illustration purposes. Feel free to change the random_seed to play around with the model.

Interesting! The text seems alright - but when taking a closer look, it is not very coherent. the 3-grams new hand sense and local batte harness are very weird and don't sound like they were written by a human. That is the big problem when sampling word sequences: The models often generate incoherent gibberish, cf. Ari Holtzman et al. (2019).

In [18]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.random.manual_seed(0)

# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0
)

print("Output:\n" + 50 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
--------------------------------------------------
远 上 寒 山 石 径 斜 ， 边 刚 住 野 人 家 。 竹 笼 宿 露 生 秋 后 ， 木 笔 春 风 起 阵 斜 。 妇 女 残 妆 溪 涧 侧 ， 樵 苏 归 担 野 田 斜 。 道


A trick is to make the distribution
sharper (increasing the likelihood of high probability words and decreasing the likelihood of low probability words) by lowering the so-called temperature of the softmax.

An illustration of applying temperature to our example from above could look as follows.

In [19]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.random.manual_seed(0)

# use temperature to decrease the sensitivity to low probability candidates
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=0,
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
远 上 寒 山 石 径 斜 ， 边 刚 住 野 人 家 。 竹 笼 宿 雾 悬 秋 月 ， 松 阁 钟 声 下 晓 。 行 尽 水 乡 人 不 见 ， 寻 来 僧 舍 客 犹 赊 。 请


#Top-K Sampling
Fan et. al (2018) introduced a simple, but very powerful sampling scheme, called Top-K sampling. In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words. GPT2 adopted this sampling scheme, which was one of the reasons for its success in story generation.

We extend the range of words used for both sampling steps in the example above from 3 words to 10 words to better illustrate Top-K sampling.

In [20]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.random.manual_seed(0)

# set top_k to 50
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
远 上 寒 山 石 径 斜 ， 边 刚 住 野 人 家 。 竹 笼 宿 露 生 秋 后 ， 草 流 萤 拾 涧 花 。 隔 坞 泉 声 如 隐 隐 ， 隔 林 风 送 似 云 霞 。 白


In step

t=1, Top-K eliminates the possibility to sample

("people","big","house","cat"), which seem like reasonable candidates. On the other hand, in step

t=2 the method includes the arguably ill-fitted words

("down","a") in the sample pool of words. Thus, limiting the sample pool to a fixed size K could endanger the model to produce gibberish for sharp distributions and limit the model's creativity for flat distribution. This intuition led Ari Holtzman et al. (2019) to create Top-p- or nucleus-sampling.

#Top-p (nucleus) sampling
Instead of sampling only from the most likely K words, in Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p. The probability mass is then redistributed among this set of words. This way, the size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word's probability distribution. Ok, that was very wordy, let's visualize.


In [21]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.random.manual_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_p=0.92,
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
远 上 寒 山 石 径 斜 ， 边 刚 住 野 人 家 。 竹 笼 宿 露 生 秋 色 ， 松 闪 轻 风 伴 日 华 。 海 曲 争 先 人 去 少 ， 伊 川 不 待 客 思 遐 。 好


Great, that sounds like it could have been written by a human. Well, maybe not quite yet.

While in theory, Top-p seems more elegant than Top-K, both methods work well in practice. Top-p can also be used in combination with Top-K, which can avoid very low ranked words while allowing for some dynamic selection.

Finally, to get multiple independently sampled outputs, we can again set the parameter num_return_sequences > 1:

In [22]:
# set seed to reproduce results. Feel free to change the seed though to get different results
torch.random.manual_seed(0)

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: 远 上 寒 山 石 径 斜 ， 边 刚 住 野 人 家 。 竹 笼 宿 露 生 秋 色 ， 松 闪 轻 风 伴 日 华 。 海 曲 争 先 人 去 少 ， 茅 檐 相 近 鸟 啼 哗 。 好
1: 远 上 寒 山 石 径 斜 ， 人 惆 怅 未 还 家 。 不 辞 满 酌 重 阳 酒 ， 多 少 清 明 未 落 花 。 笑 声 名 动 京 洛 ， 欲 将 身 世 问 烟 霞 。 故
2: 远 上 寒 山 石 径 斜 ， 平 今 日 墓 田 家 。 风 云 惨 澹 愁 眉 雨 ， 草 木 荒 凉 宿 夜 。 何 日 铭 功 归 夜 帐 ， 谁 人 扶 立 看 槐 花 。 一


#Conclusion
As ad-hoc decoding methods, top-p and top-K sampling seem to produce more fluent text than traditional greedy - and beam search on open-ended language generation. Recently, there has been more evidence though that the apparent flaws of greedy and beam search - mainly generating repetitive word sequences - are caused by the model (especially the way the model is trained), rather than the decoding method, cf. Welleck et al. (2019). Also, as demonstrated in Welleck et al. (2020), it looks as top-K and top-p sampling also suffer from generating repetitive word sequences.

In Welleck et al. (2019), the authors show that according to human evaluations, beam search can generate more fluent text than Top-p sampling, when adapting the model's training objective.

Open-ended language generation is a rapidly evolving field of research and as it is often the case there is no one-size-fits-all method here, so one has to see what works best in one's specific use case.

Good thing, that you can try out all the different decoding methods in transfomers 🤗.

That was a short introduction on how to use different decoding methods in transformers and recent trends in open-ended language generation.

Feedback and questions are very welcome on the Github repository.

For more fun generating stories, please take a look at Writing with Transformers

Thanks to everybody, who has contributed to the blog post: Alexander Rush, Julien Chaumand, Thomas Wolf, Victor Sanh, Sam Shleifer, Clément Delangue, Yacine Jernite, Oliver Åstrand and John de Wasseige.

Appendix
There are a couple of additional parameters for the generate method that were not mentioned above. We will explain them here briefly!

min_length can be used to force the model to not produce an EOS token (= not finish the sentence) before min_length is reached. This is used quite frequently in summarization, but can be useful in general if the user wants to have longer outputs.

repetition_penalty can be used to penalize words that were already generated or belong to the context. It was first introduced by Keskar et al. (2019) and is also used in the training objective in Welleck et al. (2019). It can be quite effective at preventing repetitions, but seems to be very sensitive to different models and use cases, e.g. see this discussion on Github.

attention_mask can be used to mask padded tokens

pad_token_id, bos_token_id, eos_token_id: If the model does not have those tokens by default, the user can manually choose other token ids to represent them.

For more information please also look into the generate function docstring.