### Text Generation(文本生成)

#### 生成连续文本中的挑战(The Challenge with Generating Coherent Text)

> 与其他任务如：词元和句子分类这种任务不同，取出最大值结果即可得到一个结果，文本生成需要在将这些置信值转换为文字时需要一个特殊的解码方法，其中这个过程可能存在以下的挑战：
> - 解码过程是交互式完成的，并且尽管这将显著的造成更多的计算
> - 生成文本的质量与多样性取决于选择的解码方法和关联的超参数

### GPT2的工作原理
> 像其他*自动微分和因果语言模型*一样，GPT2被训练用于预测下一个词元输出的可能性，即$P(y|x)$，其中$y$是一组词元的序列，即$y = y_1,y_2,\ldots,y_n$，通过用户输入一些必要的词元序列，即$x = x_1,x_2,\ldots,x_n $，但是能够获得充足的数据集来直接训练并预测$P(y|x)$是十分不切实际的，因此常见的方式是利用置信度的复合法则来将其分解为一个额外的*置信度积*：
$$
P(y_1,\ldots,y_t|X) = \prod \limits_{t=1}^{N} P(y_t|y_{<t},X)
$$
其中$y_{<t}$ 是一个短记符号用于表示序列$y_1,y_2,\ldots,y_{t-1}$，这样我们以便于更直接的取出自微分部分中需要的东西来预测输入中每个词对应预测的下个词，而这也正是公式右边预测的内容。  
所以综上，GPT在执行推理时大致的流程是：  
<img src="./GPTWorkflow.png " width="800" height="400" alt="GPT Workflow"/>  
这个过程的核心是其解码方法，其决定每个时间步上选择的词元。模型头部在生成了一组信息$Z_{t,i}$，由于包含词表内每个词元在每个时间步上的置信度等信息，而我们则可以使用softmax来将其转换为概率分布，即：
$$
P(y_t = W_i|y_{<t},X) = softmax(Z_{t,i})
$$
大多数解码方法的核心目的是为了在一个序列中搜索一个最佳的结果$\hat{y}$，例如：
$$
\hat{y} = \argmax_{y} P(y|x)
$$

### 贪婪搜索解码(Greedy Search Decoding)
> 此解码方法是最最简单的方法来从模型连续输出中获得离散的词元，通过每个时间步上对最高置信度的结果进行贪婪搜索实现：
$$
\hat{y}_t = \argmax_{y_t} P(y_t|y_{<t},X)
$$

In [1]:
# 加载GPT2-XL模型来体验贪婪搜索解码算法
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM # 加载自动编码器类和因果语言模型头
from tqdm.auto import tqdm
# 初始化CUDA GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_name = "gpt2-xl" # 模型名称
tokenizer = AutoTokenizer.from_pretrained(model_name) # 加载分词器
model = AutoModelForCausalLM.from_pretrained(model_name).to(device) # 加载模型

In [2]:
# 在开始研究解码方法之前，我们先让模型预测一下，来看看模型最终是如何完成句子的补全的
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt,return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_steps = 5

# 执行预测
with torch.no_grad():
    with tqdm(total=n_steps,unit="steps") as pbar:
        for i in range(n_steps):
            pbar.set_description(f"Executing inference steps {i+1}")
            iteration = {}
            iteration["Input"] = tokenizer.decode(input_ids[0])
            output = model(input_ids=input_ids)
            # 选择第一batch和最后一批次的计算信息并应用softmax取出下一词的置信
            next_token_logits = output.logits[0,-1,:]
            next_token_probs = torch.softmax(next_token_logits,dim=-1)
            # 将ID以置信度进行排序
            sorted_ids = torch.argsort(next_token_probs,dim=-1,descending=True)
            # 存储最高置信度的词元
            for choice_idx in range(choices_per_steps):
                token_id = sorted_ids[choice_idx]
                token_prob = next_token_probs[token_id].cpu().numpy()
                token_choice = (
                    f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f})"
                )
                iteration[f"Choice {choice_idx+1}"] = token_choice
            # 插入预测的下一个词元到下次输入
            input_ids = torch.cat([input_ids,sorted_ids[None,0,None]],dim=-1)
            iterations.append(iteration)
            pbar.update(1)

  0%|          | 0/8 [00:00<?, ?steps/s]

In [3]:
pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (8.53),only (4.96),best (4.65),Transformers (4.37),ultimate (2.16)
1,Transformers are the most,popular (16.78),powerful (5.37),common (4.96),famous (3.72),successful (3.20)
2,Transformers are the most popular,toy (10.63),toys (7.23),Transformers (6.60),of (5.46),and (3.76)
3,Transformers are the most popular toy,line (34.38),in (18.20),of (11.71),brand (6.10),line (2.69)
4,Transformers are the most popular toy line,in (46.28),of (15.09),", (4.94)",on (4.40),ever (2.72)
5,Transformers are the most popular toy line in,the (65.99),history (12.42),America (6.91),Japan (2.44),North (1.40)
6,Transformers are the most popular toy line in the,world (69.26),United (4.55),history (4.29),US (4.23),U (2.30)
7,Transformers are the most popular toy line in ...,", (39.73)",. (30.64),and (9.87),with (2.32),today (1.74)


In [4]:
# 使用model.generate()函数生成文本
input_ids = tokenizer(input_txt,return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids,max_new_tokens=n_steps,do_sample=False) # max_new_tokens指定GPT的迭代次数
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most popular toy line in the world,


In [5]:
# 生成故事
max_length = 128 # 指定GPT生成的最大文本长度
input_txt = """In a shocking finding,scientists discovered \
a herd of unicorns living in a remove,previously unexpected \
valley,in the Andes Mountains.Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English.\n\n
"""
input_ids = tokenizer(input_txt,return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids,max_length=max_length,do_sample=False)

print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Santa Cruz, and the University of Arizona, studied the unicorns' behavior and found that they were not only able to communicate with each other, but also with humans.


The researchers believe that the unicorns are able to communicate with humans because they are able to sense the emotions of humans.


The researchers believe that the unicorns are able


> 贪婪搜索解码算法虽然很快，但是它存在一个非常致命的问题：即*仅仅会生成循环往复的文本（车轱辘话）*，这对于新闻稿这样的问题是十分要命的，因为它会忽视掉词与词之间的联系，而仅仅只是取出最高置信度的文本来拼凑成下一个句子。

### 束搜索解码(Beam Search Decoding)
> 与贪婪搜索编码只是无脑的从每个时间步上取出最高置信度的词元不同的是，束形搜索将保持最高$b$值的置信度词元作为下一个词元，其中$b$指的是束的数量或者部分的假说。因此这意味着束搜索将会产生更多种可能。通过考虑现有集合的所有可能的下一个词元扩展并选择b个最可能的扩展来选择下一个束。这个过程将一直持续下去直至最大长度或者指针抵达`EOS(End of sequence)`词元，通过其输入的置信度对数来进行`b beams`排序来选择最有可能的序列。
![BeamSearch](BeamSearch.png)

> 使用置信度本身来替代对数置信度来为序列进行评分。这将计算一个序列的置信度$P(y_1,y_2,\ldots,y_t|X)$，其中包含一个可能置信度$P(y_t|y_{<t},X)$是其中一个原因。  
同时为了避免这个过小而导致的不可预知的后果，我们将关联对数置信度进行关联。这样我们就得到了：
> $$
> \log P(y_1,\ldots,y_t|x) = \sum_{t=1}^{N} \log P(y_t|y_{<t},X)
> $$
> 换言之：之前我们看到的置信度积已经变成了一个对数置信度的和，这样可以显著降低不稳定性。例如，我们如果按照之前的做法来评分：

In [6]:
import numpy as np

sum([np.log(0.5) * 1024]) # 计算0.5的对数 x 序列长度

-709.782712893384

In [7]:
# 看起来还行，但是考虑到Hugging Face返回的是未标准化的置信度分数，因此我们需要创建一个带有标准化的函数来获得我们的置信度标签
import torch.nn.functional as F

def log_probs_from_logits(logits,labels):
    logp = F.log_softmax(logits,dim=-1) # 首先沿最后一个维度执行softmax
    logp_label = torch.gather(logp,2,labels.unsqueeze(2)).squeeze(-1) 
    return logp_label

In [8]:
# 上面我们通过函数得到了一个独立词元的对数置信度，接下来我们得到整个序列的
def sequence_logprob(model,labels,input_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:,:-1,:],labels[:,1:]
        )
        seg_log_prob = torch.sum(log_probs[:,input_len:])
        return seg_log_prob.cpu().numpy()

In [9]:
# 验证函数置信度
logp = sequence_logprob(model,output_greedy,input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\nlog-prob: {logp:.2f}")

In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Santa Cruz, and the University of Arizona, studied the unicorns' behavior and found that they were not only able to communicate with each other, but also with humans.


The researchers believe that the unicorns are able to communicate with humans because they are able to sense the emotions of humans.


The researchers believe that the unicorns are able

log-prob: -82.38


In [10]:
# 接下来使用束搜索解码并计算其置信度
output_beam = model.generate(input_ids,max_length=max_length,num_beams=5,do_sample=False)

logp = sequence_logprob(model,output_beam,input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


According to the researchers, the unicorns are the descendants of a group of animals that lived in the Andes Mountains thousands of years ago.


The researchers believe that the unicorns are descendants of a group of animals that lived in the Andes Mountains thousands of years ago.


The researchers believe that the unicorns are descendants of a group of animals that lived in the Andes Mountains thousands

log-prob: -43.26


In [11]:
# 使用no_repeat_ngram_size参数来防止出现车轱辘话
output_beam = model.generate(input_ids,max_length=max_length,num_beams=5,do_sample=False,no_repeat_ngram_size=2)

logp = sequence_logprob(model,output_beam,input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\nlog-prob: {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The researchers, from the University of California, Santa Cruz, and the National Geographic Society, were conducting a study on the evolution of language in animals when they stumbled upon the unicorn herd.

"We were surprised to find a group of animals that spoke a language that we had never seen before," said lead author Dr. David Carrier, a professor of linguistics at UCSC. "It was

log-prob: -97.61


### 采样方法
> 一个最简单的采样方法是随机的从置信度分布采样：
$$
P(y_t=w_i|y_{<t},X) = softmax(Z_{t,i}) = \frac{\exp (Z_{t,i})}{\textstyle \sum_{j=1}^{|v|} \exp (Z_{t,j})}
$$
> 其中$|v|$表示的是词表的基数。我们可以在置信分数在执行softmax之前添加一个名为`temperature`*(T)*的参数来使得模型的生成更加具有多样性。因此，采样公式变成了：
$$
P(y_t=w_i|y_{<t},X) = \frac{\exp (Z_{t,i})}{\textstyle \sum_{j=1}^{|v|} \exp (Z_{t,j^{/T}})}
$$

> 

In [12]:
# 当T=2时的模型输出
output_temp = model.generate(
    input_ids,max_length=max_length,do_sample=True,
    temperature=2.0,top_k=0
)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


 stains EXPINO trainblue credited activist Antarctica757 Wooden Dom*iscovery Indenne Lua illustrates LemurateEvery February and nickel amazgeon life line uses 2 die hopesCOMPLESGatherTtry Horse sealszip Capital Predator frozen Seeds responsibleNine e Charge Tuniterator states moved surprisingly mapped series Vladwriter sq Attribution Census link generally impairedAdditionally topic postcapital Color year curses of price Odafi640 mile Situation Natalie Nichols arm


> 可以看到，当$T=2$时，输出的东西简直就是驴唇不对马嘴，因此我们可以适当降低

In [13]:
output_temp = model.generate(input_ids,max_length=max_length,do_sample=True,temperature=0.5,top_k=0)

print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


"The unicorns had the ability to speak English with perfect pitch, and they have an amazing vocabulary," said the leader of the study, Dr. John Mackey.


The study was carried out by the National Geographic Society's (NGS) Institute of Vertebrate Zoology.The research team included researchers from the University of Utah, the University of Utah, the University of Colorado,


> 虽然文字更加连贯了，逻辑上看上还行，但是你从细节上会发现其实生成的质量还是不高：`怎么想独角兽也不会说英语吧....`  
另外一种调整连贯性和多样性之间的平衡的方法就是截断词表，这允许我们可以自由的使用$T$来调整模型输出的多样性，但是这会在一些奇怪的词或者陌生词上产生很大的限制。因此诞生了两种不同的采样方式：*top-k*和*top-p*采样。

### TOP-K和TOP-P采样
> Top-k和Top-p采样的一个基本思路就是限制每个时间步上我们能采样的词元数量。  
举个例子：以GPT2-XL为例，当$*top-k*阈值$限制在$*k=2000*$并且$*top-p*阈值$限制在$*p=0.95*$时，你可以发现采样算法几乎是不会采样任何词元，当词元亮低于2000时，而随着词元量越大、越超过2000时，模型才逐渐开始采样。因此这样的做法可以显著降低置信度的词元，这也是为什么会产生截断词表的现象。

In [14]:
# 使用top_k来限制模型的输出
output_topk = model.generate(input_ids,max_length=max_length,do_sample=True,top_k=50)

print(tokenizer.decode(output_topk[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The scientists who studied the herds from the viewpoint of nature said that they believe that both these animals are of the unicorn family, a race of animals belonging to animals named "The Supernaturals".The most important point to be noticed by the researchers was their size – up to 120' in the mountains! This made the scientists think that the unicorns are not related to the small unicorns they


In [15]:
# 当不知道top_k应该取值为多少时，我们可以使用top_p通过传递切割的置信度阈值来动态切割
output_topp = model.generate(input_ids,max_length=max_length,do_sample=True,top_p=0.90)

print(tokenizer.decode(output_topp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding,scientists discovered a herd of unicorns living in a remove,previously unexpected valley,in the Andes Mountains.Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The unicorns were the result of one of the largest natural experiments in the world. A large herd of elk wandered through a valley in the Ecuadorian Andes, which were eventually brought here by a rancher. The rancher had also captured these unique creatures and wanted to breed them in captivity.


Unfortunately for the rancher, his ranch and the nearby villagers had been raided by local
