掩码语言建模是用掩码标记对序列中的标记进行掩码，并提示模型用适当的标记填充该掩码的任务。这允许模型同时处理右上下文(掩码右侧的标记)和左上下文(掩码左侧的标记)。这样的训练为需要双向背景的下游任务(如SQuAD)奠定了坚实的基础。

In [1]:
!pip install transformers
import torch

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |▋                               | 10kB 20.3MB/s eta 0:00:01[K     |█▏                              | 20kB 6.0MB/s eta 0:00:01[K     |█▊                              | 30kB 7.6MB/s eta 0:00:01[K     |██▎                             | 40kB 5.6MB/s eta 0:00:01[K     |███                             | 51kB 6.1MB/s eta 0:00:01[K     |███▌                            | 61kB 7.3MB/s eta 0:00:01[K     |████                            | 71kB 7.4MB/s eta 0:00:01[K     |████▋                           | 81kB 7.7MB/s eta 0:00:01[K     |█████▎                          | 92kB 8.6MB/s eta 0:00:01[K     |█████▉                          | 102kB 8.5MB/s eta 0:00:01[K     |██████▍                         | 112kB 8.5MB/s eta 0:00:01[K     |███████                         | 122kB 8.5M

In [2]:
torch.cuda.get_device_name(0)

'Tesla T4'

In [3]:
from transformers import pipeline

nlp = pipeline("fill-mask")
print(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))


HBox(children=(IntProgress(value=0, description='Downloading', max=523, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=898823, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=456318, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=230, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=331070498, style=ProgressStyle(description_…


[{'sequence': '<s> HuggingFace is creating a tool that the community uses to solve NLP tasks.</s>', 'score': 0.1572076976299286, 'token': 3944}, {'sequence': '<s> HuggingFace is creating a framework that the community uses to solve NLP tasks.</s>', 'score': 0.11565146595239639, 'token': 7208}, {'sequence': '<s> HuggingFace is creating a library that the community uses to solve NLP tasks.</s>', 'score': 0.05949191749095917, 'token': 5560}, {'sequence': '<s> HuggingFace is creating a database that the community uses to solve NLP tasks.</s>', 'score': 0.04147905111312866, 'token': 8503}, {'sequence': '<s> HuggingFace is creating a prototype that the community uses to solve NLP tasks.</s>', 'score': 0.025827907025814056, 'token': 17715}]


### 下面是一个使用模型和Tokenizer进行掩码语言建模的示例。该过程如下：

– 从checkpoint名称实例化一个tokenizer和一个模型。该模型被识别为一个DistilBERT模型，并用存储在checkpoint中的权重加载它。

– 定义一个带掩码标记的序列，不使用单词而是选择`tokenizer.mask_token`进行放置(进行掩码)。

– 将该序列编码为id，并在该id列表中找到掩码标记的位置。

– 在掩码标记的索引处检索预测：此张量与词汇表的大小相同，值是每个标记的分数。模型对他认为在这种情况下可能出现的标记会给出更高的分数。

– 使用PyTorch `topk`方法检索前5个标记。

– 用预测的标记替换掩码标记并打印结果

In [6]:
from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")

sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]#返回了103的位置

print(input)
print(tokenizer.mask_token_id)
print(torch.where(input == tokenizer.mask_token_id))

tensor([[  101, 12120,  2050,  8683,  1181,  3584,  1132,  2964,  1190,  1103,
          3584,  1152, 27180,   119,  7993,  1172,  1939,  1104,  1103,  1415,
          3827,  1156,  1494,   103,  1412,  6302,  2555, 10988,   119,   102]])
103
(tensor([0]), tensor([23]))


In [7]:
token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

print(token_logits)
print(mask_token_index)

tensor([[[ -6.6732,  -6.6450,  -6.7923,  ...,  -5.5930,  -5.2783,  -5.6559],
         [ -6.3221,  -5.6379,  -5.8990,  ...,  -4.6864,  -4.1499,  -5.3507],
         [ -5.9863,  -6.0991,  -5.8089,  ...,  -5.2297,  -4.3015,  -6.5971],
         ...,
         [ -7.8892,  -7.6718,  -7.6357,  ...,  -6.9083,  -5.5853,  -6.2459],
         [-14.7710, -14.2714, -14.1642,  ..., -11.4770, -12.1692, -13.1041],
         [-14.3694, -13.9838, -13.6330,  ..., -11.2066, -11.6753, -12.7083]]],
       grad_fn=<AddBackward0>)
tensor([23])


In [0]:
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()


In [10]:
for token in top_5_tokens:
  print(token)
  print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

4851
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
2773
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
9711
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
18134
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
4607
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.
