# 用于预训练BERT的数据集
:label:`sec_bert-dataset`

为了预训练 :numref:`sec_bert`中实现的BERT模型，我们需要以理想的格式生成数据集，以便于两个预训练任务：遮蔽语言模型和下一句预测。一方面，最初的BERT模型是在两个庞大的图书语料库和英语维基百科（参见 :numref:`subsec_bert_pretraining_tasks`）的合集上预训练的，但它很难吸引这本书的大多数读者。另一方面，现成的预训练BERT模型可能不适合医学等特定领域的应用。因此，在定制的数据集上对BERT进行预训练变得越来越流行。为了方便BERT预训练的演示，我们使用了较小的语料库WikiText-2 :cite:`Merity.Xiong.Bradbury.ea.2016`。

与 :numref:`sec_word2vec_data`中用于预训练word2vec的PTB数据集相比，WikiText-2（1）保留了原来的标点符号，适合于下一句预测；（2）保留了原来的大小写和数字；（3）大了一倍以上。


In [247]:
import os
import random
import torch
from d2l import torch as d2l

In the WikiText-2 dataset, each line represents a paragraph where space is inserted between any punctuation and its preceding token. Paragraphs with at least two sentences are retained. To split sentences, we only use the period as the delimiter for simplicity. We leave discussions of more complex sentence splitting techniques in the exercises at the end of this section.

在 WikiText-2 数据集中，每行代表一个段落，其中任何标点符号与其前面的 token 之间都插入了空格。只保留 至少有两句话的 段落。为了简单起见，我们仅使用句号作为分隔符来拆分句子。关于更复杂的句子拆分技术的讨论，我们留到本节末尾的练习中进行。


In [248]:
import pyarrow.parquet as pq

def _read_wiki(data_dir):
    # parquet文件名为 wiki.train.tokens.parquet
    file_name = os.path.join(data_dir, 'wiki.train.tokens.parquet')

    # 使用 pyarrow 直接读取
    table = pq.read_table(file_name)
    df = table.to_pandas()

     # 处理数据
    if 'text' in df.columns:
        lines = df['text'].tolist()
    else:
        first_column = df.columns[0]
        lines = df[first_column].tolist()

    # 大写字母转换为小写字母
    paragraphs = [line.strip().lower().split(' . ')
                  for line in lines if len(line.split(' . ')) >= 2]
    random.shuffle(paragraphs)
    return paragraphs
paragraphs = _read_wiki("../data")

In [249]:
len(paragraphs), paragraphs[0], len(paragraphs[0]), paragraphs[0][0], paragraphs[0][1]

(15496,
 ['" don \'t you wanna stay " was covered by colton dixon and <unk> <unk> in the eleventh season of american idol',
  'natalie finn of e ! gave a mixed review of the pair \'s performance , writing " <unk> handled kelly clarkson better than colton played jason aldean on " don \'t you wanna stay , " but she \'s the country girl , so it made sense',
  '" brian mansfield of usa today felt that the song was out of dixon \'s comfort zone and a little out of <unk> \'s range',
  'gil kaufman of mtv remarked that the chemistry between the pair was more like cold fusion',
  'jennifer still of digital spy said the performance " isn \'t anything incredible " .'],
 5,
 '" don \'t you wanna stay " was covered by colton dixon and <unk> <unk> in the eleventh season of american idol',
 'natalie finn of e ! gave a mixed review of the pair \'s performance , writing " <unk> handled kelly clarkson better than colton played jason aldean on " don \'t you wanna stay , " but she \'s the country girl , 

## 为预训练任务定义辅助函数

In the following, we begin by implementing helper functions for the two BERT pretraining tasks: next sentence prediction and masked language modeling. These helper functions will be invoked later when transforming the raw text corpus into the dataset of the ideal format to pretrain BERT.

在下文中，我们首先为BERT的两个预训练任务实现辅助函数。这些辅助函数将在稍后将原始文本语料库转换为理想格式的数据集时调用，以预训练BERT。

### 生成下一句预测任务的数据

根据 :numref:`subsec_nsp`的描述，`_get_next_sentence`函数生成二分类任务的训练样本。


In [250]:
#@save
def _get_next_sentence(sentence, next_sentence, paragraphs):
    if random.random() < 0.5:
        is_next = True
    else:
        # paragraphs 是三重列表的嵌套
        next_sentence = random.choice(random.choice(paragraphs))
        is_next = False
    return sentence, next_sentence, is_next

下面的函数通过调用 `_get_next_sentence` 函数从输入 `paragraph` 生成用于下一句预测的训练样本。这里 `paragraph` 是一个列表（每个元素是句子），其中每个句子都是 token 的列表。自变量 `max_len` 指定预训练期间的BERT输入序列的最大长度。

[

    [token1, token2, token3, ..., tokenN],

    [token1, token2, token3, ..., tokenN]

]


In [251]:
#@save
def _get_nsp_data_from_paragraph(paragraph, paragraphs, vocab, max_len, verbose=False):
    nsp_data_from_paragraph = []
    for i in range(len(paragraph) - 1):
        tokens_a, tokens_b, is_next = _get_next_sentence(paragraph[i], paragraph[i + 1], paragraphs)
        if verbose:
            print(f'---')
            print(f'paragraph: {paragraph}\n')
            print(f'tokens_a: {tokens_a}, tokens_b: {tokens_b}, is_next: {is_next}')

        # BERT 输入序列长度控制
        # 考虑1个'<cls>'词元和2个'<sep>'词元
        if len(tokens_a) + len(tokens_b) + 3 > max_len:
            continue

        tokens, segments = d2l.get_tokens_and_segments(tokens_a, tokens_b)
        nsp_data_from_paragraph.append((tokens, segments, is_next))
        if verbose:
            print(f'---')
            print(f'nsp_data_from_paragraph: {nsp_data_from_paragraph}')
    return nsp_data_from_paragraph

### 生成遮蔽语言模型任务的数据
:label:`subsec_prepare_mlm_data`

In order to generate training examples for the masked language modeling task from a BERT input sequence, we define the following _replace_mlm_tokens function.

In its inputs, tokens is a list of tokens representing a BERT input sequence, candidate_pred_positions is a list of token indices of the BERT input sequence excluding those of special tokens (special tokens are not predicted in the masked language modeling task), and num_mlm_preds indicates the number of predictions (recall 15% random tokens to predict).

Following the definition of the masked language modeling task in Section 15.8.5.1, at each `prediction position`, the input may be replaced by a special “<mask>” token or a random token, or remain unchanged. In the end, the function returns
* the input tokens after possible replacement
* the token indices where predictions take place
* labels for these predictions

为了从 BERT 输入序列生成遮蔽语言模型的训练样本，我们定义了以下 `_replace_mlm_tokens` 函数。

在其输入中，`tokens` 是表示 BERT 输入序列的词元的列表，`candidate_pred_positions` 是不包括特殊词元的 BERT 输入序列的词元索引的列表（特殊词元在遮蔽语言模型任务中不被预测），以及 `num_mlm_preds` 指示预测的数量（选择 15% 要预测的随机词元）。

在 :numref: `subsec_mlm` 中定义遮蔽语言模型任务之后，在每个 `预测位置` ，输入可以由特殊的 “掩码” 词元或随机词元替换，或者保持不变。最后，该函数返回
* 可能替换后的输入词元
* 发生预测的词元索引
* 这些预测的标签


In [252]:
#@save
def _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds,
                        vocab):
    # 为遮蔽语言模型的输入创建新的词元副本，其中输入可能包含替换的 “<mask>” 或随机词元

    # mlm_input_tokens 是一个列表，其中可能包含：
    # “<mask>” 或 随机词元 或 原始词元
    mlm_input_tokens = [token for token in tokens]

    # pred_positions_and_labels 是一个列表，其中包含（除外特殊词元）：
    # 词元索引 和 原始词元
    pred_positions_and_labels = []
    # 打乱后用于在遮蔽语言模型任务中获取15%的随机词元进行预测
    random.shuffle(candidate_pred_positions)

    for mlm_pred_position in candidate_pred_positions:
        if len(pred_positions_and_labels) >= num_mlm_preds:
            break
        masked_token = None
        # 80%的时间：将词替换为“<mask>”词元
        if random.random() < 0.8:
            masked_token = '<mask>'
        else:
            # 10%的时间：保持词不变
            if random.random() < 0.5:
                masked_token = tokens[mlm_pred_position]
            # 10%的时间：用随机词替换该词
            else:
                masked_token = random.choice(vocab.idx_to_token)

        mlm_input_tokens[mlm_pred_position] = masked_token
        pred_positions_and_labels.append((mlm_pred_position, tokens[mlm_pred_position]))

    return mlm_input_tokens, pred_positions_and_labels

通过调用前述的`_replace_mlm_tokens`函数，以下函数将 BERT 输入序列（`tokens`）作为输入，并返回输入词元的索引（在 :numref:`subsec_mlm`中描述的可能的词元替换之后）、发生预测的词元索引以及这些预测的标签索引。


In [253]:
#@save
def _get_mlm_data_from_tokens(tokens, vocab, verbose=False):
    # tokens是一个字符串列表

    # 获取被 mask 的 tokens 的索引
    candidate_pred_positions = []
    for i, token in enumerate(tokens):
        # 在遮蔽语言模型任务中不会预测特殊词元
        if token in ['<cls>', '<sep>']:
            continue
        candidate_pred_positions.append(i)

    # 获取被 mask 的 tokens 的数量，tokens 总数的 15%
    num_mlm_preds = max(1, round(len(tokens) * 0.15))
    if verbose:
        print(f"\ntokens 总数的 15%:\n {num_mlm_preds}")
        print(f"\ntokens:\n {tokens}")
        print(f"\ncandidate_pred_positions:\n {candidate_pred_positions}")
        print(f"\nvocab:\n{vocab}")
    mlm_input_tokens, pred_positions_and_labels = _replace_mlm_tokens(tokens, candidate_pred_positions, num_mlm_preds, vocab)

    # 按 位置索引 排序预测位置和标签
    pred_positions_and_labels = sorted(pred_positions_and_labels, key=lambda x: x[0])
    pred_positions = [v[0] for v in pred_positions_and_labels]
    mlm_pred_labels = [v[1] for v in pred_positions_and_labels]
    if verbose:
        print(f"\nmlm_input_tokens:\n{mlm_input_tokens}")
        print(f"\npred_positions_and_labels:\n{pred_positions_and_labels}")
        print(f"\npred_positions:\n{pred_positions}")
        print(f"\nmlm_pred_labels:\n{mlm_pred_labels}")
        print(f"\nvocab[mlm_input_tokens]:\n{vocab[mlm_input_tokens]}")
        print(f"\nvocab[mlm_pred_labels]:\n{vocab[mlm_pred_labels]}")
    return vocab[mlm_input_tokens], pred_positions, vocab[mlm_pred_labels]

这里 vocab[mlm_input_tokens] 的作用是：
* mlm_input_tokens 是一个字符串词元列表，如 ['\<cls\>', 'i', '\<mask\>', 'bert', '\<sep\>']
* 通过 vocab[mlm_input_tokens] 将每个词元转换为其在词汇表中的索引
* 返回结果是一个索引列表，如 [2, 156, 234, 5, 3]

## 将文本转换为预训练数据集

现在我们几乎准备好为BERT预训练定制一个 `Dataset` 类。

append the special “\<pad\>” tokens to the inputs.

在此之前，我们仍然需要定义辅助函数 `_pad_bert_inputs` 来将特殊的 “&lt;pad&gt;” 词元附加到输入。
它的参数`examples`包含来自两个预训练任务的辅助函数 `_get_nsp_data_from_paragraph` 和 `_get_mlm_data_from_tokens` 的输出。


In [254]:
#@save
def _pad_bert_inputs(examples, max_len, vocab):
    # 对BERT模型的输入数据进行填充，确保同一批次中的所有样本具有相同的长度。

    max_num_mlm_preds = round(max_len * 0.15)
    all_token_ids, all_segments, valid_lens,  = [], [], []
    all_pred_positions, all_mlm_weights, all_mlm_labels = [], [], []
    nsp_labels = []
    for (token_ids, pred_positions, mlm_pred_label_ids, segments, is_next) in examples:

        all_token_ids.append(torch.tensor(token_ids + [vocab['<pad>']] * (max_len - len(token_ids)),
                                          dtype=torch.long))

        all_segments.append(torch.tensor(segments + [0] * (max_len - len(segments)),
                                         dtype=torch.long))

        # valid_lens不包括'<pad>'的计数
        valid_lens.append(torch.tensor(len(token_ids),
                                       dtype=torch.float32))

        all_pred_positions.append(torch.tensor(pred_positions + [0] * (max_num_mlm_preds - len(pred_positions)),
                                               dtype=torch.long))

        # 填充词元的预测将通过乘以0权重在损失中过滤掉
        all_mlm_weights.append(torch.tensor([1.0] * len(mlm_pred_label_ids) + [0.0] * (max_num_mlm_preds - len(pred_positions)),
                                            dtype=torch.float32))
        all_mlm_labels.append(torch.tensor(mlm_pred_label_ids + [0] * (max_num_mlm_preds - len(mlm_pred_label_ids)),
                                           dtype=torch.long))
        nsp_labels.append(torch.tensor(is_next,
                                       dtype=torch.long))

    return (all_token_ids, all_segments,
            valid_lens, all_pred_positions,
            all_mlm_weights, all_mlm_labels, nsp_labels)

将用于生成两个预训练任务的训练样本的辅助函数和用于填充输入的辅助函数放在一起，我们定义以下`_WikiTextDataset`类为用于预训练BERT的WikiText-2数据集。通过实现`__getitem__ `函数，我们可以任意访问WikiText-2语料库的一对句子生成的预训练样本（遮蔽语言模型和下一句预测）样本。

最初的BERT模型使用词表大小为30000的WordPiece嵌入 :cite:`Wu.Schuster.Chen.ea.2016`。WordPiece的词元化方法是对 :numref:`subsec_Byte_Pair_Encoding`中原有的字节对编码算法稍作修改。为简单起见，我们使用`d2l.tokenize`函数进行词元化。出现次数少于5次的不频繁词元将被过滤掉。


In [255]:
#@save
class _WikiTextDataset(torch.utils.data.Dataset):
    def __init__(self, paragraphs, max_len, verbose = False):

        # 输入 paragraphs[i] 是代表段落的句子字符串列表；
        # 而输出 paragraphs[i] 是代表段落的句子列表，其中每个句子都是词元列表
        print(f"before tokenize, paragraphs:\n{paragraphs}")
        paragraphs = [d2l.tokenize(
            paragraph, token='word') for paragraph in paragraphs]
        print(f"after tokenize, paragraphs:\n{paragraphs}")

        sentences = [sentence for paragraph in paragraphs
                     for sentence in paragraph]
        print(f'\nsentences：\n{sentences}')
        self.vocab = d2l.Vocab(tokens=sentences, min_freq=5,
                               reserved_tokens=['<pad>', '<mask>', '<cls>', '<sep>'])

        # 获取下一句子预测任务的数据
        examples = []
        print(f"\n len(paragraphs): {len(paragraphs)}")
        for i, paragraph in enumerate(paragraphs):
            #if i == 0:
                #verbose = True
            #else:
                #verbose = False
            examples.extend(
                _get_nsp_data_from_paragraph(paragraph, paragraphs, self.vocab, max_len, verbose=verbose))

        # 获取 遮蔽语言模型任务 的数据
        examples = [(_get_mlm_data_from_tokens(tokens, self.vocab, verbose = verbose) + (segments, is_next))
                     for tokens, segments, is_next in examples]
        print(f'\nlen of examples\n: {len(examples)}')
        print(f'\nexamples[:1]\n: {examples[:1]}')

        # 填充输入
        (self.all_token_ids, self.all_segments, self.valid_lens,
         self.all_pred_positions, self.all_mlm_weights,
         self.all_mlm_labels, self.nsp_labels) = _pad_bert_inputs(examples, max_len, self.vocab)

    def __getitem__(self, idx):
        return (self.all_token_ids[idx], self.all_segments[idx],
                self.valid_lens[idx], self.all_pred_positions[idx],
                self.all_mlm_weights[idx], self.all_mlm_labels[idx],
                self.nsp_labels[idx])

    def __len__(self):
        return len(self.all_token_ids)

通过使用 `_read_wiki` 函数和 `_WikiTextDataset` 类，我们定义了下面的 `load_data_wiki` 来下载并生成WikiText-2数据集，并从中生成预训练样本。


In [256]:
#@save
def load_data_wiki(batch_size, max_len):
    """加载WikiText-2数据集"""
    num_workers = 0 # d2l.get_dataloader_workers()
    #data_dir = d2l.download_extract('wikitext-2', 'wikitext-2')
    data_dir = "../data"
    paragraphs = _read_wiki(data_dir)
    train_set = _WikiTextDataset(paragraphs[:1][:2], max_len, verbose=False)
    print(train_set[0])
    train_iter = torch.utils.data.DataLoader(train_set, batch_size,
                                        shuffle=True, num_workers=num_workers)
    return train_iter, train_set.vocab

将批量大小设置为 512，将 BERT 输入序列的最大长度设置为 64，我们打印出小批量的 BERT 预训练样本的形状。注意，在每个 BERT 输入序列中，为遮蔽语言模型任务预测 $10$（$64 \times 0.15$）个位置。


In [257]:
batch_size, max_len = 512, 64
train_iter, vocab = load_data_wiki(batch_size, max_len)

for (tokens_X, segments_X, valid_lens_x, pred_positions_X, mlm_weights_X,
     mlm_Y, nsp_y) in train_iter:
    print(tokens_X.shape, segments_X.shape, valid_lens_x.shape,
          pred_positions_X.shape, mlm_weights_X.shape, mlm_Y.shape,
          nsp_y.shape)
    break

before tokenize, paragraphs:
[['in 1928 , m @-@ 111 was assigned to a route connecting m @-@ 13 ( later signed as us 23 for a time ) north of bay city to bay city state park on saginaw bay', 'the original route consisted of what is today euclid avenue', 'in the early 1930s , a return leg towards bay city was added to the east of the original route along what is now state park road , giving the route an upside @-@ down @-@ u <unk> 1933 , the western leg along euclid avenue from midland road to beaver road was designated as m @-@ 47', 'in 1938 , all of m @-@ 111 was re @-@ designated as m @-@ 47 — thus making m @-@ 47 double back to bay city .']]
after tokenize, paragraphs:
[[['in', '1928', ',', 'm', '@-@', '111', 'was', 'assigned', 'to', 'a', 'route', 'connecting', 'm', '@-@', '13', '(', 'later', 'signed', 'as', 'us', '23', 'for', 'a', 'time', ')', 'north', 'of', 'bay', 'city', 'to', 'bay', 'city', 'state', 'park', 'on', 'saginaw', 'bay'], ['the', 'original', 'route', 'consisted', 'of',

最后，我们来看一下词量。即使在过滤掉不频繁的词元之后，它仍然比PTB数据集的大两倍以上。


In [258]:
len(vocab)

11

## 小结

* 与PTB数据集相比，WikiText-2数据集保留了原来的标点符号、大小写和数字，并且比PTB数据集大了两倍多。
* 我们可以任意访问从WikiText-2语料库中的一对句子生成的预训练（遮蔽语言模型和下一句预测）样本。

## 练习

1. 为简单起见，句号用作拆分句子的唯一分隔符。尝试其他的句子拆分技术，比如Spacy和NLTK。以NLTK为例，需要先安装NLTK：`pip install nltk`。在代码中先`import nltk`。然后下载Punkt语句词元分析器：`nltk.download('punkt')`。要拆分句子，比如`sentences = 'This is great ! Why not ?'`，调用`nltk.tokenize.sent_tokenize(sentences)`将返回两个句子字符串的列表：`['This is great !', 'Why not ?']`。
1. 如果我们不过滤出一些不常见的词元，词量会有多大？
