Modified version of Huggingface tutorial.  
https://huggingface.co/learn/nlp-course/en/chapter7/6?fw=pt

# Recipe
1. Data pre-processing: Gathering, filtering, splitting and tokenizing.
2. Randomly initializing a model
3. Config training arguments
4. Training
5. Verifying


# Gathering the data
However, training on the full corpus is time- and compute-consuming, and we only need the subset of the dataset concerned with the Python data science stack. So, let’s start by filtering the codeparrot dataset for all files that include any of the libraries in this stack. Because of the dataset’s size, we want to avoid downloading it; instead, we’ll use the streaming feature to filter it on the fly. To help us filter the code samples using the libraries we mentioned earlier, we’ll use the following function:

数据太大，进行筛选。

In [None]:
def any_keyword_in_string(string, keywords):
    for keyword in keywords:
        if keyword in string:
            return True
    return False

In [None]:
from collections import defaultdict
from tqdm import tqdm
from datasets import Dataset

def filter_streaming_dataset(dataset, filters):
    filtered_dict = defaultdict(list)
    total = 0
    # Streaming loop
    # 处理streaming，这里显式了迭代 streaming 对象的方法，并且添加了 tqdm 进度条
    for sample in tqdm(iter(dataset)): 
        total += 1
        if any_keyword_in_string(sample["content"], filters):
            for k, v in sample.items():
                filtered_dict[k].append(v) # add a new value. 将各个 feature: value 添加到字典中
    print(f"{len(filtered_dict['content'])/total:.2%} of data after filtering.")
    return Dataset.from_dict(filtered_dict) # return a Dataset object. 构建Dataset对象

In [None]:
# This cell will take a very long time to execute, so you should skip it and go to
# the next one!
from datasets import load_dataset

split = "train"  # "valid"
filters = ["pandas", "sklearn", "matplotlib", "seaborn"]
# Streaming data. 使用streaming的方式加载数据
data = load_dataset(f"transformersbook/codeparrot-{split}", split=split, streaming=True)
filtered_data = filter_streaming_dataset(data, filters)

Filtering the full dataset can take 2-3h depending on your machine and bandwidth. If you don’t want to go through this lengthy process yourself, we provide the filtered dataset on the Hub for you to download:



In [None]:
from datasets import load_dataset, DatasetDict
# load training and test data separately using arg 'split'.
# 直接下载对应的数据，注意数据是分为train 和 validation 的。如果你自己构建数据集，那么需要自己划分
ds_train = load_dataset("huggingface-course/codeparrot-ds-train", split="train")
ds_valid = load_dataset("huggingface-course/codeparrot-ds-valid", split="validation")
# Construct a DatasetDict object
# 通过Dataset 构建 DatasetDict
raw_datasets = DatasetDict(
    {
        "train": ds_train,  # .shuffle().select(range(50000)),
        "valid": ds_valid,  # .shuffle().select(range(500))
    }
)
raw_datasets

Let’s look at an example from the dataset. We’ll just show the first 200 characters of each field:

In [None]:
for key in raw_datasets["train"][0]: # raw_datasets["train"][0] 得到的是一个字典
    print(f"{key.upper()}: {raw_datasets['train'][0][key][:200]}")

# Preparing the dataset

Most documents contain many more than 128 tokens, so simply truncating the inputs to the maximum length would eliminate a large fraction of our dataset. Instead, we’ll use the `return_overflowing_tokens` option to tokenize the whole input and split it into several chunks, as we did in Chapter 6. We’ll also use the `return_length` option to return the length of each created chunk automatically. Often the last chunk will be smaller than the context size, and we’ll get rid of these pieces to avoid padding issues; we don’t really need them as we have plenty of data anyway.



In [None]:
from transformers import AutoTokenizer

context_length = 128
tokenizer = AutoTokenizer.from_pretrained("huggingface-course/code-search-net-tokenizer")

outputs = tokenizer(
    raw_datasets["train"][:2]["content"], # Drop the first two samples. 去前两个样本的内容
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True, # Break a long text into chunks. 将长输入拆分成小块，分块，截断。并不丢弃多余的内容。
    return_length=True,
)

print(f"Input IDs length: {len(outputs['input_ids'])}") # 34 chunks，也就是有34个 chunk
print(f"Input chunk lengths: {(outputs['length'])}")
# With the overflow_to_sample_mapping field, we can also reconstruct which chunks belonged to which input samples.
# 其实就像是 sentence mask一样，就是标记一下这chunk 属于那个字符串
print(f"Chunk mapping: {outputs['overflow_to_sample_mapping']}")

使用 .map() 函数批量操作

In [None]:
def tokenize(element):
    outputs = tokenizer(
        element["content"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for chunk_size, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if chunk_size == context_length: # drop short sequences. 过滤掉块大小没达到最大的块
            input_batch.append(input_ids)
    return {"input_ids": input_batch} # 返回字典
# Tokenization, and dropping all the other columns. 
# 批量tokenize，并且删掉原本数据集中的所有数据。但是他这里这样处理之后，就只剩下 input_ids了，没有 mask
tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets["train"].column_names
)
tokenized_datasets

✏️ Try it out! Getting rid of all the chunks that are smaller than the context size wasn’t a big issue here because we’re using small context windows. As you increase the `context size` (or if you have a corpus of short documents), the fraction of chunks that are thrown away will also grow. A more efficient way to prepare the data is to join all the tokenized samples in a batch with an `eos_token_id` token in between, and then perform the chunking on the concatenated sequences. As an exercise, modify the tokenize() function to make use of that approach. Note that you’ll want to set `truncation=False` and remove the other arguments from the tokenizer to get the full sequence of token IDs.

提示：如果上下问窗口太大，那么被删除的信息就越多。这时候可以将文本先tokenize，然后在每个input_ids序列最后添加一个 eos_token_id，然后将大量的样本连接起来。再进行chunking。
注意，想要得到完整的文本的ids，需要关掉truncating

# Initializing a new model
Our first step is to freshly initialize a GPT-2 model. We’ll use the same configuration for our model as for the small GPT-2 model, so we load the pretrained configuration, make sure that the tokenizer size matches the model vocabulary size and pass the bos and eos (beginning and end of sequence) token IDs:

如何获得随机初始化的模型? 直接用预训练模型的 config.json 文件初始化一个模型对象就行.


In [None]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig
# Load a GPT-2 configuration file and modify some parameters.
# 加载预训练的 GPT-2 模型的配置文件, 并且修改其中的一些参数值, 如vocab_size, 上下文窗口大小 n_ctx等.
config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

In [None]:
model = GPT2LMHeadModel(config) # Randomly initialize a model. 随机初始化一个模型
model_size = sum(t.numel() for t in model.parameters()) # Compute model parameter numbers. 计算模型参数量
print(f"GPT-2 size: {model_size/1000**2:.1f}M parameters")

## Data collator
Before we can start training, we need to set up a data collator that will take care of creating the batches. We can use the `DataCollatorForLanguageModeling` collator, which is designed specifically for language modeling (as the name subtly suggests). Besides stacking and padding batches, it also takes care of creating the language model labels — in causal language modeling the inputs serve as labels too (just shifted by one element), and this data collator creates them on the fly during training so we don’t need to duplicate the `input_ids`.

Note that `DataCollatorForLanguageModeling` supports both `masked language modeling (MLM)` and `causal language modeling (CLM)`. By default it prepares data for MLM, but we can switch to CLM by setting the argument `mlm=False`:

注意, Data collator 的功能很多, 它能够动态padding, 能batching, 还能够创建labels. 

In [None]:
from transformers import DataCollatorForLanguageModeling
# DataCollatorForLanguageModeling will generate attention mask according to input token ids. 
# 加载专门为casual language modeling服务的数据加载器
tokenizer.pad_token = tokenizer.eos_token
# By 'mlm=False', turning off MLM and turning on Casual language modeling. 关闭MLM,打开CLM
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False) 

In [None]:
# testing a data collator
# 注意这里的数据集只有 input_ids, 因此这个 collator 还能自动产生 mask
# data_collator 的输入是一个batch
out = data_collator([tokenized_datasets["train"][i] for i in range(5)]) 
for key in out:
    print(f"{key} shape: {out[key].shape}")
'''
input_ids shape: torch.Size([5, 128])
attention_mask shape: torch.Size([5, 128])
labels shape: torch.Size([5, 128])
'''
'''
⚠️ Shifting the inputs and labels to align them happens inside the model, 
so the data collator just copies the inputs to create the labels.
'''

# Training
All that’s left to do is configure the training arguments and fire up the Trainer. We’ll use a cosine learning rate schedule with some warmup and an effective batch size of 256 (`per_device_train_batch_size * gradient_accumulation_steps`). Gradient accumulation is used when a single batch does not fit into memory, and incrementally builds up the gradient through several forward/backward passes. We’ll see this in action when we create the training loop with 🤗 Accelerate.



In [None]:
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="codeparrot-ds", # model name and saving directory. 指定模型名字和保存路径
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps", # can also be 'epoch'. 根据步数来决定什么时候评估模型，也可以是 "epoch"
    eval_steps=5_000, # evaluate the model every 5000 steps. 每 5000 步之后，运行一次 evaluation_strategy 
    logging_steps=5_000,
    gradient_accumulation_steps=8, # Accumulate gradients for parameter updating. 当内存不够装下一个batch的时候，采用累计梯度的方式去计算梯度
    num_train_epochs=1, # 训练的epochs，居然只有一次
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
)
# There is no metric function. Loss will be computed during evaluation.
# 注意这里并没有指定 evaluation 的 metric

Now we can just start the Trainer and wait for training to finish. Depending on whether you run it on the full or a subset of the training set this will take 20 or 2 hours, respectively, so grab a few coffees and a good book to read!

💡 If you have access to a machine with multiple GPUs, try to run the code there. The Trainer automatically manages multiple machines, and this can speed up training tremendously.



In [None]:
trainer.train()

# Code generation with a pipeline

Now is the moment of truth: let’s see how well the trained model actually works! We can see in the logs that the loss went down steadily, but to put the model to the test let’s take a look at how well it works on some prompts. To do that we’ll wrap the model in a text generation pipeline, and we’ll put it on the GPU for fast generations if there is one available:

添加pipeline的好处是能够自动对模型输出采样，并且能自动启动循环，以保证生成任务的顺利进行。

In [None]:
import torch
from transformers import pipeline

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# Initialize a pipeline. Local model is also acceptable.
# 创建用于文本生成的pipeline，这里注意，因为教程中将模型上传到了huggingface，所以这里的路径是huggingface，完全可以用自己的本地路径
pipe = pipeline(
    "text-generation", model="huggingface-course/codeparrot-ds", device=device
)

In [None]:
txt = """\
# create some data
x = np.random.randn(100)
y = np.random.randn(100)

# create scatter plot with x, y
"""
print(pipe(txt, num_return_sequences=1)[0]["generated_text"])