# RACE数据集，英语阅读理解多项选择QA

1. 使用RoBERTa，基于multichoice头，完成模型微调（不同类型的QA任务是不同的）
2. 尝试更换为MLM任务进行模型微调
3. 了解DataCollator
4. 自定义Pipeline


# 加载数据集
使用 datasets 库加载 imdb 数据集。这个库会自动下载并缓存数据。
也可以下载到本地，这样可以防止网络问题导致代码执行失败（尽管已经下载到缓存）


In [1]:
from datasets import load_dataset

dataset_name = "/data/Dataset/LLM-dataset/race"
raw_dataset = load_dataset(dataset_name, "all")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
'''可以直接print 看一下数据集的概况'''
print(raw_dataset)

'''由于返回的是train 和 test 两个split 也可以直接像字典一样进行索引'''
print(type(raw_dataset['train']))
print(raw_dataset['train'])

print(raw_dataset['train'][0])

DatasetDict({
    test: Dataset({
        features: ['example_id', 'article', 'answer', 'question', 'options'],
        num_rows: 4934
    })
    train: Dataset({
        features: ['example_id', 'article', 'answer', 'question', 'options'],
        num_rows: 87866
    })
    validation: Dataset({
        features: ['example_id', 'article', 'answer', 'question', 'options'],
        num_rows: 4887
    })
})
<class 'datasets.arrow_dataset.Dataset'>
Dataset({
    features: ['example_id', 'article', 'answer', 'question', 'options'],
    num_rows: 87866
})
{'example_id': 'high19088.txt', 'article': 'Last week I talked with some of my students about what they wanted to do after they graduated, and what kind of job prospects  they thought they had.\nGiven that I teach students who are training to be doctors, I was surprised do find that most thought that they would not be able to get the jobs they wanted without "outside help". "What kind of help is that?" I asked, expecting them to tell me th

# 数据预处理 (Tokenization)
这里的处理稍微有所不一样，针对Decoder only的模型，我们期望其对答案进行生成
1. 我们需要构造：article + question + options + answer的Prompt
2. 对于labels，我们不计算answer之前的token预测loss，专注于答案生成的loss

In [216]:
from transformers import GPT2Tokenizer

model_checkpoint = "/data/Weights/gpt2/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_checkpoint)



def preprocess(instances):
    article = instances["article"]
    question = instances["question"]
    options = instances["options"]
    answer = instances['answer']

    results = {
        'input_ids': [],
        'attention_mask': [],
        'labels': []
    }
    for art,ques,opt,ans in zip(article,question,options,answer):
        opt_str = f"\nA:{opt[0]}\nB:{opt[1]}\nC:{opt[2]}\nD:{opt[3]}"
        prompt = f"contex:{art} \nquestion:{ques} \noptions:{opt_str}. \nPlease select the best option for the question based on the content in the context, answer:"        

        tokenized_context = tokenizer(
            prompt,
            max_length=512,
            padding=False,            
            truncation="only_first",  
        ) 
        # answer之前的loss，不计算
        labels = [-100]*len(tokenized_context['input_ids'])
        
        # 对Answer进行encode，并补充进labels
        tokenized_ans = tokenizer(
            ans,
            max_length=512,
            padding=False,            
            truncation="only_first",  
        ) 
        labels.extend(tokenized_ans['input_ids'])
        
        for k in tokenized_context.keys():
            tokenized_context[k].extend(tokenized_ans[k])
            
        results['input_ids'].append(tokenized_context['input_ids'])
        results['attention_mask'].append(tokenized_context['attention_mask'])
        results['labels'].append(labels)
    
    return results

c = preprocess(raw_dataset["train"][0:4])

print(c)
print(f"shape of input_ids :({len(c['input_ids'])},{len(c['input_ids'][0])})")


{'input_ids': [[1102, 16886, 25, 5956, 1285, 314, 6619, 351, 617, 286, 616, 2444, 546, 644, 484, 2227, 284, 466, 706, 484, 18303, 11, 290, 644, 1611, 286, 1693, 13285, 220, 484, 1807, 484, 550, 13, 198, 15056, 326, 314, 4545, 2444, 508, 389, 3047, 284, 307, 7519, 11, 314, 373, 6655, 466, 1064, 326, 749, 1807, 326, 484, 561, 407, 307, 1498, 284, 651, 262, 3946, 484, 2227, 1231, 366, 43435, 1037, 1911, 366, 2061, 1611, 286, 1037, 318, 326, 1701, 314, 1965, 11, 12451, 606, 284, 1560, 502, 326, 484, 561, 761, 257, 220, 220, 393, 1641, 1545, 284, 1037, 606, 503, 13, 198, 1, 14214, 7076, 42911, 530, 8712, 13, 198, 40, 373, 2495, 32064, 416, 326, 2882, 13, 632, 2331, 326, 262, 19087, 286, 1909, 389, 6481, 4684, 284, 467, 739, 262, 9845, 284, 651, 4058, 286, 1854, 618, 340, 2058, 284, 1972, 257, 1693, 764, 198, 3198, 2576, 1297, 502, 326, 673, 373, 6402, 8185, 284, 2620, 607, 6001, 13, 366, 2990, 2270, 534, 7405, 11, 1234, 287, 2041, 16610, 23742, 11, 290, 6364, 4292, 262, 7625, 1022, 262, 734

In [70]:
print(tokenizer.decode(c["input_ids"][0]))

contex:Last week I talked with some of my students about what they wanted to do after they graduated, and what kind of job prospects  they thought they had.
Given that I teach students who are training to be doctors, I was surprised do find that most thought that they would not be able to get the jobs they wanted without "outside help". "What kind of help is that?" I asked, expecting them to tell me that they would need a   or family friend to help them out.
"Surgery ," one replied.
I was pretty alarmed by that response. It seems that the graduates of today are increasingly willing to go under the knife to get ahead of others when it comes to getting a job .
One girl told me that she was considering surgery to increase her height. "They break your legs, put in special extending screws, and slowly expand the gap between the two ends of the bone as it re-grows, you can get at least 5 cm taller!"
At that point, I was shocked. I am short, I can't deny that, but I don't think I would put my

In [71]:
# 进行批处理，这里我们直接在map函数中将文本信息全部删除即可
tokenized_dataset = raw_dataset.map(preprocess, batched=True,remove_columns=raw_dataset["train"].column_names, num_proc=4)

Map (num_proc=4): 100%|██████████| 4934/4934 [00:04<00:00, 1160.78 examples/s]
Map (num_proc=4): 100%|██████████| 87866/87866 [01:01<00:00, 1428.62 examples/s]
Map (num_proc=4): 100%|██████████| 4887/4887 [00:04<00:00, 1179.60 examples/s]


In [72]:
tokenized_dataset['train'][0]

{'input_ids': [1102,
  16886,
  25,
  5956,
  1285,
  314,
  6619,
  351,
  617,
  286,
  616,
  2444,
  546,
  644,
  484,
  2227,
  284,
  466,
  706,
  484,
  18303,
  11,
  290,
  644,
  1611,
  286,
  1693,
  13285,
  220,
  484,
  1807,
  484,
  550,
  13,
  198,
  15056,
  326,
  314,
  4545,
  2444,
  508,
  389,
  3047,
  284,
  307,
  7519,
  11,
  314,
  373,
  6655,
  466,
  1064,
  326,
  749,
  1807,
  326,
  484,
  561,
  407,
  307,
  1498,
  284,
  651,
  262,
  3946,
  484,
  2227,
  1231,
  366,
  43435,
  1037,
  1911,
  366,
  2061,
  1611,
  286,
  1037,
  318,
  326,
  1701,
  314,
  1965,
  11,
  12451,
  606,
  284,
  1560,
  502,
  326,
  484,
  561,
  761,
  257,
  220,
  220,
  393,
  1641,
  1545,
  284,
  1037,
  606,
  503,
  13,
  198,
  1,
  14214,
  7076,
  42911,
  530,
  8712,
  13,
  198,
  40,
  373,
  2495,
  32064,
  416,
  326,
  2882,
  13,
  632,
  2331,
  326,
  262,
  19087,
  286,
  1909,
  389,
  6481,
  4684,
  284,
  467,
  739,
  262,
 

In [73]:
# 设置数据集格式为 PyTorch tensors (如果使用 TensorFlow 则设置为 "tf")
tokenized_dataset.set_format("torch")

`DataCollatorForLanguageModeling` 是用于处理输入数据的动态填充（padding）和标签（labels）的生成，其会复制input_ids 作为labels，这里与我们的操作有所不一样，所有要自定义一个dataCollator

In [None]:
import torch
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token

class QADataCollator(DataCollatorForLanguageModeling):
    def __call__(self, features):
        batch = self.tokenizer.pad(
            {"input_ids": [f["input_ids"] for f in features]},
            padding=True,
            return_tensors="pt",
        )
        
        labels = [f["labels"] for f in features]
        padded_labels = []
        for i, label_seq in enumerate(labels):
            # 计算需要填充的长度
            pad_len = batch["input_ids"].shape[1] - len(label_seq)
            # 在 labels 的填充部分用 -100（PyTorch 会忽略这些位置的loss）
            padded_seq = torch.cat([
                torch.tensor(label_seq),
                torch.full((pad_len,), -100, dtype=torch.long)
            ])
            padded_labels.append(padded_seq)
        batch["labels"] = torch.stack(padded_labels)
        
        return batch

data_collator = QADataCollator(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8,
    return_tensors="pt"
)

In [101]:
# 传入四个数据进行查看
batch = data_collator([tokenized_dataset['train'][i] for i in range(4)])
batch

  torch.tensor(label_seq),


{'input_ids': tensor([[ 1102, 16886,    25,  ..., 50256, 50256, 50256],
        [ 1102, 16886,    25,  ..., 50256, 50256, 50256],
        [ 1102, 16886,    25,  ...,  3280,    25,    35],
        [ 1102, 16886,    25,  ..., 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]]), 'labels': tensor([[-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., -100, -100,   35],
        [-100, -100, -100,  ..., -100, -100, -100]])}

#  加载RoBERTa预训练模型
加载带有适合下游任务头部的模型。对于文本分类，我们使用 AutoModelForSequenceClassification。其就是在最后加了层MLP

In [102]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
import numpy as np
import evaluate

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
print(model)


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [228]:
def inference(instances, model):
    article = instances["article"]
    question = instances["question"]
    options = instances["options"]
    answer = instances['answer']
    model = model.to('cuda')
    
    opt_str = f"\nA:{options[0]}\nB:{options[1]}\nC:{options[2]}\nD:{options[3]}"
    prompt = f"contex:{article} \nquestion:{question} \noptions:{opt_str}. \nPlease select the best option for the question based on the content in the context, answer:"     
    results = tokenizer(
        prompt,
        max_length=512,
        padding=False,            
        truncation="only_first",  
        return_tensors="pt"
    )    
    
    outputs = model.generate(
                        input_ids=results["input_ids"].to(model.device),
                        attention_mask=results["attention_mask"].to(model.device),
                        num_beams=1,
                        max_new_tokens=1,  # 最大生成长度
                        num_return_sequences=1,  # 返回1个候选
                        do_sample=False,  # 启用随机采样
                        top_p=1.0,                     
                        temperature=1.0, 
                        pad_token_id = tokenizer.eos_token_id, 
                        eos_token_id = tokenizer.eos_token_id,
                        )
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return answer


In [214]:
respone = inference(raw_dataset['train'][2],model)
print(respone)

contex:Last week I talked with some of my students about what they wanted to do after they graduated, and what kind of job prospects  they thought they had.
Given that I teach students who are training to be doctors, I was surprised do find that most thought that they would not be able to get the jobs they wanted without "outside help". "What kind of help is that?" I asked, expecting them to tell me that they would need a   or family friend to help them out.
"Surgery ," one replied.
I was pretty alarmed by that response. It seems that the graduates of today are increasingly willing to go under the knife to get ahead of others when it comes to getting a job .
One girl told me that she was considering surgery to increase her height. "They break your legs, put in special extending screws, and slowly expand the gap between the two ends of the bone as it re-grows, you can get at least 5 cm taller!"
At that point, I was shocked. I am short, I can't deny that, but I don't think I would put my

In [157]:
dummy_input = tokenized_dataset['train'][0]
for k, v in dummy_input.items():
    dummy_input[k] = v.unsqueeze(0)
dummy_input

{'input_ids': tensor([[ 1102, 16886,    25,  5956,  1285,   314,  6619,   351,   617,   286,
            616,  2444,   546,   644,   484,  2227,   284,   466,   706,   484,
          18303,    11,   290,   644,  1611,   286,  1693, 13285,   220,   484,
           1807,   484,   550,    13,   198, 15056,   326,   314,  4545,  2444,
            508,   389,  3047,   284,   307,  7519,    11,   314,   373,  6655,
            466,  1064,   326,   749,  1807,   326,   484,   561,   407,   307,
           1498,   284,   651,   262,  3946,   484,  2227,  1231,   366, 43435,
           1037,  1911,   366,  2061,  1611,   286,  1037,   318,   326,  1701,
            314,  1965,    11, 12451,   606,   284,  1560,   502,   326,   484,
            561,   761,   257,   220,   220,   393,  1641,  1545,   284,  1037,
            606,   503,    13,   198,     1, 14214,  7076, 42911,   530,  8712,
             13,   198,    40,   373,  2495, 32064,   416,   326,  2882,    13,
            632,  2331,   3

In [158]:
# 可以看看输出
import torch

dummy_input = tokenized_dataset['train'][0]
for k, v in dummy_input.items():
    dummy_input[k] = v.unsqueeze(0)

with torch.no_grad():
    outputs = model(**dummy_input)

print(outputs.logits.shape)

torch.Size([1, 462, 50257])


#  定义评估指标
这里做的是语言建模，是不需要compute_metrics

# GPT2模型训练

In [221]:
training_args = TrainingArguments(
    output_dir="./results/GPT2-epoch5",              # 输出目录，保存模型和日志
    eval_strategy="epoch",         # 每个 epoch 结束后进行评估
    save_strategy="epoch",               # 每个 epoch 结束后保存模型
    learning_rate=2e-5,                  # 学习率
    per_device_train_batch_size=16,      # 训练批次大小
    per_device_eval_batch_size=16,       # 评估批次大小
    num_train_epochs=5,                  # 训练轮数
    weight_decay=0.01,                   # 权重衰减
    load_best_model_at_end=True,         # 训练结束后加载最佳模型
    metric_for_best_model="eval_loss",
    report_to="tensorboard",             # 可以选择 tensorboard, wandb 等
    save_total_limit=2,
    label_names = ["labels"],
)

这里我们使用P-tuning v2进行微调

In [166]:
from peft import PrefixTuningConfig, get_peft_model, TaskType, PeftModel

# 定义 Prefix Tuning 配置（P-Tuning v2）
peft_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    num_virtual_tokens=20,         # prefix 长度，常用 10-50
    encoder_hidden_size=768,       # GPT-2 hidden size
)

model = get_peft_model(model, peft_config)

# 记得解冻lm_head的梯度计算
for name, param in model.named_parameters():
    if 'lm_head' in name:
        param.requires_grad = True

In [222]:
print(model)

PeftModelForCausalLM(
  (base_model): GPT2LMHeadModel(
    (transformer): GPT2Model(
      (wte): Embedding(50257, 768)
      (wpe): Embedding(1024, 768)
      (drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0-11): 12 x GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D(nf=2304, nx=768)
            (c_proj): Conv1D(nf=768, nx=768)
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): GPT2MLP(
            (c_fc): Conv1D(nf=3072, nx=768)
            (c_proj): Conv1D(nf=768, nx=3072)
            (act): NewGELUActivation()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_features=768,

In [223]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
)

In [219]:
# 查看一下初始精度，并验证能否正常运行
evalres = trainer.evaluate()
print(evalres)

  torch.tensor(label_seq),


{'eval_loss': 4.26106071472168, 'eval_runtime': 77.4211, 'eval_samples_per_second': 63.122, 'eval_steps_per_second': 1.976}


In [224]:
# 4090 * 2 训练约3h
trainer.train()

  torch.tensor(label_seq),


Epoch,Training Loss,Validation Loss
1,0.8457,1.590944


  torch.tensor(label_seq),


KeyboardInterrupt: 

# 模型推理

这里测试一下能否正常生成答案

In [232]:
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
model = PeftModel.from_pretrained(model, "./results/GPT2-epoch5/checkpoint-2746")

respone = inference(raw_dataset['test'][8],model)
print(respone)
print(f"\nthe GT: {raw_dataset['test'][8]['answer']}")

contex:Little Tommy was doing very badly in math. His parents had tried everything--tutors, cards, special learning centers--in short, everything they could think of. Finally they took Tommy to a catholic  school.
After the first day, little Tommy came home with a very serious look on his face. He didn't kiss his mother hello. Instead, he went straight to his room and started studying. Books and papers were spread out all over the room and little Tommy was hard at work. His mother was surprised. She called him down to dinner and as soon as he finished eating, he went back to his room, without a word. In no time he was back hitting the books as hard as before. This went on for some time, day after day while the mother tried to understand what was happening.
Finally, little Tommy brought home his report card. He quietly put it on the table and went up to his room and hit the books. His mom looked at it and to her surprise, little Tommy got an A in math. She could no longer hold her curio

接下来我们写一个函数来测试一下准确率

In [250]:
def eval(model, dataset):
    right=0.
    for idx,instance in enumerate(dataset):
        gt = instance["answer"].lower()
        respone = inference(instance, model)
        pred = respone[-1].lower()
        if pred==gt:
            right +=1
    print(f"accurancy: {right/len(dataset)}")
        


In [251]:
eval(model, raw_dataset["test"])

accurancy: 0.25678962302391567
