# RACE数据集，英语阅读理解多项选择QA

1. 使用GPT2模型，通过生成的方式完成多项选择QA任务
2. 使用P-Tuning v2 进行模型微调
3. 使用COT来增强模型

# 加载数据集
使用 datasets 库加载 imdb 数据集。这个库会自动下载并缓存数据。
也可以下载到本地，这样可以防止网络问题导致代码执行失败（尽管已经下载到缓存）


In [1]:
from datasets import load_dataset

dataset_name = "/data/Dataset/LLM-dataset/race"
raw_dataset = load_dataset(dataset_name, "all")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
'''可以直接print 看一下数据集的概况'''
print(raw_dataset)

'''由于返回的是train 和 test 两个split 也可以直接像字典一样进行索引'''
print(type(raw_dataset['train']))
print(raw_dataset['train'])

print(raw_dataset['train'][0])

DatasetDict({
    test: Dataset({
        features: ['example_id', 'article', 'answer', 'question', 'options'],
        num_rows: 4934
    })
    train: Dataset({
        features: ['example_id', 'article', 'answer', 'question', 'options'],
        num_rows: 87866
    })
    validation: Dataset({
        features: ['example_id', 'article', 'answer', 'question', 'options'],
        num_rows: 4887
    })
})
<class 'datasets.arrow_dataset.Dataset'>
Dataset({
    features: ['example_id', 'article', 'answer', 'question', 'options'],
    num_rows: 87866
})
{'example_id': 'high19088.txt', 'article': 'Last week I talked with some of my students about what they wanted to do after they graduated, and what kind of job prospects  they thought they had.\nGiven that I teach students who are training to be doctors, I was surprised do find that most thought that they would not be able to get the jobs they wanted without "outside help". "What kind of help is that?" I asked, expecting them to tell me th

# 数据预处理 (Tokenization)
这里的处理稍微有所不一样，针对Decoder only的模型，我们期望其对答案进行生成
1. 我们需要构造：article + question + options + answer的Prompt
2. 对于labels，我们不计算answer之前的token预测loss，专注于答案生成的loss（也可以尝试计算所有token的loss）

In [None]:
from transformers import GPT2Tokenizer

model_checkpoint = "/data/Weights/gpt2/gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_checkpoint)
'''
这里记得设置padding_side,因为我们用的是生成的方式来完成QA任务, 对于自回归模型生成时一般使用left pad 防止在生成时被pad_token影响
'''
tokenizer.padding_side="left"


def preprocess(instances):
    article = instances["article"]
    question = instances["question"]
    options = instances["options"]
    answer = instances['answer']

    results = {
        'input_ids': [],
        'attention_mask': [],
        'labels': []
    }
    
    for art,ques,opt,ans in zip(article,question,options,answer):
        # 构造prompt
        opt_str = f"\nA:{opt[0]}\nB:{opt[1]}\nC:{opt[2]}\nD:{opt[3]}"
        prompt = f"contex:{art} \nquestion:{ques} \noptions:{opt_str}. \nPlease select the best option for the question based on the content in the context, answer:"        

        tokenized_context = tokenizer(
            prompt,
            max_length=512,
            padding=False,            
            truncation="only_first",  
        ) 
        # answer之前的loss，不计算，设为-100
        labels = [-100]*len(tokenized_context['input_ids'])
        
        # 对Answer进行encode，并补充进labels
        tokenized_ans = tokenizer(
            ans,
            max_length=512,
            padding=False,            
            truncation="only_first",  
        ) 
        labels.extend(tokenized_ans['input_ids'])
        
        for k in tokenized_context.keys():
            tokenized_context[k].extend(tokenized_ans[k])
            
        results['input_ids'].append(tokenized_context['input_ids'])
        results['attention_mask'].append(tokenized_context['attention_mask'])
        results['labels'].append(labels)
    
    return results

c = preprocess(raw_dataset["train"][0:4])

print(c)
print(f"shape of input_ids :({len(c['input_ids'])},{len(c['input_ids'][0])})")


{'input_ids': [[1102, 16886, 25, 5956, 1285, 314, 6619, 351, 617, 286, 616, 2444, 546, 644, 484, 2227, 284, 466, 706, 484, 18303, 11, 290, 644, 1611, 286, 1693, 13285, 220, 484, 1807, 484, 550, 13, 198, 15056, 326, 314, 4545, 2444, 508, 389, 3047, 284, 307, 7519, 11, 314, 373, 6655, 466, 1064, 326, 749, 1807, 326, 484, 561, 407, 307, 1498, 284, 651, 262, 3946, 484, 2227, 1231, 366, 43435, 1037, 1911, 366, 2061, 1611, 286, 1037, 318, 326, 1701, 314, 1965, 11, 12451, 606, 284, 1560, 502, 326, 484, 561, 761, 257, 220, 220, 393, 1641, 1545, 284, 1037, 606, 503, 13, 198, 1, 14214, 7076, 42911, 530, 8712, 13, 198, 40, 373, 2495, 32064, 416, 326, 2882, 13, 632, 2331, 326, 262, 19087, 286, 1909, 389, 6481, 4684, 284, 467, 739, 262, 9845, 284, 651, 4058, 286, 1854, 618, 340, 2058, 284, 1972, 257, 1693, 764, 198, 3198, 2576, 1297, 502, 326, 673, 373, 6402, 8185, 284, 2620, 607, 6001, 13, 366, 2990, 2270, 534, 7405, 11, 1234, 287, 2041, 16610, 23742, 11, 290, 6364, 4292, 262, 7625, 1022, 262, 734

In [5]:
print(tokenizer.decode(c["input_ids"][0]))

contex:Last week I talked with some of my students about what they wanted to do after they graduated, and what kind of job prospects  they thought they had.
Given that I teach students who are training to be doctors, I was surprised do find that most thought that they would not be able to get the jobs they wanted without "outside help". "What kind of help is that?" I asked, expecting them to tell me that they would need a   or family friend to help them out.
"Surgery ," one replied.
I was pretty alarmed by that response. It seems that the graduates of today are increasingly willing to go under the knife to get ahead of others when it comes to getting a job .
One girl told me that she was considering surgery to increase her height. "They break your legs, put in special extending screws, and slowly expand the gap between the two ends of the bone as it re-grows, you can get at least 5 cm taller!"
At that point, I was shocked. I am short, I can't deny that, but I don't think I would put my

In [36]:
# 进行批处理，这里我们直接在map函数中将文本信息全部删除即可
tokenized_dataset = raw_dataset.map(preprocess, batched=True,remove_columns=raw_dataset["train"].column_names, num_proc=4)

Map (num_proc=4): 100%|██████████| 4934/4934 [00:04<00:00, 1144.41 examples/s]
Map (num_proc=4): 100%|██████████| 87866/87866 [01:02<00:00, 1411.50 examples/s]
Map (num_proc=4): 100%|██████████| 4887/4887 [00:04<00:00, 1180.32 examples/s]


In [37]:
tokenized_dataset['train'][0]

{'input_ids': [1102,
  16886,
  25,
  5956,
  1285,
  314,
  6619,
  351,
  617,
  286,
  616,
  2444,
  546,
  644,
  484,
  2227,
  284,
  466,
  706,
  484,
  18303,
  11,
  290,
  644,
  1611,
  286,
  1693,
  13285,
  220,
  484,
  1807,
  484,
  550,
  13,
  198,
  15056,
  326,
  314,
  4545,
  2444,
  508,
  389,
  3047,
  284,
  307,
  7519,
  11,
  314,
  373,
  6655,
  466,
  1064,
  326,
  749,
  1807,
  326,
  484,
  561,
  407,
  307,
  1498,
  284,
  651,
  262,
  3946,
  484,
  2227,
  1231,
  366,
  43435,
  1037,
  1911,
  366,
  2061,
  1611,
  286,
  1037,
  318,
  326,
  1701,
  314,
  1965,
  11,
  12451,
  606,
  284,
  1560,
  502,
  326,
  484,
  561,
  761,
  257,
  220,
  220,
  393,
  1641,
  1545,
  284,
  1037,
  606,
  503,
  13,
  198,
  1,
  14214,
  7076,
  42911,
  530,
  8712,
  13,
  198,
  40,
  373,
  2495,
  32064,
  416,
  326,
  2882,
  13,
  632,
  2331,
  326,
  262,
  19087,
  286,
  1909,
  389,
  6481,
  4684,
  284,
  467,
  739,
  262,
 

In [38]:
# 设置数据集格式为 PyTorch tensors (如果使用 TensorFlow 则设置为 "tf")
tokenized_dataset.set_format("torch")

`DataCollatorForLanguageModeling` 是用于处理输入数据的动态填充（padding）和标签（labels）的生成，其会复制input_ids 作为labels，这里与我们的操作有所不一样，所有要自定义一个dataCollator

In [51]:
import torch
from transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token

class QADataCollator(DataCollatorForLanguageModeling):
    def __call__(self, features):
        batch = self.tokenizer.pad(
            {"input_ids": [f["input_ids"] for f in features]},
            return_attention_mask=True,
            padding=True,
            return_tensors="pt",
        )
        
        labels = [f["labels"] for f in features]
        padded_labels = []
        for i, label_seq in enumerate(labels):
            # 计算需要填充的长度
            pad_len = batch["input_ids"].shape[1] - len(label_seq)
            # 在 labels 的填充部分用 -100（PyTorch 会忽略这些位置的loss）
            padded_seq = torch.cat([
                torch.full((pad_len,), -100, dtype=torch.long),
                torch.tensor(label_seq)
                
            ])
            padded_labels.append(padded_seq)
        batch["labels"] = torch.stack(padded_labels)
        
        return batch

data_collator = QADataCollator(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8,
    return_tensors="pt"
)

In [52]:
# 传入四个数据进行查看
batch = data_collator([tokenized_dataset['train'][i] for i in range(4)])
batch

  torch.tensor(label_seq)


{'input_ids': tensor([[50256, 50256, 50256,  ...,  3280,    25,    34],
        [50256, 50256, 50256,  ...,  3280,    25,    34],
        [ 1102, 16886,    25,  ...,  3280,    25,    35],
        [50256, 50256, 50256,  ...,  3280,    25,    33]]), 'attention_mask': tensor([[0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1]]), 'labels': tensor([[-100, -100, -100,  ..., -100, -100,   34],
        [-100, -100, -100,  ..., -100, -100,   34],
        [-100, -100, -100,  ..., -100, -100,   35],
        [-100, -100, -100,  ..., -100, -100,   33]])}

#  加载RoBERTa预训练模型
加载带有适合下游任务头部的模型。对于文本分类，我们使用 AutoModelForSequenceClassification。其就是在最后加了层MLP

In [41]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
import numpy as np
import evaluate

model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
print(model)


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [42]:
def inference(instances, model):
    article = instances["article"]
    question = instances["question"]
    options = instances["options"]
    answer = instances['answer']
    model = model.to('cuda')
    
    opt_str = f"\nA:{options[0]}\nB:{options[1]}\nC:{options[2]}\nD:{options[3]}"
    prompt = f"contex:{article} \nquestion:{question} \noptions:{opt_str}. \nPlease select the best option for the question based on the content in the context, answer:"     
    results = tokenizer(
        prompt,
        max_length=512,
        padding=False,            
        truncation="only_first",  
        return_tensors="pt"
    )    
    
    outputs = model.generate(
                        input_ids=results["input_ids"].to(model.device),
                        attention_mask=results["attention_mask"].to(model.device),
                        num_beams=1,
                        max_new_tokens=1,  # 最大生成长度
                        num_return_sequences=1,  # 返回1个候选
                        do_sample=False,  # 启用随机采样
                        top_p=1.0,                     
                        temperature=1.0, 
                        pad_token_id = tokenizer.eos_token_id, 
                        eos_token_id = tokenizer.eos_token_id,
                        )
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    return answer


尝试一下，好像并无法生成答案

In [43]:
respone = inference(raw_dataset['train'][2],model)
print(respone)

contex:Last week I talked with some of my students about what they wanted to do after they graduated, and what kind of job prospects  they thought they had.
Given that I teach students who are training to be doctors, I was surprised do find that most thought that they would not be able to get the jobs they wanted without "outside help". "What kind of help is that?" I asked, expecting them to tell me that they would need a   or family friend to help them out.
"Surgery ," one replied.
I was pretty alarmed by that response. It seems that the graduates of today are increasingly willing to go under the knife to get ahead of others when it comes to getting a job .
One girl told me that she was considering surgery to increase her height. "They break your legs, put in special extending screws, and slowly expand the gap between the two ends of the bone as it re-grows, you can get at least 5 cm taller!"
At that point, I was shocked. I am short, I can't deny that, but I don't think I would put my

In [53]:
# 可以看看输出
import torch

dummy_input = tokenized_dataset['train'][0]
for k, v in dummy_input.items():
    dummy_input[k] = v.unsqueeze(0).to(model.device)

with torch.no_grad():
    outputs = model(**dummy_input)

print(outputs.logits.shape)

torch.Size([1, 462, 50257])


#  定义评估指标
这里做的是语言建模，是不需要compute_metrics

# GPT2模型训练

In [45]:
training_args = TrainingArguments(
    output_dir="./results/GPT2-epoch5",              # 输出目录，保存模型和日志
    eval_strategy="epoch",         # 每个 epoch 结束后进行评估
    save_strategy="epoch",               # 每个 epoch 结束后保存模型
    learning_rate=2e-5,                  # 学习率
    per_device_train_batch_size=16,      # 训练批次大小
    per_device_eval_batch_size=16,       # 评估批次大小
    num_train_epochs=5,                  # 训练轮数
    weight_decay=0.01,                   # 权重衰减
    load_best_model_at_end=True,         # 训练结束后加载最佳模型
    metric_for_best_model="eval_loss",
    report_to="tensorboard",             # 可以选择 tensorboard, wandb 等
    save_total_limit=2,
    label_names = ["labels"],
)

这里我们使用P-tuning v2进行微调

In [46]:
from peft import PrefixTuningConfig, get_peft_model, TaskType, PeftModel

# 定义 Prefix Tuning 配置（P-Tuning v2）
peft_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    num_virtual_tokens=20,         
    encoder_hidden_size=768,      
)

model = get_peft_model(model, peft_config)

# 记得解冻lm_head的梯度计算
for name, param in model.named_parameters():
    if 'lm_head' in name:
        param.requires_grad = True

In [47]:
print(model)

PeftModelForCausalLM(
  (base_model): GPT2LMHeadModel(
    (transformer): GPT2Model(
      (wte): Embedding(50257, 768)
      (wpe): Embedding(1024, 768)
      (drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0-11): 12 x GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D(nf=2304, nx=768)
            (c_proj): Conv1D(nf=768, nx=768)
            (attn_dropout): Dropout(p=0.1, inplace=False)
            (resid_dropout): Dropout(p=0.1, inplace=False)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): GPT2MLP(
            (c_fc): Conv1D(nf=3072, nx=768)
            (c_proj): Conv1D(nf=768, nx=3072)
            (act): NewGELUActivation()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
      (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
    (lm_head): Linear(in_features=768,

In [54]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
)

In [55]:
# 查看一下初始精度，并验证能否正常运行
evalres = trainer.evaluate()
print(evalres)

  torch.tensor(label_seq)


{'eval_loss': 5.523458957672119, 'eval_runtime': 73.3823, 'eval_samples_per_second': 66.596, 'eval_steps_per_second': 2.085}


In [56]:
# 4090 * 2 训练约3h
trainer.train()

  torch.tensor(label_seq)


Epoch,Training Loss,Validation Loss
1,0.9919,1.785209
2,0.8505,1.594922
3,0.8029,1.538443
4,0.7845,1.515149
5,0.7819,1.509089


  torch.tensor(label_seq)
  torch.tensor(label_seq)
  torch.tensor(label_seq)
  torch.tensor(label_seq)
  torch.tensor(label_seq)


TrainOutput(global_step=13730, training_loss=0.9582138861693003, metrics={'train_runtime': 9607.9617, 'train_samples_per_second': 45.726, 'train_steps_per_second': 1.429, 'total_flos': 1.15016556503808e+17, 'train_loss': 0.9582138861693003, 'epoch': 5.0})

# 模型推理

这里测试一下能否正常生成答案

In [59]:
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)
model = PeftModel.from_pretrained(model, "./results/GPT2-epoch5/checkpoint-13730")

respone = inference(raw_dataset['test'][0],model)
print(respone)
print(f"\nthe GT: {raw_dataset['test'][0]['answer']}")

contex:The rain had continued for a week and the flood had created a big river which were running by Nancy Brown's farm. As she tried to gather her cows to a higher ground, she slipped and hit her head on a fallen tree trunk. The fall made her unconscious for a moment or two. When she came to, Lizzie, one of her oldest and favorite cows, was licking her face. 
At that time, the water level on the farm was still rising. Nancy gathered all her strength to get up and began walking slowly with Lizzie. The rain had become much heavier, and the water in the field was now waist high. Nancy's pace got slower and slower because she felt a great pain in her head. Finally, all she could do was to throw her arm around Lizzie's neck and try to hang on. About 20 minutes later, Lizzie managed to pull herself and Nancy out of the rising water and onto a bit of high land, which seemed like a small island in the middle of a lake of white water. 
Even though it was about noon, the sky was so dark and the

接下来我们写一个函数来测试一下准确率

In [60]:
def eval(model, dataset):
    right=0.
    for idx,instance in enumerate(dataset):
        gt = instance["answer"].lower()
        respone = inference(instance, model)
        pred = respone[-1].lower()
        if pred==gt:
            right +=1
    print(f"accurancy: {right/len(dataset)}")
        


In [61]:
eval(model, raw_dataset["validation"])

accurancy: 0.2578268876611418


# 改进：
1. COT
2. 修改生成的参数
3. 微调方式