<font size="5">SQuAD数据集-QA微调</font>

Stanford Question Answering 数据集    
SQuAD 1.1: 包含约10 万个问答对，基于500多篇维基百科文章构建。  
所有问题均有明确的答案，答案为原文中的连续文本片段（Span），任务形式为提取式问答。  

SQuAD 2.0: 在原有 10 万个可回答问题的基础上，新增了50,111 个不可回答的问题，总问题数达到150,111 个。   
这些新增问题表面看似合理但实际无法从文本中找到答案，要求模型不仅能提取答案，还需判断问题是否可回答。  

<font size="4">1 导入数据</font>

In [1]:
import os

os.environ['http_proxy'] = 'http://127.0.0.1:1087'
os.environ['https_proxy'] = 'http://127.0.0.1:1087'

squad_v2 = False  # 采用 squad 1
batch_size = 16
max_length = 384 
doc_stride = 128 

from datasets import load_dataset
datasets = load_dataset("squad_v2" if squad_v2 else "squad")
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [2]:
datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

In [45]:
def dp(d):
    # 打印字典
    display(HTML(pd.DataFrame(d).to_html()))

In [3]:
import pandas as pd
from IPython.display import display, HTML
dp(datasets["train"][[0]])

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend ""Venite Ad Me Omnes"". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.",To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?,"{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}"


<font size="4">2 模型介绍和数据预处理</font>

RoBERTa模型是BERT的改进版，在大规模语料库(160GB文本)上训练,包括BookCorpus、维基百科等数据集，适用于问答系统等下游任务的微调

In [4]:
from transformers import AutoTokenizer
model_checkpoint = "/home/cc/.cache/huggingface/hub/roberta-base"  # roberta-base模型
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

import transformers
print(isinstance(tokenizer, transformers.PreTrainedTokenizerFast))

True


In [4]:
eg=tokenizer("What is your name?", "My name is Sylvain.")  # 文本→模型可以理解的数字序列（通常是词向量索引）
print(eg)  # input_ids：文本转换后的数字序列， attention_mask：指示哪些位置是实际文本（1），哪些是填充（0）
print(tokenizer.decode(eg['input_ids']))  # 将input_ids（数字序列）解码回原始文本

{'input_ids': [0, 2264, 16, 110, 766, 116, 2, 2, 2387, 766, 16, 28856, 1851, 4, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
<s>What is your name?</s></s>My name is Sylvain.</s>


In [15]:
for i, example in enumerate(datasets["train"]):
    if len(tokenizer(example["question"], example["context"])["input_ids"]) > 384:
        break

dp(datasets["train"][[i]])
example = datasets["train"][i]

tokenized_example = tokenizer(
    example["question"],
    example["context"],
    max_length=384,  # 当输入文本（问题 + 上下文）的总 token 数超过 384 时，会截断
    truncation="only_second",  # 只截断第二个输入（即example["context"]，上下文），不截断第一个输入（即example["question"]，问题）
    return_overflowing_tokens=True,  # 长上下文会被分割成多个片段（每个片段长度≤max_length），确保长文本能被完整处理（而不是直接丢弃超出部分）
    return_offsets_mapping=True,  # 返回每个token在原始文本（未token化的question和context）中的起始和结束位置（偏移量）
    stride=128  # 当长上下文被分割成多个片段时，相邻片段之间的重叠长度为 128 个 token
)
dp(dict(tokenized_example))

Unnamed: 0,id,title,context,question,answers
0,5733caf74776f4190066124c,University_of_Notre_Dame,"The men's basketball team has over 1,600 wins, one of only 12 schools who have reached that mark, and have appeared in 28 NCAA tournaments. Former player Austin Carr holds the record for most points scored in a single game of the tournament with 61. Although the team has never won the NCAA Tournament, they were named by the Helms Athletic Foundation as national champions twice. The team has orchestrated a number of upsets of number one ranked teams, the most notable of which was ending UCLA's record 88-game winning streak in 1974. The team has beaten an additional eight number-one teams, and those nine wins rank second, to UCLA's 10, all-time in wins against the top team. The team plays in newly renovated Purcell Pavilion (within the Edmund P. Joyce Center), which reopened for the beginning of the 2009–2010 season. The team is coached by Mike Brey, who, as of the 2014–15 season, his fifteenth at Notre Dame, has achieved a 332-165 record. In 2009 they were invited to the NIT, where they advanced to the semifinals but were beaten by Penn State who went on and beat Baylor in the championship. The 2010–11 team concluded its regular season ranked number seven in the country, with a record of 25–5, Brey's fifth straight 20-win season, and a second-place finish in the Big East. During the 2014-15 season, the team went 32-6 and won the ACC conference tournament, later advancing to the Elite 8, where the Fighting Irish lost on a missed buzzer-beater against then undefeated Kentucky. Led by NBA draft picks Jerian Grant and Pat Connaughton, the Fighting Irish beat the eventual national champion Duke Blue Devils twice during the season. The 32 wins were the most by the Fighting Irish team since 1908-09.",How many wins does the Notre Dame men's basketball team have?,"{'text': ['over 1,600'], 'answer_start': [30]}"


Unnamed: 0,input_ids,attention_mask,offset_mapping,overflow_to_sample_mapping
0,"[0, 6179, 171, 2693, 473, 5, 10579, 9038, 604, 18, 2613, 165, 33, 116, 2, 2, 133, 604, 18, 2613, 165, 34, 81, 112, 6, 4697, 2693, 6, 65, 9, 129, 316, 1304, 54, 33, 1348, 14, 2458, 6, 8, 33, 1382, 11, 971, 5248, 11544, 4, 3531, 869, 4224, 8902, 3106, 5, 638, 13, 144, 332, 1008, 11, 10, 881, 177, 9, 5, 1967, 19, 5659, 4, 2223, 5, 165, 34, 393, 351, 5, 5248, 7647, 6, 51, 58, 1440, 30, 5, 6851, 4339, 8899, 2475, 25, 632, 4739, 2330, 4, 20, 165, 34, 24830, 10, 346, 9, 12744, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 0), (0, 3), (4, 7), (7, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 329), (329, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374, 379), (379, 380), (381, 384), (385, 389), (390, 393), (394, 406), (407, 408), (409, 415), (416, 418), (419, 422), ...]",0
1,"[0, 6179, 171, 2693, 473, 5, 10579, 9038, 604, 18, 2613, 165, 33, 116, 2, 2, 20, 1824, 2383, 1225, 165, 4633, 63, 1675, 191, 4173, 346, 707, 11, 5, 247, 6, 19, 10, 638, 9, 564, 2383, 245, 6, 5811, 219, 18, 1998, 1359, 291, 12, 5640, 191, 6, 8, 10, 200, 12, 6406, 2073, 11, 5, 1776, 953, 4, 1590, 5, 777, 12, 996, 191, 6, 5, 165, 439, 2107, 12, 401, 8, 351, 5, 10018, 1019, 1967, 6, 423, 11511, 7, 5, 15834, 290, 6, 147, 5, 18563, 3445, 685, 15, 10, 2039, 8775, 254, 12, 1610, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 0), (1107, 1110), (1111, 1115), (1115, 1116), (1116, 1118), (1119, 1123), (1124, 1133), (1134, 1137), (1138, 1145), (1146, 1152), (1153, 1159), (1160, 1166), (1167, 1172), (1173, 1175), (1176, 1179), (1180, 1187), (1187, 1188), (1189, 1193), (1194, 1195), (1196, 1202), (1203, 1205), (1206, 1208), (1208, 1209), (1209, 1210), (1210, 1211), (1212, 1215), (1215, 1216), (1216, 1218), (1219, 1224), (1225, 1233), (1234, 1236), (1236, 1237), (1237, 1240), (1241, 1247), (1247, 1248), (1249, 1252), (1253, 1254), (1255, 1261), (1261, 1262), (1262, 1267), (1268, 1274), (1275, 1277), (1278, 1281), (1282, 1285), (1286, 1290), (1290, 1291), (1292, 1298), (1299, 1302), (1303, 1307), (1307, 1308), (1308, 1310), (1311, 1317), (1317, 1318), (1319, 1322), (1323, 1327), (1328, 1332), (1333, 1335), (1335, 1336), (1336, 1337), (1338, 1341), (1342, 1345), (1346, 1349), (1350, 1353), (1354, 1364), (1365, 1375), (1375, 1376), (1377, 1382), (1383, 1392), (1393, 1395), (1396, 1399), (1400, 1405), (1406, 1407), (1407, 1408), (1409, 1414), (1415, 1418), (1419, 1427), (1428, 1433), (1434, 1438), (1439, 1441), (1442, 1443), (1444, 1450), (1451, 1455), (1455, 1457), (1457, 1458), (1458, 1460), ...]",0


In [67]:
print(tokenized_example["offset_mapping"][0][:100])

[(0, 0), (0, 3), (4, 8), (9, 13), (14, 18), (19, 22), (23, 28), (29, 33), (34, 37), (37, 39), (40, 50), (51, 55), (56, 60), (60, 61), (0, 0), (0, 0), (0, 3), (4, 7), (7, 9), (10, 20), (21, 25), (26, 29), (30, 34), (35, 36), (36, 37), (37, 40), (41, 45), (45, 46), (47, 50), (51, 53), (54, 58), (59, 61), (62, 69), (70, 73), (74, 78), (79, 86), (87, 91), (92, 96), (96, 97), (98, 101), (102, 106), (107, 115), (116, 118), (119, 121), (122, 126), (127, 138), (138, 139), (140, 146), (147, 153), (154, 160), (161, 165), (166, 171), (172, 175), (176, 182), (183, 186), (187, 191), (192, 198), (199, 205), (206, 208), (209, 210), (211, 217), (218, 222), (223, 225), (226, 229), (230, 240), (241, 245), (246, 248), (248, 249), (250, 258), (259, 262), (263, 267), (268, 271), (272, 277), (278, 281), (282, 285), (286, 290), (291, 301), (301, 302), (303, 307), (308, 312), (313, 318), (319, 321), (322, 325), (326, 329), (329, 331), (332, 340), (341, 351), (352, 354), (355, 363), (364, 373), (374, 379), (37

In [66]:
first_token_id = tokenized_example["input_ids"][0][1]  
offsets = tokenized_example["offset_mapping"][0][1]
print(first_token_id, offsets)
print(tokenizer.convert_ids_to_tokens([first_token_id])[0], example["question"][offsets[0]:offsets[1]])

6179 (0, 3)
How How


下面我们将原始文本中答案的字符位置（人类可读）转换为 token 序列中的索引位置（模型可理解），为后续模型训练（如预测答案位置）做准备。

In [22]:
answers = example["answers"]  # {'text': ['over 1,600'], 'answer_start': [30]}
start_char = answers["answer_start"][0]  # 30
end_char = start_char + len(answers["text"][0])  # 40

# 句子分片
sequence_ids = tokenized_example.sequence_ids()  # 默认返回的是第一个片段的序列标识（如果要获取第二个片段的，需要指定索引，如 sequence_ids(1)）
# [None, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, None, None, 1, 1, ..... 1, 1, None]

token_start_index = 0
while sequence_ids[token_start_index] != 1:
    token_start_index += 1
# 16
token_end_index = len(tokenized_example["input_ids"][0]) - 1
while sequence_ids[token_end_index] != 1:
    token_end_index -= 1
# 382

# 将token_start_index和token_end_index移动到答案字符串表示的两端
offsets = tokenized_example["offset_mapping"][0]  
# print(offsets[token_start_index], offsets[token_end_index])
# (0, 3) (1682, 1685)
if (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
    while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
        token_start_index += 1
    start_position = token_start_index - 1
    while offsets[token_end_index][1] >= end_char:
        token_end_index -= 1
    end_position = token_end_index + 1
    print(start_position, end_position)  # 22, 25
else:
    print("答案不在此特征中。")

22 25


In [26]:
# 通过查找offset mapping位置，解码context 中的答案 
print(tokenizer.decode(tokenized_example["input_ids"][0][start_position: end_position+1]))
# 数据集中的标准答案（answer["text"])
print(answers["text"][0])

 over 1,600
over 1,600


In [5]:
pad_on_right = tokenizer.padding_side == "right"
print(pad_on_right)

True


In [6]:
def prepare_train_features(examples):
    examples["question"] = [q.lstrip() for q in examples["question"]]  # 删除左侧的空白字符，避免因多余空格导致的 tokenization 偏差

    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,  # token序列的最大长度（超过则截断）
        stride=doc_stride,  # 滑动窗口步长。当上下文过长被截断为多个片段时，相邻片段的重叠长度
        return_overflowing_tokens=True,  # 记录截断后的片段对应原始哪个样本
        return_offsets_mapping=True,  # 记录每个token在原始文本中的字符位置
        padding="max_length",
    )

    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping") # 记录tokenized片段对应原始样本的索引列表。例如：若原始样本0的上下文过长，被截断为 3 个片段，则sample_mapping中这3个片段的位置会填 0。
    offset_mapping = tokenized_examples.pop("offset_mapping")  # 记录每个 token 在原始文本中的字符范围，如(5,8)表示token对应原始文本第 5-8 个字符

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)  # 特殊 token [CLS]的索引
        sequence_ids = tokenized_examples.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        if len(answers["answer_start"]) == 0:
            # 若原始样本无答案（answer_start为空），则将start_positions和end_positions都设为cls_index
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])
            
            # 找到上下文的第一个token索引
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1
            # 找到上下文的最后一个token索引
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                # 答案不在当前片段，用CLS标记
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # 将答案的字符位置（start_char/end_char）映射到 token 序列中的索引
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

# 将prepare_train_features函数应用到数据集中的所有样本
tokenized_datasets = datasets.map(prepare_train_features,
                                  batched=True,  # 批量处理
                                  remove_columns=datasets["train"].column_names  # 移除原始数据集中的列（如question、context），只保留 tokenized 后的特征（input_ids、attention_mask、start_positions等）
                                 )

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [7]:
dp(dict(tokenized_datasets['train'][[0]]))

Unnamed: 0,input_ids,attention_mask,start_positions,end_positions
0,"[0, 3972, 2661, 222, 5, 9880, 2708, 2346, 2082, 11, 504, 4432, 11, 226, 2126, 10067, 1470, 116, 2, 2, 37848, 37471, 28108, 6, 5, 334, 34, 10, 4019, 2048, 4, 497, 1517, 5, 4326, 6919, 18, 1637, 31346, 16, 10, 9030, 9577, 9, 5, 9880, 2708, 4, 29261, 11, 760, 9, 5, 4326, 6919, 8, 2114, 24, 6, 16, 10, 7621, 9577, 9, 4845, 19, 3701, 62, 33161, 19, 5, 7875, 22, 39043, 1459, 1614, 1464, 13292, 4977, 845, 4130, 7, 5, 4326, 6919, 16, 5, 26429, 2426, 9, 5, 25095, 6924, 4, 29261, 639, 5, 32394, 2426, 16, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]",135,142


<font size="4">3 微调模型</font>

In [24]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Some weights of RobertaForQuestionAnswering were not initialized from the model checkpoint at /home/cc/.cache/huggingface/hub/roberta-base and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
model_dir = f"/home/cc/models/finetuned-models/roberta-base-finetuned-1"

args = TrainingArguments(
    output_dir=model_dir,
    per_device_train_batch_size=24,  # 单卡训练批次大小
    per_device_eval_batch_size=32,  # 单卡评估批次大小
    gradient_accumulation_steps=2,  # 梯度累积步骤（总批大小 = 16 * 2 * 2卡 = 64）
    save_total_limit=3,  # 最多保留3个检查点
    fp16=True,  # 启用混合精度训练（利用RTX 4090的Tensor Core）
    remove_unused_columns=False,  # # 禁用自动移除未用列（确保保留所有特征）
    gradient_checkpointing=False,  # 梯度检查点（节省显存）
    
    greater_is_better=True,  
    evaluation_strategy = "epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,  # 训练结束时加载最佳模型
    learning_rate=2e-5,
    num_train_epochs=5,
    weight_decay=0.01,
)  # 训练参数

In [14]:
from transformers import default_data_collator

data_collator = default_data_collator  # 将多个样本（batch）整理成模型可接受的输入格式

In [26]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],  # 训练集
    eval_dataset=tokenized_datasets["validation"],  # 验证集
    data_collator=data_collator,  # 数据整理器
    tokenizer=tokenizer  
)

In [27]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.9987,0.887529
2,0.7701,0.852657
3,0.6376,0.851553
4,0.5488,0.88581
5,0.4987,0.914835




TrainOutput(global_step=6920, training_loss=0.7303404416652084, metrics={'train_runtime': 4426.0717, 'train_samples_per_second': 100.053, 'train_steps_per_second': 1.563, 'total_flos': 8.678449181472768e+16, 'train_loss': 0.7303404416652084, 'epoch': 5.0})

模型在训练集上的表现持续提升（损失降低），但在验证集上的最优表现出现在第3个epoch，第 4、5 个 epoch 出现过拟合。

TrainOutput(  
    global_step=6920,                  # 总训练步数  
    training_loss=0.7303404416652084,  # 平均训练损失  
    metrics={   
        'train_runtime': 4426.0717,    # 总训练时间（秒）  
        'train_samples_per_second': 100.053,  # 每秒处理的训练样本数  
        'train_steps_per_second': 1.563,      # 每秒完成的训练步数  
        'total_flos': 8.678449181472768e+16,  # 总浮点运算次数（计算量）  
        'train_loss': 0.7303404416652084,     # 平均训练损失（与training_loss一致）  
        'epoch': 5.0                         # 训练的总轮数  
    }  
)  

In [28]:
model_to_save = trainer.save_model(model_dir)  # 保存模型

<font size="4">4 查看模型预测效果</font>

首先从训练器（trainer）中获取一个验证批次数据，并通过模型进行一次前向传播，最后最终查看模型输出的键（keys）

In [29]:
import torch

for batch in trainer.get_eval_dataloader():  # 从验证数据加载器中取出第一个批次（batch） 的数据
    break
batch = {k: v.to(trainer.args.device) for k, v in batch.items()}  # 确保输入数据和模型在同一设备上，否则会报设备不匹配的错误
with torch.no_grad():  # 开启 PyTorch 的无梯度上下文。在这个上下文内，模型前向传播时不会计算梯度，也不会存储梯度
    output = trainer.model(**batch)  # 执行前向传播，得到模型输出output
output.keys()

odict_keys(['loss', 'start_logits', 'end_logits'])

其中  
start_logits：每个 token 作为答案起始位置的未归一化概率。  
end_logits：每个 token 作为答案结束位置的未归一化概率。  

In [30]:
output.start_logits.shape, output.end_logits.shape

(torch.Size([64, 384]), torch.Size([64, 384]))

In [31]:
output.start_logits

tensor([[-7.4791, -9.0262, -9.0625,  ..., -9.6502, -9.6502, -9.6502],
        [-7.5004, -9.0389, -9.0948,  ..., -9.6503, -9.6503, -9.6503],
        [-7.4040, -9.3254, -9.2946,  ..., -9.6689, -9.6689, -9.6689],
        ...,
        [-7.6094, -8.6967, -9.1115,  ..., -9.6348, -9.6348, -9.6348],
        [-7.7614, -9.0377, -8.9848,  ..., -9.6395, -9.6395, -9.6395],
        [-7.5818, -8.5148, -8.6430,  ..., -9.6020, -9.6020, -9.6020]],
       device='cuda:0')

In [32]:
output.start_logits.argmax(dim=-1), output.end_logits.argmax(dim=-1)  # 最大logit对应的位置

(tensor([ 48,  60,  81,  45, 123, 111,  75,  36, 111,  35,  76,  43,  83,  93,
         157,  36,  86,  93,  83,  62,  81,  77,  44,  56,  43,  36,  44,  80,
          12,  46,  29, 135,  68,  42,  89,  46,  87,  85, 129,  26,  29,  34,
          88, 129,  97,  26,  45,  61,  86,  31,  88,  48,  25,  47,  67,  57,
          80,  15,  58,  71,  25,  36,  56,  42], device='cuda:0'),
 tensor([ 49,  61,  94,  46, 123, 113,  78,  38, 113,  37,  79,  44,  85,  96,
         159,  36,  86,  96,  85,  64,  84,  77,  45,  57,  44,  36,  45,  93,
          14,  47,  30, 135,  68,  43,  91,  47,  89,  87, 129,  27,  31,  35,
          90, 129,  99,  27,  46, 134,  88,  32,  90,  49,  26,  48,  67,  58,
          80,  15,  59,  71,  25,  36,  57,  42], device='cuda:0'))

下面我们来看，如何从上面的logits获取可能的答案

In [33]:
n_best_size = 20  # 最有可能的前20个结果

In [37]:
import numpy as np

# 取出第一行
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()

# 获取最佳的起始和结束位置的索引：
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()  # 从大到小排列，然后取出前n_best_size的indexes
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

valid_answers = []

# 剔除掉start_index>index的组合，计算score
for start_index in start_indexes:
    for end_index in end_indexes:
        if start_index <= end_index:  # 需要进一步测试以检查答案是否在上下文中
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],  #score相加作为该答案的得分
                    "text": ""  # 我们需要找到一种方法来获取与上下文中答案对应的原始子字符串
                }
            )
print(valid_answers[:4])

[{'score': 19.589907, 'text': ''}, {'score': 12.265194, 'text': ''}, {'score': 11.992574, 'text': ''}, {'score': 9.303947, 'text': ''}]


In [75]:
def prepare_validation_features(examples):
    examples["question"] = [q.lstrip() for q in examples["question"]]

    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # 一个示例可以产生几个文本段，example_id为该文本段的示例的id
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # 将非context设置为None
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [76]:
dp(datasets["validation"][[0]])
example = prepare_validation_features(datasets["validation"][[0]])
dp(dict(example))

Unnamed: 0,id,title,context,question,answers
0,56be4db0acb8001400a502ec,Super_Bowl_50,"Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the ""golden anniversary"" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as ""Super Bowl L""), so that the logo could prominently feature the Arabic numerals 50.",Which NFL team represented the AFC at Super Bowl 50?,"{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}"


Unnamed: 0,input_ids,attention_mask,offset_mapping,example_id
0,"[0, 32251, 1485, 165, 4625, 5, 9601, 23, 1582, 2616, 654, 116, 2, 2, 16713, 2616, 654, 21, 41, 470, 1037, 177, 7, 3094, 5, 2234, 9, 5, 496, 3910, 815, 36, 12048, 43, 13, 5, 570, 191, 4, 20, 470, 3910, 2815, 36, 250, 5268, 43, 2234, 4465, 7609, 5125, 5, 496, 3910, 2815, 36, 487, 5268, 43, 2234, 1961, 6495, 706, 2383, 698, 7, 4073, 49, 371, 1582, 2616, 1270, 4, 20, 177, 21, 702, 15, 902, 262, 6, 336, 6, 23, 20050, 18, 2689, 11, 5, 764, 2659, 1501, 4121, 23, 2005, 13606, 6, 886, 4, 287, ...]","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...]","[None, None, None, None, None, None, None, None, None, None, None, None, None, None, (0, 5), (6, 10), (11, 13), (14, 17), (18, 20), (21, 29), (30, 38), (39, 43), (44, 46), (47, 56), (57, 60), (61, 69), (70, 72), (73, 76), (77, 85), (86, 94), (95, 101), (102, 103), (103, 106), (106, 107), (108, 111), (112, 115), (116, 120), (121, 127), (127, 128), (129, 132), (133, 141), (142, 150), (151, 161), (162, 163), (163, 164), (164, 166), (166, 167), (168, 176), (177, 183), (184, 191), (192, 200), (201, 204), (205, 213), (214, 222), (223, 233), (234, 235), (235, 236), (236, 238), (238, 239), (240, 248), (249, 257), (258, 266), (267, 269), (269, 270), (270, 272), (273, 275), (276, 280), (281, 286), (287, 292), (293, 298), (299, 303), (304, 309), (309, 310), (311, 314), (315, 319), (320, 323), (324, 330), (331, 333), (334, 342), (343, 344), (344, 345), (346, 350), (350, 351), (352, 354), (355, 359), (359, 361), (362, 369), (370, 372), (373, 376), (377, 380), (381, 390), (391, 394), (395, 399), (400, 402), (403, 408), (409, 414), (414, 415), (416, 426), (426, 427), (428, 430), ...]",56be4db0acb8001400a502ec


In [101]:
# 预处理时同时生成带 offset_mapping 的特征（用于评估）和不带的特征（用于模型预测）
# 1. 生成带 offset_mapping的完整特征（将token位置映射回原始文本的字符位置，从而提取出真实的答案文本（如从上下文里截取对应片段），用于后续答案提取）
validation_features_with_offset = datasets["validation"].map(
    prepare_validation_features,
    batched=True,
    remove_columns=datasets["validation"].column_names
)
# 2. 生成模型预测用的特征（删除 offset_mapping），预测的答案在token序列中的位置（start_logits和end_logits）。
validation_features = validation_features_with_offset.remove_columns(["offset_mapping"])


下面获取原始预测结果，后续会被用于提取具体的答案文本（结合之前保留的offset_mapping信息），并与真实答案进行比对来评估模型性能（如计算 EM 值、F1 分数等）。

In [102]:
raw_predictions = trainer.predict(validation_features)

In [103]:
print(validation_features_with_offset.format["type"], list(validation_features_with_offset.features.keys()))

None ['input_ids', 'attention_mask', 'offset_mapping', 'example_id']


In [104]:
validation_features_with_offset.set_format(type=validation_features_with_offset.format["type"], columns=list(validation_features_with_offset.features.keys()))

In [105]:
max_answer_length = 30

In [106]:
start_logits = output.start_logits[0].cpu().numpy()
end_logits = output.end_logits[0].cpu().numpy()
offset_mapping = validation_features_with_offset[0]["offset_mapping"]

# 第一个特征来自第一个示例。对于更一般的情况，我们需要将example_id匹配到一个示例索引
context = datasets["validation"][0]["context"]

# 收集最佳开始/结束逻辑的索引：
start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
valid_answers = []
for start_index in start_indexes:
    for end_index in end_indexes:
        # 不考虑超出范围的答案，原因是索引超出范围或对应于输入ID的部分不在上下文中。
        if (
            start_index >= len(offset_mapping)
            or end_index >= len(offset_mapping)
            or offset_mapping[start_index] is None
            or offset_mapping[end_index] is None
        ):
            continue
        # 不考虑长度小于0或大于max_answer_length的答案。
        if end_index < start_index or end_index - start_index + 1 > max_answer_length:
            continue
        if start_index <= end_index: # 我们需要细化这个测试，以检查答案是否在上下文中
            start_char = offset_mapping[start_index][0]
            end_char = offset_mapping[end_index][1]
            valid_answers.append(
                {
                    "score": start_logits[start_index] + end_logits[end_index],
                    "text": context[start_char: end_char]
                }
            )
valid_answers = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[:n_best_size]
valid_answers

[{'score': 19.589907, 'text': 'Denver Broncos'},
 {'score': 15.610378, 'text': 'Broncos'},
 {'score': 15.31563,
  'text': 'The American Football Conference (AFC) champion Denver Broncos'},
 {'score': 13.710703,
  'text': 'American Football Conference (AFC) champion Denver Broncos'},
 {'score': 12.650904, 'text': 'AFC) champion Denver Broncos'},
 {'score': 12.265194, 'text': 'Denver'},
 {'score': 11.992574,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 11.460337, 'text': 'champion Denver Broncos'},
 {'score': 9.303947,
  'text': 'Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10'},
 {'score': 8.013045,
  'text': 'Broncos defeated the National Football Conference (NFC) champion Carolina Panthers'},
 {'score': 7.9909163,
  'text': 'The American Football Conference (AFC) champion Denver'},
 {'score': 7.9181066,
  'text': 'Denver Broncos defeated the National Football Conferenc

In [107]:
datasets["validation"][0]["answers"]  # 查看真实的answers，和我们score得分最高的一致

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'],
 'answer_start': [177, 177, 177]}

下面，我们建立 “原始样本→其所有截断片段” 的映射关系， 例如

In [None]:
# 假设输出结构
features_per_example = {
    0: [2, 3],   # 原始样本0对应features中的索引2和3（被拆分为2个片段）
    1: [5],      # 原始样本1对应features中的索引5（未被拆分）
    2: [7, 8, 9] # 原始样本2对应features中的索引7、8、9（被拆分为3个片段）
}

In [108]:
import collections

examples = datasets["validation"]
features = validation_features

example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
features_per_example = collections.defaultdict(list)
for i, feature in enumerate(features):
    features_per_example[example_id_to_index[feature["example_id"]]].append(i)

在问答任务中，一个长上下文可能被截断为多个片段（每个片段都是一个feature），而模型会对每个片段单独预测答案。通过features_per_example，可以收集同一个原始样本的所有片段的预测结果，从而从中筛选出最优的答案（如置信度最高的），最终将预测结果与原始样本的真实答案对齐，用于评估模型性能。

下面，将模型对多个文本片段（features）的原始预测（logits）转换为最终的答案文本，并处理长上下文截断带来的多片段聚合问题

In [109]:
from tqdm.auto import tqdm

def postprocess_qa_predictions(examples, features, raw_predictions, n_best_size = 20, max_answer_length = 30):
    all_start_logits, all_end_logits = raw_predictions  # 拆分模型预测的开始/结束位置logits
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}  # 样本ID到索引的映射
    features_per_example = collections.defaultdict(list)  # 存储每个样本对应的所有片段索引
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    predictions = collections.OrderedDict()

    # 日志记录。
    print(f"正在后处理 {len(examples)} 个示例的预测，这些预测分散在 {len(features)} 个特征中。")

    #  遍历原始样本，处理每个样本的所有片段
    for example_index, example in enumerate(tqdm(examples)):
        feature_indices = features_per_example[example_index]  # 当前样本对应的所有片段索引

        min_null_score = None # 仅在squad_v2为True时使用。
        valid_answers = []
        
        context = example["context"]
        # 处理单个片段的预测结果
        for feature_index in feature_indices:
            start_logits = all_start_logits[feature_index]  # 该片段的开始位置logits
            end_logits = all_end_logits[feature_index]      # 该片段的结束位置logits
            offset_mapping = features[feature_index]["offset_mapping"]  # token到原始文本的字符映射

            #  计算空答案的分数（针对 SQuAD v2）
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # 筛选有效候选答案
            # 取logits最高的前n_best_size个开始和结束位置
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()

            for start_index in start_indexes:
                for end_index in end_indexes:
                    # 过滤无效位置（超出范围、不在上下文中等）
                    if (start_index >= len(offset_mapping) or end_index >= len(offset_mapping) or
                        offset_mapping[start_index] is None or offset_mapping[end_index] is None):
                        continue
                    # 过滤长度无效的答案（反向或过长）
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue
                    # 转换为原始文本中的字符位置，截取答案
                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append({
                        "score": start_logits[start_index] + end_logits[end_index],
                        "text": context[start_char: end_char]
                    })
        # 选择最优答案
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # 在极少数情况下我们没有一个非空预测，我们创建一个假预测以避免失败。
            best_answer = {"text": "", "score": 0.0}
        
        # 最终答案确定（区分 SQuAD v1/v2）
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]  # v1直接用最佳答案
        else:
            # v2需比较最佳答案与空答案的分数，取分数高的
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions


In [110]:
final_predictions = postprocess_qa_predictions(datasets["validation"], validation_features_with_offset, raw_predictions.predictions)

正在后处理 10570 个示例的预测，这些预测分散在 10790 个特征中。


  0%|          | 0/10570 [00:00<?, ?it/s]

In [111]:
from datasets import load_metric

metric = load_metric("squad_v2" if squad_v2 else "squad")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


SQuAD v2：支持 “无答案” 场景，因此每个预测项需包含no_answer_probability（无答案的概率，这里简化为 0.0，实际中可根据模型输出计算）。  
SQuAD v1：答案一定存在，只需包含样本 ID 和预测文本。

In [112]:
if squad_v2:
    formatted_predictions = [{"id": k, "prediction_text": v, "no_answer_probability": 0.0} for k, v in final_predictions.items()]
else:
    formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in datasets["validation"]]
metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 85.38315988647115, 'f1': 91.86383353880744}

EM（Exact Match，精确匹配）：预测答案与真实答案完全一致的比例  
F1 分数：预测答案与真实答案的重叠度（考虑部分匹配），是召回率和精确率的调和平均    