# How To Train Model for Open Book Q&A Technique
In this notebook we demonstrate how to train a model to be used with top scoring Open Book Q&A method. The Open Book method was first presented by JJ (@jjinho) [here][1], then Quangteo (@quangbk) improved RAM usage [here][2], and Anil (@nlztrk) combined with Q&A [here][3]. Radek (@radek1) demonstrated the strength of Q&A [here][5]. Next Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using this method [here][4] by finetuning DeBerta large on this method.

In order to train a model for use with Open Book Q&A, we need a CSV that contains; `prompt` (i.e. question), `A, B, C, D, E` (i.e. answer choices), and we need a column of `context` extracted from wikipedia pages for each question. To generate the `context` column, we run Mgoksu's notebook [here][4]. In code cell #5, we load our CSV without `context` column with code `trn = pd.read_csv(OUR_DATASET.CSV)`. Then in code cell #21 our dataset is saved to disk as `test_context.csv` with the column `context` added.

I have searched and concatenated all publicly shared datasets into one 60k CSV and then ran Mgoksu's notebook with `NUM_TITLES_INCLUDE = 5` and `NUM_SENTENCES_INCLUDE = 20`. This added an additional `context` column. I uploaded the resultant CSV file to a Kaggle dataset [here][6]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks! 

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)
 
(image source [here][7])

[1]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[2]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[3]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[4]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[7]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

# Load CSV
We will load 60k CSV of `prompts`, `A,B,C,D,E`, and `context` from my Kaggle dataset [here][1]. This dataset is all publicly shared datasets concatenated then processed with Mgoksu's notebook [here][2] to create a `context` column. (To learn more about the datasets within read my discussion post). This Kaggle dataset also contains competition `train.csv` with added `context` column (to be used as a validation dataset).

In this train notebook, we have internet turned on and can choose whatever model we wish to download and train. After we finetune this model, we will create a second notebook with the Open Book Q&A technique and load the finetuned model from the output of this notebook. The second notebook will have internet turned off so that it can be submitted to Kaggle's competition.

[1]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[2]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"  # 指定CUDA设备ID，用于确定哪些GPU可用于计算

from typing import Optional, Union
import pandas as pd, numpy as np, torch  # 导入必要的库
from datasets import Dataset  # 从datasets库中导入Dataset类
from dataclasses import dataclass  # 导入dataclass装饰器
from transformers import AutoTokenizer  # 从transformers库中导入AutoTokenizer类
from transformers import EarlyStoppingCallback  # 导入EarlyStoppingCallback类，用于提前停止训练
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy  # 导入预训练分词器基类和填充策略枚举类
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer  # 导入多项选择自动模型类、训练参数类和训练器类

VER=2  # 定义版本变量
# 使用60K的子集进行训练
NUM_TRAIN_SAMPLES = 1_024  # 定义训练样本数量
# 参数高效微调（Parameter Efficient Fine Tuning，PEFT）
# PEFT要求使用1个XP100 GPU而不是2个T4
USE_PEFT = False  # 定义是否使用PEFT的布尔变量
# 冻结的层数
# DeBERTa大模型总共有24层
FREEZE_LAYERS = 18  # 定义要冻结的层数
# 是否冻结嵌入
FREEZE_EMBEDDINGS = True  # 定义是否冻结嵌入的布尔变量
# 上下文、问题和答案的最大输入长度
MAX_INPUT = 256  # 定义最大输入长度
# Hugging Face模型
MODEL = 'microsoft/deberta-v3-large'  # 定义使用的模型

In [None]:
#验证集使用的是比赛提供的200条数据
df_valid = pd.read_csv('/kaggle/input/60k-data-with-context-v2/train_with_context2.csv')
print('Validation data size:', df_valid.shape )
df_valid.head()

In [None]:
#训练集
df_train = pd.read_csv('/kaggle/input/60k-data-with-context-v2/all_12_with_context2.csv')
df_train = df_train.drop(columns="source")
df_train = df_train.fillna('').sample(NUM_TRAIN_SAMPLES)
print('Train data size:', df_train.shape )
df_train.head()

# Data Loader
Code is from Radek's notebook [here][1] with modifications to the tokenization process.

[1]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training

In [None]:
# 数据整理
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}  # 创建选项到索引的映射，例如：{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
index_to_option = {v: k for k,v in option_to_index.items()}  # 创建索引到选项的映射，例如：{0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'}

# 预处理函数
def preprocess(example):
    # 为每个选项创建第一句，即上下文，并重复5次（因为有5个选项）
    first_sentence = ["[CLS] " + example['context']] * 5  
    # 为每个选项创建第二句，包括提示、选项和特殊的分隔符
    second_sentences = [" #### " + example['prompt'] + " [SEP] " + example[option] + " [SEP]" for option in 'ABCDE']  
    # 使用分词器对句子进行分词，并限制最大长度
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation='only_first', 
                                  max_length=MAX_INPUT, add_special_tokens=False)
    # 将正确答案的选项映射到索引
    tokenized_example['label'] = option_to_index[example['answer']]  
    return tokenized_example  # 返回分词后的例子

# 用于多项选择任务的数据整理类
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase  # 预训练的分词器
    padding: Union[bool, str, PaddingStrategy] = True  # 填充策略
    max_length: Optional[int] = None  # 最大长度
    pad_to_multiple_of: Optional[int] = None  # 填充到指定的倍数
    
    # 调用方法，用于整理一批特征
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'  # 确定标签的键名
        labels = [feature.pop(label_name) for feature in features]  # 提取并移除标签
        batch_size = len(features)  # 批次大小
        num_choices = len(features[0]['input_ids'])  # 选项数量
        # 将特征展平，以便每个选项都有一个单独的特征字典
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])  # 展平特征列表
        
        # 使用分词器的pad方法进行填充，并将结果转换为张量
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        # 调整张量的形状以匹配批次大小和选项数量
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}  
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)  # 添加标签张量
        return batch  # 返回整理后的批次

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
dataset_valid = Dataset.from_pandas(df_valid)
dataset = Dataset.from_pandas(df_train)
dataset = dataset.remove_columns(["__index_level_0__"])
dataset

In [None]:
tokenized_dataset_valid = dataset_valid.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_dataset = dataset.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_dataset

# 构建模型
# 我们将使用Hugging Face的AutoModelForMultipleChoice。关于可能的模型列表，请参阅Hugging Face的仓库[此处][1]。
# 我们可以选择使用PEFT来加速训练并减少内存使用，但是我注意到验证准确率较低。（注意，PEFT要求我们使用1xP100而不是2xT4 GPU，我不确定为什么）。
# 我们还可以选择冻结层，这也会加速训练并减少内存使用，但是验证准确率可能会降低。
# [1]: https://huggingface.co/models

In [None]:
# 构建模型
model = AutoModelForMultipleChoice.from_pretrained(MODEL)  # 从预训练模型加载AutoModelForMultipleChoice模型，是Hugging Face的Transformers库中的一个类，它是为解决多项选择任务而设计的

# 参数高效微调方法，会损害模型的性能，在GPU允许的情况下能不开启就不开启
if USE_PEFT:
    !pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl  
    print('We are using PEFT.')
    from peft import LoraConfig, get_peft_model, TaskType  # 导入PEFT相关的类和函数
    # 配置PEFT参数
    peft_config = LoraConfig(
        r=8, lora_alpha=4, task_type=TaskType.SEQ_CLS, lora_dropout=0.1, 
        bias="none", inference_mode=False, 
        target_modules=["query_proj", "value_proj"],
        modules_to_save=['classifier','pooler'],
    )
    # 获取经过PEFT处理的模型
    model = get_peft_model(model, peft_config)  
    model.print_trainable_parameters()  # 打印可训练的参数

# 如果需要，冻结模型的嵌入层
if FREEZE_EMBEDDINGS:
    print('Freezing embeddings.')
    for param in model.deberta.embeddings.parameters():  # 遍历嵌入层的所有参数
        param.requires_grad = False  # 禁止对参数的梯度计算，从而冻结嵌入层

# 如果需要，冻结模型的指定层数
if FREEZE_LAYERS > 0:
    print(f'Freezing {FREEZE_LAYERS} layers.')
    for layer in model.deberta.encoder.layer[:FREEZE_LAYERS]:  # 遍历要冻结的层
        for param in layer.parameters():  # 遍历层中的所有参数
            param.requires_grad = False  # 禁止对参数的梯度计算，从而冻结层


# MAP@3 Metric
The competition metric is MAP@3 therefore we will make a custom code to add to Hugging Face's trainer. Discussion [here][1]

[1]: https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/435602

In [None]:
# 评价指标
def map_at_3(predictions, labels):
    map_sum = 0  # 初始化平均准确率的总和为0
    pred = np.argsort(-1 * np.array(predictions), axis=1)[:, :3]  # 对预测结果按照降序排列，并取前3个最可能的预测
    for x, y in zip(pred, labels):  # 遍历预测和实际标签
        # 对于每个预测，如果预测与实际标签匹配，则其值为1/排名，否则为0
        z = [1 / i if y == j else 0 for i, j in zip([1, 2, 3], x)]  
        map_sum += np.sum(z)  # 累计每个预测的准确率
    return map_sum / len(predictions)  # 返回平均准确率

def compute_metrics(p):
    predictions = p.predictions.tolist()  # 将预测结果转换为列表
    labels = p.label_ids.tolist()  # 将实际标签转换为列表
    return {"map@3": map_at_3(predictions, labels)}  # 返回包含map@3评价指标的字典

In [None]:
# 训练参数
training_args = TrainingArguments(
    warmup_ratio=0.1,  # 预热率，用于学习率调度，预热期间学习率将线性增加
    learning_rate=2e-5,  # 学习率
    per_device_train_batch_size=1,  # 每个设备上的训练批次大小
    per_device_eval_batch_size=2,  # 每个设备上的评估批次大小
    num_train_epochs=2,  # 训练周期数
    report_to='none',  # 不报告到任何平台
    output_dir=f'./checkpoints_{VER}',  # 输出目录，用于保存模型和日志
    overwrite_output_dir=True,  # 如果输出目录已存在，是否覆盖
    fp16=True,  # 使用16位浮点数进行训练，以节省内存和提高速度
    gradient_accumulation_steps=8,  # 梯度累积步数，用于处理大批次
    logging_steps=25,  # 日志记录步数，每25步记录一次日志
    evaluation_strategy='steps',  # 评估策略，按步数进行评估
    eval_steps=25,  # 评估步数，每25步进行一次评估
    save_strategy="steps",  # 保存策略，按步数保存模型
    save_steps=25,  # 保存步数，每25步保存一次模型
    load_best_model_at_end=False,  # 在训练结束时不加载最佳模型
    metric_for_best_model='map@3',  # 用于选择最佳模型的指标
    lr_scheduler_type='cosine',  # 学习率调度类型，使用余弦退火调度
    weight_decay=0.01,  # 权重衰减
    save_total_limit=2,  # 保存的总模型限制，最多保存2个模型
)
# 创建训练器
trainer = Trainer(
    model=model,  # 要训练的模型
    args=training_args,  # 训练参数
    tokenizer=tokenizer,  # 用于分词的分词器
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),  # 用于处理数据的数据整理器
    train_dataset=tokenized_dataset,  # 训练数据集
    eval_dataset=tokenized_dataset_valid,  # 评估数据集
    compute_metrics=compute_metrics,  # 用于计算评价指标的函数
    #callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],  # 早停回调，此行已注释，可根据需要取消注释
)

# 开始训练
trainer.train()
# 保存模型
trainer.save_model(f'model_v{VER}')  # 保存模型到指定路径

# Verify Saved Model
During training, we see the MAP@3 validation score above. Let's load the saved model and compute it again here to verify that our model is saved correctly.

In [None]:
del model, trainer
if USE_PEFT:
    model = AutoModelForMultipleChoice.from_pretrained(MODEL)
    model = get_peft_model(model, peft_config)
    checkpoint = torch.load(f'model_v{VER}/pytorch_model.bin')
    model.load_state_dict(checkpoint)
else:
    model = AutoModelForMultipleChoice.from_pretrained(f'model_v{VER}')
trainer = Trainer(model=model)

In [None]:
test_df = pd.read_csv('/kaggle/input/60k-data-with-context-v2/train_with_context2.csv')
tokenized_test_dataset = Dataset.from_pandas(test_df).map(
        preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E'])

test_predictions = trainer.predict(tokenized_test_dataset).predictions
predictions_as_ids = np.argsort(-test_predictions, 1)
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]

# Compute Validation Score

In [None]:
# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
import numpy as np
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u].split()
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

In [None]:
m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
print( 'CV MAP@3 =',m )