# How To Train Model for Open Book Q&A Technique - Part 2
The notebook you are reading is a fork of Mgoksu's great notebook [here][1]. Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using Open Book technique. The Open Book method was first presented by JJ (@jjinho) [here][2], then Quangteo (@quangbk) improved RAM usage [here][3], and Anil (@nlztrk) combined with Q&A [here][4]. Radek (@radek1) demonstrated the strength of Q&A [here][5].

In my previous notebook [here][6] (i.e. Part 1), we demonstrated how to train a model for Open Book. The model was trained using my 60k Kaggle dataset [here][7]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks!

In this notebook, we will load the trained model output from my previous notebook. We will infer this model after running the code from Mgoksu's public notebook to use Open Book to seach Wikipedia for context. For each test sample in the hidden dataset, we will append Wikipedia context. Then our trained model will infer the multiple choice answer (using both question and appended Wikipedia context). When predicting the answer, this notebook uses a 50% 50% ensemble of the new Q&A model we trained ensembled with Mgoksu's original model. Here is a diagram showing the Open Book method:

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)

(image source [here][8])

[1]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[2]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[3]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[4]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model
[7]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[8]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

# OpenBook DeBERTaV3-Large with an updated model

This work is based on the great [work](https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model) of [nlztrk](https://www.kaggle.com/nlztrk).

I trained a model offline using the dataset I shared [here](https://www.kaggle.com/datasets/mgoksu/llm-science-exam-dataset-w-context). I just added my model to the original notebook. The model is available [here](https://www.kaggle.com/datasets/mgoksu/llm-science-run-context-2).

I also addressed the problem of [CSV Not Found at submission](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/434228) with this notebook by clipping the context like so:

`test_df["prompt"] = test_df["context"].apply(lambda x: x[:1500]) + " #### " +  test_df["prompt"]`

You can probably get more than 1500 without getting an OOM.

In [None]:
# installing offline dependencies
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
!pip install -U /kaggle/working/sentence-transformers
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/datasets-2.14.3-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

In [None]:
import os
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader

In [None]:
# 检索管道
# 这个函数把文档划分成一个一个句子，划分成句子之后我们可以把句子拿到我们的样本里面进行推理，处理成一个句子的模式
def process_documents(documents: Iterable[str],
                      document_ids: Iterable,
                      split_sentences: bool = True,
                      filter_len: int = 3,
                      disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    主要的辅助函数，用于处理来自电子病历（EMR）的文档。
    
    :param documents: 包含文档的可迭代对象，其中文档是字符串
    :param document_ids: 包含文档唯一标识符的可迭代对象
    :param split_sentences: 标志，用于确定是否将部分进一步分成句子
    :param filter_len: 句子的最小字符长度（否则过滤掉）
    :param disable_progress_bar: 标志，用于禁用tqdm进度条
    :return: 包含`document_id`、`text`、`section`、`offset`列的Pandas DataFrame
    """
    
    df = sectionize_documents(documents, document_ids, disable_progress_bar)

    if split_sentences:
        df = sentencize(df.text.values, 
                        df.document_id.values,
                        df.offset.values, 
                        filter_len, 
                        disable_progress_bar)
    return df


def sectionize_documents(documents: Iterable[str],
                         document_ids: Iterable,
                         disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    获取影像报告的各个部分，并仅返回选定的部分（默认为FINDINGS、IMPRESSION和ADDENDUM）。

    :param documents: 包含文档的可迭代对象，其中文档是字符串
    :param document_ids: 包含文档唯一标识符的可迭代对象
    :param disable_progress_bar: 标志，用于禁用tqdm进度条
    :return: 包含`document_id`、`text`、`offset`列的Pandas DataFrame
    """
    processed_documents = []
    for document_id, document in tqdm(zip(document_ids, documents), total=len(documents), disable=disable_progress_bar):
        row = {}
        text, start, end = (document, 0, len(document))
        row['document_id'] = document_id
        row['text'] = text
        row['offset'] = (start, end)

        processed_documents.append(row)

    _df = pd.DataFrame(processed_documents)
    if _df.shape[0] > 0:
        return _df.sort_values(['document_id', 'offset']).reset_index(drop=True)
    else:
        return _df


def sentencize(documents: Iterable[str],
               document_ids: Iterable,
               offsets: Iterable[tuple[int, int]],
               filter_len: int = 3,
               disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    将文档分成句子。可以与`sectionize_documents`一起使用，以将文档进一步分成更易管理的片段。
    接受偏移量以确保在拆分后，句子可以与原始文档中的位置匹配。

    :param documents: 包含文档的可迭代对象，其中文档是字符串
    :param document_ids: 包含文档唯一标识符的可迭代对象
    :param offsets: 可迭代的元组，表示开始和结束索引
    :param filter_len: 句子的最小字符长度（否则过滤掉）
    :return: 包含`document_id`、`text`、`section`、`offset`列的Pandas DataFrame
    """

    document_sentences = []
    for document, document_id, offset in tqdm(zip(documents, document_ids, offsets), total=len(documents), disable=disable_progress_bar):
        try:
            _, sentence_offsets = bf.text_to_sentences_and_offsets(document)
            for o in sentence_offsets:
                if o[1]-o[0] > filter_len:
                    sentence = document[o[0]:o[1]]
                    abs_offsets = (o[0]+offset[0], o[1]+offset[0])
                    row = {}
                    row['document_id'] = document_id
                    row['text'] = sentence
                    row['offset'] = abs_offsets
                    document_sentences.append(row)
        except:
            continue
    return pd.DataFrame(document_sentences)

In [None]:
SIM_MODEL = '/kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2'
DEVICE = 0
MAX_LENGTH = 384
BATCH_SIZE = 16

WIKI_PATH = "/kaggle/input/wikipedia-20230701"
wiki_files = os.listdir(WIKI_PATH)

# Relevant Title Retrieval

In [None]:
trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv").drop("id", 1)
trn.head()

In [None]:
# 这个模型是专门用来把文本编码成可以比较的相似性向量的模型，可以对维基百科上面的数据进行编码
model = SentenceTransformer(SIM_MODEL, device='cuda')
model.max_seq_length = MAX_LENGTH
model = model.half()

In [None]:
# 因为推理时间有限，所以我们线下直接把维基百科的向量编码好，然后封装进index里面
sentence_index = read_index("/kaggle/input/wikipedia-2023-07-faiss-index/wikipedia_202307.index")

In [None]:
# 因为推理的时候，隐藏数据的多项选择题是无法离线编码的，所以我们把这个问题用这个模型进行编码，编码成一个向量
prompt_embeddings = model.encode(trn.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
prompt_embeddings = prompt_embeddings.detach().cpu().numpy()
_ = gc.collect()

In [None]:
# 将上面检索的向量在index文件里面进行search，search之后我们得到最相关的索引，根据这个索引我们到对应维基百科的文章里面去找到对应的文本，把它作为context加到隐藏数据里面去
search_score, search_index = sentence_index.search(prompt_embeddings, 5)

In [None]:
del sentence_index
del prompt_embeddings
_ = gc.collect()
libc.malloc_trim(0)

# Getting Sentences from the Relevant Titles

In [None]:
df = pd.read_parquet("/kaggle/input/wikipedia-20230701/wiki_2023_index.parquet",
                     columns=['id', 'file'])

In [None]:
## Get the article and associated file location using the index
wikipedia_file_data = []

for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
    scr_idx = idx
    _df = df.loc[scr_idx].copy()
    _df['prompt_id'] = i
    wikipedia_file_data.append(_df)
wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)
wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)

## Save memory - delete df since it is no longer necessary
del df
_ = gc.collect()
libc.malloc_trim(0)

In [None]:
## 使用索引获取文章和相关的文件位置
wikipedia_file_data = []  # 初始化一个空列表，用于存储文章和文件位置的数据

# 使用tqdm库显示进度条，遍历search_score和search_index的组合
for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
    scr_idx = idx  # 获取当前索引
    _df = df.loc[scr_idx].copy()  # 使用.loc方法从原始数据框中获取与当前索引相关的行，并创建其副本
    _df['prompt_id'] = i  # 向_df数据框中添加一个新列'prompt_id'，并将其设置为当前迭代的索引i
    wikipedia_file_data.append(_df)  # 将_df数据框添加到wikipedia_file_data列表中

# 使用pd.concat方法将wikipedia_file_data列表中的所有数据框合并成一个新的数据框
wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)

# 选择'id', 'prompt_id', 'file'列，并删除重复项，然后按'file'和'id'列对数据框进行排序
wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)

## 保存内存 - 删除df，因为它不再需要
del df  # 删除原始数据框df，以释放内存
_ = gc.collect()  # 使用Python的垃圾收集机制，释放不再使用的对象所占用的内存
libc.malloc_trim(0)  # 调用C库的malloc_trim函数，以释放C堆上的未使用内存

In [None]:
## Parse documents into sentences
processed_wiki_text_data = process_documents(wiki_text_data.text.values, wiki_text_data.id.values)

In [None]:
## Get embeddings of the wiki text data
wiki_data_embeddings = model.encode(processed_wiki_text_data.text,
                                    batch_size=BATCH_SIZE,
                                    device=DEVICE,
                                    show_progress_bar=True,
                                    convert_to_tensor=True,
                                    normalize_embeddings=True)#.half()
wiki_data_embeddings = wiki_data_embeddings.detach().cpu().numpy()

In [None]:
_ = gc.collect()

In [None]:
## Combine all answers
trn['answer_all'] = trn.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)


## Search using the prompt and answers to guide the search
trn['prompt_answer_stem'] = trn['prompt'] + " " + trn['answer_all']

In [None]:
question_embeddings = model.encode(trn.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
question_embeddings = question_embeddings.detach().cpu().numpy()

# Extracting Matching Prompt-Sentence Pairs

In [None]:
## 参数，确定要包含多少相关句子
NUM_SENTENCES_INCLUDE = 20

## 只包含上下文的列表
contexts = []

# 使用tqdm库显示进度条，遍历trn数据框的每一行
for r in tqdm(trn.itertuples(), total=len(trn)):

    prompt_id = r.Index  # 获取当前行的索引

    # 查找与当前prompt_id相关的文档ID，并获取它们在processed_wiki_text_data数据框中的索引
    prompt_indices = processed_wiki_text_data[processed_wiki_text_data['document_id'].isin(wikipedia_file_data[wikipedia_file_data['prompt_id']==prompt_id]['id'].values)].index.values

    # 检查是否找到了与当前prompt_id相关的文档ID
    if prompt_indices.shape[0] > 0:
        # 创建一个FAISS索引，用于存储与当前prompt_id相关的文档的嵌入
        prompt_index = faiss.index_factory(wiki_data_embeddings.shape[1], "Flat")
        # 将与当前prompt_id相关的文档的嵌入添加到FAISS索引中
        prompt_index.add(wiki_data_embeddings[prompt_indices])

        context = ""  # 初始化一个空字符串，用于存储上下文
        
        ## 获取最佳匹配
        # 使用FAISS索引搜索与question_embeddings最相似的NUM_SENTENCES_INCLUDE个句子
        ss, ii = prompt_index.search(question_embeddings, NUM_SENTENCES_INCLUDE)
        # 遍历最佳匹配的得分和索引
        for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
            # 从processed_wiki_text_data数据框中获取匹配句子的文本，并将其添加到上下文中
            context += processed_wiki_text_data.loc[prompt_indices]['text'].iloc[_i] + " "
        
    contexts.append(context)  # 将上下文添加到contexts列表中


In [None]:
trn['context'] = contexts

In [None]:
# 这样隐藏数据和训练数据都对应起来了
trn[["prompt", "context", "A", "B", "C", "D", "E"]].to_csv("./test_context.csv", index=False)

# Inference

In [None]:
test_df = pd.read_csv("test_context.csv")
test_df.index = list(range(len(test_df)))
test_df['id'] = list(range(len(test_df)))
test_df["prompt"] = test_df["context"].apply(lambda x: x[:1750]) + " #### " +  test_df["prompt"]
test_df['answer'] = 'A'

In [None]:
model_dir = "/kaggle/input/llm-science-run-context-2"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

In [None]:
# 我们将创建一个字典，以将选项名称（A、B、C、D、E）转换为索引，并反过来再转换回选项名称
options = 'ABCDE'  # 选项名称
indices = list(range(5))  # 对应的索引值，即0到4

# 创建选项到索引的字典
option_to_index = {option: index for option, index in zip(options, indices)}
# 创建索引到选项的字典
index_to_option = {index: option for option, index in zip(options, indices)}

# 定义预处理函数
def preprocess(example):
    # AutoModelForMultipleChoice类期望得到一组问题/答案对，
    # 所以我们将问题复制5次然后再进行标记化（tokenization）
    first_sentence = [example['prompt']] * 5  # 复制问题5次
    second_sentence = []  # 初始化一个空列表来存储选项文本
    for option in options:
        second_sentence.append(example[option])  # 将每个选项的文本添加到second_sentence列表中
    
    # 我们的标记器（tokenizer）会将文本转换为BERT可以理解的标记ID
    # 这里假设tokenizer已经被正确定义和初始化
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
    
    # 将正确答案的选项名称转换为索引，并添加到tokenized_example字典中
    tokenized_example['label'] = option_to_index[example['answer']]
    
    return tokenized_example  # 返回标记化的样本

In [None]:
@dataclass  # 使用dataclass装饰器简化类的定义
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase  # 预训练的标记器
    padding: Union[bool, str, PaddingStrategy] = True  # 填充策略，可以是布尔值、字符串或PaddingStrategy枚举值
    max_length: Optional[int] = None  # 可选的最大长度参数，如果提供，则所有序列都将被截断或填充到此长度
    pad_to_multiple_of: Optional[int] = None  # 如果提供，所有序列的长度将填充到此值的倍数
    
    def __call__(self, features):  # 定义__call__方法使得这个类的实例可以像函数一样被调用
        # 确定标签的键名是"label"还是"labels"
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        # 从特征中提取并删除标签
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)  # 批次大小
        num_choices = len(features[0]['input_ids'])  # 选项数量
        # 将特征展平以适应标记器的pad方法
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])  # 将特征列表展平
        
        # 使用标记器的pad方法进行填充，并返回张量
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',  # 返回PyTorch张量
        )
        # 调整张量的维度以匹配期望的批处理结构
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # 将标签转换为张量并添加到批处理中
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch  # 返回批处理

In [None]:
tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

In [None]:
test_predictions = []
for batch in test_dataloader:
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictions.append(outputs.logits.cpu().detach())

test_predictions = torch.cat(test_predictions)
test_predictions = test_predictions.numpy()

# Load Model From Our Train Notebook

In [None]:
model_dir = "/kaggle/input/how-to-train-open-book-model/model_v2"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

In [None]:
test_predictions2 = []
for batch in test_dataloader:
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictions2.append(outputs.logits.cpu().detach())

test_predictions2 = torch.cat(test_predictions2)
test_predictions = (test_predictions+test_predictions2.numpy()) / 2.0

predictions_as_ids = np.argsort(-test_predictions, 1)

predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
# predictions_as_answer_letters[:3]

predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]

In [None]:
submission = test_df[['id', 'prediction']]
submission.to_csv('submission.csv', index=False)