# How To Train Model for Open Book Q&A Technique
In this notebook we demonstrate how to train a model to be used with top scoring Open Book Q&A method. The Open Book method was first presented by JJ (@jjinho) [here][1], then Quangteo (@quangbk) improved RAM usage [here][2], and Anil (@nlztrk) combined with Q&A [here][3]. Radek (@radek1) demonstrated the strength of Q&A [here][5]. Next Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using this method [here][4] by finetuning DeBerta large on this method.

In order to train a model for use with Open Book Q&A, we need a CSV that contains; `prompt` (i.e. question), `A, B, C, D, E` (i.e. answer choices), and we need a column of `context` extracted from wikipedia pages for each question. To generate the `context` column, we run Mgoksu's notebook [here][4]. In code cell #5, we load our CSV without `context` column with code `trn = pd.read_csv(OUR_DATASET.CSV)`. Then in code cell #21 our dataset is saved to disk as `test_context.csv` with the column `context` added.

I have searched and concatenated all publicly shared datasets into one 60k CSV and then ran Mgoksu's notebook with `NUM_TITLES_INCLUDE = 5` and `NUM_SENTENCES_INCLUDE = 20`. This added an additional `context` column. I uploaded the resultant CSV file to a Kaggle dataset [here][6]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks!

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)

(image source [here][7])

[1]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[2]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[3]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[4]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[7]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

# Load CSV
We will load 60k CSV of `prompts`, `A,B,C,D,E`, and `context` from my Kaggle dataset [here][1]. This dataset is all publicly shared datasets concatenated then processed with Mgoksu's notebook [here][2] to create a `context` column. (To learn more about the datasets within read my discussion post). This Kaggle dataset also contains competition `train.csv` with added `context` column (to be used as a validation dataset).

In this train notebook, we have internet turned on and can choose whatever model we wish to download and train. After we finetune this model, we will create a second notebook with the Open Book Q&A technique and load the finetuned model from the output of this notebook. The second notebook will have internet turned off so that it can be submitted to Kaggle's competition.

[1]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[2]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model

In [1]:
%load_ext autoreload
%autoreload 2
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

from typing import Optional, Union
import pandas as pd, numpy as np, torch
from datasets import Dataset # 这个的使用方法 hugging face 上面有教程
from dataclasses import dataclass
from transformers import AutoTokenizer
from transformers import EarlyStoppingCallback
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

# VER='repeat_V3'
# TRAIN WITH SUBSET OF 60K
# NUM_TRAIN_SAMPLES = 1_024
NUM_TRAIN_SAMPLES = None

# PARAMETER EFFICIENT FINE TUNING
# PEFT REQUIRES 1XP100 GPU NOT 2XT4
USE_PEFT = True
# USE_PEFT = True # 这个的全称是 pretrained efficient finetuning, hugging face 上面有教程

# NUMBER OF LAYERS TO FREEZE
# DEBERTA LARGE HAS TOTAL OF 24 LAYERS
FREEZE_LAYERS = 18
# FREEZE_LAYERS = 24
# FREEZE_LAYERS = 20


# BOOLEAN TO FREEZE EMBEDDINGS
FREEZE_EMBEDDINGS = True
# LENGTH OF CONTEXT PLUS QUESTION ANSWER
# 我需要搞懂这个长度到底指的是什么，尤其是 context 和 question 的长度是怎么分配的。256 不可能 cover 全部。
# 因为如果模型没能在足够长的 input 中训练，那么positional encoding 很差的模型就不好 extrapolate
# MAX_INPUT = 256
MAX_INPUT = 786 # 调整这个的大小的时候，每次都需要重新跑一下dataset

# HUGGING FACE MODEL
MODEL = 'microsoft/deberta-v3-large'
VER=f'{FREEZE_LAYERS}_{MAX_INPUT}_original_data'


checkpoint_folder = MODEL.split('/')[-1] + '_checkpoints'
dataset_folder = MODEL.split('/')[-1] + '_datasets'

In [2]:
df_valid = pd.read_csv('../input/60k-data-with-context-v2/train_with_context2.csv')
print('Validation data size:', df_valid.shape )
df_valid

Validation data size: (200, 8)


Unnamed: 0,prompt,context,A,B,C,D,E,answer
0,Which of the following statements accurately d...,The presence of a clustered thick disk-like co...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,Which of the following is an accurate definiti...,Many of these systems evolve in a self-similar...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,Which of the following statements accurately d...,It is possible that this usage is related with...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,What is the significance of regularization in ...,Renormalization is distinct from regularizatio...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,Which of the following statements accurately d...,Several qualitative observations can be made o...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D
...,...,...,...,...,...,...,...,...
195,What is the relation between the three moment ...,The second equation is more general as it does...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...,C
196,"What is the throttling process, and why is it ...",A throttle is the mechanism by which fluid flo...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,B
197,What happens to excess base metal as a solutio...,"Furthermore, this melting may begin at a tempe...","The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be...",B
198,"What is the relationship between mass, force, ...",Newton first set out the definition of mass Th...,Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...,D


In [3]:
df_train = pd.read_csv('../input/60k-data-with-context-v2/all_12_with_context2.csv')
# df_train = pd.read_csv('../input/99k-context/RACE_with_context_original.csv')

print('size of dataset', len(df_train))
# df_train = df_train.drop(columns="source")
if 'source' in df_train.columns:
    df_train = df_train.drop(columns="source")
df_train = df_train.fillna('')
if NUM_TRAIN_SAMPLES:
    df_train = df_train.sample(NUM_TRAIN_SAMPLES) # taken NUM_TRAIN_SAMPLES of samples here
print('Train data size:', df_train.shape )
df_train

size of dataset 60347
Train data size: (60347, 8)


Unnamed: 0,prompt,context,A,B,C,D,E,answer
0,"In relation to Eunice Fay McKenzie's career, w...","Eunice Fay McKenzie (February 19, 1918 – April...",McKenzie showcased her singing talents in nume...,McKenzie is primarily remembered for her starr...,McKenzie gained recognition for her role as a ...,McKenzie's collaborations with director Blake ...,McKenzie's successful career in sound films co...,B
1,How does Modified Newtonian Dynamics (MOND) im...,The presence of a clustered thick disk-like co...,MOND is a theory that increases the discrepanc...,MOND explains the missing baryonic mass in gal...,MOND is a theory that reduces the observed mis...,MOND is a theory that eliminates the observed ...,MOND's impact on the observed missing baryonic...,E
2,Which of the following statements accurately d...,Woody Hartman is a retired American soccer goa...,Ray Montgomerie is a former footballer who pla...,Ray Montgomerie is a former footballer who pla...,Ray Montgomerie is a former footballer who pla...,Ray Montgomerie is a former footballer who pla...,Ray Montgomerie is a former footballer who pla...,B
3,What is the significance of the Museum of the ...,The Museum of the Occupation of Latvia () is a...,The Museum of the Occupation of Latvia is a me...,The Museum of the Occupation of Latvia showcas...,The Museum of the Occupation of Latvia was est...,The Museum of the Occupation of Latvia primari...,The Museum of the Occupation of Latvia is a mu...,C
4,What was the previous name of the Christian Sc...,It was named the Evangelical School for the De...,The Christian School for the Deaf (CSD),The Christian School for the Blind (CSB),The Evangelical School and Chapel for the Deaf...,The Evangelical School for the Deaf (ESD),The Evangelical School for the Blind (ESB),D
...,...,...,...,...,...,...,...,...
60342,"The outer ear, or ear canal, carries sound to ...","The ear canal (external acoustic meatus, exter...",aorta,ear lobe,eardrum,lungs,,C
60343,What sport involves people quickly finding des...,Orienteering sports in which route choice is a...,mapping,,orienteering,patterning,sticking,C
60344,Almost all earthquakes occur at which place?,This subduction zone led to the formation of t...,mountains,land boundaries,plate boundaries,continental shelf,,C
60345,"Melting glaciers, rising temperatures and drou...",Impacts include changes in regional rainfall p...,nature's natural cycle,air pollution,global warming,sudden warming,,C


# Data Loader
Code is from Radek's notebook [here][1] with modifications to the tokenization process.

[1]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training

In [4]:
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k,v in option_to_index.items()}

# 等于说训练的时候模型是可以看到context的，因此与预测保持一致
def preprocess(example, tokenizer):
    first_sentence = [ "[CLS] " + example['context'] ] * 5
    second_sentences = [" #### " + example['prompt'] + " [SEP] " + example[option] + " [SEP]" for option in 'ABCDE']
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation='only_first',
                                  max_length=MAX_INPUT, add_special_tokens=False)
    tokenized_example['label'] = option_to_index[example['answer']]

    return tokenized_example

def race_preprocess(example, tokenizer, max_length=None):
    context = example["context"].replace("\n", " ")
    first_sentence = ["[CLS] " + context] * 4
    second_sentences = [
        " #### " + example["prompt"] + " [SEP] " + example[option] + " [SEP]"
        for option in "ABCD"
    ]
    tokenized_example = tokenizer(
        first_sentence,
        second_sentences,
        truncation="only_first",
        max_length=max_length,
        add_special_tokens=False,
    )
    tokenized_example["label"] = option_to_index[example["answer"]]

    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        # Huggingface tokenizer padding
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of, # this is related to mixed precision training
            return_tensors='pt',
        )

        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [5]:
# 每个 tokenizer 都是 pretrained model 在 pretraining 的时候使用的
# 在 tokenizer 的 special tokens 中, 注意到 [CLS] 同时是 bos 和 cls token; cls_token 是为了让模型知道要做 classification 了
tokenizer = AutoTokenizer.from_pretrained(MODEL)
tokenizer


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


DebertaV2TokenizerFast(name_or_path='microsoft/deberta-v3-large', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [6]:
dataset_valid = Dataset.from_pandas(df_valid)
dataset = Dataset.from_pandas(df_train)
if '__index_level_0__' in dataset._info.features: # 加了这行防爆
    print('removing __index_level_0__')
    dataset = dataset.remove_columns(["__index_level_0__"])
dataset

Dataset({
    features: ['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'],
    num_rows: 60347
})

In [7]:
from functools import partial
preprocess = partial(preprocess, tokenizer=tokenizer)
race_preprocess = partial(race_preprocess, tokenizer=tokenizer)

tokenized_dataset_valid = dataset_valid.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])

num_choices = len([x for x in dataset.column_names if x in 'ABCDEFG'])
dataset_path = f'./{dataset_folder}/tokenized_dataset_{MAX_INPUT}_{num_choices}'
if os.path.exists(dataset_path):
    print(f'getting tokenized_dataset_{MAX_INPUT} from disk')
    tokenized_dataset = Dataset.load_from_disk(dataset_path)
else:
    if num_choices == 5:
        tokenized_dataset = dataset.map(preprocess, remove_columns=dataset.column_names)
    elif num_choices == 4:
        tokenized_dataset = dataset.map(race_preprocess, remove_columns=dataset.column_names)
        
    tokenized_dataset.save_to_disk(dataset_path)

# changed by cxzheng, fuck! stucked!

tokenized_dataset # 他跑到 21100 附近的时候会卡住，有点奇怪

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

getting tokenized_dataset_786 from disk


Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
    num_rows: 60347
})

# MAP@3 Metric
The competition metric is MAP@3 therefore we will make a custom code to add to Hugging Face's trainer. Discussion [here][1]

[1]: https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/435602

In [8]:
def map_at_3(predictions, labels):
    map_sum = 0
    pred = np.argsort(-1*np.array(predictions),axis=1)[:,:3]
    for x,y in zip(pred,labels):
        z = [1/i if y==j else 0 for i,j in zip([1,2,3],x)]
        map_sum += np.sum(z)
    return map_sum / len(predictions)

def compute_metrics(p):
    predictions = p.predictions.tolist()
    labels = p.label_ids.tolist()
    return {"map@3": map_at_3(predictions, labels)}

# Train and Save
We will now train and save our model using Hugging Face's easy to use trainer. By adjusting the parameters in this notebook, we can achieve `CV MAP@3 = 0.915+` and corresponding single model `LB MAP@3 = 0.830+` wow!

In we run this notebook outside of Kaggle then we can train longer and with more RAM. If we run this notebook on Kaggle, then we need to use tricks to train models efficiently. Here are some ideas:
* use fp16 (this speeds up T4 not P100)
* use gradient_accumlation_steps (this simulates larger batch sizes)
* use gradient_checkpointing (this uses disk to save RAM)
* use 2xT4 instead of 1xP100 (this doubles GPUs)
* freeze model embeddings (this reduces weights to train)
* freeze some model layers (this reduces weights to train)
* use PEFT (this reduces weights to train)
* increase LR and decrease epochs (this reduces work)
* use smaller models (this reduces weights to train)

We will use a Hugging Face AutoModelForMultipleChoice. For the list of possible models, see Hugging Face's repository [here][1]. We can optionally use PEFT to accelerate training and use less memory. However i have noticed that validation accuracy is less. (Note that PEFT requires us to use 1xP100 not 2xT4 GPU. I'm not sure why). We can also optionally freeze layers. This also accelerates training and uses less memory. However validation accuracy may become less.

[1]: https://huggingface.co/models

In [None]:
# # NOTE PEFT REQUIRES US TO USE 1XP100 NOT 2XT4. I'M NOT SURE WHY.
# if USE_PEFT:
#     !pip install --no-index --no-deps ../input/llm-whls/peft-0.4.0-py3-none-any.whl

In [9]:
model = AutoModelForMultipleChoice.from_pretrained(MODEL)

if USE_PEFT:
    print('We are using PEFT.')
    from peft import LoraConfig, get_peft_model, TaskType
    peft_config = LoraConfig(
        r=8, lora_alpha=4, task_type=TaskType.SEQ_CLS, lora_dropout=0.1,
        bias="none", inference_mode=False,
        target_modules=["query_proj", "value_proj"],
        modules_to_save=['classifier','pooler'],
    )
    model = get_peft_model(model, peft_config)
    model.print_trainable_parameters()


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.bias', 'classifier.weight', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We are using PEFT.
trainable params: 2,887,682 || all params: 436,899,842 || trainable%: 0.6609482820549979


In [10]:

if FREEZE_EMBEDDINGS:
    print('Freezing embeddings.')
    for param in model.deberta.embeddings.parameters():
        param.requires_grad = False

if FREEZE_LAYERS>0:
    print(f'Freezing {FREEZE_LAYERS} layers.')
    for layer in model.deberta.encoder.layer[:FREEZE_LAYERS]:
        for param in layer.parameters():
            param.requires_grad = False
    # Newly added for v3
    for layer in model.deberta.encoder.layer[FREEZE_LAYERS:]:
        for param in layer.parameters():
            param.requires_grad = True

    total_params = sum(p.numel() for p in model.parameters())
    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
    trainable_params = sum([np.prod(p.size()) for p in model_parameters])
    print(trainable_params, total_params, trainable_params/total_params)

Freezing embeddings.
Freezing 18 layers.
77875202 436899842 0.17824497633945127


In [11]:
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
from transformers import TrainerState, TrainerControl, TrainerCallback
class SavePeftModelCallback(TrainerCallback):
    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        checkpoint_folder = os.path.join(
            args.output_dir, f"{PREFIX_CHECKPOINT_DIR}-{state.global_step}"
        )

        peft_model_path = os.path.join(checkpoint_folder, "torch_model")
        # peft_model_path = os.path.join(checkpoint_folder)
        kwargs["model"].base_model.save_pretrained(peft_model_path)

        # pytorch_model_path = os.path.join(checkpoint_folder, "pytorch_model.bin")
        # if os.path.exists(pytorch_model_path):
        #     os.remove(pytorch_model_path)
        return control

In [15]:
# batch_size = 4
# target_batch_size = 512
# GRAD_ACCUM = target_batch_size // batch_size
# # SAVING_STEP = 10
# SAVING_STEP = 4

# LOGGING_STEPS = SAVING_STEP
# print(SAVING_STEP)
# training_args = TrainingArguments(
#     warmup_ratio=0.03,
#     learning_rate=1e-5, # maybe 1e5
#     # per_device_train_batch_size=16,
#     per_device_train_batch_size=batch_size,
#     # per_device_eval_batch_size=32,
#     per_device_eval_batch_size=batch_size*2,
#     num_train_epochs=3,  # 2.5
#     report_to='none',
#     output_dir = f'./checkpoints_{VER}',
#     overwrite_output_dir=True,
#     fp16=True,
#     # gradient_accumulation_steps=8,
#     gradient_accumulation_steps=GRAD_ACCUM,
#     logging_steps=LOGGING_STEPS,
#     evaluation_strategy='steps',
#     eval_steps=SAVING_STEP,
#     save_strategy="steps",
#     save_steps=SAVING_STEP,
#     load_best_model_at_end=False,
#     metric_for_best_model='map@3',
#     lr_scheduler_type='cosine',
#     # weight_decay=0.01,
#     weight_decay=1e-3,
#     save_total_limit=5,

# )

4


In [12]:
# batch_size = 8 # Try this if possible for 18 512
batch_size = 4 # for 16 512

# effective_batch_size = 1024
effective_batch_size = 512
# effective_batch_size = 256

# effective_batch_size = 128
GRAD_ACCUM = effective_batch_size // batch_size
SAVING_STEP = 4
LOGGING_STEPS = SAVING_STEP
print(SAVING_STEP)
training_args = TrainingArguments(
    # warmup_ratio=0.1, 
    # warmup_ratio=0.0, 
    warmup_ratio = 0.03,
    # warmup_ratio = 0.0,
    # learning_rate = 1e-4,
    learning_rate = 2.28e-5,
    # learning_rate = 2.28e-5 * 1.6,

    # max_grad_norm = 2.0,
    max_grad_norm = 1.0,

    # max_grad_norm = 0.3,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,  # 2.5
    report_to='none',
    output_dir = f'./{checkpoint_folder}/{VER}',
    overwrite_output_dir=True,
    fp16=True,
    # gradient_accumulation_steps=8,
    gradient_accumulation_steps=GRAD_ACCUM,
    logging_steps=LOGGING_STEPS,
    evaluation_strategy='steps',
    eval_steps=SAVING_STEP,
    save_strategy="steps",
    save_steps=SAVING_STEP,
    load_best_model_at_end=False,
    # metric_for_best_model='map@3',
    metric_for_best_model='eval_loss',
    seed=666,
    # lr_scheduler_type='linear',
    lr_scheduler_type='cosine',
    # lr_scheduler_type='cosine_with_restarts',    
    # lr_scheduler_type='reduce_lr_on_plateau',
    # weight_decay=0.01,
    # weight_decay=1e-6, # set this slightly higher to reduce oscillation
    weight_decay=1e-3, # set this slightly higher to reduce oscillation
    # weight_decay=3e-4, # set this slightly higher to reduce oscillation
    save_total_limit=5,
    
)
# training_args = training_args.set_optimizer(name="adamw_torch", beta1=0.9, beta2=0.98, weight_decay=training_args.weight_decay)
# training_args = training_args.set_lr_scheduler(name="reduce_lr_on_plateau", )

4


In [66]:
del model, trainer
import gc
gc.collect()
torch.cuda.empty_cache()

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_dataset_valid,
    compute_metrics = compute_metrics,
    callbacks=[SavePeftModelCallback] if USE_PEFT else None,
    # resume
    #callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],
)

# trainer.train(resume_from_checkpoint = True)
trainer.train()


if USE_PEFT:
    trainer.model.save_pretrained(f'model_v{VER}') # 我改了这个
else:
    trainer.save_model(f'model_v{VER}')

# I think I read from some parts of the discussion that some length of input during training could be changed.
# Training longer during training will hopefully cover the length of even the longest sentence in testing.
# This is some problem with model extrapolation.
# Basically, if the model is not using very good positional encoding, it will perform badly in sequences longer than what it has been trained on.

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Map@3
4,1.6143,1.608413,0.398333
8,1.6155,1.610454,0.31
12,1.6112,1.609639,0.31
16,1.6145,1.609487,0.325833
20,1.6122,1.609429,0.340833
24,1.6134,1.609424,0.3875
28,1.6118,1.609355,0.468333
32,1.6127,1.609175,0.544167
36,1.6139,1.608804,0.613333
40,1.6102,1.606958,0.699167


# Verify Saved Model
During training, we see the MAP@3 validation score above. Let's load the saved model and compute it again here to verify that our model is saved correctly.

In [23]:
del model, trainer
from peft import get_peft_model, set_peft_model_state_dict
if USE_PEFT:
    print('loading peft')
    model = AutoModelForMultipleChoice.from_pretrained(MODEL)
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    model = get_peft_model(model, peft_config)
    # checkpoint = torch.load(f'model_v{VER}/adapter_model.bin')
    # checkpoint = torch.load(f'./checkpoints_3/checkpoint-160/adapter_model.bin')
    # checkpoint = torch.load(f'./checkpoints_3/checkpoint-224/adapter_model.bin')
    # checkpoint = torch.load(f'./checkpoints_5/checkpoint-3900/adapter_model.bin')

    checkpoint = torch.load(f'./checkpoints_{VER}/checkpoint-30/torch_model/pytorch_model.bin')
    model.base_model.model.load_state_dict(checkpoint)


    # print('loading state dict')
    set_peft_model_state_dict(model, checkpoint)
    # if FREEZE_EMBEDDINGS:
    #     print('Freezing embeddings.')
    #     for param in model.deberta.embeddings.parameters():
    #         param.requires_grad = False
    # if FREEZE_LAYERS>0:
    #     print(f'Freezing {FREEZE_LAYERS} layers.')
    #     for layer in model.deberta.encoder.layer[:FREEZE_LAYERS]:
    #         for param in layer.parameters():
    #             param.requires_grad = False
    #     # Newly added for v3
    #     for layer in model.deberta.encoder.layer[FREEZE_LAYERS:]:
    #         for param in layer.parameters():
    #             param.requires_grad = True
    model.eval()
    model.print_trainable_parameters()
else:
    model = AutoModelForMultipleChoice.from_pretrained(f'model_v{VER}')
trainer = Trainer(model=model,
                data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
                tokenizer=tokenizer,
                train_dataset=tokenized_dataset,
                eval_dataset=tokenized_dataset_valid,
                compute_metrics = compute_metrics,)
# trainer.evaluate()

loading peft


Some weights of DebertaV2ForMultipleChoice were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


FileNotFoundError: [Errno 2] No such file or directory: './checkpoints_repeat_V3/checkpoint-30/torch_model/pytorch_model.bin'

In [None]:
MAX_INPUT = 512
# MAX_INPUT = 256

from functools import partial
test_df = pd.read_csv('../input/60k-data-with-context-v2/train_with_context2.csv')

# tokenized_dataset_valid = dataset_valid.map(preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_test_dataset = Dataset.from_pandas(test_df).map(
        preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E'])

# with torch.no_grad():
test_predictions = trainer.predict(tokenized_test_dataset).predictions
predictions_as_ids = np.argsort(-test_predictions, 1)
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

法1：
1. 通过 question 在 wiki 中 进行 context matching，用 sentence embedding 的 cosine similarity，得到 k 个 context，Ctx
2. 通过 answer 在 Ctx 中进行 sentence similarity matching, 将每个答案的 similarity 进行 normalize, 当作先验概率（prior probability）
3. 以训练过的方式，让模型给logits，然后用 sonormalize 得到后验概率（posterior probability）
4. 先后 相乘，然后 softmax，准备ensemble

法2：
1. 通过 question 在 wiki 中 进行 context matching，用 sentence embedding 的 cosine similarity，得到 k 个 context，Ctx
2. 通过某种方法分解提取答案，然后将分解出来的 entity 拿到 context 中，计算 context 每个句子的权重，挑出最重要的 s 个句子，以他们在wiki 中出现的顺序喂给模型

In [None]:
for x in tokenized_dataset[0]['input_ids']:
    print(len(x))
    print(tokenizer.decode(x)) # 许需要加一个 句子修剪
    # print(x)
# I think tokenized_dataset[0]['input_ids'] is a good example for checking the model's prediction
# 因为它的 information retrieval 刚好找到了正确的context，因此需要看看模型是否能从这个context中获取正确答案。
# 还应该 check 一下其他的 retrieval 里面，有多少 context 是相关的。
# 能不能做一个 multi-hop retrieval:
#
#

512
[CLS] Eunice Fay McKenzie (February 19, 1918 – April 16, 2019) was an American actress and singer. She also entertained the troops with her former screen partner, Gene Autry. ===Later career=== After World War II, McKenzie retired from films to raise her two children. She was briefly billed as Fay Shannon. ==Biography== ===Early life and silent film=== McKenzie was born on February 19, 1918, in Hollywood, California, to show business parents, film actor Eva (née Heazlitt) and Irish American actor/director Robert McKenzie.Mike Fitzgerald, "An Interview with... She starred in silent films as a child, and then sound films as an adult, but perhaps she is best known for her leading roles opposite Gene Autry in the early 1940s in five horse opera features. Fay's sister Ida Mae McKenzie, cousin Ella McKenzie, and brother-in-law Billy Gilbert, were also actors. McKenzie sang duets with Autry in each of these films. Ida Mae also played the character of Sarah Lincoln in The Dramatic Life of 

# Compute Validation Score

In [None]:
# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
import numpy as np
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u].split()
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

In [None]:
m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
print( 'CV MAP@3 =',m )

CV MAP@3 = 0.6274999999999998
