# How To Train Model for Open Book Q&A Technique - Part 2
The notebook you are reading is a fork of Mgoksu's great notebook [here][1]. Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using Open Book technique. The Open Book method was first presented by JJ (@jjinho) [here][2], then Quangteo (@quangbk) improved RAM usage [here][3], and Anil (@nlztrk) combined with Q&A [here][4]. Radek (@radek1) demonstrated the strength of Q&A [here][5].

In my previous notebook [here][6] (i.e. Part 1), we demonstrated how to train a model for Open Book. The model was trained using my 60k Kaggle dataset [here][7]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks!

In this notebook, we will load the trained model output from my previous notebook. We will infer this model after running the code from Mgoksu's public notebook to use Open Book to seach Wikipedia for context. For each test sample in the hidden dataset, we will append Wikipedia context. Then our trained model will infer the multiple choice answer (using both question and appended Wikipedia context). When predicting the answer, this notebook uses a 50% 50% ensemble of the new Q&A model we trained ensembled with Mgoksu's original model. Here is a diagram showing the Open Book method:

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)

(image source [here][8])

[1]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[2]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[3]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[4]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model
[7]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[8]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

# OpenBook DeBERTaV3-Large with an updated model

This work is based on the great [work](https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model) of [nlztrk](https://www.kaggle.com/nlztrk).

I trained a model offline using the dataset I shared [here](https://www.kaggle.com/datasets/mgoksu/llm-science-exam-dataset-w-context). I just added my model to the original notebook. The model is available [here](https://www.kaggle.com/datasets/mgoksu/llm-science-run-context-2).

I also addressed the problem of [CSV Not Found at submission](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/434228) with this notebook by clipping the context like so:

`test_df["prompt"] = test_df["context"].apply(lambda x: x[:1500]) + " #### " +  test_df["prompt"]`

You can probably get more than 1500 without getting an OOM.

In [1]:
# installing offline dependencies
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
!pip install -U /kaggle/working/sentence-transformers
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/datasets-2.14.3-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

Processing /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Processing ./sentence-transformers
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=126134 sha256=b16fb534b977453600e05a6402b3a3566d0e1d8927d7b87a166298f6c72a9dbf
  Stored in directory: /root/.cache/pip/wheels/6c/ea/76/d9a930b223b1d3d5d6aff69458725316b0fe205b854faf1812
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
Processing /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl
Installing collected packages: blingfire
Successfully installed blingfir

In [2]:
%load_ext autoreload
%autoreload 2
import os
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [None]:
def process_documents(documents: Iterable[str],
                      document_ids: Iterable,
                      split_sentences: bool = True,
                      filter_len: int = 3,
                      disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Main helper function to process documents from the EMR.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param document_type: String denoting the document type to be processed
    :param document_sections: List of sections for a given document type to process
    :param split_sentences: Flag to determine whether to further split sections into sentences
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """
    
    df = sectionize_documents(documents, document_ids, disable_progress_bar)

    if split_sentences:
        df = sentencize(df.text.values, 
                        df.document_id.values,
                        df.offset.values, 
                        filter_len, 
                        disable_progress_bar)
    return df


def sectionize_documents(documents: Iterable[str],
                         document_ids: Iterable,
                         disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Obtains the sections of the imaging reports and returns only the 
    selected sections (defaults to FINDINGS, IMPRESSION, and ADDENDUM).

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `offset`
    """
    processed_documents = []
    for document_id, document in tqdm(zip(document_ids, documents), total=len(documents), disable=disable_progress_bar):
        row = {}
        text, start, end = (document, 0, len(document))
        row['document_id'] = document_id
        row['text'] = text
        row['offset'] = (start, end)

        processed_documents.append(row)

    _df = pd.DataFrame(processed_documents)
    if _df.shape[0] > 0:
        return _df.sort_values(['document_id', 'offset']).reset_index(drop=True)
    else:
        return _df


def sentencize(documents: Iterable[str],
               document_ids: Iterable,
               offsets: Iterable[tuple[int, int]],
               filter_len: int = 3,
               disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Split a document into sentences. Can be used with `sectionize_documents`
    to further split documents into more manageable pieces. Takes in offsets
    to ensure that after splitting, the sentences can be matched to the
    location in the original documents.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param offsets: Iterable tuple of the start and end indices
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """

    document_sentences = []
    for document, document_id, offset in tqdm(zip(documents, document_ids, offsets), total=len(documents), disable=disable_progress_bar):
        try:
            _, sentence_offsets = bf.text_to_sentences_and_offsets(document)
            for o in sentence_offsets:
                if o[1]-o[0] > filter_len:
                    sentence = document[o[0]:o[1]]
                    abs_offsets = (o[0]+offset[0], o[1]+offset[0])
                    row = {}
                    row['document_id'] = document_id
                    row['text'] = sentence
                    row['offset'] = abs_offsets
                    document_sentences.append(row)
        except:
            continue
    return pd.DataFrame(document_sentences)

In [8]:
SIM_MODEL = '/kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2'
DEVICE = 0
# MAX_LENGTH = 384
MAX_LENGTH = 512
# BATCH_SIZE = 16
BATCH_SIZE = 8
MAX_INPUT =  2048

WIKI_PATH = "/kaggle/input/wikipedia-20230701"
wiki_files = os.listdir(WIKI_PATH)

# Relevant Title Retrieval

In [None]:
trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv").drop("id", 1)
trn.head()

In [None]:
model = SentenceTransformer(SIM_MODEL, device='cuda')
model.max_seq_length = MAX_LENGTH
model = model.half()

Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors.

Some of the methods, like those based on binary vectors and compact quantization codes, solely use a compressed representation of the vectors and do not require to keep the original vectors. This generally comes at the cost of a less precise search but these methods can scale to billions of vectors in main memory on a single server. Other methods, like HNSW and NSG add an indexing structure on top of the raw vectors to make searching more efficient.

The GPU implementation can accept input from either CPU or GPU memory. On a server with GPUs, the GPU indexes can be used a drop-in replacement for the CPU indexes (e.g., replace IndexFlatL2 with GpuIndexFlatL2) and copies to/from GPU memory are handled automatically. Results will be faster however if both input and output remain resident on the GPU. Both single and multi-GPU usage is supported.

In [None]:
sentence_index = read_index("/kaggle/input/wikipedia-2023-07-faiss-index/wikipedia_202307.index")

In [None]:
prompt_embeddings = model.encode(trn.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
prompt_embeddings = prompt_embeddings.detach().cpu().numpy()
_ = gc.collect()

In [None]:
## Get the top 3 pages that are likely to contain the topic of interest
search_score, search_index = sentence_index.search(prompt_embeddings, 5)

In [None]:
## Save memory - delete sentence_index since it is no longer necessary
del sentence_index
del prompt_embeddings
_ = gc.collect()
libc.malloc_trim(0)

# Getting Sentences from the Relevant Titles

In [None]:
df = pd.read_parquet("/kaggle/input/wikipedia-20230701/wiki_2023_index.parquet",
                     columns=['id', 'file'])

In [None]:
## Get the article and associated file location using the index
wikipedia_file_data = []

for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
    scr_idx = idx
    _df = df.loc[scr_idx].copy()
    _df['prompt_id'] = i
    wikipedia_file_data.append(_df)
wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)
wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)

## Save memory - delete df since it is no longer necessary
del df
_ = gc.collect()
libc.malloc_trim(0)

In [None]:
## Get the full text data
wiki_text_data = []

for file in tqdm(wikipedia_file_data.file.unique(), total=len(wikipedia_file_data.file.unique())):
    _id = [str(i) for i in wikipedia_file_data[wikipedia_file_data['file']==file]['id'].tolist()]
    _df = pd.read_parquet(f"{WIKI_PATH}/{file}", columns=['id', 'text'])

    _df_temp = _df[_df['id'].isin(_id)].copy()
    del _df
    _ = gc.collect()
    libc.malloc_trim(0)
    wiki_text_data.append(_df_temp)
wiki_text_data = pd.concat(wiki_text_data).drop_duplicates().reset_index(drop=True)
_ = gc.collect()

In [None]:
## Parse documents into sentences
processed_wiki_text_data = process_documents(wiki_text_data.text.values, wiki_text_data.id.values)

In [None]:
## Get embeddings of the wiki text data
wiki_data_embeddings = model.encode(processed_wiki_text_data.text,
                                    batch_size=BATCH_SIZE,
                                    device=DEVICE,
                                    show_progress_bar=True,
                                    convert_to_tensor=True,
                                    normalize_embeddings=True)#.half()
wiki_data_embeddings = wiki_data_embeddings.detach().cpu().numpy()

In [None]:
_ = gc.collect()

In [None]:
## Combine all answers
trn['answer_all'] = trn.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)


## Search using the prompt and answers to guide the search
trn['prompt_answer_stem'] = trn['prompt'] + " " + trn['answer_all']

In [None]:
question_embeddings = model.encode(trn.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
question_embeddings = question_embeddings.detach().cpu().numpy()

# Extracting Matching Prompt-Sentence Pairs

In [None]:
## Parameter to determine how many relevant sentences to include
NUM_SENTENCES_INCLUDE = 20 # 这个也是一个很重要的 factor

## List containing just Context
contexts = []

for r in tqdm(trn.itertuples(), total=len(trn)):

    prompt_id = r.Index

    prompt_indices = processed_wiki_text_data[processed_wiki_text_data['document_id'].isin(wikipedia_file_data[wikipedia_file_data['prompt_id']==prompt_id]['id'].values)].index.values

    if prompt_indices.shape[0] > 0:
        prompt_index = faiss.index_factory(wiki_data_embeddings.shape[1], "Flat")
        prompt_index.add(wiki_data_embeddings[prompt_indices])

        context = ""
        
        ## Get the top matches
        ss, ii = prompt_index.search(question_embeddings, NUM_SENTENCES_INCLUDE)
        for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
            context += processed_wiki_text_data.loc[prompt_indices]['text'].iloc[_i] + " "
        
    contexts.append(context)

In [None]:
trn['context'] = contexts

In [None]:
trn[["prompt", "context", "A", "B", "C", "D", "E"]].to_csv("./test_context.csv", index=False)

# Inference

In [3]:
test_df = pd.read_csv("test_context.csv")

test_df.index = list(range(len(test_df)))
test_df['id'] = list(range(len(test_df)))


test_df["prompt"] = test_df["context"].apply(lambda x: x[:MAX_INPUT+16]) + " #### " +  test_df["prompt"]
# test_df["prompt"] = test_df["context"].apply(lambda x: x[:1750]) + " #### " +  test_df["prompt"]
# test_df["prompt"] = test_df["context"].apply(lambda x: x[:2500]) + " #### " +  test_df["prompt"]

test_df['answer'] = 'A'

In [5]:
# We'll create a dictionary to convert option names (A, B, C, D, E) into indices and back again
options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}



In [10]:
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

# Model A

In [None]:
# model_dir = "/kaggle/input/llm-science-run-context-2"

# tokenizer = AutoTokenizer.from_pretrained(model_dir)
# model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
# model.eval()

In [None]:
# tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
# tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
# data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
# test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

In [None]:
# test_predictions = []
# for batch in test_dataloader:
#     for k in batch.keys():
#         batch[k] = batch[k].cuda()
#     with torch.no_grad():
#         outputs = model(**batch)
#     test_predictions.append(outputs.logits.cpu().detach())

# test_predictions = torch.cat(test_predictions)
# test_predictions = test_predictions.numpy()

# Load Model From Our Train Notebook

In [None]:
# model_dir = "/kaggle/input/how-to-train-open-book-model/model_v2"
# tokenizer = AutoTokenizer.from_pretrained(model_dir)
# model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
# model.eval()

In [None]:
# test_predictions2 = []
# for batch in test_dataloader:
#     for k in batch.keys():
#         batch[k] = batch[k].cuda()
#     with torch.no_grad():
#         outputs = model(**batch)
#     test_predictions2.append(outputs.logits.cpu().detach())

# test_predictions2 = torch.cat(test_predictions2)
# test_predictions = (test_predictions+test_predictions2.numpy()) / 2.0

# predictions_as_ids = np.argsort(-test_predictions, 1)

# predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
# # predictions_as_answer_letters[:3]

# predictions_as_string = test_df['prediction'] = [
#     ' '.join(row) for row in predictions_as_answer_letters[:, :3]
# ]

In [None]:
# submission = test_df[['id', 'prediction']]
# submission.to_csv('submission.csv', index=False)

# My Model

In [4]:
from peft import LoraConfig, get_peft_model

# deberta-v3-large
# model_weights = f'../input/jason-trained-models-v3-512/checkpoint-550/checkpoint-550/torch_model'
# checkpoint_path = f'../input/jason-trained-models-v3-512/checkpoint_18_256-550/checkpoint-550'
# checkpoint_path = f'../input/jason-trained-models-v3-512/checkpoint_4-450/checkpoint-450'
# checkpoint_path = f'../input/jason-trained-models-v3-512/checkpoint_12_256-180/checkpoint-180'
# checkpoint_path = f'/kaggle/input/jason-trained-models-v3-512/checkpoint-290/checkpoint-290' # 18_512
# checkpoint_path = f'/kaggle/input/jason-trained-models-v3-512/checkpoint_18_512-550/checkpoint-550' # 18_512 
# checkpoint_path = f'/kaggle/input/jason-trained-models-v3-512/checkpoint-18_256_emb-550/checkpoint-550' # 18_512
# checkpoint_path = '../input/jason-trained-models-v3-512/18_512_v1_checkpoint-1140/checkpoint-1140'
# checkpoint_path = '../input/jason-trained-models-v3-512/checkpoint-72/checkpoint-72' # 18 256 v1
# checkpoint_path = '/kaggle/input/jason-trained-models-v3-512/checkpoint_18_512_v2-86/checkpoint-86' # 18 256 v1
# checkpoint_path = '/kaggle/input/jason-trained-models-v3-512/checkpoint-266/checkpoint-266' # 16 512 v2 
checkpoint_path = '/kaggle/input/jason-trained-models-v3-512/checkpoint_16_512_v6-350/checkpoint-350' # 16 512 v2 

## deberta-v2-xxlarge
# checkpoint_path = '/kaggle/input/jason-trained-models-v3-512/checkpoint_44_256_v1-480/checkpoint-480'


peft_config = LoraConfig.from_pretrained(checkpoint_path)
model_weights = f'{checkpoint_path}/torch_model'
model = AutoModelForMultipleChoice.from_pretrained(model_weights, device_map='cuda:0')

# from peft import LoraConfig, TaskType, get_peft_model
# peft_config = LoraConfig(
#     r=8, lora_alpha=4, task_type=TaskType.SEQ_CLS, lora_dropout=0.1, 
#     bias="none", inference_mode=False, 
#     target_modules=["query_proj", "value_proj"],
#     modules_to_save=['classifier','pooler'],
# )
tokenizer_weights = checkpoint_path
tokenizer = AutoTokenizer.from_pretrained(tokenizer_weights)
model = get_peft_model(model, peft_config)
checkpoint = torch.load(f'{model_weights}/pytorch_model.bin')
model.base_model.model.load_state_dict(checkpoint)
# model.eval().cuda()

Some weights of the model checkpoint at /kaggle/input/jason-trained-models-v3-512/checkpoint_16_512_v6-350/checkpoint-350/torch_model were not used when initializing DebertaV2ForMultipleChoice: ['deberta.encoder.layer.10.attention.self.query_proj.lora_B.default.weight', 'deberta.encoder.layer.18.attention.self.key_proj.lora_B.default.weight', 'deberta.encoder.layer.13.intermediate.dense.lora_A.default.weight', 'deberta.encoder.layer.16.output.dense.lora_B.default.weight', 'deberta.encoder.layer.3.attention.self.value_proj.lora_A.default.weight', 'deberta.encoder.layer.6.output.dense.lora_A.default.weight', 'deberta.encoder.layer.22.attention.output.dense.lora_B.default.weight', 'deberta.encoder.layer.11.attention.output.dense.lora_B.default.weight', 'deberta.encoder.layer.13.attention.self.key_proj.lora_B.default.weight', 'deberta.encoder.layer.9.attention.self.query_proj.lora_A.default.weight', 'deberta.encoder.layer.20.output.dense.lora_A.default.weight', 'deberta.encoder.layer.3.att

<All keys matched successfully>

In [7]:
del checkpoint
_ = gc.collect()

In [None]:
# model = model.base_model.model

In [11]:
# MAX_INPUT =  512


# MAX_INPUT =  4096

# MAX_INPUT = 256 
# MAX_INPUT = 512
## commenting out max length means no truncation

## This block gets the validation set
# test_df = pd.read_csv('../input/60k-data-with-context-v2/train_with_context2.csv')
# test_df.index = list(range(len(test_df)))
# test_df['id'] = list(range(len(test_df)))
# # test_df["prompt"] = test_df["context"].apply(lambda x: x[:1750]) + " #### " +  test_df["prompt"]
# test_df["prompt"] = test_df["context"].apply(lambda x: x[:2500]) + " #### " +  test_df["prompt"]


def preprocess(example):
    # The AutoModelForMultipleChoice class expects a set of question/answer pairs
    # so we'll copy our question 5 times before tokenizing
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    # Our tokenizer will turn our text into token IDs BERT can understand
#     tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True) # truncate should go only first
    tokenized_example = tokenizer(first_sentence, second_sentence, max_length=MAX_INPUT, truncation='only_first') # truncate should go only first
#     tokenized_example = tokenizer(first_sentence, second_sentence, truncation='only_first') # truncate should go only first

    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
try:
    tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
except:
    print('no need to remove column')
    pass



# def preprocess(example):
#     first_sentence = [ "[CLS] " + example['context'] ] * 5
#     second_sentences = [" #### " + example['prompt'] + " [SEP] " + example[option] + " [SEP]" for option in 'ABCDE']
#     tokenized_example = tokenizer(first_sentence, second_sentences, truncation='only_first',
#                                   max_length=MAX_INPUT, add_special_tokens=False)
#     tokenized_example['label'] = option_to_index[example['answer']]
#     return tokenized_example

# tokenized_test_dataset = Dataset.from_pandas(test_df).map(
#         preprocess, remove_columns=['prompt', 'context', 'A', 'B', 'C', 'D', 'E'])



data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

with torch.no_grad():
    test_predictions = []
    for batch in test_dataloader:
        for k in batch.keys():
            batch[k] = batch[k].cuda()
        with torch.no_grad():
            outputs = model(**batch)
        test_predictions.append(outputs.logits.cpu().detach())

test_predictions = torch.cat(test_predictions)
test_predictions = test_predictions.numpy()


predictions_as_ids = np.argsort(-test_predictions, 1)

predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
# predictions_as_answer_letters[:3]

predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
submission = test_df[['id', 'prediction']]
submission.to_csv('submission.csv', index=False)

In [None]:
# batch

In [12]:
# https://www.kaggle.com/code/philippsinger/h2ogpt-perplexity-ranking
import numpy as np
def precision_at_k(r, k):
    """Precision at k"""
    assert k <= len(r)
    assert k != 0
    return sum(int(x) for x in r[:k]) / k

def MAP_at_3(predictions, true_items):
    """Score is mean average precision at 3"""
    U = len(predictions)
    map_at_3 = 0.0
    for u in range(U):
        user_preds = predictions[u].split()
        user_true = true_items[u]
        user_results = [1 if item == user_true else 0 for item in user_preds]
        for k in range(min(len(user_preds), 3)):
            map_at_3 += precision_at_k(user_results, k+1) * user_results[k]
    return map_at_3 / U

m = MAP_at_3(test_df.prediction.values, test_df.answer.values)
print( 'CV MAP@3 =',m )

CV MAP@3 = 0.9108333333333333
