# **BERT Question Answering Model**

- **Mount drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


- **Move inside directory**

In [None]:
cd "/content/drive/MyDrive/chaii-hindi-and-tamil-question-answering"

/content/drive/MyDrive/chaii-hindi-and-tamil-question-answering


- **Install required packages**

In [None]:
# install required packages
!pip install torchtext==0.10.0
!pip install Dataset
!pip install sentencepiece
!pip install transformers
!pip install datasets
!pip install indic-nlp-library
!pip install deep-translator

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.10.0
  Downloading torchtext-0.10.0-cp37-cp37m-manylinux1_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 26.8 MB/s 
Collecting torch==1.9.0
  Downloading torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[K     |████████████████████████████████| 831.4 MB 2.8 kB/s 
Installing collected packages: torch, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1+cu113
    Uninstalling torch-1.12.1+cu113:
      Successfully uninstalled torch-1.12.1+cu113
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.13.1
    Uninstalling torchtext-0.13.1:
      Successfully uninstalled torchtext-0.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.13.1

- **Impor tlibraries**

In [None]:
import numpy as np 
import pandas as pd 
from transformers import default_data_collator, Trainer
from transformers import BertForQuestionAnswering, AutoTokenizer, TrainingArguments

- **Data collator object**

In [None]:
data_collator = default_data_collator

In [None]:
MODEL = "mrm8488/bert-multi-cased-finetuned-xquadv1"

- **Load pretrained model**

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = BertForQuestionAnswering.from_pretrained(MODEL)

In [None]:
padding_style = tokenizer.padding_side == "right"

- **Read dataset**

In [None]:
train = pd.read_csv("train.csv", index_col=0)
print(train.shape)

(1114, 5)


**Check for null values**

In [None]:
# check for null values
train.isna().sum()

context         0
question        0
answer_text     0
answer_start    0
language        0
dtype: int64

- **Data cleaning**

In [None]:
def remove_space(text):
  tex = text.lstrip() 
  return text

In [None]:
train['question'] = train['question'].apply(remove_space)

- **Drop null values if there are any**

In [None]:
# drop null values and reset index
train = train.dropna().reset_index(drop=True)

- **Map answers to dictionary**
- **Find end answer position**

In [None]:
def map_answers_todict(col):
    s_position = col[0]
    answer_text = col[1]
    start_pos = col[0]
    e_position = start_pos + len(answer_text)
    answer_dict = { 'answer_start': [s_position], 'answer_text': [answer_text],'answer_end': [e_position]}
    return answer_dict

- **Train, validation and test split**

In [None]:
# split data into train, test and validation sets
train_data = train.iloc[:914].reset_index(drop=True)
test_data = train.iloc[914:1014].reset_index(drop=True)
val_data = train.iloc[1014:].reset_index(drop=True)
print("Train dataset size: ", train_data.shape)
print("Test dataset size: ", test_data.shape)
print("Validation dataset size: ", val_data.shape)

Train dataset size:  (914, 5)
Test dataset size:  (100, 5)
Validation dataset size:  (100, 5)


In [None]:
train_data['answers'] = train_data[['answer_start', 'answer_text']].apply(map_answers_todict, axis=1)
test_data['answers'] = test_data[['answer_start', 'answer_text']].apply(map_answers_todict, axis=1)
val_data['answers'] = val_data[['answer_start', 'answer_text']].apply(map_answers_todict, axis=1)

In [None]:
train_data.head(2)

Unnamed: 0,context,question,answer_text,answer_start,language,answers
0,ஒரு சாதாரண வளர்ந்த மனிதனுடைய எலும்புக்கூடு பின...,மனித உடலில் எத்தனை எலும்புகள் உள்ளன?,206,53,tamil,"{'answer_start': [53], 'answer_text': ['206'],..."
1,காளிதாசன் (தேவநாகரி: कालिदास) சமஸ்கிருத இலக்கி...,காளிதாசன் எங்கு பிறந்தார்?,காசுமீரில்,2358,tamil,"{'answer_start': [2358], 'answer_text': ['காசு..."


In [None]:
test_data.head(1)

Unnamed: 0,context,question,answer_text,answer_start,language,answers
0,"राजधानी: बगदाद\nजनसंख्या: 30,39 9, 572 (...",ईराक का क्षेत्रफल कितना है?,"16 9, 250 वर्ग मील",76,hindi,"{'answer_start': [76], 'answer_text': ['16 9, ..."


In [None]:
val_data.head(2)

Unnamed: 0,context,question,answer_text,answer_start,language,answers
0,"अमिताभ बच्चन (जन्म-११ अक्टूबर, १९४२) बॉलीवुड क...",अमिताभ बच्चन के पिता का नाम क्या था?,डॉ॰ हरिवंश राय बच्चन,1084,hindi,"{'answer_start': [1084], 'answer_text': ['डॉ॰ ..."
1,यह जेल अंडमान निकोबार द्वीप की राजधानी पोर्ट ब...,काला पानी जेल कहाँ पर स्थित है?,पोर्ट ब्लेयर,39,hindi,"{'answer_start': [39], 'answer_text': ['पोर्ट ..."


- **Data tokenization** 

In [None]:
from torch.utils.data import Dataset, DataLoader
class DataTokenization(Dataset):
     # initialize all required parameters in initialization function
    def __init__(
        self,
        data: pd.DataFrame,
        tokenizer: AutoTokenizer,
        text_max_length: int
        ):

        self.tokenizer = tokenizer
        self.data = data
        self.text_max_length = text_max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index: int):
        data_row = self.data.iloc[index]
        tokenized_text = tokenizer(
        data_row["question" if padding_style  else "context"], data_row["context" if padding_style  else "question"],
        padding = "max_length", truncation = "only_second" if padding_style  else "only_first",
        max_length = self.text_max_length, return_attention_mask=True,
        add_special_tokens=True, return_offsets_mapping=True, return_tensors=None)
        # extract offset mapping from tokenized dictionary
        offset_tokens = tokenized_text["offset_mapping"]
        tokenized_text["start_positions"] = []
        tokenized_text["end_positions"] = []
        # extract input ids from tokenized dictionary
        input_ids = tokenized_text["input_ids"]
        tokens_style = tokenized_text.sequence_ids()
        answers = data_row["answers"]
        # check if answer is not in the start of context
        if len(answers["answer_start"]) != 0:
            loop_begin = 0
            ans_begin = answers["answer_start"][0]
            ans_end = answers["answer_end"][0]
            while tokens_style[loop_begin] != (1 if padding_style  else 0):
              loop_begin += 1
            loop_end = len(input_ids) - 1
             # update answer end position 
            while tokens_style[loop_end] != (1 if padding_style  else 0):
              loop_end -= 1
            if not (offset_tokens[loop_begin][0] <= ans_begin and offset_tokens[loop_end][1] >= ans_end):
                tokenized_text["start_positions"].append(0)
                tokenized_text["end_positions"].append(0)
            else:
                while loop_begin < len(offset_tokens) and offset_tokens[loop_begin][0] <= ans_begin:
                    loop_begin += 1
                tokenized_text["start_positions"].append(loop_begin - 1)
                # if the answer exists in the context iterate to update end positions
                while offset_tokens[loop_end][1] >= ans_end:
                    loop_end -= 1
                tokenized_text["end_positions"].append(loop_end + 1)
        else:
            tokenized_text["start_positions"].append(0)
            tokenized_text["end_positions"].append(0)
        
        # make dictionary of all extracted ids and return
        return dict(
            input_ids = tokenized_text['input_ids'],
            attention_mask = tokenized_text['attention_mask'],
            start_positions = tokenized_text['start_positions'],
            end_positions = tokenized_text['end_positions']
            )

In [None]:
DataTokenization(train_data, tokenizer, 384)

<__main__.DataTokenization at 0x7f932c748a50>

In [None]:
import torch
train_dataset = list(DataTokenization(train_data, tokenizer, 384))
val_dataset = list(DataTokenization(val_data, tokenizer, 384))

In [None]:
train_dataset[:2]

[{'input_ids': [0,
   69535,
   81049,
   37368,
   153264,
   12095,
   52989,
   21883,
   1629,
   145615,
   32,
   2,
   2,
   3219,
   224013,
   124335,
   5966,
   69535,
   4930,
   74149,
   12095,
   52989,
   21883,
   182394,
   3686,
   51833,
   57210,
   101912,
   15,
   6161,
   2912,
   70597,
   52989,
   21883,
   102080,
   54512,
   91585,
   1962,
   212933,
   18599,
   16242,
   94236,
   16,
   198236,
   29160,
   12095,
   52989,
   21883,
   173139,
   23618,
   72817,
   5,
   5894,
   198236,
   81049,
   37334,
   144257,
   7827,
   82890,
   84853,
   80517,
   114452,
   232094,
   3686,
   17984,
   11830,
   62001,
   182394,
   4167,
   5,
   203312,
   10753,
   50667,
   2650,
   4,
   45303,
   1962,
   163062,
   198236,
   29160,
   176030,
   15453,
   4,
   3219,
   171093,
   5944,
   2650,
   8120,
   10175,
   12095,
   52989,
   21883,
   15,
   2650,
   24183,
   5638,
   14861,
   16,
   56735,
   3219,
   171093,
   5944,
   2650,
  

In [None]:
def map__data(features):
  data_ = {}
  for k in features[0].keys():
    data_[k] = [data_[k] for data_ in features]
  return data_

In [None]:
train_df = map__data(train_dataset)
val_df = map__data(val_dataset)

In [None]:
train_df.keys()

dict_keys(['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [None]:
val_df.keys()

dict_keys(['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [None]:
from datasets import Dataset
def convert_df_todict(data):
  # convert dataset to dictionary from dataframe
  data_dict = Dataset.from_pandas(data)
  return data_dict

In [None]:
train_loader = convert_df_todict(pd.DataFrame.from_dict(train_df,orient='index').transpose())
val_loader = convert_df_todict(pd.DataFrame.from_dict(val_df,orient='index').transpose())

In [None]:
train_loader

Dataset({
    features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 914
})

In [None]:
val_loader

Dataset({
    features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 100
})

In [None]:
params = TrainingArguments( output_dir = 'bert_best_model', overwrite_output_dir = True,
                           evaluation_strategy = 'epoch', learning_rate = 0.0001, 
                           gradient_accumulation_steps = 8,
                           per_device_train_batch_size = 4,
                           per_device_eval_batch_size = 4,
                           num_train_epochs = 4, weight_decay = 0.01,
                           save_strategy = 'epoch', no_cuda = False,
                           logging_strategy = 'steps')

In [None]:
params

TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=8,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_n

In [None]:
model_trainer = Trainer( model = model, args = params, train_dataset = train_loader, eval_dataset = val_loader,
                   data_collator = data_collator, tokenizer = tokenizer)

- **Model Training**

In [None]:
# start model training
model_trainer.train()

***** Running training *****
  Num examples = 914
  Num Epochs = 4
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 112


Epoch,Training Loss,Validation Loss
0,No log,1.718499
1,No log,1.61126
2,No log,1.861912
3,No log,2.155155


***** Running Evaluation *****
  Num examples = 100
  Batch size = 4
Saving model checkpoint to bert_best_model/checkpoint-28
Configuration saved in bert_best_model/checkpoint-28/config.json
Model weights saved in bert_best_model/checkpoint-28/pytorch_model.bin
tokenizer config file saved in bert_best_model/checkpoint-28/tokenizer_config.json
Special tokens file saved in bert_best_model/checkpoint-28/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 100
  Batch size = 4
Saving model checkpoint to bert_best_model/checkpoint-56
Configuration saved in bert_best_model/checkpoint-56/config.json
Model weights saved in bert_best_model/checkpoint-56/pytorch_model.bin
tokenizer config file saved in bert_best_model/checkpoint-56/tokenizer_config.json
Special tokens file saved in bert_best_model/checkpoint-56/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 100
  Batch size = 4
Saving model checkpoint to bert_best_model/checkpoint-84
Configuration save

TrainOutput(global_step=112, training_loss=1.040583883013044, metrics={'train_runtime': 313.6646, 'train_samples_per_second': 11.656, 'train_steps_per_second': 0.357, 'total_flos': 712948200754176.0, 'train_loss': 1.040583883013044, 'epoch': 3.98})

- **Evaluation phase**

In [None]:
# make test data tokenization class
from torch.utils.data import Dataset, DataLoader
class TestDataTokenization(Dataset):
    def __init__(self, data: pd.DataFrame, tokenizer: AutoTokenizer, text_max_length: int):
        self.tokenizer = tokenizer
        self.data = data
        self.text_max_length = text_max_length


    def __len__(self):
        return len(self.data)

    def __getitem__(self, index: int):
        data_row = self.data.iloc[index]

        
        tokenized_text = tokenizer(
        data_row["question" if padding_style  else "context"], data_row["context" if padding_style  else "question"],
        padding = "max_length", truncation = "only_second" if padding_style  else "only_first",
        max_length = self.text_max_length, return_attention_mask=True,
        add_special_tokens=True, return_offsets_mapping=True, return_tensors=None)
        # extract sequence id
        tokens_style = tokenized_text.sequence_ids()
        c_id = 1 if padding_style else 0
        # extract all offset mapping 
        tokenized_text["offset_mapping"] = [
            (offset if tokens_style[value] == c_id else None)
            for value, offset in enumerate(tokenized_text["offset_mapping"])]
        
        return dict(
            input_ids = tokenized_text['input_ids'],
            attention_mask = tokenized_text['attention_mask'],
            offset_mapping = tokenized_text['offset_mapping']
            )

In [None]:
test_dataset = list(TestDataTokenization(test_data, tokenizer, 384))

In [None]:
test_df = map__data(test_dataset)

In [None]:
test_df.keys()

dict_keys(['input_ids', 'attention_mask', 'offset_mapping'])

In [None]:
from datasets import Dataset
test_loader = convert_df_todict(pd.DataFrame.from_dict(test_df,orient='index').transpose())

In [None]:
test_loader_model = test_loader.map(remove_columns=[ 'offset_mapping'])

  0%|          | 0/100 [00:00<?, ?ex/s]

- **Model predictions**

In [None]:
model_predictions = model_trainer.predict(test_loader_model)

***** Running Prediction *****
  Num examples = 100
  Batch size = 4


In [None]:
import pickle
with open('bert_predictions.pickle', 'wb') as handle:
    pickle.dump(model_predictions, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
with open('bert_predictions.pickle', 'rb') as handle:
    model_predictions = pickle.load(handle)

In [None]:
start_prediction_seq, end_prediction_seq = model_predictions.predictions

In [None]:
def extract_predicted_answers(raw_data, tokenizer_features, start_prediction_seq, end_prediction_seq, top, target_length):
    # apply processing on generated predictions of model
    pred_ans_list = []
    # iterate over test data instances
    for id, data_row in enumerate(raw_data):
      context = data_row["context"]
      m_value = None 
      best_predicted_answers = []
      start_prediction = start_prediction_seq[id]
      end_prediction = end_prediction_seq[id]
      offset_tokens = tokenizer_features[id]["offset_mapping"]
      start_index = tokenizer_features[id]["input_ids"].index(tokenizer.cls_token_id)
      # find the minimum null value in feature vector
      feature_min_value = start_prediction[start_index] + end_prediction[start_index]
      if m_value is None or m_value < feature_min_value:
        m_value = feature_min_value
      start_prediction_seq_idx = np.argsort(start_prediction)[-1 : -top - 1 : -1].tolist()
      end_prediction_seq_idx = np.argsort(end_prediction)[-1 : -top - 1 : -1].tolist()
      for start_prediction_id in start_prediction_seq_idx:
        # iterate over all selected predicted vlaues of each instance end ids
        for end_prediction_id in end_prediction_seq_idx:
            if (start_prediction_id >= len(offset_tokens) or end_prediction_id >= len(offset_tokens)
                or offset_tokens[end_prediction_id] is None or offset_tokens[start_prediction_id] is None):
              continue
            if end_prediction_id - start_prediction_id + 1 > target_length or end_prediction_id < start_prediction_id:
              continue
            ans_start_position = offset_tokens[start_prediction_id][0]
            ans_end_position = offset_tokens[end_prediction_id][1]
            ans_prob = start_prediction[start_prediction_id] + end_prediction[end_prediction_id]
            result_ans_dict = { "pred_prob": ans_prob, "pred_answer_text": context[ans_start_position: ans_end_position]}
            best_predicted_answers.append(result_ans_dict)
      if len(best_predicted_answers) <= 0:
        # assign probability 0 and return predicted answer as empty string
        predicted_answer = {"pred_answer_text": "", "pred_prob": 0.0}
      else:
        predicted_answer = sorted(best_predicted_answers, key=lambda y: y["pred_prob"], reverse=True)[0]  
      pred_ans_list.append(predicted_answer["pred_answer_text"])
    # return predictions list
    return pred_ans_list

In [None]:
predictions_list = extract_predicted_answers(convert_df_todict(test_data), test_loader, start_prediction_seq, end_prediction_seq, 15, 30)

In [None]:
# function to evaluate bleu scores of predicted answeres vs original answers
def evaluate_blue_score(actual, prediction):
  results = dict()
  bleu_score1 = 0
  bleu_score2 = 0
  bleu_score3 = 0
  bleu_score4 = 0
  if len(actual) == len(prediction):
    # iterate over all predicitons
    for i in range(len(actual)):
      if prediction == "":
        return 0,0,0,0
      actual_tokenized = list(map(lambda x: indic_tokenize.trivial_tokenize(x), actual[i]))
      pred_tokenized = indic_tokenize.trivial_tokenize(prediction[i])
      chencherry = SmoothingFunction()
      # nltk functions to calculate bleu1, bleu2, bleu3 and bleu4 score
      bleu_1 = sentence_bleu(actual_tokenized, pred_tokenized, weights=(1, 0, 0, 0), smoothing_function=chencherry.method2)
      bleu_2 = sentence_bleu(actual_tokenized, pred_tokenized, weights=(0.5, 0.5, 0, 0), smoothing_function=chencherry.method2)
      bleu_3 = sentence_bleu(actual_tokenized, pred_tokenized, weights=(0.33, 0.33, 0.33, 0), smoothing_function=chencherry.method2)
      bleu_4 = sentence_bleu(actual_tokenized, pred_tokenized, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=chencherry.method2)
      # add up scores of each instance
      bleu_score1 +=bleu_1
      bleu_score2 +=bleu_2
      bleu_score3 +=bleu_3
      bleu_score4 +=bleu_4
    # convert decimale values  bleu scores to percentage
    results["bleu_1"] = [round(bleu_score1 / len(actual) * 100, 2)]
    results["bleu_2"] = [round(bleu_score2 / len(actual) * 100, 2)]
    results["bleu_3"] = [round(bleu_score3 / len(actual) * 100, 2)]
    results["bleu_4"] = [round(bleu_score4 / len(actual) * 100, 2)]
    # return total evaluated results
    return results
  else:
    print("Error: Actual values and predictions are not of same length....")


In [None]:
import pandas as pd
import numpy as np
import nltk
nltk.download('punkt')
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
from indicnlp.tokenize import indic_tokenize 
from deep_translator import GoogleTranslator

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
actual_answers = list(test_data['answer_text'].values)

In [None]:
pred_answers = [i.strip(" ").strip("\n") for i in predictions_list]

In [None]:
# evaluate bleu scores
actual_answers = [[i] for i in actual_answers]
ans_results = evaluate_blue_score(actual_answers, pred_answers)
pd.DataFrame(ans_results)

Unnamed: 0,bleu_1,bleu_2,bleu_3,bleu_4
0,45.51,41.48,37.79,34.76


In [None]:
for i in range(len(actual_answers)):
  print("\n----------------------------------\n")
  print("Actual answers: ",actual_answers[i][0])
  print("Predicted answer: ",pred_answers[i])
  try:
    print("Translated Answer: ",GoogleTranslator(source='auto', target='en').translate(pred_answers[i]))
  except:
    print("Translated Answer: ",pred_answers[i])



----------------------------------

Actual answers:  16 9, 250 वर्ग मील
Predicted answer:  16 9, 250 वर्ग मील
Translated Answer:  169,250 square miles

----------------------------------

Actual answers:  नाथूराम गोडसे
Predicted answer:  नाथूराम गोडसे
Translated Answer:  Nathuram Godse

----------------------------------

Actual answers:  स्त्री-रोग विशेषज्ञ
Predicted answer:  स्त्रीरोगविज्ञान
Translated Answer:  gynecology

----------------------------------

Actual answers:  लक्ष्मी
Predicted answer:  लक्ष्मी
Translated Answer:  Laxmi

----------------------------------

Actual answers:  जेद्दाह से १९कि.मी उत्तर में स्थित
Predicted answer:  सऊदी अरब के शहर जेद्दाह
Translated Answer:  Saudi Arabian city Jeddah

----------------------------------

Actual answers:  चीन
Predicted answer:  चीन
Translated Answer:  China

----------------------------------

Actual answers:  चार्ल्स बैबेज
Predicted answer:  चार्ल्स बैबेज
Translated Answer:  Charles Babbage

---------------------------------