# About

This is a modified version of the HuggingFace tutorial [Fine-tuning a model on a question-answering task](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb), where you can find more information about features and training loop. I adapted it to [chaii competition](https://www.kaggle.com/c/chaii-hindi-and-tamil-question-answering) data and added some processing logic. Hopefully, this will make your experiments easier and more efficient, while keeping track of your results. I also recommend [notebook](https://www.kaggle.com/thedrcat/chaii-eda-baseline) create by @thedrcat based on the same master example, where I found the best model so far and used a few useful tricks. 

**Notebooks connected to this work:**

1. [Converting original and external data to SQuAD format](https://www.kaggle.com/oleksandrsirenko/chaii-dataframe-and-external-data-to-squad)
2. [Inference notebook](https://www.kaggle.com/oleksandrsirenko/chaii-inference-finetuned-model)

There are many possible improvements and tweaks that can be implemented on each level from the data processing to the training loop. So feel free to fork and customize this stuff according to your requirements and vision, and ***don't forget to upvote if you like this kernel*** 🤗

In [None]:
%%capture
!conda install --yes -c huggingface -c conda-forge datasets
!pip install jiwer

In [None]:
import pandas as pd
import numpy as np
import os
import json
import collections

from pathlib import Path
from typing import List, Dict, Optional, Union
from pydantic import BaseModel

import datasets
from transformers.trainer_utils import set_seed
from transformers import (AutoTokenizer, PreTrainedTokenizerFast,
                          AutoModelForQuestionAnswering, TrainingArguments,
                          Trainer, default_data_collator, DataCollatorWithPadding)
import wandb
import jiwer

from tqdm.auto import tqdm
import gc

from IPython.display import FileLink

# Configuration

In [None]:
class Config:
    # Path
    model_path: Path = '../input/xlm-roberta-squad2/deepset/xlm-roberta-large-squad2'
    test_path: Path = '../input/chaii-hindi-and-tamil-question-answering/test.csv'
    output_dir: str = './'
    
    # Base
    model_name: str = 'xlm-roberta-large-squad2'
    version: str = 'v11'
    seed: int = 42
    test_size: float = 0.1
    debug=False
    
    # Tokenizer
    max_length: int = 256
    doc_stride: int = 128
    
    # Trainer
    batch_size: int = 8
    learning_rate: float = 3e-5
    warmup_ratio: float = 0.1
    gradient_accumulation_steps: int = 4
    num_train_epochs: int = 1
    weight_decay: float = 0.01
    
    # Postprocess
    n_best_size: int = 20
    max_answer_length: int = 30
    squad_v2: bool = False
    
    # Notes
    scores: Dict[str,float] = {}
    notes: str = "Some important findings"
    LB: float = 0.0
    
    @staticmethod
    def save_config(file_name: str, output_dir: str = output_dir) -> None:
        config_dict = {}
        for key, value in vars(Config).items():
            if key.startswith('_') or isinstance(value, staticmethod):
                continue
            config_dict[key] = value
        
        out_path = f'{output_dir}{file_name}.json'
        with open(out_path, 'w') as out_file:
            json.dump(config_dict, out_file, indent=2, sort_keys=False)

In [None]:
set_seed(Config.seed)

In [None]:
def read_json(from_path: Path) -> dict:
    with open(from_path, 'r', encoding='utf-8') as out_file:
        return json.load(out_file)
        
def write_json(data: dict, out_path: Path) -> None:
    with open(out_path, 'w', encoding='utf-8') as out_file:
        json.dump(data, out_file, indent=2, sort_keys=True, ensure_ascii=False)

# Prepare the Datasets

First, we need to convert the pandas dataframe to a JSON object of the appropriate format:

```python
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'}
 ```
 
I have already performed this task for all currently available datasets, so I skip this step and will use prepared data. You can find all the helper functions for converting data to SQuAD in [this notebook](https://www.kaggle.com/oleksandrsirenko/chaii-dataframe-to-squad-and-external-data). The [chaii-squad dataset](https://www.kaggle.com/oleksandrsirenko/chaii-squad) will be updated during the competition.

### Facebook MLQA Hindi

In [None]:
mlqa_dev_hindi_path = '../input/chaii-squad/mlqa_dev_hindi.json'
mlqa_dev_hindi_dataset = datasets.load_dataset(
    'json',
    data_files=mlqa_dev_hindi_path,
    field='data',
    split="train"
)
mlqa_dev_hindi_dataset

In [None]:
mlqa_test_hindi_path = '../input/chaii-squad/mlqa_test_hindi.json'
mlqa_test_hindi_dataset = datasets.load_dataset(
    'json',
    data_files=mlqa_test_hindi_path,
    field='data',
    split="train"
)
mlqa_test_hindi_dataset

In [None]:
mlqa_hindi_dataset = datasets.concatenate_datasets([mlqa_dev_hindi_dataset, mlqa_test_hindi_dataset])
mlqa_hindi_dataset

### XQuAD Hindi

In [None]:
xquad_hindi_path = '../input/chaii-squad/xquad_hindi.json'
xquad_hindi_dataset = chaii_dataset = datasets.load_dataset(
    'json',
    data_files=xquad_hindi_path,
    field='data',
    split="train"
)
xquad_hindi_dataset

### CHAII

In [None]:
chaii_path = '../input/chaii-squad/chaii_train.json'
chaii_dataset = datasets.load_dataset(
    'json',
    data_files=chaii_path,
    field='data',
    split='train'
)
chaii_dataset

In [None]:
chaii_dataset = chaii_dataset.train_test_split(
    test_size=Config.test_size,
    shuffle=True,
    seed=Config.seed
)
chaii_dataset

### Concatinate Train Split with External Data

In [None]:
chaii_dataset['train'] = datasets.concatenate_datasets([chaii_dataset['train'], xquad_hindi_dataset, mlqa_hindi_dataset])

In [None]:
chaii_dataset

# Tokenize

In [None]:
tokenizer = AutoTokenizer.from_pretrained(Config.model_path)
assert isinstance(tokenizer, PreTrainedTokenizerFast)
pad_on_right = tokenizer.padding_side == "right"

In [None]:
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=Config.max_length,
        stride=Config.doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

In [None]:
tokenized_chaii = chaii_dataset.map(
    prepare_train_features,
    batched=True,
    remove_columns=chaii_dataset['train'].column_names
)

In [None]:
tokenized_chaii

# Training

In [None]:
model = AutoModelForQuestionAnswering.from_pretrained(Config.model_path)

In [None]:
args = TrainingArguments(
    output_dir=Config.output_dir,
    evaluation_strategy = "epoch",
    learning_rate=Config.learning_rate,
    warmup_ratio=Config.warmup_ratio,
    gradient_accumulation_steps=Config.gradient_accumulation_steps,
    per_device_train_batch_size=Config.batch_size,
    per_device_eval_batch_size=Config.batch_size,
    num_train_epochs=Config.num_train_epochs,
    weight_decay=Config.num_train_epochs,
    seed=Config.seed
)

[Weights & Biases](https://wandb.ai/site) is a service for tracking experiments, versioning datasets, and managing models, it is incorporated by default into the Trainer API and has its own behavior. Sometimes it is useful and convenient, sometimes it can be a hassle:) I personally would prefer this service to be disabled by default, and manually enable it if I need to. So be free to check it out:)

In [None]:
%%capture
wandb.init(mode="disabled")
wandb.init(mode="offline")

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_chaii['train'],
    eval_dataset=tokenized_chaii['test'],
    data_collator=default_data_collator,
    tokenizer=tokenizer
)

In [None]:
trainer.train()

# Evaluation

In [None]:
def prepare_validation_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples["question"] = [q.lstrip() for q in examples["question"]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples["question" if pad_on_right else "context"],
        examples["context" if pad_on_right else "question"],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=Config.max_length,
        stride=Config.doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")

    # We keep the example_id that gave us this feature and we will store the offset mappings.
    tokenized_examples["example_id"] = []

    for i in range(len(tokenized_examples["input_ids"])):
        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token
        # position is part of the context or not.
        tokenized_examples["offset_mapping"][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples["offset_mapping"][i])
        ]

    return tokenized_examples

In [None]:
validation_features = chaii_dataset['test'].map(
    prepare_validation_features,
    batched=True,
    remove_columns=chaii_dataset['test'].column_names
)

In [None]:
validation_features

In [None]:
raw_predictions = trainer.predict(validation_features)

In [None]:
validation_features.set_format(type=validation_features.format["type"], columns=list(validation_features.features.keys()))

# Post-processing

In [None]:
def postprocess_qa_predictions(examples, features, raw_predictions, tokenizer=tokenizer,
                               squad_v2=Config.squad_v2, n_best_size=Config.n_best_size, 
                               max_answer_length=Config.max_answer_length):
    
    all_start_logits, all_end_logits = raw_predictions
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    # Let's loop over all the examples!
    for example_index, example in enumerate(tqdm(examples)):
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]

        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example["context"]
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        if not squad_v2:
            predictions[example["id"]] = best_answer["text"]
        else:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example["id"]] = answer

    return predictions

In [None]:
final_predictions = postprocess_qa_predictions(chaii_dataset['test'], validation_features, raw_predictions.predictions)

# Scores

In [None]:
metric = datasets.load_metric("squad")

In [None]:
formatted_predictions = [{"id": k, "prediction_text": v} for k, v in final_predictions.items()]
references = [{"id": ex["id"], "answers": ex["answers"]} for ex in chaii_dataset['test']]
scores = metric.compute(predictions=formatted_predictions, references=references)
scores

In [None]:
def jaccard(row: pd.Series) -> float: 
    str1 = row[0]
    str2 = row[1]
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

[Word error rate (WER)](https://huggingface.co/metrics/wer) is a common metric of the performance of an automatic speech recognition system. The general difficulty of measuring performance lies in the fact that the recognized word sequence can have a different length from the reference word sequence (supposedly the correct one). The WER is derived from the Levenshtein distance, working at the word level instead of the phoneme level. The WER is a valuable tool for comparing different systems as well as for evaluating improvements within one system. 

Thus, it can be used as one of our metrics and, possibly, as an alternative to the word-level Jaccard score. The lower the WER the better, it is not in the 0-1 range, but the best WER is zero if I understand correctly.

In [None]:
wer_metric = datasets.load_metric("wer")

In [None]:
formatted_predictions[:2]

In [None]:
references[:2]

In [None]:
eval_df = pd.DataFrame(references)
eval_df['answers'] = eval_df.answers.apply(lambda x: x['text'][0])
eval_df['predictions'] = eval_df['id'].apply(lambda x: final_predictions[x])
eval_df['jaccard'] = eval_df[['answers', 'predictions']].apply(jaccard, axis=1)
eval_df['wer'] = eval_df[['predictions', 'answers']].apply(lambda x: wer_metric.compute(predictions=x[[0]], references=[x[1]]), axis=1)
eval_df.head()

In [None]:
jaccard_wer_scores = {'jaccard': eval_df.jaccard.mean(), 'wer': eval_df.wer.mean()}
scores.update(jaccard_wer_scores)
jaccard_wer_scores

# Save Model and Experiment Configuration

In [None]:
training_output = f"{Config.model_name}-finetuned-{Config.version}"
trainer.save_model(training_output)

In [None]:
scores

In [None]:
notes = 'XLM Roberta large model, add external data. Change max len to 256.'
Config.notes = notes
Config.scores = scores
Config.save_config(training_output.replace('-', '_'))

In [None]:
%cd ./
%ls

In [None]:
# verify config
read_json('xlm_roberta_large_squad2_finetuned_v11.json')

# Download 

**NOTE:** This is training notebook will work only in an online mode.  You do not allow internet access to submit your prediction. While you can modify this code to use it without access to the internet and send your predictions directly,  the more convenient way is to create an inference notebook without all this training staff, but only applying the model on a test dataset. To do this, you need to load the directory that you created in the previous step when calling

```python
trainer.save_model(training_output)
```

Then you need to create a new (or update an existing) Kaggle dataset, load the whole thing, and connect the dataset to your inference kernel. Of course, if the loop is worth it, otherwise just download the experiment config and keep looking for a solution :)

In [None]:
!zip -r  xlm-roberta-large-squad2-finetuned-v11.zip  xlm-roberta-large-squad2-finetuned-v11

In [None]:
FileLink(r'xlm-roberta-large-squad2-finetuned-v11.zip')

In [None]:
FileLink('xlm_roberta_large_squad2_finetuned_v11.json')

# Inference 
You can find out how this works at the inference stage [here](https://www.kaggle.com/oleksandrsirenko/chaii-inference-finetuned-model).