## **Introduction**

Hindi and Tamil are most widely spoken languages in India. Even if not going by the data **(*sacrilageous remark on Kaggle*)**  I have not personally enountered any one in Bangalore *(Karnataka, India)* who does not speak either of the two languages.


Since the spread of these langauges is far and wide, it is very practical that a competition with Hindi and Tamil at the helm is introduced. Looking forward to learning a lot!


This notebook is an earnest attempt at learning 4 things that are very new to me -

*   Natural Language Processing
*   Keras
*   HuggingFace
*   Tamil and Hindi *(I can croon some melodies in both but god help the one who wants to talk with me using these languages)*

### **Notebook Details**

Mostly an amalgamation of two excellent resouces - 

- The [Keras SQuAD](https://keras.io/examples/nlp/text_extraction_with_bert/) example
- The [HuggingFace SQuAD](https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb) walkthrough

Also, I really liked the notebook series from [Julián Peller](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs) - used to understand the whole rigmarole of QA in NLP and Shahules's [kernel](https://www.kaggle.com/shahules/chaii-custom-qa-training-pytorch) which had PyTorch implementation with HF additions. 

On the EDA side I really liked the word clouds from [Shivam Ralli](https://www.kaggle.com/hoshi7/chaii-interactive-wordclouds) and [AK Nain](https://www.kaggle.com/aakashnain/chaii-explore-the-data).

That being said, all the notebooks are awesome!🙂

## **Setup**

In [None]:
import collections

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from transformers import TFXLMRobertaModel,AutoTokenizer

In [None]:
base_model = "../input/jplu-tf-xlm-roberta-base-2"
tokenizer = AutoTokenizer.from_pretrained(base_model, add_special_tokens=True)

 ## **Dealing with Data**
 
 Basic things like splitting the data into folds and pre-processing is performed here

### **Loading the data**

In [None]:
# Not running across all folds in this data. Just using a really convoluted way to split the data
df = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/train.csv")
folds = 4
df["kfold"] = -1

kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
for f, (t_, v_) in enumerate(kf.split(X=df, y=df.language.values)):
    df.loc[v_, 'kfold'] = f
    
fold_val = 3
train = df[df['kfold']!=fold_val]
valid = df[df['kfold']==fold_val]

# Coverting the dataframe to list of dictionaries.
# NOT IDEAL. But this was the best way to get the data closer to the 
# Keras example. Please let me know if I can do it in a better way.
train_records = train.to_dict('records')
valid_records = valid.to_dict('records')

print (f'Totally there are {len(train_records)} training records and {len(valid_records)} validation records.')

### **Preprocessing the data** 


Two steps happen in this section -

1.   Go through each record and create `ChaiiExample` object from each.
2. Include the HF method to deal with long documents



**Dealing with very long documents.** 
- Usually, in other NLP tasks, the long documents are truncated, when they are longer than the model maximum sentence length.
- However, in Question-Answering, removing part of the the *context* might result in losing the answer we are looking for. 
- To deal with this, we will allow one (long) example in our dataset to give several input features, each of length shorter than the maximum length of the model *(in this notebook, set as the `max_length` hyper-parameter)*. 
- Also, just in case the answer lies at the point we split a long context, we allow some overlap between the features we generate controlled by the hyper-parameter `doc_stride`.
- We do allow for truncation BUT only the context.

In [None]:
max_length = 384
doc_stride = 128

In [None]:
class ChaiiExample:
    def __init__(self, question, context, start_char_idx=0, answer_text='', id=''):
        self.question = question
        self.context = context
        self.start_char_idx = start_char_idx
        self.answer_text = answer_text
        self.id = id

        # The tokenizer returns us a list of features from a single long doc. 
        # This is enabled by setting the return_overflowing_tokens=True 

        # It is required to find put in which features, the answer actually is
        # present. The answer requires start and end positions, but we have
        # split the document! 
        # To map the relative answer positions with the original,
        # we need 'offsets' , obtained by setting return_offsets_mapping=True

        # Padded to max length
        self.tokenized_sample = tokenizer(
            question, context, max_length=max_length, stride=doc_stride,
            truncation="only_second",
            return_overflowing_tokens=True,
            return_offsets_mapping=True,
            padding="max_length"
        )
    
 
    def prepare_train_sample(self):
        # Retrive the features
        sample_mapping = self.tokenized_sample.pop("overflow_to_sample_mapping")
        # Retrive the offsets
        offset_mapping = self.tokenized_sample.pop("offset_mapping")

        # Obtaining start and end positions of the answer
        self.tokenized_sample["start_positions"] = []
        self.tokenized_sample["end_positions"] = []

        for i, offsets in enumerate(offset_mapping):
            input_ids = self.tokenized_sample["input_ids"][i]
            # Assume that the answer is not present here. So, tokenize
            # label these impossible answers with the CLS token
            cls_index = input_ids.index(tokenizer.cls_token_id)

            # Grab the sequence corresponding to that sample. 
            # Sequence id's help to know what is the context and 
            # what is the question (basically sets 0 to question 
            # and 1 to context - based on the order we used when 
            # calling the tokenizer)
            sequence_ids = self.tokenized_sample.sequence_ids(i)

            # Start/end character index of the answer in the text.
            answer_start = self.start_char_idx
            answer_end = answer_start + len(self.answer_text)

            # Start token index of the current span in the context.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the context.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Check the position of the answer in the context
            # 1. If answer is out of span, then mark the feature with CLS index
            if not (offsets[token_start_index][0] <= answer_start and\
                    offsets[token_end_index][1] >= answer_end):
                self.tokenized_sample["start_positions"].append(cls_index)
                self.tokenized_sample["end_positions"].append(cls_index)
            # 2. Else move the token_start_index and token_end_index to the two 
            #    ends of the answer.
            else:
                while token_start_index < len(offsets) and\
                      offsets[token_start_index][0] <= answer_start:
                    token_start_index += 1
                self.tokenized_sample["start_positions"].append(token_start_index - 1)
                
                while offsets[token_end_index][1] >= answer_end:
                    token_end_index -= 1
                self.tokenized_sample["end_positions"].append(token_end_index + 1)

            # Note that there maybe an edge case where the answer could go after 
            # the last offset if the answer is the last word.

        self.split_the_dicts()

    def split_the_dicts(self):
        # Splits the list of dictionary values into separate dictionaries
        if isinstance(self.tokenized_sample['input_ids'][0],list):
            keys = self.tokenized_sample.keys()
            vals = zip(*[self.tokenized_sample[k] for k in keys])
            self.tokenized_sample = [dict(zip(keys, v)) for v in vals]


    def prepare_validation_sample(self):
        # Retrive the features
        sample_mapping = self.tokenized_sample.pop("overflow_to_sample_mapping")

        # Keep the example_id that gave us this feature.
        # We will aslo store the offset mappings.
        self.tokenized_sample["id"] = []

        for i in range(len(self.tokenized_sample["input_ids"])):
            # Grab the sequence corresponding to that example
            sequence_ids = self.tokenized_sample.sequence_ids(i)
            context_index = 1 

            self.tokenized_sample["id"].append(self.id)

            # Set to None the offset_mapping that are not part of the context 
            # so it's easy to determine if a token
            # position is part of the context or not.
            self.tokenized_sample["offset_mapping"][i] = [
                (o if sequence_ids[k] == context_index else None)
                for k, o in enumerate(self.tokenized_sample["offset_mapping"][i])
            ]

        self.split_the_dicts()


Helper functions to create the training and validation data from the `ChaiiExample` class.

In [None]:
def create_chaii_examples(raw_data, train_flag=True):
    chaii_examples = []
    for data in raw_data:
        if train_flag:
            chaii = ChaiiExample(data['question'],data['context'],\
                                 data['answer_start'],data['answer_text'])
            chaii.prepare_train_sample()
        else:
            # Keeping track of id for validation post processing
            chaii = ChaiiExample(data['question'],data['context'],id=data['id'])
            chaii.prepare_validation_sample()

        # Since we are creating multiple features from the same document,
        # it is essential to take care while adding the record for training
        if isinstance(chaii.tokenized_sample, dict):
            chaii_examples.append(chaii.tokenized_sample)
        else:
            chaii_examples.extend(chaii.tokenized_sample)

    return chaii_examples

In [None]:
def create_inputs_targets(chaii_examples):
    # Creating the training data dictionary
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
        "start_positions": [],
        "end_positions": [],
    }
    
    for item in chaii_examples:
        for key in dataset_dict:
            dataset_dict[key].append(item[key])

    # Converting all to numpy array
    for key in dataset_dict:
        dataset_dict[key] = np.array(dataset_dict[key])

    x = [
        dataset_dict["input_ids"],
        dataset_dict["attention_mask"],
    ]
    y = [dataset_dict["start_positions"], dataset_dict["end_positions"]]
    return x, y

In [None]:
def create_eval_inputs(chaii_examples):
    # The evaluation consists of only the inputs
    # The targets are directly taken from the validation records
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
        "id": [],
        "offset_mapping": [],
    }
    
    for item in chaii_examples:
        for key in dataset_dict:
            dataset_dict[key].append(item[key])

    x = [
        np.array(dataset_dict["input_ids"]),
        np.array(dataset_dict["attention_mask"]),
    ]
    # No y is returned here
    return x

In [None]:
train_chaii_examples = create_chaii_examples(train_records)
x_train, y_train = create_inputs_targets(train_chaii_examples)
print(f"{len(train_chaii_examples)} training points created.")

eval_chaii_examples = create_chaii_examples(valid_records, train_flag=False)
x_eval = create_eval_inputs(eval_chaii_examples)
print(f"{len(eval_chaii_examples)} evaluation points created.")

## **Model Creation, Post-processing and Evaluation**

### **Creating the model** 
Create the Question-Answering Model using HF and Keras Functional API

In [None]:
def create_model():
    ## This code is directly from the Keras example ##
    encoder = TFXLMRobertaModel.from_pretrained(base_model)
    input_ids = layers.Input(shape=(max_length,), dtype=tf.int32)
    attention_mask = layers.Input(shape=(max_length,), dtype=tf.int32)
    embedding = encoder(
        input_ids, attention_mask=attention_mask
    )[0]

    start_logits = layers.Dense(1, name="start_logit", use_bias=False)(embedding)
    start_logits = layers.Flatten()(start_logits)

    end_logits = layers.Dense(1, name="end_logit", use_bias=False)(embedding)
    end_logits = layers.Flatten()(end_logits)

    start_probs = layers.Activation(keras.activations.softmax)(start_logits)
    end_probs = layers.Activation(keras.activations.softmax)(end_logits)

    model = keras.Model(
        inputs=[input_ids, attention_mask],
        outputs=[start_probs, end_probs],
    )
    loss = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
    optimizer = keras.optimizers.Adam(lr=5e-5)
    model.compile(optimizer=optimizer, loss=[loss, loss])
    return model

In [None]:
model = create_model()

In [None]:
model.summary()

### **Output post-processing** 

* The output is one logit for each feature and each token. One can take `argmax` *(index of the max element)* to filter the result. 
* But, in order to avoid cases where the *start_index* is lesser than *end_index*, it is advised to use score obtained by **adding** the start and end logits.
* Also to avoid checking over all the predictions, only the top-k predictions will be picked. This can be controlled using a parameter. In this notebook, it is called `n_best_size`.  
* Also, it is imperative to create a mapping between the validation input and predictions to enable this processing. Here is where the `id` stored in the validation comes in handy.

In [None]:
def post_process_results(all_pred_start, all_pred_end, original_records, hf_features, n_best_size = 20, max_answer_length = 30):
    # Build a map example to its corresponding features.
    example_id_to_index = {k['id']: i for i, k in enumerate(original_records)}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(hf_features):
        features_per_example[example_id_to_index[feature["id"]]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()
    min_null_score = None

    for idx, example in enumerate(original_records):
        feature_indices = features_per_example[idx]

        valid_answers = []
        context = example['context']

        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # Get the predictions of the model for this feature
            pred_start = all_pred_start[feature_index]
            pred_end = all_pred_end[feature_index]

            # Offset to map the position of predicted span with the span of the 
            # answer in the context
            offset_mapping = hf_features[feature_index]["offset_mapping"]

            # Update minimum null prediction.
            cls_index = hf_features[feature_index]["input_ids"].index(tokenizer.cls_token_id)
            feature_null_score = pred_start[cls_index] + pred_end[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score


            # Go through all possibilities for the `n_best_size` greater start and end.
            start_indexes = np.argsort(pred_start)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(pred_end)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": pred_start[start_index] + pred_end[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key=lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # The final answer
        predictions[example["id"]] = best_answer["text"]

    return predictions
            

### **Create the Jaccard evaluation callback**

Widely used in many NLP tasks. Think of it as the metric that evaluates how close you are to the exact answer. The higher it is, the better fares the model.

**Jaccard similarity coefficient score.**

Definition - 

> The Jaccard index or Jaccard similarity coefficient,is the size of the intersection divided by the size of the union of two label sets. It is used to compare set of predicted labels for a sample to the corresponding set of labels in true labels.




In [None]:
class JaccardScore(keras.callbacks.Callback):
    def __init__(self, x_eval):
        self.x_eval = x_eval
        
    def safe_div(self, x,y):
        if y == 0:
            return 1
        return x / y

    def jaccard(self, str1, str2): 
        a = set(str1.split()) 
        b = set(str2.split())
        c = a.intersection(b)
        return self.safe_div(float(len(c)) , (len(a) + len(b) - len(c)))

    def get_jaccard_score(self, y_true,y_pred):
        assert len(y_true)==len(y_pred)
        score=0.0
        for i in range(len(y_true)):
            score += self.jaccard(y_true[i], y_pred[i])
            
        return score

    def on_epoch_end(self, epoch, logs=None):
        y_pred = []
        y_true = []
        pred_start, pred_end = self.model.predict(self.x_eval)
        # The post processing function to select only the best output
        predictions = post_process_results(pred_start, pred_end, \
                                           valid_records, eval_chaii_examples)
        for idx, record in enumerate(valid_records):
            y_true.append(record['answer_text'])
            y_pred.append(predictions[record['id']])

        score = self.get_jaccard_score(y_true,y_pred)
        epoch_jaccard = score/len(predictions)

        print(f"epoch={epoch+1}, jaccard score={epoch_jaccard:.2f}")

## **Train and Evaluate**

Finally, never thought I will get here!

In [None]:
epochs = 1 # training only for one epoch to quickly check the flo
batch_size = 4

jaccard_score_callback = JaccardScore(x_eval)
model.fit(
    x_train,
    y_train,
    epochs=epochs,
    verbose=2,
    batch_size=batch_size,
    callbacks=[jaccard_score_callback],
)

## **Submit** 

Not a very good score. In fact, not at all upto the baseline at all!😢

In [None]:
test = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/test.csv")
test_records = test.to_dict('records')

test_chaii_examples = create_chaii_examples(test_records, train_flag=False)
x_test = create_eval_inputs(test_chaii_examples)
print(f"{len(test_chaii_examples)} evaluation points created.")

In [None]:
pred_start, pred_end = model.predict(x_test)
predictions = post_process_results(pred_start, pred_end, test_records, test_chaii_examples)

In [None]:
submit_df = pd.DataFrame({'id': list(predictions.keys()), 'PredictionString': list(predictions.values())})
submit_df.to_csv('submission.csv', index=False)
submit_df.head()

## Closing notes

- Training takes a veeeerrrry long time.🙁
- Comments on various things that can make this notebook better are always welcome. One is surely getting rid of record conversion.
- Despite a bad score, I think I am better equipped on 4 things I set out to learn *(not sure of the last one though!)*. 

