# Brief Introduction

### Tamil

- Tamil is a Dravidian language spoken by Tamils in southern India, Sri Lanka, and elsewhere
- Tamil language originated from Proto-Dravidian in 450BCE
- Tamil language is derived from the Dravidian language family written in Tamil scripts. It is one of the four Dravidian languages along with Telegu, Malayalam, and Kannada
- It is the oldest of all Dravidian languages
- Tamil language witnesses it’s existence for more than 2000 years making it the oldest and longest surviving classical language in the world
- The Tamil language is spoken widely in India, Sri Lanka, Malaysia, Singapore, South Africa and Mauritius

### Hindi

- Hindi is an Indic language of northern India that derived from Vedic Sanskrit language
- Hindi is written in the Devanagari script
- Hindi language originated from the Indo-Aryans linguistic Family in the 17th century CE
- It is one of the official languages of India which includes Tamil as well

# Competition Overview:

In this competition, the goal is to predict answers to real questions about Wikipedia articles. You will use chaii-1, a new question answering dataset with question-answer pairs. The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions written by native-speaking expert data annotators. 


# Competition Rules:
- CPU Notebook <= 5 hours run-time
-GPU Notebook <= 5 hours run-time
-Internet access disabled
- Freely & publicly available external data is allowed, including pre-trained models
- Submission file must be named submission.csv

# Competition Metrics:
The metric in this competition is the word-level Jaccard score

`def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))`

### Ref:
- https://keras.io/examples/nlp/text_extraction_with_bert/
- https://github.com/huggingface/notebooks/blob/master/examples/question_answering.ipynb

### External Datasets
- This Notebook uses @rhtsingh's hindi dataset: [Hindi External](https://www.kaggle.com/rhtsingh/external-data-mlqa-xquad-preprocessing/data)
- For Tamil, the dataset i have created: [Tamil External](https://www.kaggle.com/msafi04/squad-translated-to-tamil-for-chaii)

In [None]:
%matplotlib inline

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from tqdm import tqdm
tqdm.pandas()

import gc

from sklearn.model_selection import StratifiedKFold

import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

from tqdm.auto import tqdm
import collections

import os

from pathlib import Path

import json

plt.rcParams["figure.figsize"] = (12, 8)
plt.rcParams['axes.titlesize'] = 16
sns.set_palette('Set3_r')

pd.set_option("display.max_rows", 20, "display.max_columns", None)

print(os.listdir('../input/'))
        
from time import time, strftime, gmtime
start = time()
import datetime
print(str(datetime.datetime.now()))

import warnings
warnings.simplefilter(action = 'ignore', category = Warning)

In [None]:
train = pd.read_csv('/kaggle/input/chaii-hindi-and-tamil-question-answering/train.csv')
print(train.shape)
train.head()

In [None]:
test = pd.read_csv('/kaggle/input/chaii-hindi-and-tamil-question-answering/test.csv')
print(test.shape)
test.head()

In [None]:
sub = pd.read_csv('/kaggle/input/chaii-hindi-and-tamil-question-answering/sample_submission.csv')
print(sub.shape)
sub.head()

In [None]:
external_hindi1 = pd.read_csv('/kaggle/input/mlqa-hindi-processed/mlqa_hindi.csv')
print(external_hindi1.shape)

In [None]:
external_hindi2 = pd.read_csv('/kaggle/input/mlqa-hindi-processed/xquad.csv')
print(external_hindi2.shape)

In [None]:
print('External hindi dataset...')
external_hindi = pd.concat([external_hindi1, external_hindi2])
print(external_hindi.shape)

In [None]:
print('External Tamil dataset...')
external_tamil = pd.read_csv('/kaggle/input/squad-translated-to-tamil-for-chaii/squad_translated_tamil.csv')
external_tamil['language'] = 'tamil'
print(external_tamil.shape)

In [None]:
print('Combined External dataset...')
external_df = pd.concat([external_hindi, external_tamil])
external_df = external_df.sample(frac = 1).reset_index(drop = True)
print(external_df.shape)
external_df.head()

In [None]:
del external_hindi1, external_hindi2, external_hindi, external_tamil
gc.collect()

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2)
sns.countplot(x = 'language', data = external_df, ax = ax1).set_title('External Dataset Language Counts')
for p in ax1.patches:
    ax1.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))
sns.countplot(x = 'language', data = test, ax = ax2).set_title('Test Language Counts')
for p in ax2.patches:
    ax2.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

# Huggingface TF XLM RoBerta

In [None]:
import yaml

hparams = {
    'DEVICE': 'TPU',
    'EPOCHS': 3,
    'MODEL_2': '../input/jplu-tf-xlm-roberta-large',
    'N_FOLDS': 5,
    'SEED': 777,
    'VERBOSE': 1,
    'BATCH_SIZE': 16,
    'MAX_LENGTH': 512,
    'DOC_STRIDE': 128
    
}

In [None]:
import tensorflow as tf
import tensorflow.keras.backend as K

import transformers
from transformers import AutoTokenizer, TFXLMRobertaForQuestionAnswering, TFXLMRobertaModel

print(tf.__version__)
print(transformers.__version__)

In [None]:
SEED = hparams['SEED']
tf.random.set_seed(SEED)
np.random.seed(SEED)

In [None]:
model_checkpoint = hparams['MODEL_2']
batch_size = hparams['BATCH_SIZE']

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_special_tokens = True)
print(tokenizer)

In [None]:
train['context_num_tokens'] = train['context'].apply(lambda x: len(tokenizer(x)['input_ids']))
train['context_num_tokens'].max()

In [None]:
train['context_num_tokens'].hist();

- The context length is too long, it has to be split into pieces before processing
- Usually in NLP tasks long documents are truncated but in QA tasks truncating 'context' would lead to loss of answer
- To avoid this, the long context is split into many input features each of length less than the max_length parameter
- And if answer is at the split, we use overlapping of split features which is controlled by the parameter doc_stride

In [None]:
train = train.sample(frac = 1, random_state = 2021).reset_index(drop = True)
print(train.shape)
train.head()

In [None]:
#Split data to folds
n_folds = hparams['N_FOLDS']
train['kfold'] = -1

skf = StratifiedKFold(n_splits = n_folds, shuffle = True, random_state = SEED)
for fold, (trn_idx, val_idx) in enumerate(skf.split(X = train, y = train['language'].values)):
    train.loc[val_idx, 'kfold'] = fold
train.head(2)

# Set TPU

In [None]:
DEVICE = 'TPU'

if DEVICE == "TPU":
    print("connecting to TPU...")
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
        print('Running on TPU ', tpu.master())
    except ValueError:
        print("Could not connect to TPU")
        tpu = None

    if tpu:
        try:
            print("initializing  TPU ...")
            tf.config.experimental_connect_to_cluster(tpu)
            tf.tpu.experimental.initialize_tpu_system(tpu)
            strategy = tf.distribute.experimental.TPUStrategy(tpu)
            print("TPU initialized")
        except _:
            print("failed to initialize TPU")
    else:
        DEVICE = "GPU"

if DEVICE != "TPU":
    print("Using default strategy for CPU and single GPU")
    strategy = tf.distribute.get_strategy()

if DEVICE == "GPU":
    print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
    

AUTO     = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync

print(f'REPLICAS: {REPLICAS}')

In [None]:
max_length = hparams['MAX_LENGTH'] #The maximum length of a feature (question and context)
doc_stride = hparams['DOC_STRIDE'] #The authorized overlap between two part of the context when splitting it if needed.

pad_on_right = tokenizer.padding_side == "right"

In [None]:
def prepare_training(examples):
    examples['question'] = [q.lstrip() for q in examples['question']] #remove leading white space
    examples['question'] = [q.rstrip('?') for q in examples['question']] #remove '?' from the questions
    
    #Tokenize our examples with truncation and padding, but keep the overflows using a stride. This results
    #in one example possible giving several features when a context is long, each of those features having a
    #context that overlaps a bit the context of the previous feature.
    
    tokenized_examples = tokenizer(
                list(examples['question' if pad_on_right else 'context'].values),
                list(examples['context' if pad_on_right else 'question'].values),
                truncation = 'only_second' if pad_on_right else 'only_first',
                max_length = max_length,
                stride = doc_stride,
                return_overflowing_tokens = True,
                return_offsets_mapping = True,
                padding = 'max_length'
            )
    #Since one example might give us several features if it has a long context, we need a map from a feature to
    #its corresponding example. This key gives us just that.
    
    sample_mapping = tokenized_examples.pop('overflow_to_sample_mapping')
    
    #The offset mappings will give us a map from token to character position in the original context. This will
    #help us compute the start_positions and end_positions.
    
    offset_mapping = tokenized_examples.pop('offset_mapping')
    
    tokenized_examples['start_positions'] = []
    tokenized_examples['end_positions'] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples['input_ids'][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)
        
        sequence_ids = tokenized_examples.sequence_ids(i)
        
        #One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples.loc[sample_index, 'answer_text']
        start_char = examples.loc[sample_index, 'answer_start']
        
        # If no answers are given, set the cls_index as answer.
        if start_char is None:
            tokenized_examples['start_positions'].append(cls_index)
            tokenized_examples['end_positions'].append(cls_index)
        else:
            # Start/end character idx of the answer in the text.
            end_char = start_char + len(answers)
            
             #Start token idx of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1
            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1
            #Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples['start_positions'].append(cls_index)
                tokenized_examples['end_positions'].append(cls_index)
            else:
                #Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                #Note: we could go after the last offset if the answer is the last word (edge case).
                
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples['start_positions'].append(token_start_index - 1)
                
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples['end_positions'].append(token_end_index + 1)

    return tokenized_examples

In [None]:
def prepare_validation(examples):
    examples['question'] = [q.lstrip() for q in examples['question']]
    examples['question'] = [q.rstrip('?') for q in examples['question']]
    
    tokenized_examples = tokenizer(
                list(examples['question' if pad_on_right else 'context'].values),
                list(examples['context' if pad_on_right else 'question'].values),
                truncation = 'only_second' if pad_on_right else 'only_first',
                max_length = max_length,
                stride = doc_stride,
                return_overflowing_tokens = True,
                return_offsets_mapping = True,
                padding = 'max_length'
            )
    
    sample_mapping = tokenized_examples.pop('overflow_to_sample_mapping')
    
    #id column from the dataset
    tokenized_examples['example_id'] = []

    for i in range(len(tokenized_examples['input_ids'])):
        sequence_ids = tokenized_examples.sequence_ids(i)
        context_index = 1 if pad_on_right else 0
        sample_index = sample_mapping[i]
        tokenized_examples['example_id'].append(examples.loc[sample_index, 'id'])
        tokenized_examples['offset_mapping'][i] = [
            (o if sequence_ids[k] == context_index else None)
            for k, o in enumerate(tokenized_examples['offset_mapping'][i])
        ]

    return tokenized_examples

In [None]:
def build_tf_dataset(df, batch_size = 4, flag = 'train'):
    
    if flag == 'train':
        features = prepare_training(df)
    else:
        features = prepare_validation(df)
    
    input_ids = features['input_ids']
    attn_masks = features['attention_mask']
    
    if flag == 'train':
        start_positions = features['start_positions']
        end_positions = features['end_positions']
        
        train_dataset = tf.data.Dataset.from_tensor_slices((input_ids, attn_masks, start_positions, end_positions))
        train_dataset = train_dataset.map(lambda x1, x2, y1, y2: ({'input_ids': x1, 'attention_mask': x2}, {'start_positions': y1, 'end_positions': y2}))
        train_dataset = train_dataset.batch(batch_size)
        train_dataset = train_dataset.shuffle(1000)
        train_dataset = train_dataset.prefetch(AUTO)
        
        return train_dataset, features
    
    elif flag == 'valid':
        dataset = tf.data.Dataset.from_tensor_slices((input_ids, attn_masks))
        dataset = dataset.map(lambda x1, x2: ({'input_ids': x1, 'attention_mask': x2}))
        dataset = dataset.batch(batch_size)
        dataset = dataset.prefetch(buffer_size = AUTO)
        
        return dataset, features

# TF XLM RoBerta Model

In [None]:
def build_model():
    roberta = TFXLMRobertaModel.from_pretrained(model_checkpoint)
    
    input_ids = tf.keras.layers.Input(shape = (max_length, ), name = 'input_ids', dtype = tf.int32)
    attention_mask = tf.keras.layers.Input(shape = (max_length, ), name = 'attention_mask', dtype = tf.int32)
    
    embeddings = roberta(input_ids = input_ids, attention_mask = attention_mask)[0]
    
    x1 = tf.keras.layers.Dropout(0.1)(embeddings) 
    x1 = tf.keras.layers.Dense(1, use_bias = False)(x1)
    x1 = tf.keras.layers.Flatten()(x1)
    x1 = tf.keras.layers.Activation('softmax', name = 'start_positions')(x1)
    
    x2 = tf.keras.layers.Dropout(0.1)(embeddings) 
    x2 = tf.keras.layers.Dense(1, use_bias = False)(x2)
    x2 = tf.keras.layers.Flatten()(x2)
    x2 = tf.keras.layers.Activation('softmax', name = 'end_positions')(x2)

    model = tf.keras.models.Model(inputs = [input_ids, attention_mask], outputs = [x1, x2])
    
    optimizer = tf.keras.optimizers.Adam(learning_rate = 3e-5)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False)
    
    model.compile(loss = [loss, loss], optimizer = optimizer)

    return model

# Metrics

In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

# Post Process Predictions

In [None]:
def post_process_predictions(examples, features, start, end, n_best_size = 20, max_answer_length = 30):
    
    all_start_logits, all_end_logits = start, end
    # Build a map example to its corresponding features.
    example_id_to_index = {k: i for i, k in enumerate(examples['id'])}
    features_per_example = collections.defaultdict(list)
    
    for i, feature in enumerate(features['example_id']):
        features_per_example[example_id_to_index[feature]].append(i)

    # The dictionaries we have to fill.
    predictions = collections.OrderedDict()

    # Logging.
    print(f"Post-processing {len(examples)} example predictions split into {len(features['input_ids'])} features.")

    # Let's loop over all the examples!
    for example_index, example in examples.iterrows():
        # Those are the indices of the features associated to the current example.
        feature_indices = features_per_example[example_index]
        min_null_score = None # Only used if squad_v2 is True.
        valid_answers = []
        
        context = example['context']
        # Looping through all the features associated to the current example.
        for feature_index in feature_indices:
            # We grab the predictions of the model for this feature.
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]
            # This is what will allow us to map some the positions in our logits to span of texts in the original
            # context.
            offset_mapping = features['offset_mapping'][feature_index]

            # Update minimum null prediction.
            cls_index = features['input_ids'][feature_index].index(tokenizer.cls_token_id)
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score

            # Go through all possibilities for the `n_best_size` greater start and end logits.
            start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist()
            end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond
                    # to part of the input_ids that are not in the context.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or offset_mapping[end_index] is None
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "score": start_logits[start_index] + end_logits[end_index],
                            "text": context[start_char: end_char]
                        }
                    )
        
        if len(valid_answers) > 0:
            best_answer = sorted(valid_answers, key = lambda x: x["score"], reverse=True)[0]
        else:
            # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid
            # failure.
            best_answer = {"text": "", "score": 0.0}
        
        # Let's pick our final answer: the best one or the null answer (only for squad_v2)
        #if not squad_v2:
        #    predictions[example["id"]] = best_answer["text"]
        #else:
        answer = best_answer["text"] 
        predictions[example['id']] = answer

    return predictions

# Fine Tune Model

In [None]:
strart_probs, end_probs = [], []
epochs = hparams['EPOCHS']
jaccard_scores = []

for i, fold in enumerate(range(n_folds)):
    print('#########' * 15)
    print(f"Fold: {fold + 1}")
    print('#########' * 15)
    train_df = train[train['kfold'] != fold]
    valid_df = train[train['kfold'] == fold]
    
    #concat external_df to train_df for training, no change in valid_df
    train_df = pd.concat([train_df.iloc[:, 1:-1], external_df])

    train_df = train_df.reset_index(drop = True)
    valid_df = valid_df.reset_index(drop = True)
    print(train_df.shape, valid_df.shape)
    
    train_dataset, train_enc = build_tf_dataset(train_df, batch_size = batch_size, flag = 'train')
    valid_dataset, valid_enc = build_tf_dataset(valid_df, batch_size = batch_size, flag = 'valid')
    
    K.clear_session()
    
    with strategy.scope():
        model = build_model()
    if i == 0:
        print(model.summary())
    
    checkpoint = tf.keras.callbacks.ModelCheckpoint(f'qa_model_{fold + 1}.h5', verbose = 1, monitor = 'loss', mode = 'min', save_best_only = True, 
                                                save_weights_only = True)

    history = model.fit(train_dataset, 
                        epochs = epochs, 
                        batch_size = batch_size,
                        callbacks = [checkpoint],
                        verbose = 1
                        )
    print('Predicting valid dataset...')
    start_pred, end_pred = model.predict(valid_dataset, batch_size = batch_size, verbose = 1)
    print(start_pred.shape, end_pred.shape)
    print('Post-process predictions...')
    valid_preds = post_process_predictions(valid_df, valid_enc, start_pred, end_pred)
    
    score = []
    for idx in range(len(valid_df)):
        str1 = valid_df['answer_text'].values[idx]
        str2 = valid_preds[valid_df.loc[idx, 'id']]
        score.append(jaccard(str1, str2))
    print(f'Jaccard Score for fold {fold + 1}: {np.mean(score)}')
    jaccard_scores.append(np.mean(score))
    
    strart_probs.append(start_pred)
    end_probs.append(end_pred)
    
    del train_dataset, valid_dataset, model
    gc.collect()

Save the hyperparameters and the validation scores for prediction notebook

In [None]:
hparams['JAC_SCORES'] = jaccard_scores

with open(r'hparams.yaml', 'w') as f:
    yaml.dump(hparams, f)

# Predict on Test Data

In [None]:
test_dataset, test_enc = build_tf_dataset(test, batch_size = batch_size, flag = 'valid')

In [None]:
start_probs, end_probs = [], []
for fold in range(n_folds):
    with strategy.scope():
        model = build_model()
    print('Loading trained model weights...')
    model.load_weights(f'qa_model_{fold + 1}.h5')
    print(f'Predicting testset - Fold: {fold + 1}...')
    start_pred, end_pred = model.predict(test_dataset, batch_size = batch_size, verbose = 1)
    print(start_pred.shape, end_pred.shape)
    start_probs.append(start_pred)
    end_probs.append(end_pred)

In [None]:
test_start_probs, test_end_probs = np.mean(start_probs, axis = 0), np.mean(end_probs, axis = 0)
predictions = post_process_predictions(test, test_enc, test_start_probs, test_end_probs)

In [None]:
sub_df = pd.DataFrame({'id': list(predictions.keys()), 'PredictionString': list(predictions.values())})
sub_df.to_csv('./submission.csv', index = False)
sub_df.head()

In [None]:
finish = time()
print(strftime("%H:%M:%S", gmtime(finish - start)))

Thanks to Kaggle and fellow Kagglers for the all the learnings, nothing beats doing and learning!!!