# Objective

The goal of this notebook is to use **data augmentation** to improve the textual entailment classification across under-represented languages. 

For a beginner's tutorial on implementing baseline model for textual entailment recognition, you can check this [notebook](https://www.kaggle.com/wchowdhu/hands-on-nli-w-transformers-m-bert-xlm-roberta).

If you are interested to explore more on this topic, you can check my [Udacity capstone project](https://github.com/wchowdhu/udacity-capstone-project).

# Data Augmentation:
The amount of labeled data available to train a machine learning model might impact the model’s performance. This is especially true in case of deep learning-based NLP models that generally benefit from larger amounts of annotated training examples to be able to distinguish between the different output classes. However it can be an expensive and time-consuming process to manually annotate additional data. To increase the number of training examples in low-resource languages, **data augmentation**, in the form of **back-translations**, is used to generate additional, synthetic data using the original train data. 

In **back-translation**, the input text data is translated to some language and then translated back to the original language. This can help to generate textual data with different words while preserving the context of the input text data.


An example of back-translation is provided below.

**Input**:
*The rules developed in the interim were put together with these comments in mind.* 

**Back-Translation**:
*The rules developed in the meantime have been drawn up taking these comments into account.* 

# Problem
In the [Contradictory, My Dear Watson](https://www.kaggle.com/c/contradictory-my-dear-watson/overview) competition, the task is to build a system that automatically classifies how pairs of sentences are related from texts in 15 different natural languages. 

Let's look at the distribution of examples across all the languages in the original train data.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

train_df = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
labels, frequencies = np.unique(train_df.language.values, return_counts=True)
plt.figure(figsize = (10,10))
plt.pie(frequencies,labels = labels, autopct = '%1.1f%%')
plt.show()

From the above plot we observe that more than half of the training examples are in English, as data resources are abundant in this language. Rest of the data is fairly shared between other 14 languages.



As majority of the languages are under represented in the original train data, this might affect the classification performance per language. To alleviate data scarcity in these languages, an augmented dataset is generated before training the models, concatenated with the original train subset, and later fed into data loaders to train the model. 

The data augmentation process is depicted in the flowchart below.

In [None]:
from IPython import display
display.Image("../input/figures/data_augmentation_workflow.png")

# Split the Training Data

We will be splitting the training dataset into two parts - the data we will train the model with and a validation set. We stratify data during train-valid split to preserve the original distribution of the target classes.

In [None]:
from sklearn.model_selection import train_test_split

# Create train-val split data
print ("Creating training and validation split csv files...")
# Stratify ensures that each sub-set contains approximately the same percentage of samples of each target class as the original set.
train_df, validation_df = train_test_split(train_df, stratify=train_df.label.values, 
                                                      random_state=42, 
                                                      test_size=0.20, shuffle=True)


train_df.reset_index(drop=True, inplace=True)
validation_df.reset_index(drop=True, inplace=True)
    
# check the number of rows and columns in the subsets after split
print("Training data shape after split: {}".format(train_df.shape))
print("Validation data shape after split: {}".format(validation_df.shape))

# Implement Baseline

We will firslty implement the baseline model using XLM-RoBERTa with the original train data. Then the train data will be augmented with back-translated data and the model will be fine-tuned on this new train data. 

In [None]:
# install the latest version of the following libraries
!pip install datasets #to load xnli dataset from huggingface library
!pip install googletrans==3.1.0a0 # for back-translation

In [None]:
# import libraries
import os
from transformers import AutoTokenizer, AutoConfig, TFAutoModel    
from transformers import XLMRobertaConfig, XLMRobertaTokenizer, TFXLMRobertaModel  
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
import tensorflow as tf
import tensorflow.keras.backend as K
import os.path
from os import path
from tensorflow.keras.layers import Input, Dropout, Dense
from sklearn.model_selection import train_test_split
from datasets import load_dataset, list_datasets
from tqdm import tqdm
from googletrans import Translator
import time
import glob
import seaborn as sns

os.environ["WANDB_API_KEY"] = "0" # to silence warning

np.random.seed(0)

In [None]:
# configure tpu settings
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
print('Number of replicas:', strategy.num_replicas_in_sync)

In [None]:
# Hyperparameter Settings
EPOCHS = 4
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 120
PATIENCE = 1
LEARNING_RATE = 1e-5

In [None]:
# set up the tokenizer
PRETRAINED_MODEL_TYPES = {
    'xlmroberta': (XLMRobertaConfig, TFXLMRobertaModel, XLMRobertaTokenizer, 'jplu/tf-xlm-roberta-large')
}

config_class, model_class, tokenizer_class, model_name = PRETRAINED_MODEL_TYPES['xlmroberta']

# Download vocabulary from huggingface.co and cache.
# tokenizer = tokenizer_class.from_pretrained(model_name) 
tokenizer = AutoTokenizer.from_pretrained(model_name) #fast tokenizer

tokenizer

In [None]:
def encode(df, tokenizer, max_len=50):
    ''' Function to vectorize the input data '''
    pairs = df[['premise','hypothesis']].values.tolist() #shape=[num_examples]
    
    print ("Encoding...")
    encoded_dict = tokenizer.batch_encode_plus(pairs, max_length=max_len, padding=True, truncation=True, 
                                               add_special_tokens=True, return_attention_mask=True)
    print ("Complete")
    
    input_word_ids = tf.convert_to_tensor(encoded_dict['input_ids'], dtype=tf.int32) #shape=[num_examples, max_len])
    input_mask = tf.convert_to_tensor(encoded_dict['attention_mask'], dtype=tf.int32) #shape=[num_examples, max_len]
    
    inputs = {
        'input_word_ids': input_word_ids,
        'input_mask': input_mask}    
    
    return inputs

In [None]:
# process input data
print("Processing Training Data:")
train_input = encode(train_df, tokenizer=tokenizer, max_len=MAX_LEN)
print("\nProcessing Validation Data:")
validation_input = encode(validation_df, tokenizer=tokenizer, max_len=MAX_LEN)

In [None]:
def build_model(max_len, model_name, model_class):
    """
    Define the model architecture
    """
    
    tf.random.set_seed(123) # For reproducibility
    
    print('starting:')
    # The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.
    encoder = model_class.from_pretrained(model_name)
#     encoder = TFAutoModel.from_pretrained(model_name)
    print('pretrained model:')
    
    input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    input_type_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_type_ids")
    
    # Extract final layer feature vectors
    features = encoder([input_word_ids, input_mask])[0] # shape=(batch_size, max_len, output_size)
    
    # We pass the vector of only the [cls] token (at index=0) to the classification layer
    sequence = features[:,0,:] #shape=(batch_size, output_size)
#     sequence = features.pooler_output #shape: [batch_size, output_size]
   
    # Add a classification layer
    output = tf.keras.layers.Dense(3, activation="softmax")(sequence)  
    model = tf.keras.Model(inputs=[input_word_ids, input_mask], outputs=output)        
    model.compile(tf.keras.optimizers.Adam(lr=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model


# Instantiating the model in the strategy scope creates the model on the TPU
with strategy.scope():
    model = build_model(MAX_LEN, model_name, model_class)
    model.summary()

In [None]:
# Fit the model to original train data
callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE)]
train_history = model.fit(x=train_input, y=train_df.label.values, validation_data=(validation_input, validation_df.label.values), epochs=EPOCHS, verbose=1, batch_size=BATCH_SIZE, callbacks=callbacks)

In [None]:
# save validation predictions to compare the results later
val_predictions = [np.argmax(i) for i in model.predict(validation_input)]
validation_df['prediction'] = val_predictions
validation_df.to_csv("validation_predictions_original.csv", index=False)

In [None]:
del model #to free up space

In [None]:
# Resets all state generated by Keras
K.clear_session()

# Fine-tune the Model with Augmented Data

Now, we will generate back-translations of the original train data and re-train a XLM-R model with the augmented and auxiliary data. I already generated the back-translations with a Google Translate API and saved it in a csv file, since it takes some time to get the back-translations. You can access the dataset from [here](https://www.kaggle.com/wchowdhu/backtranslations-for-data-augmentation-in-nlp). If you want to generate new back-translations, set `BT_DIR = ''`.

In [None]:
# Hyper-parameter Settings
LOAD_XNLI = True #use auxiliary XNLI data
BACK_TRANSLATE = True #enable data augmentation
# directory containing the back-translations of input training data
BT_DIR = '../input/backtranslations-for-data-augmentation-in-nlp'

In [None]:
def back_translate(train_df, target_lang='fr', sample=True, num_samples_per_lang=1000):
    ''' Function to back-translate data examples '''
    if sample: #sample input training data to back translate
        train_df = train_df.groupby('language', group_keys=False).apply(lambda x: x.sample(min(len(x), num_samples_per_lang))).reset_index(drop=True)  

    df_list = []
    limit_before_timeout = 100
    timeout = 5
    
    translator = Translator() 
    
    # Add functions to back translate input sentences
    def target_translate(x, target_lang):
        translation = translator.translate(x, dest=target_lang)
        return translation.text
    def source_translate(x, source_lang):
        translation = translator.translate(x, dest=source_lang) 
        return translation.text 
    
    for i in tqdm(range(len(train_df))):
        entry = train_df.loc[[i]]
        source_lang = entry.lang_abv.values.tolist()[0]
        if source_lang == 'zh':
            #print(googletrans.LANGUAGES) 
            source_lang = 'zh-cn' #'zh' not in googletrans.LANGUAGES        
        if (i!=0) and (i%limit_before_timeout == 0): #apply timeout after every 100 iterations 
            print('Iteration {} of {}'.format(i, len(train_df)))
            time.sleep(timeout)      
        # Back translate premise sentence
        entry['premise'] = entry['premise'].apply(lambda x: target_translate(x, target_lang))
#         time.sleep(0.2)
        entry['premise'] = entry['premise'].apply(lambda x: source_translate(x, source_lang))
#         time.sleep(0.2)       
        # Back translate hypothesis sentence
        entry['hypothesis'] = entry['hypothesis'].apply(lambda x: target_translate(x, target_lang))
#         time.sleep(0.2)
        entry['hypothesis'] = entry['hypothesis'].apply(lambda x: source_translate(x, source_lang))
#         time.sleep(0.2)
        df_list.append(entry)
    
    train_bt = pd.concat(df_list, ignore_index=True)
    print("Shape of back-translated training data: {}".format(train_bt.shape))
    return train_bt

In [None]:
def process_xnli_data(all_keys=False, only_train=False): 
    ''' Function to process XNLI data '''
    if only_train:
        print ("Splitting by machine-translated train data only")
        split = 'train'
    elif all_keys:
        print ("Splitting by all keys")
        split = 'validation+test+train[:5%]'
    else:
        print ("Splitting by human-catered validation and test data")
        split = 'validation+test'
    
    print("Loading XNLI data...")
    print("Split: ", split)
    
    dataset = load_dataset('xnli', 'all_languages', split=split) #returns a Dataset object  
    print(dataset)
    
    entries = []   
    for entry in tqdm(dataset): 
        hypothesis_langs = entry['hypothesis']['language'] #list of 15 lang string values
        hypothesis_values = entry['hypothesis']['translation'] #list of 15 hypothesis string values

        premise_langs = list(entry['premise'].keys()) #list of 15 lang string values
        premise_values = list(entry['premise'].values()) #list of 15 premise string values

        labels = [entry['label']]*len(hypothesis_langs) #all 15 languages for the same example have same label 

        if premise_langs == hypothesis_langs: #the languages in premise and hypothesis are in same order
            values = list(zip(premise_values, hypothesis_values, hypothesis_langs, hypothesis_langs, labels))
            entries += values

    xnli_df = pd.DataFrame(entries, columns=['premise', 'hypothesis', 'lang_abv', 'language', 'label']) #create dataframe for each key
    
    xnli_df['language'].replace({"en": "English", "ar": "Arabic", "sw": "Swahili", "th": "Thai", "vi": "Vietnamese", "es": "Spanish", "bg": "Bulgarian", "zh": "Chinese", "ur": "Urdu", "ru": "Russian", "hi": "Hindi", "fr": "French", "tr": "Turkish", "el": "Greek", "de": "German"}, inplace=True)

    xnli_df['id'] = xnli_df.index + 1
    column_names = ['id', 'premise', 'hypothesis', 'lang_abv', 'language', 'label']
    xnli_df = xnli_df.reindex(columns=column_names)
    
    # Get the number of missing data points per column
    missing_values_xnli = xnli_df.isnull().sum() 

    print("Number of missing data points per column in XNLI corpus:")
    print (missing_values_xnli)

    # Drop the missing value rows
    xnli_df.dropna(axis=0, inplace=True)
#     print("Total number of data examples in XNLI corpus after dropping NA values: {}".format(xnli_df.shape[0]))
    
    print("XNLI corpus shape: {}".format(xnli_df.shape))
    
    del dataset #free up space
    
    return xnli_df

In [None]:
def augment_data(train_df, use_xnli=True, use_bt=True, bt_dir=''):
    ''' Function for adding augmented and auxiliary train data '''
    df_list = []  
    if use_bt: #use back-translation
        if path.isdir(bt_dir):
            files = glob.glob(bt_dir+'/*.csv')
            bt_list = []
            for filename in files:
                bt_list.append(pd.read_csv(filename))
            bt_df = pd.concat(bt_list, ignore_index=True)
#             bt_df.isnull().sum() # we get the number of missing values
            bt_df.dropna(inplace=True) #remove missing values
            bt_df.drop_duplicates(inplace=True) #drop duplicate rows  
            bt_df = bt_df.sample(frac=1) #randomly select n examples as back-translated data 
            print("Shape of back-translated training data: {}".format(bt_df.shape))
            bt_df.to_csv('back_translation_all.csv', index=False)
        else:
            bt_df = back_translate(df)
        df_list.append(bt_df)
        del bt_df #free up space 
    if use_xnli: #use auxiliary data
        xnli_df = process_xnli_data()
        df_list.append(xnli_df)
        del xnli_df #free up space 
        
    if len(df_list) > 0: #augment data
        augmented_df = pd.concat(df_list, ignore_index=True)
        train_df = train_df.append(augmented_df, ignore_index=True) 
        
    train_df = train_df.sample(frac=1).reset_index(drop=True) #shuffle data
    train_df.drop_duplicates(inplace=True) #drop duplicate rows
    return train_df


augmented_df = augment_data(train_df, LOAD_XNLI, BACK_TRANSLATE, BT_DIR) 
augmented_df.head(100)

In [None]:
# check the number of rows and columns in the augmented train data
print("Augmented train data shape: {}".format(augmented_df.shape))

The chart below compares the distribution of languages in the train data before and after data augmentation. We can see from the chart that adding augmented and auxiliary data greatly reduces the overall train data coverage in English language from ~56% to ~10%, while increasing the number of input samples in all other languages. Hence adding more samples helps to keep a fair balance in the number of samples per language.

In [None]:
orig_lang_dist = train_df.language.value_counts(normalize=True).sort_index()
aug_lang_dist = augmented_df.language.value_counts(normalize=True).sort_index()

# fig = plt.figure() # Create matplotlib figure
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(111) # Create matplotlib axes
width = 0.4
orig_lang_dist.plot(kind='bar', color='blue', ax=ax, width=width, position=1, label='Original Train Data')
aug_lang_dist.plot(kind='bar', color='green', ax=ax, width=width, position=0, label='Augmented Train Data')
ax.set_xlabel('Language')
ax.set_ylabel('Train Data Coverage')
plt.legend(loc="upper right")
plt.tight_layout()
plt.savefig('data_coverage_comparison_bt.png')
plt.show()

In [None]:
# process the new augmented train data
train_input = encode(augmented_df, tokenizer=tokenizer, max_len=MAX_LEN)
train_input

In [None]:
# instantiate a new XLM-R model
with strategy.scope():
    model = build_model(MAX_LEN, model_name, model_class)
    model.summary()

In [None]:
# fine-tune model with augmented train data
checkpoint_filepath='best_checkpoint.hdf5' #save the best checkpoint
callbacks = [EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=PATIENCE), ModelCheckpoint(filepath=checkpoint_filepath, save_best_only=True, save_weights_only=True, monitor='val_loss', mode='min', verbose=1)]
train_history = model.fit(x=train_input, y=augmented_df.label.values, validation_data=(validation_input, validation_df.label.values), epochs=EPOCHS, verbose=1, batch_size=BATCH_SIZE, callbacks=callbacks)

In [None]:
# save validation predictions to compare results later
val_predictions = [np.argmax(i) for i in model.predict(validation_input)]
validation_df['prediction'] = val_predictions
validation_df.to_csv("validation_predictions_augmented.csv", index=False)

# Visualize and Compare Results

To compare the performance of the two models across all the languages, we calculate the number of correct prediction for each language in the validation data. 

The figure below gives an overview of the validation accuracy in percentage across all the languages for each model. Augmenting the original train data with back-translations increases the classification performance for majority of the languages and using an auxiliary XNLI corpus helps to notably reduce the gap in accuracy between the languages. We observe that the scores of the majority of the languages now cluster in a relatively small range. We can see significant gain in low-resource languages like Swahili, Urdu, and Thai. 


In [None]:
def accuracy(x):
    return round(float(x[2]/x[1]), 2)*100

def calculate(file):
    ''' Function to calculate accuracy per language '''
    validation = pd.read_csv(file)  
    # Calculate the total number of examples per language
    lang_counts = validation.language.value_counts().sort_index()

    # Calculate the number of correct predictions per language
    tp_per_lang = validation[validation['label'] == validation['prediction']].groupby('language').agg({'language': ['count']}).sort_index()

    lang_names = lang_counts.index.tolist()
    lang_tuples = list(zip(lang_names, lang_counts.values.tolist(), tp_per_lang.iloc[:, 0].values.tolist()))
    acc = map(accuracy, lang_tuples)
    acc_list = []
    lang_list = []
    for i, score in enumerate(acc):
        acc_list.append(score)
        lang_list.append(lang_tuples[i][0])
#         print ("Accuracy of {} is {} ".format(lang_tuples[i][0], score))
    df = pd.DataFrame({'language': lang_list, 'validation_accuracy': acc_list})
    return df


val_df_original = calculate("./validation_predictions_original.csv")
val_df_augmented = calculate("./validation_predictions_augmented.csv")

In [None]:
DF1 = val_df_original
DF2 = val_df_augmented

DF = pd.concat([DF1, DF2])
DF['model'] = ['Baseline']*15 + ['Baseline+BT+XNLI']*15

# g = sns.factorplot(data=DF, x='language', y='validation_accuracy', hue='model', kind="bar")
g = sns.factorplot(data=DF, x='language', y='validation_accuracy', hue='model')
g.fig.set_size_inches(10, 10)
g.set_xticklabels(DF2.language, rotation=40, ha="right")
plt.xlabel("Language", size=12)
plt.ylabel("Score", size=12)
plt.savefig('model_comparison_per_lang_barplot.png')
plt.show()

We also analyze the performance of the models based on different language families. The table below shows the 15 languages grouped by language families and the next figure plots the results. 

In [None]:
display.Image("../input/figures/language_families.png")

We observe that some branches of the Indo-European language family like Romance, Germanic, and Slavic perform fairly well across all the models. Overall, the difference in performance is the highest between the Indo-European language families and low-resource language families like Niger-Congo and Tai-Kadai. The performance of these low-resource language families improves by a large margin once the original train data is extended by back-translations and the XNLI corpus and the final best model yields at least 90% accuracy for most of the under-represented languages.

In [None]:
DF = pd.concat([DF1, DF2])
DF['Model'] = ['Baseline']*15 + ['Baseline+BT+XNLI']*15

# create a list of our conditions to assign language families
conditions = [
    (DF['language'] == 'Arabic'),
    (DF['language'] == 'Bulgarian') | (DF['language'] == 'Russian'),
    (DF['language'] == 'German') | (DF['language'] == 'English'),
    (DF['language'] == 'Greek'),
    (DF['language'] == 'Spanish') | (DF['language'] == 'French'),
    (DF['language'] == 'Hindi') | (DF['language'] == 'Urdu'),
    (DF['language'] == 'Chinese'),
    (DF['language'] == 'Swahili'),
    (DF['language'] == 'Thai'),
    (DF['language'] == 'Turkish'),
    (DF['language'] == 'Vietnamese'),
    ]

# create a list of the values we want to assign for each condition
values = ['Afro-Asiatic', 'Indo-European: Slavic', 'Indo-European: Germanic', 'Indo-European: Greek', 'Indo-European: Romance', 'Indo-European: Indo-Aryan', 'Sino-Tibetan', 'Niger-Congo', 'Tai-Kadai', 'Turkic', 'Austro-Asiatic']

# create a new column and use np.select to assign values to it using our lists as arguments
DF['language_family'] = np.select(conditions, values)

x_labels = ['Afro-Asiatic', 'Indo-European: Slavic', 'Sino-Tibetan', 'Indo-European: Germanic', 'Indo-European: Romance', 'Indo-European: Greek', 'Indo-European: Indo-Aryan', 'Niger-Congo', 'Tai-Kadai', 'Turkic', 'Austro-Asiatic']

f, ax = plt.subplots(figsize=(10, 10))
g = sns.scatterplot(data=DF, x='language_family', y='validation_accuracy', hue='Model', style='Model', palette='dark', hue_order=['Baseline', 'Baseline+BT+XNLI'])
g.set_xticklabels(x_labels, rotation=40, ha="right")
plt.xlabel("Language Family", size=12)
plt.ylabel("Score", size=12)
# place the legend outside the figure/plot
plt.legend(bbox_to_anchor=(1.01, 1),borderaxespad=0)
plt.tight_layout()
plt.savefig('model_comparison_valscore_per_lang_family.png')
plt.show()

# Generate Predictions on Test Data and Submit

In [None]:
# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)

In [None]:
#encode the test-input sequences
test_df = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")
test_input = encode(test_df, tokenizer=tokenizer, max_len=MAX_LEN)

In [None]:
predictions = [np.argmax(i) for i in model.predict(test_input)]

submission = test_df.id.copy().to_frame()
submission['prediction'] = predictions
submission.head()

In [None]:
submission.to_csv("submission.csv", index = False)

That's it! The submission file has been created, for more information on how to submit to the competition, please visit the following [link](https://www.kaggle.com/c/contradictory-my-dear-watson/overview/evaluation).




<span style="color:blue">If you find this notebook helpful, please kindly upvote:-)</span>