Natural Language Inferencing (NLI) is a classic NLP (Natural Language Processing) problem that involves taking two sentences (the _premise_ and the _hypothesis_ ), and deciding how they are related- if the premise entails the hypothesis, contradicts it, or neither.

* [Imports](#imports)
* [Download competition data](#download-data)
* [Exploratory data analysis](#eda)
    - [Sanity checks](#sanity-checks-eda)
    - [Distributions](#data-dist-eda)
    - [Premise-hypothesis length relationship](#premise-hypothesis-eda)
* [Download more data (caveats)](#download-more-data)
* [Set up TPU](#set-up-tpu)
* [Parameters](#parameters)
* [Prepare training data](#prepare-data)
* [Create & train model](#create-train-model)
    - [Small scale training](#small-training)
    - [Main model training](#main-training)
* [Generate & submit predictions](#submit-predictions)

## Imports 
<a id='Imports'></a>

In [None]:
!pip install -q transformers==3.0.2
!pip install -q nlp

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import os

from transformers import BertTokenizer, AutoTokenizer, TFBertModel, TFXLMRobertaModel
import tensorflow as tf
from tensorflow.keras import Input, Model, Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, LSTM, Embedding, GlobalAveragePooling1D
from keras.optimizers import Adam

from nlp import load_dataset

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

np.random.seed(12345)

In [None]:
pd.set_option('display.max_colwidth', 100) ## to display more characters in a pandas dataframe column 
os.environ["WANDB_API_KEY"] = "0" ## to silence warning
sns.set_context("talk", font_scale=1.05)

<a id='download-data'></a>
## Download competition data

The training set contains a premise, a hypothesis, a label (0 = entailment, 1 = neutral, 2 = contradiction), and the language of the text. For more information about what these mean and how the data is structured, check out the data page: https://www.kaggle.com/c/contradictory-my-dear-watson/data

In [None]:
# Input data files are available in the read-only "../input/" directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv("../input/contradictory-my-dear-watson/train.csv")
test = pd.read_csv("../input/contradictory-my-dear-watson/test.csv")

In [None]:
train.head()

<a id='eda'></a>
## Exploratory data analysis

<a id='sanity-checks-eda'></a>
### Sanity checks

In [None]:
## check for duplicate ids
print("Any duplicate rows (train or test): ", 
      max(train['id'].nunique() != train.shape[0], test['id'].nunique() != test.shape[0]))

In [None]:
print("Train and test datasets have unique, non-overlapping ids: ",
      pd.merge(train['id'], test['id'], on = 'id', how = 'inner').shape[0] == 0)

In [None]:
print("Training data contains missing values: ", train.dropna().shape != train.shape)
print("Test data contains missing values: ", test.dropna().shape != test.shape)

<a id='data-dist-eda'></a>
### Distributions

In [None]:
print("# of examples in the training dataset: ", train.shape[0])
print("# of examples in the test dataset: ", test.shape[0])

In [None]:
print("% distribution by language - training dataset")
train['language'].value_counts(normalize=True) * 100.

In [None]:
print("% distribution by language - test dataset")
test['language'].value_counts(normalize=True) * 100.

In [None]:
## convert numeric labels to strings
train['label_str'] = train['label'].map({0 : "entailment", 1 : "neutral", 2 : "contradiction"})

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y ='label_str', data = train, alpha=.5, palette="muted")

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(y ='language', hue = "label_str", data = train, alpha=.5, palette="muted")

<a id='premise-hypothesis-eda'></a>
### Premise-hypothesis length relationship

In [None]:
def _get_word_count(snt):
    return len(str(snt).split())

In [None]:
train['premise_len'] = train['premise'].apply(lambda x: _get_word_count(x))
train['hypothesis_len'] = train['hypothesis'].apply(lambda x: _get_word_count(x))
train['relative_diff'] = (train['hypothesis_len'] - train['premise_len']) * 1. / train['premise_len']

In [None]:
train.head(2)

In [None]:
train[['premise_len', 'hypothesis_len', 'relative_diff']].describe()

In [None]:
for label in ('entailment', 'neutral', 'contradiction'):
    g = sns.jointplot(x="premise_len", y="hypothesis_len",
                      kind='scatter', alpha=.5, 
                      ylim= [-5, 50], xlim= [-5, 225],
                      height=6, data=train[train['label_str'] == label])
    g.fig.subplots_adjust(top=0.9)
    g.fig.suptitle("Sentence lengths in case of " + label, fontsize=16)

In [None]:
plt.figure(figsize=(8, 5))
at_premise_length = 7
for label in ('entailment', 'neutral', 'contradiction'):
    ax = sns.distplot(train[(train['label_str'] == label) & (train['premise_len'] < at_premise_length)]['relative_diff'], 
                 hist = False, 
                 kde = True, 
                 kde_kws = {'cumulative': True}, 
                 label = label)
plt.axvline(x=0, color='k', linestyle='--')
ax.set_xlim([-10, 25])
plt.ylabel('CDF(relative_diff)')
plt.title("Premise < " + str(at_premise_length) + " words")

If the hypothesis is shorter than the premise, the sentence pair is more likely to be tagged as "entailment" or "contradiction". As the hypothesis gets longer with respect to the premise, the sentence pair is more likely to be tagged as "neutral".

In [None]:
train[(train['label_str'] == 'neutral') & (train['relative_diff'] > 2) & (train['language'] == 'English')].tail(10)

<a id='download-more-data'></a>
## Download more data (caveats)

Caveats: Adding SNLI data led to pretty erratic behavior during validation and also resulted in a pretty poor accuracy in the submission, possibly due to some differences in how the data was generated. Other datasets, such as XNLI, overlap with the competition dataset so it wouldn't be fair to use them.

In [None]:
# About SNLI: https://nlp.stanford.edu/projects/snli/
snli = load_dataset(path='snli')

In [None]:
result_snli = []
for k in ['train', 'validation']:
    for record in snli[k]:
        c1, c2, c3 = record['premise'], record['hypothesis'], record['label']
        if c1 and c2 and c3 in {0,1,2}:
            result_snli.append((c1,c2,'en','English',c3))
snli_df = pd.DataFrame(result_snli, columns=['premise','hypothesis','lang_abv', 'language', 'label'])

In [None]:
## To avoid duplication, check if premises in SNLI and the Kaggle training dataset overlap.
pd.merge(train['premise'], snli_df['premise'], on = 'premise', how='inner').shape[0] != 0

In [None]:
train.columns

In [None]:
## if we wanted to combine the datasets
# combined_df = pd.concat([train.drop(columns=['id', 'label_str', 'premise_len', 'hypothesis_len', 'relative_diff']), snli_df], axis=0)
# combined_df = shuffle(combined_df).reset_index(drop = True)
# assert combined_df.shape[0] == train.shape[0] + snli_df.shape[0]

In [None]:
## sticking to the data provided by Kaggle
final_df = shuffle(train.drop(columns=['id', 'label_str', 'premise_len', 'hypothesis_len', 'relative_diff'])).reset_index(drop = True)

In [None]:
X, y = final_df[['premise', 'hypothesis']].values.tolist(), final_df['label']
x_train, x_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=12345)

In [None]:
## delete snli
del snli
## collect garbage
gc.collect()

<a id='set-up-tpu'></a>
## Set up TPU 

In [None]:
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy() # for CPU and single GPU
    print('Number of replicas:', strategy.num_replicas_in_sync)

<a id='parameters'></a>
## Parameters

In [None]:
encoder_handle = 'jplu/tf-xlm-roberta-large'

In [None]:
!curl https://s3.amazonaws.com/models.huggingface.co/bert/jplu/tf-xlm-roberta-large/config.json

In [None]:
tokenizer = AutoTokenizer.from_pretrained(encoder_handle)

In [None]:
max_len = 64 # max sequence length
# random_seed = 2021
random_seed = 11887
learning_rate = 1e-5 # Controls how large a step is taken when updating model weights during training.
epochs = 5
batch_size = 16 * strategy.num_replicas_in_sync # The number of examples that will be processed in parallel during training. Tailored for TPUs.
loss = 'sparse_categorical_crossentropy'
metrics = ['accuracy']
steps_per_epoch = 1000

auto = tf.data.experimental.AUTOTUNE

In [None]:
print(batch_size)

<a id='prepare-data'></a>
## Prepare training data

In [None]:
def encode_sentence(s, tokenizer):
    """
    Turn a sequence of words into and array of numbers using a selected tokenizer.
    Args:
        s (list of str) - Input string.
        tokenizer - XLM-R tokenizer.
    Returns:
        (list of int) - Tokenized string.

    """
    tokens = list(tokenizer.tokenize(s))
    tokens.append(tokenizer.sep_token)
    return tokenizer.convert_tokens_to_ids(tokens)

def tokenize(data, tokenizer, max_len):
    """
    Encode hypotheses and premises into arrays of numbers using a selected tokenizer. 
    Args:
        data - An array consisting of [hypothesis (str), premise (str)] pairs.
        tokenizer - Tokenizer handle.
        max_len - Max sequence length.
    Returns: (dictionary of tensors)
        input_word_ids - Indices of input sequence tokens in the vocabulary, truncated to max_len.
        input_mask - Real input indices mapped to ones. Padding indices mapped to zeroes.
        input_type_ids - Segment token indices to indicate first and second portions of the inputs.
    """

    PAD_ID = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
 
    # Append a separator to each sentence, tokenize, and concatenate.
    tokens1 = tf.ragged.constant([encode_sentence(s[0], tokenizer) for s in data], dtype=tf.int32) # ENCODED_SEQUENCE_A [SEP]
    tokens2 = tf.ragged.constant([encode_sentence(s[1], tokenizer) for s in data], dtype=tf.int32) # ENCODED_SEQUENCE_B [SEP]
    cls_label = [tokenizer.convert_tokens_to_ids([tokenizer.cls_token])]*tokens1.shape[0] # [CLS] ENCODED_SEQUENCE_A [SEP]
    tokens = tf.concat([cls_label, tokens1, tokens2], axis=-1) # [CLS] ENCODED_SEQUENCE_A [SEP] ENCODED_SEQUENCE_B [SEP]

    # Truncate to max_len.
    tokens = tokens[:, :max_len]

    # Pad with zeroes if len < max_len.
    tokens = tokens.to_tensor(default_value=PAD_ID)
    pad = max_len - tf.shape(tokens)[1]
    tokens = tf.pad(tokens, [[0, 0], [0, pad]], constant_values=PAD_ID)
    input_word_ids = tf.reshape(tokens, [-1, max_len])

    # The input mask allows the model to cleanly differentiate between the content and the padding. 
    input_mask = tf.cast(input_word_ids != PAD_ID, tf.int32)
    input_mask = tf.reshape(input_mask, [-1, max_len])

    # Map tokens1 indices to zeroes and tokens2 indices to ones.
    input_type_ids = tf.concat([tf.zeros_like(cls_label), tf.zeros_like(tokens1), tf.ones_like(tokens2)], axis=-1).to_tensor()


    inputs = {
      'input_word_ids': input_word_ids,
      'input_mask': input_mask,
      'input_type_ids': input_type_ids}

    return inputs

In [None]:
def build_dataset(x, y, mode, batch_size):
    """
    Build a batched TF training, validation, or test dataset.
    
    (This function is borrowed from some of the other notebooks in this competition -
    not sure who to credit exactly so thanks all!)
    """
    if mode == "train":
        dataset = (
            tf.data.Dataset
            .from_tensor_slices((x, y))
            .repeat()
            .shuffle(5678)
            .batch(batch_size)
            .prefetch(auto)
        )
    elif mode == "valid":
        dataset = (
            tf.data.Dataset
            .from_tensor_slices((x, y))
            .batch(batch_size)
            .cache()
            .prefetch(auto)
        )
    elif mode == "test":
        dataset = (
            tf.data.Dataset
            .from_tensor_slices(x)
            .batch(batch_size)
            )
    else:
        raise NotImplementedError
    return dataset

In [None]:
x_train_ = tokenize(x_train, tokenizer, max_len)
x_valid_ = tokenize(x_valid, tokenizer, max_len)

In [None]:
print('Shape Word Ids : ', x_train_['input_word_ids'].shape)
print('Word Ids       : ', x_train_['input_word_ids'][0, :max_len])
print('Shape Mask     : ', x_train_['input_mask'].shape)
print('Input Mask     : ', x_train_['input_mask'][0, :max_len])
print('Shape Type Ids : ', x_train_['input_type_ids'].shape)
print('Type Ids       : ', x_train_['input_type_ids'][0, :max_len])

In [None]:
train_dataset = build_dataset(x_train_, y_train, "train", batch_size)
valid_dataset = build_dataset(x_valid_, y_valid, "valid", batch_size)

<a id='create-train-model'></a>
## Create & train model

In [None]:
def build_model(encoder_handle, random_seed, learning_rate, loss, metrics, max_len):
    
    tf.keras.backend.clear_session()
    tf.random.set_seed(random_seed)
    
    with strategy.scope():
        
        input_word_ids = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
        input_mask = tf.keras.Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
        # RoBERTa doesn’t use token_type_ids.
        
        #  Create an instance of a model defined in encoder_handle
        roberta = TFXLMRobertaModel.from_pretrained(encoder_handle)
        roberta = roberta([input_word_ids, input_mask])[0]
        out = GlobalAveragePooling1D()(roberta)
        out = Dense(3, activation='softmax')(out)
        
        model = Model(inputs=[input_word_ids, input_mask], outputs = out)
        model.compile(optimizer=Adam(lr=learning_rate), loss=loss, metrics=metrics)
    
    model.summary()
    
    return model

In [None]:
model = build_model(encoder_handle, random_seed, learning_rate, loss, metrics, max_len)

In [None]:
# Early stopping is a technique used to prevent machine learning models from overfitting to training data. 
# The general idea is to terminate training once the model stops improving its performance on the validation/test data. 
# The patience is how many steps to wait before termination. With a patience of 2, we will terminate training 
# if the evaluation loss does not improve for 2 consecutive evaluations.
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                                  verbose=1,
                                                  patience=2,
                                                  mode='min',
                                                  restore_best_weights=True)

<a id='small-training'></a>
### Small scale training

Sanity check: Working with a very small dataset, we should achieve perfect classification.

In [None]:
x_small_train, y_small_train = X[:batch_size], y[:batch_size]

In [None]:
x_small_train = tokenize(x_small_train, tokenizer, max_len)
small_train_dataset = build_dataset(x_small_train, y_small_train, "train", batch_size)

In [None]:
print('Shape Word Ids : ', x_small_train['input_word_ids'].shape)
print('Word Ids       : ', x_small_train['input_word_ids'][0, :max_len])
print('Shape Mask     : ', x_small_train['input_mask'].shape)
print('Input Mask     : ', x_small_train['input_mask'][0, :max_len])
print('Shape Type Ids : ', x_small_train['input_type_ids'].shape)
print('Type Ids       : ', x_small_train['input_type_ids'][0, :max_len])

In [None]:
history_small_train = model.fit(small_train_dataset,
                                steps_per_epoch=100,
                                epochs=3)

In [None]:
ep_nbr = np.arange(1, len(history_small_train.history['loss']) + 1)
plt.plot(ep_nbr, history_small_train.history['loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.show()

In [None]:
plt.plot(ep_nbr, history_small_train.history['accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.show()

<a id='main-training'></a>
### Main model training

In [None]:
model = build_model(encoder_handle, random_seed, learning_rate, loss, metrics, max_len)

In [None]:
history = model.fit(train_dataset,
                    validation_data=valid_dataset,
                    steps_per_epoch=steps_per_epoch,
                    epochs=epochs,
                    callbacks=[early_stopping])

In [None]:
# list all data in history
print(history.history.keys())

In [None]:
# summarize history for loss
ep_nbr = np.arange(1, len(history.history['accuracy']) + 1)
plt.plot(ep_nbr, history.history['loss'])
plt.plot(ep_nbr, history.history['val_loss'])
plt.title('Unadjusted Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

# Training loss is continually reported over the course of an entire epoch.
# Validation metrics are computed over the validation set only once the current training epoch is completed.
# This implies, that on average, training losses are measured half an epoch earlier.

# plot the *shifted* training and validation loss
plt.plot(ep_nbr - 0.5, history.history['loss'], label="train_loss")
plt.plot(ep_nbr, history.history['val_loss'], label="val_loss")
plt.title("Shifted Loss")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.legend()
plt.show()

# summarize history for accuracy
plt.plot(ep_nbr, history.history['accuracy'])
plt.plot(ep_nbr, history.history['val_accuracy'])
plt.title('Unadjusted Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

# plot the *shifted* training and validation accuracy
plt.plot(ep_nbr - 0.5, history.history['accuracy'], label="train_accuracy")
plt.plot(ep_nbr, history.history['val_accuracy'], label="val_accuracy")
plt.title("Shifted Accuracy")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.legend()
plt.show()

<a id='submit-predictions'></a>
## Generate & submitting predictions

In [None]:
test.head(2)

In [None]:
x_test = tokenize(test[['premise', 'hypothesis']].values.tolist(), tokenizer, max_len)
test_dataset  = build_dataset(x_test, None, "test", batch_size)

In [None]:
#model predictions
predictions_prob = model.predict(test_dataset)
final = predictions_prob.argmax(axis=-1)   

submission = pd.DataFrame()    
submission['id'] = test['id']
submission['prediction'] = final.astype(np.int32)

In [None]:
assert submission.shape[0] == test.shape[0]

In [None]:
submission.head()

In [None]:
submission.to_csv("submission.csv", index = False)