# "Contradictory, my dear Watson" Test Set Submission

## This notebook contains the submission for the Kaggle challenge. The submission utilizes the BERT architecture and the `bert_base_en` preset from `keras-nlp`.

### Required packages

In [None]:
!pip install -q keras-nlp --upgrade


### Imports

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy
import keras_nlp
import keras
import tensorflow as tf
import os
import gc

keras.mixed_precision.set_global_policy("mixed_float16")


### Dataset fetch

In [5]:
DATA_DIR = '/kaggle/input/contradictory-my-dear-watson/'
for dirname, _, filenames in os.walk(DATA_DIR):
    for filename in filenames:
        print(os.path.join(dirname, filename))


### Dataset exploration and visualization

In [None]:
df_train = pd.read_csv(DATA_DIR + "train.csv")
df_train.head()


#### Descriptions and summaries of the data

In [None]:
df_train.id.count()


In [None]:
df_train.hypothesis.describe()


In [None]:
for i in range(10):
    print(df_train.hypothesis[i])


- All hypothesis are unique and follow the structure shown above. The premises that follow may either directly follow from the hypothesis, contradict it or not contain enough information to make a proper conclusion.
bert_en_dataset.premise.describe()

In [None]:
df_train.premise.describe()


In [None]:
for i in range(10):
    print(df_train.premise[i])


- Note: Not all the premises that follow **are necessarily unique**. This is done to have the model be able to **recognize several conclusions that may be drawn from a hypothesis**.

In [None]:
def get_length_of_text(_text):
    return len(_text)


In [None]:
length_of_hypothesis_texts = df_train.hypothesis.apply(get_length_of_text)

length_of_premise_texts = df_train.premise.apply(get_length_of_text)


In [None]:
plt.figure(figsize=(12, 8))

plt.boxplot(length_of_hypothesis_texts, vert=False)

for pos in ["top", "right", "left"]:
    plt.gca().spines[pos].set_visible(False)

plt.gca().get_yaxis().set_visible(False)

plt.grid(axis="x", alpha=0.33, color="black")

plt.xlabel("Character Count")

plt.title("Box-Whisker Plot Summary for Hypothesis Character Count")

plt.show()


In [None]:
plt.figure(figsize=(12, 8))

plt.boxplot(length_of_premise_texts, vert=False)

for pos in ["top", "right", "left"]:
    plt.gca().spines[pos].set_visible(False)

plt.gca().get_yaxis().set_visible(False)

plt.grid(axis="x", alpha=0.33, color="black")

plt.xlabel("Character Count")

plt.title("Box-Whisker Plot Summary for Premise Character Count")

plt.show()


- From the summary plots of the character counts in the texts, it can be seen that there appear to be many outliers in the distribution. This means that there exists several text elements that are much longer than upper limits of the distribution. This may affect the encoding of the textual embeddings in the model.

- Otherwise, it would seem on average that the premises are longer than the hypotheses.

### Training loop

#### Data preprocessing

In [None]:
def split_labels(x, y):
    return (x[0], x[1]), y


In [None]:
batch_size = 36
buffer_size = batch_size * 10

training_dataset = (
    tf.data.Dataset.from_tensor_slices(
        (
            df_train[["hypothesis", "premise"]].values,
            df_train["label"].values
        )
    )
)

train_preprocessed = training_dataset.shuffle(buffer_size=buffer_size).map(split_labels, tf.data.AUTOTUNE).batch(batch_size, drop_remainder=True).prefetch(tf.data.AUTOTUNE)


#### Model initialization

In [None]:
preprocessor = keras_nlp.models.BertPreprocessor.from_preset("bert_base_en", sequence_length=195)
bert_train_set = (train_preprocessed.map(preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE))

classifier = keras_nlp.models.BertClassifier.from_preset("bert_base_en", preprocessor=None, num_classes=3)
classifier.compile(optimizer=keras.optimizers.Adam(4e-05), metrics=['accuracy'])
history = classifier.fit(bert_train_set, epochs=5)


### Predictions on test data for submission

In [None]:
df_test = pd.read_csv(DATA_DIR +"test.csv")
print(df_test.head())

df_test["label"] = [0] * len(df_test)


In [None]:
testing_dataset = (
    tf.data.Dataset.from_tensor_slices(
        (
            df_test[["hypothesis", "premise"]].values,
            df_test["label"].values
        )
    )
)

test_preprocessed = testing_dataset.map(split_labels, tf.data.AUTOTUNE).batch(1, drop_remainder=False).cache().prefetch(tf.data.AUTOTUNE)


In [None]:
preprocessor = keras_nlp.models.BertPreprocessor.from_preset("bert_base_en", sequence_length=195)

bert_test_set = (test_preprocessed.map(preprocessor, tf.data.AUTOTUNE).cache().prefetch(tf.data.AUTOTUNE))


In [None]:
predictions = classifier.predict(bert_test_set)
predicted_classes = tf.argmax(predictions, axis=1)
predicted_classes_np = predicted_classes.numpy()
print(predicted_classes_np)

### Creation of submission file

In [None]:
submission = df_test.id.copy().to_frame()
submission["prediction"] = predicted_classes_np
submission.to_csv("submission.csv", index=False)

submission