# Contradictory, My Dear Watson

---

A Kaggle competition whose objective is to classify pairs of sentences in three categories as follows:

0 == entailment
1 == neutral
2 == contradiction

Entry data consist of hypothesis, premises and labels based on their relation. The sentence pairs are present in 14 different languages.

Source: [Redirect to Kaggle.com - Contradictory, My Dear Watson ](https://www.kaggle.com/competitions/contradictory-my-dear-watson/overview/code-requirements)

## About this notebook

This project is my first NLP project besides a few smaller tasks in TensorFlow courses. It is though my first personal experience with training HuggingFace models.

- I have explored how encoders work, what kind of input data they need, and which models would be best for multiclass classification tasks.

- I have chosen to compare three models: `TFBertForSequenceClassification`, `TFXLMRobertaForSequenceClassification`, and `TFDistilBertForSequenceClassification`. I decided to use `TFBertForSequenceClassification`.
    - The results in this code are based on the mentioned model, but the code allows trying any of them.

- I have tried various combinations of layers to learn and observe how they work:
    - Convolutional/Pooling layers
    - Bidirectional LSTM or GRU layers
    - Additional Dense layers

- I have experimented with optimizers, batch size, and callbacks.

- Finally, I decided to go with an easy Dropout and Dense layer following the BERT layer. The results are not perfect, and I believe there are ways to achieve better accuracy, but for now, I am happy to share this working, easy-to-follow notebook for those who are just a step behind me and can benefit from it.

### Challenges I faced

- **Computational power** - Initially, I worked on my local device and connected TensorFlow to my GPU. However, that was not enough to retrain BERT on this task in a reasonable time. Allowing training of just the output layers did not bring good results, so I moved to Colab, where I purchased some GPU/TPU units because I had issues with waiting for Kaggle's TPU.

- **Tokenization** - Just a few values of the dataset are longer than the rest, but the code still returned warnings for thousands of them no matter what kind of truncation I tried. So, I raised the `max_len` up to 200.

- **The task itself** - Choosing multiclass classification of pairs of sentences as my first Transformers project was quite ambitious. I spent hours studying all related information:
    - TensorFlow, Keras, and HuggingFace documentation
    - Many notebooks related to this or similar tasks
    
    
- **Kaggle enviroment** - Google Colab and Kaggle offer different environments for running code. In Kaggle, you may need to install specific versions of TensorFlow, Keras, and Transformers if they are not already installed. Additionally, to utilize GPU resources, ensure that you have selected a GPU accelerator in the Kaggle notebook settings. To install necessary packages, run installation commands in code cells within the Kaggle notebook.


### Following steps

- Study Pipeline - Learn how to use it to experiment with different models easily
- Data Augmentation - Learn how to use Synonym Replacement and Backtranslation
- Other LLMs - I am curious about how Llama would perform on this task
- NLP Tools - Study NLTK, Datasets etc.


# Methodology

* Explore the dataset
* Try various models to check which one performs the best and is eligible for fine-tuning
* Prevent overfitting and improve selected models' parameters






In [None]:
!pip install tf-keras
!pip install transformers==4.37.2

In [None]:
# import statements

import warnings
warnings.filterwarnings("ignore")


import tensorflow as tf
import pandas as pd
import numpy as np

from transformers import (DistilBertTokenizer, TFDistilBertForSequenceClassification,
                          XLMRobertaTokenizer, TFXLMRobertaForSequenceClassification,
                          BertTokenizer, TFBertForSequenceClassification)

from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.callbacks import EarlyStopping


from tensorflow.keras import layers, Model, Input

from sklearn.model_selection import train_test_split


import plotly.express as px



import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'


for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))




# Explorative Data Analysis

In [None]:
# reading training dataset

train_df = pd.read_csv('../input/contradictory-my-dear-watson/train.csv')
train_df.head()


In [None]:
# reading test dataset

test_df = pd.read_csv('../input/contradictory-my-dear-watson/test.csv')
test_df.head()

In [None]:
# check label distribution

fig = px.pie(values=train_df['label'].value_counts(), names=['contradiction', 'entailment', 'neutral'], title='Labels distribution', hole=0.4)
fig.show()

In [None]:
# check language distribution in both train and test data

fig1 = px.pie(values=train_df['language'].value_counts(), names=train_df.language.value_counts().index, title='Training data language distribution', hole=0.4)
fig2 = px.pie(values=test_df['language'].value_counts(), names=test_df.language.value_counts().index, title='Testing data language distribution', hole=0.4)
fig1.show()
fig2.show()


In [None]:
# check max length of input data sentences to be encoded

hypothesis_word_counts = train_df['hypothesis'].apply(lambda x: len(x.split()))
premise_word_counts = train_df['premise'].apply(lambda x: len(x.split()))

hypothesis_word_counts.max(), premise_word_counts.max()

In [None]:
train_df.isna().sum()

# Data preprocessing and model building

In [None]:
# splitting the training dataset into training and validation set

train_hypothesis, val_hypothesis, train_premise, val_premise, train_labels, val_labels = train_test_split(train_df['hypothesis'], train_df['premise'], train_df['label'], test_size=0.2, random_state=42, stratify=train_df['label'])


In [None]:
# defining model layers

def build_model(model_class, model_pretrained, max_len, optimizer, dropout_rate):
    encoder = model_class.from_pretrained(model_pretrained, num_labels=3)

    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")



    # Get the logits directly from the model
    outputs = encoder(input_word_ids, attention_mask=input_mask)
    logits = outputs.logits

    x = layers.Dropout(dropout_rate)(logits)


    # Output layer
    output = layers.Dense(3, activation='softmax')(x)

    model = Model(inputs=[input_word_ids, input_mask], outputs=output)
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])

    return model

# Enocoding data and training the model

In [None]:
# defining variables

models_dict = {
    # 'distilbert': ('distilbert-base-multilingual-cased', DistilBertTokenizer, TFDistilBertForSequenceClassification),
    #'xlm-roberta': ('jplu/tf-xlm-roberta-base', XLMRobertaTokenizer, TFXLMRobertaForSequenceClassification),
    'bert': ('bert-base-multilingual-uncased', BertTokenizer, TFBertForSequenceClassification)
}


history_dict = {}


MAX_LEN = 213
LEARNING_RATE = 2e-5
EPOCHS = 10
BATCH_SIZE = 32
DROPOUT_RATE = 0.5


OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
EARLY_STOPPING = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)
LR_SCHEDULER = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=1, verbose=1)


In [None]:

# encode the inputs and train them on selected models

for model_name, (model_pretrained, tokenizer_class, model_class) in models_dict.items():

    tokenizer = tokenizer_class.from_pretrained(model_pretrained)


  # Tokenize the input data
    train_tokens = tokenizer(list(train_hypothesis), list(train_premise),
                    truncation=True, padding=True, return_tensors='tf', max_length=MAX_LEN)

    validation_tokens = tokenizer(list(val_hypothesis), list(val_premise),
                    truncation=True, padding=True, return_tensors='tf', max_length=MAX_LEN)

    train_ids = train_tokens['input_ids']
    train_mask = train_tokens['attention_mask']
    train_labels = np.asarray(train_labels)

    validation_ids = validation_tokens['input_ids']
    validation_mask = validation_tokens['attention_mask']
    validation_labels = np.asarray(val_labels)

    model = build_model(model_class, model_pretrained, MAX_LEN, OPTIMIZER, DROPOUT_RATE)
    model.summary()

    print('Training', model_name)
    history = model.fit(
      [train_ids, train_mask], train_labels,
      epochs=EPOCHS,
      verbose=1,
      batch_size=BATCH_SIZE,
      validation_data = ([validation_ids, validation_mask], validation_labels),
      callbacks = [EARLY_STOPPING, LR_SCHEDULER])
    history_dict[model_name] = history.history

print(history_dict)



# Visualize the results

In [None]:
for model_name, (model_name_str, tokenizer, model_class) in models_dict.items():

    # Check if the model name exists in history_dict (avoid errors)
    if model_name in history_dict:
        epochs = range(1, len(history_dict[model_name]['loss']) + 1)
        data = {
            'Epoch': epochs,
            'Loss': history_dict[model_name]['loss'],
            'Accuracy': history_dict[model_name]['accuracy'],
            'Val Loss': history_dict[model_name]['val_loss'],
            'Val Accuracy': history_dict[model_name]['val_accuracy'],
            'Learning Rate': history_dict[model_name].get('lr', None)  # Handle models without learning rate
        }
        df = pd.DataFrame(data)

        # Plot loss and accuracy
        fig = px.line(df, x='Epoch', y=['Loss', 'Val Loss'], 
                      title=f'{model_name_str} Training and Validation Loss')
        fig.add_scatter(x=df['Epoch'], y=df['Accuracy'], mode='lines', name='Accuracy')
        fig.add_scatter(x=df['Epoch'], y=df['Val Accuracy'], mode='lines', name='Val Accuracy')
        fig.update_layout(yaxis_title='Loss / Accuracy', xaxis_title='Epoch')

        # Plot learning rate (if available)
        if 'lr' in history_dict[model_name]:
            lr_fig = px.line(df, x='Epoch', y='Learning Rate', 
                             title=f'{model_name_str} Learning Rate')
            lr_fig.update_layout(yaxis_title='Learning Rate', xaxis_title='Epoch')
            lr_fig.show()

        fig.show()

    else:
        print(f"WARNING: Model '{model_name}' not found in history_dict. Skipping plots.")

# Prediction and creating submission file

In [None]:
predictions_dict = {}

for model_name, (model_pretrained, tokenizer_class, model_class) in models_dict.items():

    tokenizer = tokenizer_class.from_pretrained(model_pretrained)

  # Tokenize the input data
    test_tokens = tokenizer(list(test_df['hypothesis']), list(test_df['premise']),
                    truncation=True, padding=True, return_tensors='tf', max_length=MAX_LEN)


    test_ids = test_tokens['input_ids']
    test_mask = test_tokens['attention_mask']

    test_dataset = tf.data.Dataset.from_tensor_slices({
        'input_word_ids': test_ids,
        'input_mask': test_mask
    }).batch(32)

    predictions = model.predict(test_dataset)
    predicted_class_indices = tf.argmax(predictions, axis=-1).numpy()

    # Store predictions in the dictionary
    predictions_dict[model_name] = predicted_class_indices

final_predictions = predictions_dict[list(models_dict.keys())[-1]]

# Create the submission DataFrame
submission_df = pd.DataFrame({
    'id': test_df['id'],
    'prediction': final_predictions
})

# Save to CSV
submission_df.to_csv('submission.csv', index=False)

print("Predictions saved to submission.csv")

