# Contradictory, My Dear Watson


This notebook provides a solution framework for the Kaggle competition
["Contradictory, My Dear Watson"](https://www.kaggle.com/competitions/contradictory-my-dear-watson).

The goal of the competition is to classify pairs of sentences as `entailment`, `contradiction`,
or `neutral`. This solution uses a pretrained transformer model like `mBERT` or `XLM-Roberta`.


## Problem Description

This notebook provides a solution framework for the Kaggle competition
["Contradictory, My Dear Watson"](https://www.kaggle.com/competitions/contradictory-my-dear-watson).

### Goal of the Competition

The objective is to classify pairs of sentences into one of three categories:
- **Entailment**: The meaning of the second sentence logically follows from the first.
- **Contradiction**: The second sentence contradicts the meaning of the first.
- **Neutral**: There is no logical relationship between the two sentences.

#### Example:
1. **Entailment**:
   - Premise: "The dog is sleeping."
   - Hypothesis: "The animal is sleeping."
   - Label: **Entailment**

2. **Contradiction**:
   - Premise: "The dog is sleeping."
   - Hypothesis: "The dog is running."
   - Label: **Contradiction**

3. **Neutral**:
   - Premise: "The dog is sleeping."
   - Hypothesis: "The dog is hungry."
   - Label: **Neutral**

### Approach in This Notebook
We use pretrained transformer models like `mBERT` or `XLM-Roberta` to solve the problem. These models are well-suited for multilingual tasks and can understand the relationships between pairs of sentences.

The pipeline includes:
1. Loading and exploring the dataset.
2. Preprocessing and tokenizing the data for use with transformer models.
3. Fine-tuning a pretrained model on the competition dataset.
4. Making predictions on the test dataset and preparing a submission.


## Installing Required Libraries

To implement the solution, we rely on powerful Python libraries, including:
- **`transformers`**: Provides access to pretrained transformer models and tokenizers.
- **`datasets`**: Handles the dataset in an efficient and easy-to-use format.

Installing these libraries ensures a streamlined process for handling data and models.

In [1]:

!pip install transformers datasets




## Importing Libraries

Here, we import the necessary libraries for the solution pipeline:
- **`pandas`**: For loading and manipulating tabular data (training and test datasets).
- **`numpy`**: For numerical operations and array manipulations.
- **`transformers`**: Provides models, tokenizers, and utilities for working with transformer-based models.
- **`datasets`**: Facilitates dataset handling and preprocessing.
- **`torch`**: PyTorch, the deep learning framework used for training and fine-tuning models.

These libraries form the backbone of the solution pipeline.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch
import os

In [3]:
# @title Install Kaggle API and Download Dataset
!pip install kaggle
from google.colab import files
files.upload()  # Upload kaggle.json
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c contradictory-my-dear-watson
!unzip -q contradictory-my-dear-watson.zip -d data



Saving kaggle.json to kaggle (1).json
contradictory-my-dear-watson.zip: Skipping, found more recently modified local copy (use --force to force download)
replace data/sample_submission.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace data/test.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace data/train.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y


## Loading the Dataset

We load the dataset provided for the competition:
- **Training Dataset**: Contains labeled examples of sentence pairs (`premise`, `hypothesis`, and `label`). This is used to train the model.
- **Test Dataset**: Contains unlabeled sentence pairs for which the model will predict labels.

#### Initial Data Exploration
After loading, we inspect the dataset to:
1. Verify its structure and content.
2. Check for missing values or anomalies.
3. Understand the distribution of sentence lengths and labels.

In [3]:
# Load Kaggle dataset
import os
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

# Display sample data
train_df.head()

Unnamed: 0,id,premise,hypothesis,lang_abv,language,label
0,5130fd2cb5,and these comments were considered in formulat...,The rules developed in the interim were put to...,en,English,0
1,5b72532a0b,These are issues that we wrestle with in pract...,Practice groups are not permitted to work on t...,en,English,2
2,3931fbe82a,Des petites choses comme celles-là font une di...,J'essayais d'accomplir quelque chose.,fr,French,0
3,5622f0c60b,you know they can't really defend themselves l...,They can't defend themselves because of their ...,en,English,0
4,86aaa48b45,ในการเล่นบทบาทสมมุติก็เช่นกัน โอกาสที่จะได้แสด...,เด็กสามารถเห็นได้ว่าชาติพันธุ์แตกต่างกันอย่างไร,th,Thai,1


In [4]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5195 entries, 0 to 5194
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          5195 non-null   object
 1   premise     5195 non-null   object
 2   hypothesis  5195 non-null   object
 3   lang_abv    5195 non-null   object
 4   language    5195 non-null   object
dtypes: object(5)
memory usage: 203.1+ KB


## Preprocessing and Tokenization

Transformers require numerical inputs, so we preprocess the data:
1. **Tokenizing Sentences**:
   - Sentences are converted into token IDs using a pretrained tokenizer.
   - Tokenization creates two key features:
     - **Input IDs**: Numerical representations of the words in each sentence.
     - **Attention Masks**: Indicators of valid tokens vs. padding.

2. **Handling Sentence Pairs**:
   - Both `premise` and `hypothesis` are tokenized together to ensure the model understands the relationship between them.

3. **Batch Tokenization**:
   - Multiple sentence pairs are processed together to improve efficiency.

4. **Dataset Format**:
   - The tokenized dataset is converted into a PyTorch-friendly format for compatibility with the training process.

In [5]:
# @title Preprocessing and Tokenization cell
# Tokenization
def tokenize_function(examples):
    return tokenizer(examples["premise"], examples["hypothesis"], padding="max_length", truncation=True)

train_data = Dataset.from_pandas(train_df[['premise', 'hypothesis', 'label']])
test_data = Dataset.from_pandas(test_df[['premise', 'hypothesis']])

train_data = train_data.map(tokenize_function, batched=True)
test_data = test_data.map(tokenize_function, batched=True)

train_data = train_data.rename_column("label", "labels")
train_data.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
test_data.set_format("torch", columns=["input_ids", "attention_mask"])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/12120 [00:00<?, ? examples/s]

Map:   0%|          | 0/5195 [00:00<?, ? examples/s]

## Defining the Model

We load a pretrained transformer model designed for sequence classification tasks:
- **Model Checkpoint**: A model like `distilbert-base-multilingual-cased` is loaded from Hugging Face.
- **Classification Head**: The model is modified to include a classification layer that predicts one of the three target labels: `entailment`, `contradiction`, or `neutral`.

#### Why Use Pretrained Models?
Pretrained models have already been trained on massive datasets, enabling them to understand complex language structures. Fine-tuning them for this specific task allows us to leverage their capabilities with minimal computational cost.


In [None]:
# Choose model checkpoint
model_checkpoint = "distilbert-base-multilingual-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

## Training Arguments

This section defines hyperparameters and configurations for training:
- **Learning Rate**: Determines how quickly the model updates its weights.
- **Batch Size**: The number of samples processed in one training step.
- **Number of Epochs**: The number of complete passes through the training dataset.
- **Weight Decay**: Adds regularization to prevent overfitting.
- **Evaluation Strategy**: Specifies when to evaluate the model on validation data.

These settings are critical for achieving optimal performance and controlling resource usage during training.
Also reduced the train data and test data to first 1000 and 200 entries respectively to reduce memory requirement and fasten this thraining process.

In [8]:
# @title Training Arguments cell
os.environ["WANDB_MODE"] = "disabled"
train_data = train_data.select(range(1000))  # Use only the first 10,000 examples
test_data = test_data.select(range(200))      # Use only the first 2,000 examples
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-4,
    per_device_train_batch_size=10,  # Reduce batch size
    num_train_epochs=1,
    weight_decay=0.01,
    save_steps=10_000,
    save_total_limit=2,
    fp16=True,  # Enable mixed precision
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data, # Pass the test_data as eval_dataset
)



## Training the Model

We use the Hugging Face `Trainer` API to handle the training process. Key features:
- **Automatic Batching**: The data is processed in batches to make training efficient.
- **Loss Calculation**: The API computes the loss for each batch and updates the model weights.
- **Validation**: The model is evaluated on unseen data after each epoch to monitor progress.
- **Checkpointing**: Model checkpoints are saved periodically, allowing training to resume if interrupted.

This step fine-tunes the pretrained model on the competition dataset to adapt it for the NLI task.


In [9]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,No log


TrainOutput(global_step=100, training_loss=1.0771829986572266, metrics={'train_runtime': 3852.5669, 'train_samples_per_second': 0.26, 'train_steps_per_second': 0.026, 'total_flos': 132469761024000.0, 'train_loss': 1.0771829986572266, 'epoch': 1.0})

## Making Predictions and Creating Submission

In this final step:
1. The fine-tuned model is used to predict labels for the test dataset.
2. The predictions are post-processed to match the required format (`id` and `prediction`).
3. A submission file is created, which can be uploaded to Kaggle for evaluation.

This step completes the solution pipeline and generates results for leaderboard scoring.

In [14]:
predictions = trainer.predict(test_data)
predicted_labels = np.argmax(predictions.predictions, axis=1)

# Create submission file
submission = pd.DataFrame({
    "id": test_df["id"][:200],
    "prediction": predicted_labels
})
submission.to_csv("submission.csv", index=False)


In [15]:
# @title Save model, tokenizer and trainer
# Define the directory where the model and tokenizer will be saved
model_save_path = "./saved_model"

# Save the trained model
trainer.save_model(model_save_path)

# Save the tokenizer
tokenizer.save_pretrained(model_save_path)
import shutil
shutil.make_archive("model_tokenizer", 'zip', "./saved_model")

"""
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the saved model
saved_model_path = "./saved_model"
model = AutoModelForSequenceClassification.from_pretrained(saved_model_path)
tokenizer = AutoTokenizer.from_pretrained(saved_model_path)
# Tokenize new data
inputs = tokenizer("The dog is sleeping.", "The animal is resting.", return_tensors="pt", truncation=True)

# Make predictions
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, axis=1)
print("Predicted label:", predictions.item())  # Outputs the predicted class
"""



cp: cannot create directory '/content/drive/MyDrive/saved_model': No such file or directory


'/content/saved_model.zip'

In [16]:
shutil.make_archive("model_tokenizer", 'zip', "./saved_model")


'/content/model_tokenizer.zip'

In [13]:
print(f"Number of test samples: {len(test_df)}")
print(f"Number of predictions: {len(predicted_labels)}")


Number of test samples: 5195
Number of predictions: 200
