# **0. Dependencies**

In [1]:
#@title 0.1 Install Dependencies

!pip install pytorch-lightning
!pip install torchmetrics
!pip install scikit-optimize

[31mERROR: Operation cancelled by user[0m[31m


In [None]:
#@title 0.2 Import Dependencies
import numpy as np
import json
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sns
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split, SubsetRandomSampler
import pytorch_lightning as pl
from torchmetrics.functional import accuracy, precision, recall, auroc
from torch.utils.data import Dataset, DataLoader

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix

from torchmetrics import F1Score
from torchmetrics.classification import Accuracy, Precision, Recall
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping

import torch
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import CSVLogger

import requests
import io
from io import BytesIO

In [None]:
#@title 0.3 Set Device

device='cuda' if torch.cuda.is_available() else 'cpu'

print(device)

# **1. Reading File**

In [None]:
import gdown


urls = {
    "train": "https://drive.google.com/uc?id=1he-dSeLCYl5lLIEbndfXOu4195jLgxrC",
    "val": "https://drive.google.com/uc?id=18qITfL87wdNX0EAq-OwLFliRv6KmCEVA",
    "test": "https://drive.google.com/uc?id=1aVxq8xkR4fFUpb4vw9jyyFlmC0woASNd"
}

# Download each file
for name, url in urls.items():
    output_path = f"/content/{name}.jsonl"  # Save each file with its respective name
    gdown.download(url, output_path, quiet=False)

print("All files downloaded successfully!")



We implement a function that reads data from a `.jsonl` file and creates a DataFrame from it.

In [None]:
def load_jsonl(file_path):
    data = []
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            data.append(json.loads(line))
    return pd.DataFrame(data)

Now, using the previous function, we create DataFrames for:

1.   Train Set
2.   Validation Set
3.   Test set



In [None]:
train_df = load_jsonl("train.jsonl")
val_df = load_jsonl("val.jsonl")
test_df = load_jsonl("test.jsonl")

We display the DataFrames, and as we can observe, they contain the following columns:


1.  `annotator_labels` – A list of labels representing the annotators' votes (neutral, entailment, contradiction). The length of this list varies from 1 to 5, depending on the number of votes cast.

2.  `captionID` - The ID of the first sentence (premise).

3.  `gold_label` - The label determined by the majority vote of the annotators (see `annotator_labels`)

4. `pairID` – The ID of the sentence pair (premise and hypothesis).

5. `sentence1` – The first sentence (premise).

6. `sentence1_binary_parse` - The first sentence formatted for tree-structured neural networks with no unary nodes and no labels.

7. `sentence1_parse` - The first sentence formatted for tree-structured neural networks with no unary nodes and no labels.

8. `sentence2` - The second sentence (hypothesis).

9. `sentence1_binary_parse` - The second sentence formatted for tree-structured neural networks with no unary nodes and no labels.

10. `sentence2_parse` - The second sentence parsed using the Stanford Parser in Penn Treebank format.

In [None]:
train_df.head()

In [None]:
val_df.head()

In [None]:
test_df.head()

Now, we create three DataFrames containing only the necessary labels for our models. We will filter them to retain only valid labels, specifically those in `['entailment', 'neutral', 'contradiction']`.
The resulting DataFrames will include the following columns:

1.   `sentence1` – The first sentence (premise)

2.   `sentence2` – The second sentence (hypothesis).

3.   `gold_label` – The majority vote label from the annotators.

This ensures that we work only with relevant and properly labeled data for training and evaluation.

In [None]:
train_df = train_df[['sentence1', 'sentence2', 'gold_label']]
val_df = val_df[['sentence1', 'sentence2', 'gold_label']]
test_df = test_df[['sentence1', 'sentence2', 'gold_label']]

In [None]:
valid_labels = ['entailment', 'neutral', 'contradiction']

train_df_invalid = train_df[~train_df['gold_label'].isin(valid_labels)]
val_df_invalid = val_df[~val_df['gold_label'].isin(valid_labels)]
test_df_invalid = test_df[~test_df['gold_label'].isin(valid_labels)]

print(f"Invalid Train: {len(train_df_invalid)}, Invalid Validation: {len(val_df_invalid)}, Invalid Test: {len(test_df_invalid)}")

In [None]:
train_df_invalid.head()

In [None]:
train_df = train_df[train_df['gold_label'].isin(valid_labels)]
val_df = val_df[val_df['gold_label'].isin(valid_labels)]
test_df = test_df[test_df['gold_label'].isin(valid_labels)]

print(f"Train: {len(train_df)}, Validation: {len(val_df)}, Test: {len(test_df)}")

Now, we assign a numerical value to each label in gold_label using the following mapping:

`{'entailment': 0, 'contradiction': 1, 'neutral': 2}`

We then apply this transformation to the training set and display the updated DataFrame.


In [None]:
label_mapping = {'entailment': 0, 'contradiction': 1, 'neutral': 2}
train_df['gold_label'] = train_df['gold_label'].map(label_mapping)
val_df['gold_label'] = val_df['gold_label'].map(label_mapping)
test_df['gold_label'] = test_df['gold_label'].map(label_mapping)

In [None]:
train_df.head()

## 1.1 Data Analysis

We visualize the length distribution of the two sentences using the graphs below.

In [None]:
train_df['sentence1_length'] = train_df['sentence1'].apply(lambda x: len(x.split()))
train_df['sentence2_length'] = train_df['sentence2'].apply(lambda x: len(x.split()))


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(train_df['sentence1_length'], bins=50, kde=True)
plt.title('Histogram of Sentence 1 Length')
plt.xlabel('Sentence Length')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.histplot(train_df['sentence2_length'], bins=50, kde=True)
plt.title('Histogram of Sentence 2 Length')
plt.xlabel('Sentence Length')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


In [None]:
all_lengths = pd.concat([train_df['sentence1_length'], train_df['sentence2_length']], ignore_index=True)

plt.figure(figsize=(8, 6))

sns.histplot(all_lengths, bins=50, kde=True)
plt.title('Histogram of All Sentence Lengths')
plt.xlabel('Sentence Length')
plt.ylabel('Frequency')

In [None]:
train_df[['sentence1_length','sentence2_length']].describe().T


We can observe that the two distributions are very similar and right-skewed. Above, we display the main statistical characteristics.

Below, we show the shortest and longest sentences for `sentence1` and `sentence2`, respectively.

In [None]:
shortest_sentence1 = train_df['sentence1_length'].min()
shortest_sentence1_row = train_df[train_df['sentence1_length'] == shortest_sentence1]
print(f"Shortest sentence in sentence1 (length: {shortest_sentence1}):")
print(shortest_sentence1_row[['sentence1', 'sentence1_length']].iloc[0]['sentence1'])


shortest_sentence2 = train_df['sentence2_length'].min()
shortest_sentence2_row = train_df[train_df['sentence2_length'] == shortest_sentence2]
print(f"\nShortest sentence in sentence2 (length: {shortest_sentence2}):")
print(shortest_sentence2_row[['sentence2', 'sentence2_length']].iloc[0]['sentence2'])


In [None]:
longest_sentence1 = train_df['sentence1_length'].max()
longest_sentence1_row = train_df[train_df['sentence1_length'] == longest_sentence1]
print(f"Longest sentence in sentence1 (length: {longest_sentence1}):")
print(longest_sentence1_row[['sentence1', 'sentence1_length']].iloc[0]['sentence1'])

longest_sentence2 = train_df['sentence2_length'].max()
longest_sentence2_row = train_df[train_df['sentence2_length'] == longest_sentence2]
print(f"\nLongest sentence in sentence2 (length: {longest_sentence2}):")
print(longest_sentence2_row[['sentence2', 'sentence2_length']].iloc[0]['sentence2'])

We visualize the length of the two sentences up to the 95th percentile to observe the maximum sentence length that generally occurs.

From the red line, we can see that at the 95th percentile, the sentence lengths are **23 for the premise** and **18 for the hypothesis**. Moreover, when considering all sentences, this length becomes **20**.

In [None]:
percentile_95_sentence1 = train_df['sentence1_length'].quantile(0.95)
percentile_95_sentence2 = train_df['sentence2_length'].quantile(0.95)

print(f"95th percentile of sentence1 length: {percentile_95_sentence1}")
print(f"95th percentile of sentence2 length: {percentile_95_sentence2}")


plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(train_df['sentence1_length'], bins=50, kde=True)
plt.axvline(percentile_95_sentence1, color='r', linestyle='--', label='95th Percentile')
plt.title('Histogram of Sentence 1 Length with 95th Percentile')
plt.xlabel('Sentence Length')
plt.ylabel('Frequency')
plt.legend()

plt.subplot(1, 2, 2)
sns.histplot(train_df['sentence2_length'], bins=50, kde=True)
plt.axvline(percentile_95_sentence2, color='r', linestyle='--', label='95th Percentile')
plt.title('Histogram of Sentence 2 Length with 95th Percentile')
plt.xlabel('Sentence Length')
plt.ylabel('Frequency')
plt.legend()

plt.tight_layout()
plt.show()


In [None]:
all_lengths = pd.concat([train_df['sentence1_length'], train_df['sentence2_length']], ignore_index=True)

percentile_95_all = all_lengths.quantile(0.95)

print(f"95th percentile of all sentence lengths: {percentile_95_all}")

plt.figure(figsize=(8, 6))
sns.histplot(all_lengths, bins=50, kde=True)
plt.axvline(percentile_95_all, color='r', linestyle='--', label='95th Percentile')

plt.title('Histogram of All Sentence Lengths with 95th Percentile')
plt.xlabel('Sentence Length')
plt.ylabel('Frequency')
plt.legend()

plt.show()


As we can observe, the pie chart of the label distribution is perfectly balanced. This is due to the fact that each premise is repeated **three times**, with a different hypothesis assigned to each instance.

In [None]:
label_counts = train_df['gold_label'].value_counts()
plt.figure(figsize=(8, 8))
plt.pie(label_counts, labels=label_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Labels')
plt.show()

In [None]:
train_df.head(9)

# **2. Model 1**

The first implemented model, explained in more detail in the report, is based on six key steps:

1.   **Feature Extraction**
2.   **DataLoader**
3.   **Model Definition**
4.   **Hyperparameter Tuning**
5.   **Training**
6.   **Evaluation**


## 2.1 Feature Extraction


In this step, we transform each sentence (premise and hypothesis) into a vector using the **Hashing Vectorizer function** with **10,007 features** to standardize the input dimension of the model.

Due to the large dataset size and the high number of features, the result will be a **sparse matrix**, as storing it as a dense matrix would require excessive RAM.

In [None]:
vectorizer = HashingVectorizer(n_features=10007, alternate_sign=False, norm=None)

X_train_premise_hashed = vectorizer.fit_transform(train_df['sentence1'])
X_train_hypothesis_hashed = vectorizer.fit_transform(train_df['sentence2'])
X_val_premise_hashed = vectorizer.transform(val_df['sentence1'])
X_val_hypothesis_hashed = vectorizer.transform(val_df['sentence2'])
X_test_premise_hashed = vectorizer.transform(test_df['sentence1'])
X_test_hypothesis_hashed = vectorizer.transform(test_df['sentence2'])

y_train = train_df['gold_label'].values
y_val = val_df['gold_label'].values
y_test = test_df['gold_label'].values

## 2.2 DataLoader

We implement our dataset using the `class SNLIDataset(Dataset)`, which includes:

1. **Input**: Two vectors, `X1` and `X2`, corresponding to the two sentences (premise and hypothesis).
2. **Output**: The target variable `y`, corresponding to `gold_label`.

Additionally, we convert the sparse matrix corresponding to each batch into a dense array for the model. By converting only the vectors within each batch, the matrix can be transformed into a dense array efficiently.

Next, we create three datasets corresponding to the **training, validation, and test sets** using the class defined above. Finally, we implement the respective data loaders using `Dataloader` with a **batch size of 512** for training set and **64** for validation and test set.  

Moreover, we shuffle (**shuffle=True**) only the training set to prevent the model from learning any specific order in the data.


In [None]:
class SNLIDataset(Dataset):
    def __init__(self, X1, X2, y):
        self.X1 = X1  # Manteniamo sparse
        self.X2 = X2
        self.y = torch.tensor(y, dtype=torch.long)  # Convertiamo solo le etichette

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        # Convertiamo solo il batch corrente in array denso
        premise_dense = torch.tensor(self.X1[idx].toarray(), dtype=torch.float32).squeeze(0)
        hypothesis_dense = torch.tensor(self.X2[idx].toarray(), dtype=torch.float32).squeeze(0)

        return premise_dense, hypothesis_dense, self.y[idx]

In [None]:
train_dataset = SNLIDataset(X_train_premise_hashed, X_train_hypothesis_hashed, y_train)
val_dataset = SNLIDataset(X_val_premise_hashed, X_val_hypothesis_hashed, y_val)
test_dataset = SNLIDataset(X_test_premise_hashed, X_test_hypothesis_hashed, y_test)

train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

Finally, we verify the dimensions of the two inputs (premise and hypothesis) in the training set and the corresponding output dimension.  

As we can observe, the dimensions match:  
- **Batch size** = 512  
- **Number of features** = 10,007 (which corresponds to the length of each vector representing a sentence).  

This confirms that the input structure is correctly formatted for the model.


In [None]:
premise, hypothesis, labels = next(iter(train_loader))
print(f"Premise shape: {premise.shape}, Hypothesis shape: {hypothesis.shape}, Labels shape: {labels.shape}")


## 2.3 Model Definition

In [None]:
class SNLIMLP(pl.LightningModule):
    def __init__(self, input_dim = 10007, dropout = 0.1, lr = 1e-4):
        super(SNLIMLP, self).__init__()
        self.lr = lr

        self.training_losses = []
        self.validation_accuracies = []
        self.validation_losses = []

        self.first_net = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.BatchNorm1d(1024),
            nn.LeakyReLU(),
            nn.Dropout(dropout),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.LeakyReLU(),
            nn.Dropout(dropout),
        )

        self.combined_net = nn.Sequential(
            nn.Linear(512 * 2, 256),
            nn.BatchNorm1d(256),
            nn.LeakyReLU(),
            nn.Dropout(dropout),
            nn.Linear(256, 128),
            nn.BatchNorm1d(128),
            nn.LeakyReLU(),
            nn.Dropout(dropout),
            nn.Linear(128, 3),
        )

        self.accuracy = Accuracy(task="multiclass", num_classes=3)
        self.f1 = F1Score(task="multiclass", num_classes=3, average="macro")

    def forward(self, premise, hypothesis):
        premise_feat = self.first_net(premise)
        hypothesis_feat = self.first_net(hypothesis)
        combined_feat = torch.cat((premise_feat, hypothesis_feat), dim=1)
        return self.combined_net(combined_feat)

    def training_step(self, batch, batch_idx):
        premise, hypothesis, labels = batch
        logits = self(premise, hypothesis)
        loss = nn.CrossEntropyLoss()(logits, labels)
        preds = torch.argmax(logits, dim=1)
        acc = self.accuracy(preds, labels)
        f1 = self.f1(preds, labels)

        self.log("train_loss", loss, prog_bar=True)
        self.log("train_acc", acc, prog_bar=True)

        return loss
    def validation_step(self, batch, batch_idx):
        premise, hypothesis, labels = batch
        logits = self(premise, hypothesis)
        loss = nn.CrossEntropyLoss()(logits, labels)
        preds = torch.argmax(logits, dim=1)
        acc = self.accuracy(preds, labels)
        f1 = self.f1(preds, labels)

        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)
        try:
            if self.trainer.logged_metrics["train_loss"].cpu().item() not in self.training_losses:
                self.training_losses.append(self.trainer.logged_metrics["train_loss"].cpu().item())
        except:
            pass
        return loss

    def test_step(self, batch, batch_idx):
        premise, hypothesis, labels = batch
        logits = self(premise, hypothesis)
        acc = self.accuracy(logits, labels)
        self.log("test_acc", acc, prog_bar=True)

    def on_train_epoch_end(self):
        train_loss = self.trainer.callback_metrics["train_loss"].item()
        print(f"Epoch {self.current_epoch} - Training Loss: {train_loss:.4f}")

    def on_validation_epoch_end(self):
        avg_val_loss = self.trainer.logged_metrics["val_loss"].cpu().item()

        avg_val_acc = self.trainer.logged_metrics["val_acc"].cpu().item()

        try:
            if self.trainer.logged_metrics["val_acc"].cpu().item() not in self.validation_accuracies:
                self.validation_accuracies.append(self.trainer.logged_metrics["val_acc"].cpu().item())
            if self.trainer.logged_metrics["val_loss"].cpu().item() not in self.validation_losses:
                self.validation_losses.append(self.trainer.logged_metrics["val_loss"].cpu().item())
        except:
            pass


        print(f"[Validation] Epoch {self.current_epoch+1} - Loss: {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}")

    def configure_optimizers(self):
        optimizer = optim.AdamW(self.parameters(), self.lr, weight_decay=1e-3)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", patience=2, factor=0.5)
        return {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "val_loss"}


In [None]:
#@title Model Visualization
from graphviz import Digraph
from IPython.display import display

def draw_snli_classifier():
    dot = Digraph(format="png")
    dot.attr(rankdir='LR')  # Disposizione orizzontale


    dot.node("P", "Premise Input\n(input_dim = 10007)", shape="rectangle", style="filled", fillcolor="lightblue")
    dot.node("H", "Hypothesis Input\n(input_dim = 10007)", shape="rectangle", style="filled", fillcolor="lightblue")

    dot.node("F1_P", "Linear(10007->1024)\nBatchNorm\nLeakyReLU\nDropout", shape="rectangle", style="filled", fillcolor="lightgreen")
    dot.node("F2_P", "Linear(1024->512)\nBatchNorm\nLeakyReLU\nDropout", shape="rectangle", style="filled", fillcolor="lightgreen")

    dot.node("F1_H", "Linear(10007->1024)\nBatchNorm\nLeakyReLU\nDropout", shape="rectangle", style="filled", fillcolor="lightgreen")
    dot.node("F2_H", "Linear(1024->512)\nBatchNorm\nLeakyReLU\nDropout", shape="rectangle", style="filled", fillcolor="lightgreen")

    dot.node("Concat", "Linear(512 * 2->256)\nBatchNorm\nLeakyReLU\nDropout", shape="rectangle", style="filled", fillcolor="orange")


    dot.node("C1", "Linear(256->128)\nBatchNorm\nLeakyReLU\nDropout", shape="rectangle", style="filled", fillcolor="lightyellow")
    dot.node("C2", "Linear(128->3)", shape="rectangle", style="filled", fillcolor="lightyellow")

    dot.node("Class1", "Entailment (class 0) \n(Softmax)", shape="rectangle", style="filled", fillcolor="lightcoral")
    dot.node("Class2", "Contradiction (class 1)\n(Softmax)", shape="rectangle", style="filled", fillcolor="lightcoral")
    dot.node("Class3", "Neutral (class 2)\n(Softmax)", shape="rectangle", style="filled", fillcolor="lightcoral")

    dot.edge("P", "F1_P")
    dot.edge("H", "F1_H")
    dot.edge("F1_P", "F2_P")
    dot.edge("F1_H", "F2_H")
    dot.edge("F2_P", "Concat")
    dot.edge("F2_H", "Concat")
    dot.edge("Concat", "C1")
    dot.edge("C1", "C2")
    dot.edge("C2", "Class1")
    dot.edge("C2", "Class2")
    dot.edge("C2", "Class3")

    display(dot)

draw_snli_classifier()


## 2.4 Hyperparameter Tuning

In [None]:
def load_model_from_github(model_class, url, hidden_dim = 200, is_MLP = True, device = device):
    response = requests.get(url)
    response.raise_for_status()
    if is_MLP:
        model = model_class()
    else:
        model = model_class(hidden_dim)
    model.load_state_dict(torch.load(BytesIO(response.content), map_location=torch.device(device), weights_only=False))
    model.to(device)
    return model

If you want to **retrain the models**, set `load_models=False`.  

Otherwise, if `load_models=True`, the pre-trained models will be loaded using the `load_model_from_github` function defined above.

In [None]:
load_models = True

In [None]:
param_grid = {
    'learning_rate': [0.001, 0.0001],
    'dropout': [0.1, 0.3]
}

# Track best results
best_val_acc = 0
best_params = {}


if not load_models:
    best_val_acc = 0
    best_params = {}
    for lr in param_grid['learning_rate']:
        for do in param_grid['dropout']:
                print(f"\nlearning_rate={lr}, dropout={do}")

                model = SNLIMLP(
                    input_dim=10007,
                    dropout=do,
                    lr=lr
                )

                trainer = Trainer(
                    max_epochs=5,
                    accelerator="gpu" if torch.cuda.is_available() else "cpu",
                    devices=1,
                    logger=CSVLogger("logs/", name=f"gridsearch_lr{lr}_do{do}"),
                    enable_progress_bar=True
                )

                trainer.fit(model, train_loader, val_loader)

                val_acc = trainer.logged_metrics.get("val_acc")
                model_state = model.state_dict()
                if val_acc and val_acc.item() > best_val_acc:
                    best_val_acc = val_acc.item()
                    best_params = {'learning_rate': lr, 'dropout': do}

                model_path = f"model_LR{lr}_DO{do}.pth"
                torch.save(model_state, model_path)
                print(f"Model saved at {model_path}")

if load_models:
    for lr in param_grid['learning_rate']:
        for do in param_grid['dropout']:
            model = load_model_from_github(SNLIMLP, f'https://raw.github.com/FRAMAX444/Textual-Entailment/main/Hyperparamter_Tuning_1st_Model/model_LR{lr}_DO{do}.pth')
            trainer = Trainer(
                max_epochs=5,
                accelerator="gpu" if torch.cuda.is_available() else "cpu",
                devices=1
            )
            results = trainer.test(model, val_loader)



            val_acc = results[0]['test_acc']

            if val_acc and val_acc > best_val_acc:
                best_val_acc = val_acc
                best_params = {'learning_rate': lr, 'dropout': do}

# Print the Best Hyperparameters
print("\nBest Hyperparameters:", best_params)
print("Best Validation Accuracy:", best_val_acc)

Thus, the best model in terms of **accuracy** is the one with the following hyperparameters:

```{'learning_rate': 0.0001, 'dropout': 0.1}```

It achieved an **accuracy** of ```0.7448689341545105``` on the **Validation Set**.

## 2.5 Training

We now proceed with training the **best model** for **10 epochs** with **Early Stopping** to prevent overfitting and then evaluate its performance on the **Test Set**.


In [None]:
EPOCHS=10
torch.manual_seed(0)
model = SNLIMLP(input_dim=10007, dropout = 0.1 , lr = 0.0001)
checkpoint_callback = ModelCheckpoint(
    dirpath="checkpoints/",
    save_top_k=2,
    monitor="val_loss",
    mode="min"
)
early_stopping = EarlyStopping(
    monitor="val_loss",
    mode="min",
    min_delta=0.00,
    patience=3,
    verbose=False
)

trainer = pl.Trainer(
    max_epochs=EPOCHS,
    accelerator="gpu" if torch.cuda.is_available() else "cpu",
    callbacks=[early_stopping, checkpoint_callback],
    log_every_n_steps=1,
    deterministic=True
)

trainer.fit(model, train_loader, val_loader)

trainer.test(model, test_loader)

In [None]:
def plot_training_metrics(model):
    if not model.training_losses or not model.validation_accuracies:
        print("Not enough data to plot. Ensure that at least one epoch has been completed.")
        return

    plt.figure(figsize=(10, 5))
    loss_training = model.training_losses
    loss_validation = model.validation_losses
    accuracy = model.validation_accuracies

    epochs = range(0, len(loss_validation))

    plt.subplot(1, 2, 1)
    plt.plot(epochs[1:],loss_training, marker='o', label="Training Loss",color='blue')
    plt.plot(epochs,loss_validation, marker='o', label="Validation Loss", color='orange')
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Loss Over Epochs")
    plt.legend()


    plt.subplot(1, 2, 2)
    plt.plot(epochs, accuracy, marker='o', label="Validation Accuracy", color='orange')
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy")
    plt.title("Validation Accuracy Over Epochs")
    plt.legend()

    plt.show()


plot_training_metrics(model)

Epoch 0 represents the validation loss and accuracy before the first training epoch.

---
As we can see, the training loss decreases, while the validation loss initially drops but starts to rise again after the third epoch. This indicates that after this point, the model begins to overfit. Similarly, the right-hand graph shows that accuracy increases rapidly before epoch 3 and then stabilizes.

To address this, we applied Early Stopping, to prevent overfitting.

## 2.6 Evaluation

Now, we evaluate the model using the following metrics:

1.   `Accuracy`
2.   `Precision`
3.   `Recall`
4.   `F1-score`



In [None]:
def evaluate_model(model, test_loader, device):

    model.eval()
    model.to(device)
    predictions = []
    true_labels = []

    with torch.no_grad():
        for premise_tokens, hypothesis_tokens, labels in test_loader:

            premise_tokens, hypothesis_tokens, labels = (
                premise_tokens.to(device),
                hypothesis_tokens.to(device),
                labels.to(device),
            )


            logits = model(premise_tokens, hypothesis_tokens)
            preds = torch.argmax(logits, dim=1)


            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())


    accuracy = accuracy_score(true_labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average="macro")

    # Confusion Matrix
    conf_matrix = confusion_matrix(true_labels, predictions)

    # Results
    print("Evaluation Metrics:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

    # Plot Confusion Matrix
    plt.figure(figsize=(6, 5))
    sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues",
                xticklabels=["Contradiction", "Neutral", "Entailment"],
                yticklabels=["Contradiction", "Neutral", "Entailment"])
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.title("Confusion Matrix for SNLI Model")
    plt.show()

    return accuracy, precision, recall, f1, conf_matrix


In [None]:
accuracy, precision, recall, f1, conf_matrix = evaluate_model(model, test_loader, device)

# **3. Improved Model**

The second implemented model, explained in more detail in the report, is based on six key steps:

1.   **Tokenization**
2.   **Dataloader**
3.   **Model Definition**
4.   **Hyperparameter Tuning**
5.   **Training**
6.   **Evaluation**

## 3.1 Tokenization

We transform each sentence into a vector of words.

In [None]:
def tokenize(sentence):
    sentence = sentence.lower()
    cleaned_sentence = "".join(char if char.isalnum() or char.isspace() else "" for char in sentence)
    return cleaned_sentence.split()

In [None]:
tokenize('Ciao, mi chi^=)(&$""/£)"(amo Francesco e sono uno studente!')

We create a vocabulary where:

1. Each word in the **training set** is assigned a unique number.
2. All **unknown words** appearing in the **test/validation set** are mapped to **1**.
3. Each sentence is represented as a **vector of numbers**, where each number corresponds to a word. The maximum length is **20 words**, as **95% of the sentences in the dataset** contain **fewer than 20 words**.
4. If a sentence has **more than 20 words**, it is truncated to **20**.
5. If a sentence has **fewer than 20 words**, it is **padded with 0s**.

In [None]:
class Vocabulary:

    def __init__(self):
        self.word2idx = {"<PAD>": 0, "<UNK>": 1}
        self.idx2word = ["<PAD>", "<UNK>"]

    def add_sentence(self, sentence):
        for word in tokenize(sentence):
            if word not in self.word2idx:
                self.word2idx[word] = len(self.idx2word)
                self.idx2word.append(word)

    def encode(self, sentence, max_len=20):
        tokens = [self.word2idx.get(word, 1) for word in tokenize(sentence)]
        return tokens[:max_len] + [0] * (max_len - len(tokens))

    def __str__(self):
        vocab_size = len(self.idx2word)

        preview_size = min(10, vocab_size)
        vocab_preview = ", ".join(self.idx2word[:preview_size])
        idx_preview = ", ".join(str(self.word2idx[word]) for word in self.idx2word[:preview_size])

        return (f"Vocabulary(size={vocab_size}):\n"
                f"Words: [{vocab_preview}, ...]\n"
                f"Indices: [{idx_preview}, ...]")


We build the vocabulary and print it.


In [None]:
vocab = Vocabulary()

for _, row in train_df.iterrows():
    vocab.add_sentence(row['sentence1'])
    vocab.add_sentence(row['sentence2'])

print(vocab)

We visualize how a sentence is encoded into the vector described above.


In [None]:
i=37521
test_sentence = train_df.loc[i, 'sentence1']

encoded_sentence = vocab.encode(test_sentence)

print(f"Encoded '{test_sentence}': {encoded_sentence}")


## 3.2 DataLoader

We implement our dataset using the `class SNLIDataset(Dataset)`, where the data consists of:

1. **Input**: Two vectors corresponding to the two sentences (premise and hypothesis).
2. **Output**: The `gold_label`.

Additionally, we include the **vocabulary** and set the **maximum sentence length to 20**, as described earlier.

Next, we create three datasets corresponding to the **training, validation, and test sets** using the defined class. Finally, we implement the respective data loaders using `Dataloader` with a with a **batch size of 512** for training set and **64** for validation and test set.  .  

Moreover, we shuffle (**shuffle=True**) only the **training set** to prevent the model from learning any specific order in the data.


In [None]:
data_train=[
    (row['sentence1'], row['sentence2'], row['gold_label'])
    for _, row in train_df.iterrows()
]

data_val=[
    (row['sentence1'], row['sentence2'], row['gold_label'])
    for _, row in val_df.iterrows()
]

data_test=[
    (row['sentence1'], row['sentence2'], row['gold_label'])
    for _, row in test_df.iterrows()
]

In [None]:
class SNLIDataset(Dataset):
    def __init__(self, data, vocab, max_len=20):
        self.data = data
        self.vocab = vocab
        self.max_len = max_len

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        premise, hypothesis, label = self.data[idx]
        return (
            torch.tensor(self.vocab.encode(premise, self.max_len), dtype=torch.long),
            torch.tensor(self.vocab.encode(hypothesis, self.max_len), dtype=torch.long),
            torch.tensor(label, dtype=torch.long),
        )

In [None]:
train_dataset = SNLIDataset(data_train, vocab)
val_dataset = SNLIDataset(data_val, vocab)
test_dataset = SNLIDataset(data_test, vocab)

In [None]:
batch_size = 512
train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False, num_workers=2, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=2, pin_memory=True)

## 3.3 Model

### Neural Network Architecture

For simplicity, when specifying tensor dimensions, we omit the **batch size** (e.g., if the batch size is 512, we write `(512, 20) -> (20)`).

The model takes as **input** two vectors of size **20**, corresponding to the **premise** and the **hypothesis**.

To better understand the architecture, we analyze **each input separately**, starting with the **premise**:

#### **1. Embedding Layer**
Each element (word) in the input vector is passed through an **Embedding layer**, which transforms it into a vector of size **300**.  
Thus, we obtain a tensor of **size (20, 300)**, where:
- `20` is the number of words in the sentence.
- `300` is the embedding dimension for each word.

#### **2. Bidirectional LSTM**
The resulting tensor is passed through a **Bidirectional LSTM**, which processes the sequence **both forward and backward** to capture contextual meaning from previous and subsequent words.  
The LSTM consists of **two layers**, with **dropout applied only after the first layer** (not on the final output).  

The output tensor has size **(20, 400)**, where:
- `20` is the sequence length.
- `200 * 2 = 400` (since the LSTM is bidirectional, it concatenates forward and backward outputs).

#### **3. Mean-Max Pooling**
To reduce dimensionality while retaining the most important information, we apply **Mean-Max Pooling**, which performs:

- **Mean Pooling**: Computes the average across the sequence dimension, producing a vector of size **400**.
- **Max Pooling**: Computes the maximum value across the sequence dimension, also producing a vector of size **400**.

We then **concatenate** the two outputs.  
The final output after Mean-Max Pooling has size **400 * 2 = 800**.

We repeat the **same process** for the **hypothesis**.

---

### **Final Feature Combination**
After processing both inputs, we obtain two vectors of **size 800** (one for the premise and one for the hypothesis). We then compute:

1. **Premise representation** (size **800**).
2. **Hypothesis representation** (size **800**).
3. **Absolute difference** between the two vectors to highlight differences (**size 800**).
4. **Element-wise product** between the two vectors to emphasize similarity (**size 800**).

We **concatenate** these four vectors, resulting in a final **feature vector of size (800 + 800 + 800 + 800 = 3200)**.

---

### **Final Classification**
The concatenated vector is passed through a **Multi-Layer Perceptron (MLP)**:
1. First, the **MLP** reduces the vector to size **300**, applying:
   - **ReLU activation**
   - **Dropout**
2. Finally, a fully connected **output layer** maps the **300-dimensional vector** to the **three target classes**:  
   `{Entailment, Contradiction, Neutral}`.


In [None]:
class SNLI_LSTM(pl.LightningModule):
    def __init__(self, hidden_dim=200, vocab_size = len(vocab.idx2word), embedding_dim=300, output_dim=3,
                 num_layers=2, dropout=0.3, learning_rate=1e-3):
        super(SNLI_LSTM, self).__init__()

        self.learning_rate = learning_rate
        self.training_losses = []
        self.validation_accuracies = []
        self.validation_losses = []

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers,
                            batch_first=True, bidirectional=True, dropout=dropout)

        self.fc = nn.Sequential(
            nn.Linear(hidden_dim * 16, 300),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(300, output_dim)
        )

        self.criterion = nn.CrossEntropyLoss()
        self.accuracy = Accuracy(task="multiclass", num_classes=output_dim)

    def mean_max_pooling(self, lstm_output):
        mean_pool = torch.mean(lstm_output, dim=1)
        max_pool, _ = torch.max(lstm_output, dim=1)
        return torch.cat((mean_pool, max_pool), dim=1)

    def forward(self, premise, hypothesis):
        premise_embed = self.embedding(premise)
        hypothesis_embed = self.embedding(hypothesis)

        premise_output, _ = self.lstm(premise_embed)
        hypothesis_output, _ = self.lstm(hypothesis_embed)

        premise_vector = self.mean_max_pooling(premise_output)
        hypothesis_vector = self.mean_max_pooling(hypothesis_output)

        combined = torch.cat((premise_vector, hypothesis_vector,
                              torch.abs(premise_vector - hypothesis_vector),
                              premise_vector * hypothesis_vector), dim=1)

        output = self.fc(combined)

        return output

    def training_step(self, batch, batch_idx):
        premise, hypothesis, labels = batch
        logits = self(premise, hypothesis)
        loss = self.criterion(logits, labels)
        acc = self.accuracy(logits, labels)
        self.log("train_loss", loss, prog_bar=True)
        self.log("train_acc", acc, prog_bar=True)

        return loss

    def validation_step(self, batch, batch_idx):
        premise, hypothesis, labels = batch
        logits = self(premise, hypothesis)
        loss = self.criterion(logits, labels)
        acc = self.accuracy(logits, labels)
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)
        try:
            if self.trainer.logged_metrics["train_loss"].cpu().item() not in self.training_losses:
                self.training_losses.append(self.trainer.logged_metrics["train_loss"].cpu().item())
        except:
            pass

        return loss

    def test_step(self, batch, batch_idx):
        premise, hypothesis, labels = batch
        logits = self(premise, hypothesis)
        acc = self.accuracy(logits, labels)
        self.log("test_acc", acc, prog_bar=True)

    def on_train_epoch_end(self):
        avg_train_loss = self.trainer.logged_metrics["train_loss"].cpu().item()
        print(f"[Train] Epoch {self.current_epoch+1} - Loss: {avg_train_loss:.4f}")

    def on_validation_epoch_end(self):
        avg_val_loss = self.trainer.logged_metrics["val_loss"].cpu().item()
        avg_val_acc = self.trainer.logged_metrics["val_acc"].cpu().item()
        try:
            if self.trainer.logged_metrics["val_acc"].cpu().item() not in self.validation_accuracies:
                self.validation_accuracies.append(self.trainer.logged_metrics["val_acc"].cpu().item())
            if self.trainer.logged_metrics["val_loss"].cpu().item() not in self.validation_losses:
                self.validation_losses.append(self.trainer.logged_metrics["val_loss"].cpu().item())
        except:
            pass

        print(f"[Validation] Epoch {self.current_epoch+1} - Loss: {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}")

    def configure_optimizers(self):
        optimizer = optim.AdamW(self.parameters(), self.learning_rate, weight_decay=1e-3)
        scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min", patience=2, factor=0.5)
        return {"optimizer": optimizer, "lr_scheduler": scheduler, "monitor": "val_loss"}


In [None]:
#@title Model Visualization
from graphviz import Digraph
from IPython.display import display

def draw_snli_lstm_classifier():
    dot = Digraph(format="png")
    dot.attr(rankdir='LR')  # Disposizione orizzontale

    # Input Layers
    dot.node("P", "Premise Input\n(input_dim = 20)", shape="rectangle", style="filled", fillcolor="lightblue")
    dot.node("H", "Hypothesis Input\n(input_dim = 20)", shape="rectangle", style="filled", fillcolor="lightblue")

    # Embedding Layer
    dot.node("Emb_P", "Embedding Layer\n(20,embedding_size = 300)", shape="rectangle", style="filled", fillcolor="lightgreen")
    dot.node("Emb_H", "Embedding Layer\n(20,embedding_size = 300)", shape="rectangle", style="filled", fillcolor="lightgreen")


    # BiLSTM Layer
    dot.node("LSTM_P", "BiLSTM Layer\n(20,200*2)\n2 Layer\nDropout", shape="rectangle", style="filled", fillcolor="lightyellow")
    dot.node("LSTM_H", "BiLSTM Layer\n(20,200*2)\n2 Layer\nDropout", shape="rectangle", style="filled", fillcolor="lightyellow")

    # Pooling
    dot.node("Pool_P", "Mean-Max Pooling (2*400)", shape="rectangle", style="filled", fillcolor="orange")
    dot.node("Pool_H", "Mean-Max Pooling (2*400)", shape="rectangle", style="filled", fillcolor="orange")



    # Feature Combination
    dot.node("Concat", "Feature Concatenation\n(800+800+800+800)", shape="rectangle", style="filled", fillcolor="lightgrey")

    # Fully Connected Layers
    dot.node("FC1", "Linear (3200)\nReLU\nDropout", shape="rectangle", style="filled", fillcolor="lightpink")
    dot.node("FC2", "Linear (300)\n", shape="rectangle", style="filled", fillcolor="lightcoral")

    # Output Classes
    dot.node("Class1", "Entailment (class 0)", shape="rectangle", style="filled", fillcolor="red")
    dot.node("Class2", "Contradiction (class 1)", shape="rectangle", style="filled", fillcolor="red")
    dot.node("Class3", "Neutral (class 2)", shape="rectangle", style="filled", fillcolor="red")

    # Edges
    dot.edge("P", "Emb_P")
    dot.edge("H", "Emb_H")
    dot.edge("Emb_P", "LSTM_P")
    dot.edge("Emb_H", "LSTM_H")
    dot.edge("LSTM_P", "Pool_P")
    dot.edge("LSTM_H", "Pool_H")
    dot.edge("Pool_P", "Concat")
    dot.edge("Pool_H", "Concat")
    dot.edge("Concat", "FC1")
    dot.edge("FC1", "FC2")
    dot.edge("FC2", "Class1")
    dot.edge("FC2", "Class2")
    dot.edge("FC2", "Class3")

    display(dot)

draw_snli_lstm_classifier()


## 3.4 Hyperparameter Tuning

If you want to **retrain the models**, set `load_models=False`.  

Otherwise, if `load_models=True`, the pre-trained models will be loaded using the `load_model_from_github` function defined above.


In [None]:
# Track best results
best_val_acc = 0
best_params = {}

param_grid = {
    'hidden_dim': [200, 300],
    'learning_rate': [0.001, 0.0001],
    'dropout': [0.1, 0.3]
}


if not load_models:
    for h_dim in param_grid['hidden_dim']:
        for lr in param_grid['learning_rate']:
            for do in param_grid['dropout']:
                print(f"\nTraining with hidden_dim={h_dim}, learning_rate={lr}, dropout={do}")


                model = SNLI_LSTM(
                    hidden_dim=h_dim,
                    vocab_size=len(vocab.idx2word),
                    dropout=do,
                    learning_rate=lr
                )


                trainer = Trainer(
                    max_epochs=5,
                    accelerator="gpu" if torch.cuda.is_available() else "cpu",
                    devices=1,
                    logger=CSVLogger("logs/", name=f"gridsearch_h{h_dim}_lr{lr}_do{do}"),
                    enable_progress_bar=True,
                )


                trainer.fit(model, train_loader, val_loader)

                # Save Model after training
                model_path = f"model_HD{h_dim}_LR{lr}_DO{do}.pth"
                torch.save(model.state_dict(), model_path)
                print(f"Model saved at {model_path}")

                # Define the path where you want to save the model on Google Drive
                drive_model_path = '/content/drive/MyDrive/model_HD{}_LR{}_DO{}.pth'

                # Save model to Google Drive
                drive_path = drive_model_path.format(h_dim, lr, do)
                torch.save(model.state_dict(), drive_path)
                print(f"Model saved to Google Drive at {drive_path}")

if load_models:
    for h_dim in param_grid['hidden_dim']:
        for lr in param_grid['learning_rate']:
            for do in param_grid['dropout']:
                model = load_model_from_github(SNLI_LSTM, f'https://raw.github.com/FRAMAX444/Textual-Entailment/main/Hyperparamter_Tuning_2nd_Model/model_HD{h_dim}_LR{lr}_DO{do}.pth', h_dim, False)
                trainer = Trainer(
                    max_epochs=5,
                    accelerator="gpu" if torch.cuda.is_available() else "cpu",
                    devices=1
                )
                results = trainer.test(model, val_loader)

                # Extracting the test accuracy
                val_acc = results[0]['test_acc']  # Extract test accuracy
                print(f"Test Accuracy for hidden_dim={h_dim}, learning_rate={lr}, dropout={do}: {val_acc}")

                if val_acc and val_acc > best_val_acc:
                    best_val_acc = val_acc
                    best_params = {'hidden_dimension':h_dim, 'learning_rate': lr, 'dropout': do}

# Print the Best Hyperparameters
print("\nBest Hyperparameters:", best_params)
print("Best Validation Accuracy:", best_val_acc)


Thus, the best model is the one with the following hyperparameters:

```{'hidden_dim': 300, 'learning_rate': 0.0001, 'dropout': 0.1}```


It achieved an **accuracy** of ```0.8204632997512817``` on the **Validation Set**.


## 3.5 Training

We now proceed with **training the best model for 10 epochs** and then evaluate its performance on the **Test Set**.

In [None]:
EPOCHS=10

torch.manual_seed(0)

model = SNLI_LSTM(hidden_dim=300, vocab_size = len(vocab.idx2word), dropout=0.1, learning_rate=0.001)
checkpoint_callback = ModelCheckpoint(
    dirpath="checkpoints/",
    save_top_k=2,
    monitor="val_loss",
    mode="min"
)
early_stopping = EarlyStopping(
    monitor="val_loss",
    mode="min",
    min_delta=0.00,
    patience=3,
    verbose=False
)

trainer = Trainer(
    max_epochs=EPOCHS,
    accelerator="gpu" if torch.cuda.is_available() else "cpu",
    callbacks=[early_stopping, checkpoint_callback],
    log_every_n_steps=1,
)

trainer.fit(model, train_loader, val_loader)

trainer.test(model, test_loader)

In [None]:
plot_training_metrics(model)

Epoch 0 represents the validation loss and accuracy before the first training epoch.

---
As we can see, the training loss decreases, while the validation loss initially drops but starts to rise again after the third epoch. This indicates that after this point, the model begins to overfit. Similarly, the right-hand graph shows that accuracy increases rapidly before epoch 3 and then stabilizes.

To address this, we applied Early Stopping, to prevent overfitting.

## 3.6 Evaluation

Now, we evaluate the model using the following metrics:

1. **Accuracy**
2. **Precision**
3. **Recall**
4. **F1-score**


In [None]:
accuracy, precision, recall, f1, conf_matrix = evaluate_model(model, test_loader, device)

# **Conclusions and final report**

## **Objective**

The goal of this project is to determine the relationship between two sentences, referred to as **premise** and **hypothesis**, and classify them into three categories:

- **Entailment**: The second sentence logically follows from the first.
- **Contradiction**: The second sentence contradicts the first.
- **Neutral**: There is no clear relationship between the two sentences.

To achieve this, we implemented and compared two machine learning models:

- **Baseline Model**: A Multi-Layer Perceptron (MLP) that takes as input sentence vectors generated using **Hashing Vectorization**.
- **Improved Model**: A BiLSTM + MLP model that processes sentences at the **word level**, where each word is passed through an **embedding layer**.

---

## **Data Preprocessing**

The dataset consists of sentence pairs labeled as **entailment, contradiction, or neutral**. The dataset was split as follows:

- **Training set**: 550,152 samples
- **Validation set**: 10,000 samples
- **Test set**: 10,000 samples

Each row contains the **premise, hypothesis**, and **target class**, labeled as *sentence1*, *sentence2*, and *gold_label*, respectively.

### **Filtering Invalid Labels**
We first removed any invalid labels (i.e., those not belonging to the three predefined categories), ensuring **data consistency**. The updated dataset sizes were:

- **Training set**: 549,367 samples
- **Validation set**: 9,842 samples
- **Test set**: 9,842 samples

### **Mapping Labels to Numerical Values**
To facilitate model training, we mapped each label to a numerical value:

- *entailment* → `0`
- *contradiction* → `1`
- *neutral* → `2`

### **Sentence Length Analysis**
We analyzed sentence lengths and observed that:

- The **95th percentile** of sentence lengths was **23 words** for premises and **13 words** for hypotheses.
- When considering **both premises and hypotheses**, the **95th percentile** of sentence lengths was **20 words**.

This analysis helped determine the **maximum sequence length** for tokenization in the improved model.

### **Dataset Balance**
A **pie chart** visualization showed that the dataset distribution was **balanced**, ensuring that the model would not be biased toward any particular class.

---

## **Baseline Model**

### **Data Handling**
We used **Hashing Vectorization** to convert both sentences into **10,007-dimensional sparse vectors**, ensuring a consistent input size for the model.

### **Model Architecture**
The model consists of two main networks:

#### **1. First Network (`first_net`)**
This network processes the **premise and hypothesis separately**, reducing the dimensionality of the input representations:

- **Linear Layers**: Project the **original 10,007-dimensional** feature space to **1,024**, then **512 dimensions**.
- **Batch Normalization**: Improves training stability.
- **LeakyReLU Activation**: Introduces non-linearity.
- **Dropout**: Prevents overfitting.

#### **2. Combined Network (`combined_net`)**
This network combines the **premise and hypothesis representations** (each of size **512**) to produce the final classification:

- **Linear Layers**: Transform the concatenated representation (512 × 2 = **1,024**) to **256**, then **128 dimensions**.
- **Batch Normalization**: Stabilizes training.
- **LeakyReLU Activation**: Introduces non-linearity.
- **Dropout**: Reduces overfitting.
- **Final Linear Layer**: Maps the **128-dimensional** representation to a **3-class output**.

---

## **Results**

After training the model, the **results** with all the metrics (precision, accuracy, recall and F1-score) are very high all around 0.75 as shown in section 2.6.

---

## **Improved Model**

To improve performance, we implemented a **Bidirectional LSTM (BiLSTM)** with **word embeddings**, as it captures **sequential relationships between words**, unlike the hashing vectorized model.

### **Data Handling**
We performed the following preprocessing steps for each sentence:

1. **Lowercasing text**
2. **Removing special characters**
3. **Splitting sentences into words**
4. **Building a vocabulary dictionary**, mapping words to numerical values
5. **Padding/truncating** each sentence to the **maximum sequence length (20 words)**

---

## **Architecture**

For simplicity, we omit the batch size in tensor shapes (e.g., if batch size is 64, `(64, 20) → (20)`).

The model takes as input **two vectors of size 20** (one for the **premise** and one for the **hypothesis**).

We analyze each input separately, starting with the **premise**:

### **1. Embedding Layer**
Each word is passed through an **embedding layer**, converting it into a **300-dimensional vector**.  
The output tensor has shape **(20, 300)**, where:
- `20` = sentence length
- `300` = embedding dimension

### **2. Bidirectional LSTM**
The output is passed to a **BiLSTM**, which reads the sequence **both forward and backward**.  
The final output has shape **(20, 400)**, where:
- `200 × 2 = 400` (since the LSTM is bidirectional).

### **3. Mean-Max Pooling**
To extract the most relevant information, we apply **Mean-Max Pooling**, which computes:
- **Mean Pooling** → size **400**
- **Max Pooling** → size **400**

The two outputs are concatenated, resulting in a **final size of 800**.

The same process is applied to the **hypothesis**.

---

### **Final Feature Combination**
We obtain two feature vectors of size **800** each. We then compute:

1. **Premise representation** → size **800**
2. **Hypothesis representation** → size **800**
3. **Absolute difference** between the two vectors (**800**)
4. **Element-wise product** between the two vectors (**800**)

After concatenation, the final vector has size **3,200**, which is passed through:

- **MLP** (with ReLU activation & Dropout) → reduces to **300 dimensions**.
- **Final Linear Layer** → maps **300** to the **3-class output** `{Entailment, Contradiction, Neutral}`.

---

## **Results**

After training the model, the **results** with all the metrics (precision, accuracy, recall and F1-score) are very high all around 0.81 as shown in section 3.6.

---

## **Model Comparison**

The **BiLSTM model** outperformed the baseline for four main reasons:

1. **Contextual Understanding**  
   BiLSTMs process text **bidirectionally**, allowing them to capture word meaning based on **preceding and following words**.

2. **Sequential Dependencies**  
   Unlike the **hashing vectorized MLP**, which treats sentences as **bags of words**, the BiLSTM **preserves word order**, improving sentence representation.

3. **Improved Sentence Representations**  
   - Embeddings + BiLSTM produce **dense, meaningful representations**.
   - Mean-Max Pooling extracts the **most relevant** sentence features.
   - The **hashing vectorizer** used in the baseline **introduces noise** due to hash collisions.

4. **Enhanced Relationship Modeling**  
   The BiLSTM+MLP model explicitly calculates:
   - **Absolute difference** between sentence representations (highlighting differences).
   - **Element-wise product** (capturing similarity).
   The baseline model **lacks this information**.

---

## **Limitations**
- **Long Sentences**: BiLSTMs struggle with very long sequences.
- **Ambiguous Sentences**: The model may misclassify ambiguous cases.
- **Out-of-Distribution Data**: Performance drops on unseen sentence structures.
- **Colab GPU Constraints**: Limited hyperparameter tuning due to compute restrictions.

---
## **Potential Improvements**

In the processes described above, improvements can be made at different levels:

### **1. Data Handling** ###  

- **Word Frequency Consideration**: Rare words might provide more information than frequent ones. For example, we could try to eliminate words that appear too often, as they are likely articles or words with little meaningful contribution.

### **2. Hyperparameter Tuning** ###

- **Bayesian Optimization Approach**: The models described above use Grid Search, but due to Colab's limited runtime and the long training times, the hyperparameter set is restricted. For this reason, other approaches such as Bayesian optimization could be used to maximize the search for optimal hyperparameters.

- **Expand the Hyperparameter Search Space**: A more exhaustive grid search over a broader range of parameters, including learning rates, dropout rates, and weight decay values, could lead to superior model configurations.

### **3. Model** ###  

- **Transformers**: Transformer models are much more effective at capturing context because they do not read sequentially like LSTMs but analyze all words simultaneously. This prevents information loss in long sentences and reduces vanishing gradient issues.

- **Attention Mechanism**: An LSTM with an attention mechanism can improve performance by better capturing the most important words in a sentence. Instead of treating each word equally, the model can focus on key words.

- **Use of Pre-Trained Word Embeddings (e.g., GloVe, FastText, Word2Vec)**  
  Instead of randomly initializing word embeddings, we can use pre-trained vectors. This helps the model understand word meanings better and faster.

- **Word Frequency**: We could consider incorporating word frequency not only in data handling but also in the model, giving more importance to less frequently occurring words.

