# **A4: Do you AGREE?**



In [None]:
import os, math, random
from typing import Dict
import numpy as np
import torch
from torch import nn
from torch.utils.data import DataLoader

from datasets import load_dataset
from transformers import (
    BertConfig, BertForMaskedLM, BertModel, BertTokenizerFast,
    DataCollatorForLanguageModeling, get_linear_schedule_with_warmup
)
from sklearn.metrics import classification_report

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device


device(type='cuda')

## **Task 1: Train BERT from scratch with MLM**
We create a fresh `BertConfig` and train weights from scratch on a public dataset subset.


## **Dataset**
The Masked Language Modeling task uses the Wikipedia dataset provided through the Hugging Face Datasets library. Wikipedia is a widely used public corpus containing collaboratively written articles across a broad range of topics. Its size, diversity, and high linguistic quality make it suitable for pretraining language models such as BERT. In this assignment, the English Wikipedia snapshot corresponding to the specified release date is used, and a subset of the full corpus is selected to reduce computational cost while retaining sufficient textual diversity for effective language modeling. The dataset is accessed programmatically using the Hugging Face load_dataset interface, ensuring reproducibility and consistency.

**Source:** Wikimedia Foundation. Wikipedia. Retrieved from https://www.wikipedia.org



In [None]:
from datasets import load_dataset

# 1.1 Load dataset for MLM
DATASET_NAME = "wikimedia/wikipedia"
DATASET_CONFIG = "20231101.en"
MAX_SAMPLES_MLM = 30_000 #sample size
MAX_LENGTH = 128

raw = load_dataset(DATASET_NAME, DATASET_CONFIG, split="train[:2%]")
raw = raw.filter(lambda x: x["text"] is not None and len(x["text"]) > 200)
raw = raw.shuffle(seed=SEED).select(range(min(MAX_SAMPLES_MLM, len(raw))))
raw[0]


Resolving data files:   0%|          | 0/41 [00:00<?, ?it/s]

{'id': '4322487',
 'url': 'https://en.wikipedia.org/wiki/1985%20World%20Artistic%20Gymnastics%20Championships',
 'title': '1985 World Artistic Gymnastics Championships',
 'text': 'The 23rd Artistic Gymnastics World Championships were held in Montreal, Quebec, Canada, 3 to 10 November 1985.\n\nResults\n\nMen\n\nTeam Final\n\nAll-around\n\nFloor Exercise\n\nPommel Horse\n\nRings\n\nVault\n\nParallel Bars\n\nHorizontal Bar\n\nWomen\n\nTeam Final\n\nAll-around\n\nNeither Shushunova nor Omelianchik had originally qualified to the individual all-around final. However, the Soviet coaches felt they would have a good shot at medalling, so their teammates Olga Mostepanova and Irina Baraksanova were pulled from all individual finals under the guise of injury.\n\nVault\n\nUneven bars\n\nBalance beam\n\nFloor exercise\n\nMedals\n\nReferences\n\nExternal links\nGymn Forum: World Championships Results\nGymnastics\n\nWorld Artistic Gymnastics Championships\nG\nWorld Artistic Gymnastics Championships\n

## **Tokenizer**
The bert-base-uncased WordPiece tokenizer is used to convert raw Wikipedia text into BERT friendly inputs. WordPiece breaks words into subword units, which helps the model handle rare or unseen words. Each text sample is tokenized into input IDs and an attention mask, with truncation enabled and padding applied to a fixed MAX_LENGTH so batches have consistent shapes. The dataset map function is used in batched mode for faster preprocessing, and the original text columns are removed to keep only the tensors needed for MLM training.


In [None]:

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

def tokenize_for_mlm(batch):
    return tokenizer(
        batch["text"],
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length",
    )

tokenized = raw.map(tokenize_for_mlm, batched=True, remove_columns=raw.column_names)
tokenized


Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 30000
})

## **Build a small BERT model from scratch and train with MLM**

### **Training**

 In this step, we create our own small BERT model from scratch using BertConfig, instead of downloading a pretrained BERT. We make the model smaller (fewer layers and smaller hidden size) so it can train faster on our computer. Then we use it for Masked Language Modeling, where the model learns by guessing missing words in sentences. This training helps the model learn general language patterns, and we later reuse the trained encoder for the Sentence BERT NLI task in the project.In this step, we create our own small BERT model from scratch using BertConfig, instead of downloading a pretrained BERT. We make the model smaller (fewer layers and smaller hidden size) so it can train faster on our computer. Then we use it for Masked Language Modeling, where the model learns by guessing missing words in sentences. This training helps the model learn general language patterns, and we later reuse the trained encoder for the Sentence BERT NLI task in the project.

In [None]:

bert_config = BertConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=256,
    num_hidden_layers=4,
    num_attention_heads=4,
    intermediate_size=1024,
    max_position_embeddings=MAX_LENGTH,
    type_vocab_size=2,
)

mlm_model = BertForMaskedLM(bert_config).to(device)
print("Parameters (M):", round(sum(p.numel() for p in mlm_model.parameters()) / 1e6, 2))


Parameters (M): 11.1


In [None]:
BATCH_SIZE = 16
EPOCHS_MLM = 3
LR = 5e-4
WARMUP_RATIO = 0.06
WEIGHT_DECAY = 0.01

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=True,
    mlm_probability=0.15,
)

train_loader = DataLoader(
    tokenized,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=data_collator
)

optimizer = torch.optim.AdamW(mlm_model.parameters(), lr=LR, weight_decay=WEIGHT_DECAY)

total_steps = len(train_loader) * EPOCHS_MLM
warmup_steps = int(total_steps * WARMUP_RATIO)

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps,
)

mlm_model.train()
global_step = 0

for epoch in range(EPOCHS_MLM):
    running = 0.0
    epoch_loss_sum = 0.0
    epoch_steps = 0

    for batch in train_loader:
        batch = {k: v.to(device) for k, v in batch.items()}

        out = mlm_model(**batch)
        loss = out.loss

        loss.backward()
        torch.nn.utils.clip_grad_norm_(mlm_model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

        loss_val = loss.item()
        running += loss_val
        epoch_loss_sum += loss_val
        epoch_steps += 1
        global_step += 1

        if global_step % 200 == 0:
            print(f"epoch {epoch+1} step {global_step} avg_loss {running/200:.4f}")
            running = 0.0

    # print end-of-epoch result
    print(f"Epoch {epoch+1} finished | avg_loss {epoch_loss_sum/epoch_steps:.4f}")

print("MLM training finished")


epoch 1 step 200 avg_loss 8.5980
epoch 1 step 400 avg_loss 7.3643
epoch 1 step 600 avg_loss 7.1840
epoch 1 step 800 avg_loss 7.0428
epoch 1 step 1000 avg_loss 6.9440
epoch 1 step 1200 avg_loss 6.8507
epoch 1 step 1400 avg_loss 6.7844
epoch 1 step 1600 avg_loss 6.6817
epoch 1 step 1800 avg_loss 6.6595
Epoch 1 finished | avg_loss 7.1038
epoch 2 step 2000 avg_loss 4.1114
epoch 2 step 2200 avg_loss 6.5493
epoch 2 step 2400 avg_loss 6.5028
epoch 2 step 2600 avg_loss 6.4561
epoch 2 step 2800 avg_loss 6.4380
epoch 2 step 3000 avg_loss 6.4157
epoch 2 step 3200 avg_loss 6.4014
epoch 2 step 3400 avg_loss 6.3442
epoch 2 step 3600 avg_loss 6.3491
Epoch 2 finished | avg_loss 6.4355
epoch 3 step 3800 avg_loss 1.5705
epoch 3 step 4000 avg_loss 6.2604
epoch 3 step 4200 avg_loss 6.2261
epoch 3 step 4400 avg_loss 6.2099
epoch 3 step 4600 avg_loss 6.1865
epoch 3 step 4800 avg_loss 6.2157
epoch 3 step 5000 avg_loss 6.1524
epoch 3 step 5200 avg_loss 6.1371
epoch 3 step 5400 avg_loss 6.1195
epoch 3 step 560

In [None]:
#Saving for Task-2
SAVE_DIR_BERT = "models/bert_mlm_small"
os.makedirs(SAVE_DIR_BERT, exist_ok=True)

mlm_model.save_pretrained(SAVE_DIR_BERT)
tokenizer.save_pretrained(SAVE_DIR_BERT)

print("Saved:", SAVE_DIR_BERT)


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Saved: models/bert_mlm_small


## **Task 2: Sentence BERT style Siamese NLI with SoftmaxLoss**
The model predicts one of three possible NLI labels: entailment, neutral, or contradiction. The model outputs a score for each class, called logits. These logits are passed through the softmax function, which converts them into probabilities that sum to one. Softmax allows us to interpret the model’s output as how confident it is in each class, and the class with the highest probability is selected as the final prediction.
During training, cross entropy loss (SoftmaxLoss) is used to compare the predicted probabilities with the true label. This loss penalises the model when it assigns a low probability to the correct class and encourages it to increase confidence in the right prediction over time. SoftmaxLoss is well suited for multi class classification tasks like Natural Language Inference, where each input belongs to exactly one class.



In [None]:

# 2.1 Load SNLI
snli = load_dataset("snli")

def keep_valid(ex):
    return ex["label"] != -1 and ex["premise"] and ex["hypothesis"]

snli_train = snli["train"].filter(keep_valid)
snli_val = snli["validation"].filter(keep_valid)
snli_test = snli["test"].filter(keep_valid)

TRAIN_SAMPLES = 30_000 #sample size
VAL_SAMPLES = 3_000
TEST_SAMPLES = 6_000

snli_train = snli_train.shuffle(seed=SEED).select(range(min(TRAIN_SAMPLES, len(snli_train))))
snli_val = snli_val.shuffle(seed=SEED).select(range(min(VAL_SAMPLES, len(snli_val))))
snli_test = snli_test.shuffle(seed=SEED).select(range(min(TEST_SAMPLES, len(snli_test))))

snli_train[0]


Filter:   0%|          | 0/550152 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

{'premise': 'A group of people riding a yellow roller coaster.',
 'hypothesis': 'A group of people are riding a roller coaster.',
 'label': 0}

In [None]:

# 2.2 Model: shared encoder, mean pooling, classifier on [u, v, |u - v|]
encoder = BertModel.from_pretrained(SAVE_DIR_BERT).to(device)

class MeanPooling(nn.Module):
    def forward(self, last_hidden_state, attention_mask):
        mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)
        summed = (last_hidden_state * mask).sum(dim=1)
        counts = mask.sum(dim=1).clamp(min=1e-9)
        return summed / counts

class SiameseNLI(nn.Module):
    def __init__(self, encoder, hidden_size, num_labels=3, dropout=0.1):
        super().__init__()
        self.encoder = encoder
        self.pool = MeanPooling()
        self.drop = nn.Dropout(dropout)
        self.fc = nn.Linear(hidden_size * 3, num_labels)

    def encode(self, input_ids, attention_mask, token_type_ids=None):
        out = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )
        return self.pool(out.last_hidden_state, attention_mask)

    def forward(self, a, b):
        u = self.encode(**a)
        v = self.encode(**b)
        feats = torch.cat([u, v, torch.abs(u - v)], dim=1)
        logits = self.fc(self.drop(feats))
        return logits, u, v

sbert_nli = SiameseNLI(encoder, hidden_size=bert_config.hidden_size).to(device)


Loading weights:   0%|          | 0/69 [00:00<?, ?it/s]

BertModel LOAD REPORT from: models/bert_mlm_small
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
pooler.dense.weight                        | MISSING    | 
pooler.dense.bias                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [None]:
# 2.3 Tokenize pairs (final run settings)
# Goal: train properly without being too heavy on a normal laptop
# Uses a moderate SNLI subset that is big enough to learn patterns but still manageable

NLI_MAX_LEN = 16

def tokenize_pair(batch):
    a = tokenizer(batch["premise"], truncation=True, max_length=NLI_MAX_LEN, padding="max_length")
    b = tokenizer(batch["hypothesis"], truncation=True, max_length=NLI_MAX_LEN, padding="max_length")
    return {
        "a_input_ids": a["input_ids"],
        "a_attention_mask": a["attention_mask"],
        "a_token_type_ids": a.get("token_type_ids", [[0]*len(x) for x in a["input_ids"]]),
        "b_input_ids": b["input_ids"],
        "b_attention_mask": b["attention_mask"],
        "b_token_type_ids": b.get("token_type_ids", [[0]*len(x) for x in b["input_ids"]]),
        "labels": batch["label"],
    }

# Moderate sizes for real learning (still feasible on a laptop)
TRAIN_N = 30_000
VAL_N   = 3_000
TEST_N  = 6_000

snli_train_small = snli_train.shuffle(seed=SEED).select(range(min(TRAIN_N, len(snli_train))))
snli_val_small   = snli_val.shuffle(seed=SEED).select(range(min(VAL_N, len(snli_val))))
snli_test_small  = snli_test.shuffle(seed=SEED).select(range(min(TEST_N, len(snli_test))))

train_tok = snli_train_small.map(tokenize_pair, batched=True, remove_columns=snli_train_small.column_names)
val_tok   = snli_val_small.map(tokenize_pair, batched=True, remove_columns=snli_val_small.column_names)
test_tok  = snli_test_small.map(tokenize_pair, batched=True, remove_columns=snli_test_small.column_names)

train_tok.set_format(type="torch")
val_tok.set_format(type="torch")
test_tok.set_format(type="torch")

def collate(features):
    return {k: torch.stack([f[k] for f in features]) for k in features[0].keys()}

# Batch size for training
# If you have GPU, you can try 16 or 32
NLI_BATCH = 16

train_loader = DataLoader(train_tok, batch_size=NLI_BATCH, shuffle=True, collate_fn=collate)
val_loader   = DataLoader(val_tok, batch_size=NLI_BATCH, shuffle=False, collate_fn=collate)
test_loader  = DataLoader(test_tok, batch_size=NLI_BATCH, shuffle=False, collate_fn=collate)

print("Train examples:", len(train_tok), "| steps per epoch:", len(train_loader))
print("Val examples:", len(val_tok), "| steps:", len(val_loader))
print("Test examples:", len(test_tok), "| steps:", len(test_loader))


Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/6000 [00:00<?, ? examples/s]

Train examples: 30000 | steps per epoch: 1875
Val examples: 3000 | steps: 188
Test examples: 6000 | steps: 375


In [None]:
import os, torch

SAVE_DIR = "models/sbert_nli"
os.makedirs(SAVE_DIR, exist_ok=True)

# save tokenizer
tokenizer.save_pretrained(SAVE_DIR)

# save model weights
torch.save(sbert_nli.state_dict(), os.path.join(SAVE_DIR, "sbert_nli.pt"))

print("Saved SBERT NLI model to:", SAVE_DIR)


Saved SBERT NLI model to: models/sbert_nli


In [None]:
# show a few labels to confirm the mapping
print("Unique labels in train:", sorted(set(train_tok["labels"][:2000])))
print("Example label values:", train_tok["labels"][:20])


Unique labels in train: [tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(

In [None]:
# 2.4 Train with SoftmaxLoss (cross entropy)
EPOCHS_NLI = 3        # change this to 1, 2, 3, 5 etc
LR_NLI = 2e-5
PRINT_EVERY = 100       # prints progress every 100 batches

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(sbert_nli.parameters(), lr=LR_NLI)

def run_epoch(model, loader, train=True, print_every=100):
    model.train() if train else model.eval()

    total_loss = 0.0
    total = 0
    correct = 0

    running_loss = 0.0
    running_total = 0
    running_correct = 0

    for step, batch in enumerate(loader, start=1):
        y = batch["labels"].to(device)

        a = {
            "input_ids": batch["a_input_ids"].to(device),
            "attention_mask": batch["a_attention_mask"].to(device),
            "token_type_ids": batch["a_token_type_ids"].to(device),
        }
        b = {
            "input_ids": batch["b_input_ids"].to(device),
            "attention_mask": batch["b_attention_mask"].to(device),
            "token_type_ids": batch["b_token_type_ids"].to(device),
        }

        with torch.set_grad_enabled(train):
            logits, _, _ = model(a, b)
            loss = criterion(logits, y)

            if train:
                optimizer.zero_grad()
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()

        bs = y.size(0)
        loss_val = loss.item()

        total_loss += loss_val * bs
        total += bs
        correct += (logits.argmax(dim=1) == y).sum().item()

        running_loss += loss_val * bs
        running_total += bs
        running_correct += (logits.argmax(dim=1) == y).sum().item()

        if train and (step % print_every == 0):
            print(
                f"  step {step:>5} | "
                f"avg_loss {running_loss/running_total:.4f} | "
                f"acc {running_correct/running_total:.4f}"
            )
            running_loss = 0.0
            running_total = 0
            running_correct = 0

    return total_loss / total, correct / total


for e in range(EPOCHS_NLI):
    print(f"\nEpoch {e+1}/{EPOCHS_NLI}")
    tr_loss, tr_acc = run_epoch(sbert_nli, train_loader, train=True, print_every=PRINT_EVERY)
    va_loss, va_acc = run_epoch(sbert_nli, val_loader, train=False)
    print(
        f"Epoch {e+1} finished | "
        f"train_loss {tr_loss:.4f} train_acc {tr_acc:.4f} | "
        f"val_loss {va_loss:.4f} val_acc {va_acc:.4f}"
    )



Epoch 1/3
  step   100 | avg_loss 1.0958 | acc 0.3762
  step   200 | avg_loss 1.0841 | acc 0.3744
  step   300 | avg_loss 1.0785 | acc 0.4169
  step   400 | avg_loss 1.0516 | acc 0.4450
  step   500 | avg_loss 1.0480 | acc 0.4587
  step   600 | avg_loss 1.0396 | acc 0.4556
  step   700 | avg_loss 1.0273 | acc 0.4681
  step   800 | avg_loss 1.0368 | acc 0.4644
  step   900 | avg_loss 0.9976 | acc 0.5044
  step  1000 | avg_loss 1.0150 | acc 0.4819
  step  1100 | avg_loss 0.9932 | acc 0.4894
  step  1200 | avg_loss 1.0019 | acc 0.5025
  step  1300 | avg_loss 0.9746 | acc 0.5162
  step  1400 | avg_loss 0.9929 | acc 0.5081
  step  1500 | avg_loss 0.9630 | acc 0.5413
  step  1600 | avg_loss 0.9856 | acc 0.5194
  step  1700 | avg_loss 0.9729 | acc 0.5356
  step  1800 | avg_loss 0.9806 | acc 0.5075
Epoch 1 finished | train_loss 1.0160 train_acc 0.4787 | val_loss 0.9413 val_acc 0.5537

Epoch 2/3
  step   100 | avg_loss 0.9393 | acc 0.5481
  step   200 | avg_loss 0.9375 | acc 0.5556
  step   30

In [None]:

# 2.5 Save SBERT NLI model
SAVE_DIR_SBERT = "models/sbert_nli_small"
os.makedirs(SAVE_DIR_SBERT, exist_ok=True)

torch.save(
    {
        "bert_dir": SAVE_DIR_BERT,
        "state_dict": sbert_nli.state_dict(),
        "hidden_size": bert_config.hidden_size,
        "max_len": NLI_MAX_LEN,
        "labels": ["entailment", "neutral", "contradiction"],
    },
    os.path.join(SAVE_DIR_SBERT, "sbert_nli.pt"),
)

tokenizer.save_pretrained(SAVE_DIR_SBERT)
print("Saved:", SAVE_DIR_SBERT)


Saved: models/sbert_nli_small


## **Task 3: Evaluation**

The trained Sentence BERT NLI model is evaluated on the test dataset. The model is set to evaluation mode and predictions are generated without computing gradients to improve efficiency. For each batch, the model predicts the most likely NLI label by selecting the class with the highest score. The true labels and predicted labels are collected across the entire test set and combined into final arrays. A classification report is then generated to measure precision, recall, F1 score, and accuracy for entailment, neutral, and contradiction. This evaluation shows how well the model generalises to unseen data and provides a clear summary of its performance.


In [None]:

label_names = ["entailment", "neutral", "contradiction"]

def predict_all(model, loader):
    model.eval()
    ys, ps = [], []
    with torch.no_grad():
        for batch in loader:
            y = batch["labels"].to(device)
            a = {
                "input_ids": batch["a_input_ids"].to(device),
                "attention_mask": batch["a_attention_mask"].to(device),
                "token_type_ids": batch["a_token_type_ids"].to(device),
            }
            b = {
                "input_ids": batch["b_input_ids"].to(device),
                "attention_mask": batch["b_attention_mask"].to(device),
                "token_type_ids": batch["b_token_type_ids"].to(device),
            }
            logits, _, _ = model(a, b)
            p = logits.argmax(dim=1)

            ys.append(y.cpu().numpy())
            ps.append(p.cpu().numpy())

    return np.concatenate(ys), np.concatenate(ps)

y_true, y_pred = predict_all(sbert_nli, test_loader)
print(classification_report(y_true, y_pred, target_names=label_names, digits=2))


               precision    recall  f1-score   support

   entailment       0.66      0.64      0.65      2047
      neutral       0.59      0.64      0.62      1941
contradiction       0.62      0.59      0.60      2012

     accuracy                           0.62      6000
    macro avg       0.62      0.62      0.62      6000
 weighted avg       0.62      0.62      0.62      6000



## **Conclusion**
The evaluation results show that the Sentence BERT based NLI model achieves an overall accuracy of 62%, demonstrating effective learning of semantic relationships between sentence pairs. The precision, recall, and F1 scores are relatively balanced across the three classes, indicating that the model does not strongly favour a single label and is able to distinguish between entailment, neutral, and contradiction with reasonable consistency. Entailment achieves the highest F1 score, suggesting that the model performs slightly better when sentence meaning clearly supports the hypothesis, while contradiction remains the most challenging class due to its reliance on fine grained semantic differences and negation.
The close alignment between macro averaged and weighted metrics further indicates that the dataset is well balanced and that model performance is stable across all classes. Although the results do not reach the performance of large scale pretrained models, they are appropriate given the reduced training data, lightweight encoder architecture, and limited training epochs. Overall, the results confirm that the two stage training approach using Masked Language Modeling followed by Sentence BERT fine tuning is effective and meets the objectives of the assignment.

In [None]:
import os, torch

SAVE_DIR = "/content/models/sbert_nli"
os.makedirs(SAVE_DIR, exist_ok=True)

# save tokenizer
tokenizer.save_pretrained(SAVE_DIR)

# save config file for encoder
sbert_nli.encoder.config.save_pretrained(SAVE_DIR)

# save weights
torch.save(sbert_nli.state_dict(), os.path.join(SAVE_DIR, "sbert_nli.pt"))

print("Saved to:", SAVE_DIR)
!ls -la /content/models/sbert_nli


Saved to: /content/models/sbert_nli
total 44004
drwxr-xr-x 2 root root     4096 Feb 15 04:23 .
drwxr-xr-x 5 root root     4096 Feb 15 04:23 ..
-rw-r--r-- 1 root root      673 Feb 15 04:47 config.json
-rw-r--r-- 1 root root 44328870 Feb 15 04:47 sbert_nli.pt
-rw-r--r-- 1 root root      322 Feb 15 04:47 tokenizer_config.json
-rw-r--r-- 1 root root   711659 Feb 15 04:47 tokenizer.json


In [None]:
!cd /content/models && zip -r -X sbert_nli_mac.zip sbert_nli
!ls -lh /content/models/sbert_nli_mac.zip

from google.colab import files
files.download("/content/models/sbert_nli_mac.zip")


updating: sbert_nli/ (stored 0%)
updating: sbert_nli/config.json (deflated 50%)
updating: sbert_nli/tokenizer.json (deflated 71%)
updating: sbert_nli/tokenizer_config.json (deflated 42%)
updating: sbert_nli/sbert_nli.pt (deflated 8%)
-rw-r--r-- 1 root root 40M Feb 15 04:47 /content/models/sbert_nli_mac.zip


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### **Limitations and improvements**

In this project, I was able to build a complete pipeline that trains a small BERT model from scratch using Masked Language Modeling and then fine tunes it into a Sentence BERT style NLI classifier. The final evaluation achieved an overall accuracy of 62%, and the class wise results were fairly balanced across entailment, neutral, and contradiction. This tells me the model learned meaningful patterns rather than predicting only one label, and it can reasonably distinguish different sentence relationships.
At the same time, I can see clear limits in the results, especially for contradiction, which is usually harder because it requires stronger understanding of negation and fine meaning differences. Because of time and compute constraints, I could not train on a much larger corpus or run many more epochs, and I also had to cap dataset sizes to keep the workflow practical. These constraints likely prevented the model from reaching higher accuracy.
Overall, I’m satisfied that the project meets the assignment goals and demonstrates the full process from pretraining to fine tuning to deployment. If I had more time, I would train on a larger subset of the MLM corpus, increase the SNLI training samples, run more epochs, and experiment with pooling strategies or a slightly larger encoder to improve generalisation, especially for contradiction cases.