# Transformer Model — DistilRoBERTa for ANLI R2 NLI

**Goal:** Fine-tune a transformer model (DistilRoBERTa) for 3-way classification.

**Workflow:**
1. Load ANLI R2
2. Tokenize text
3. Train using HuggingFace Trainer
4. Evaluate (accuracy, macro F1, confusion matrix)
5. Error analysis
6. Save model

In [1]:
import transformers
print("Transformers version:", transformers.__version__)

from transformers import Trainer, TrainingArguments
print("Trainer and TrainingArguments imported successfully!")


Transformers version: 4.57.3
Trainer and TrainingArguments imported successfully!


## 1. Imports

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import numpy as np
import torch

# Try to import local preprocessing utilities
try:
    from src.data_loading import load_anli_r2
    from src.preprocessing import get_tokenizer, tokenize_batch
except:
    # Fallback for Colab
    def load_anli_r2():
        ds = load_dataset("facebook/anli", "plain_text")
        return ds["train_r2"], ds["dev_r2"], ds["test_r2"]

    def get_tokenizer(model_name="distilroberta-base"):
        return AutoTokenizer.from_pretrained(model_name)

    def tokenize_batch(batch, tokenizer):
        return tokenizer(
            batch["premise"],
            batch["hypothesis"],
            truncation=True,
            padding="max_length",
            max_length=256
        )

## 2. Load Dataset

In [3]:
train, val, test = load_anli_r2()
train

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train_r1-00000-of-00001.parqu(…):   0%|          | 0.00/3.14M [00:00<?, ?B/s]

plain_text/dev_r1-00000-of-00001.parquet:   0%|          | 0.00/351k [00:00<?, ?B/s]

plain_text/test_r1-00000-of-00001.parque(…):   0%|          | 0.00/353k [00:00<?, ?B/s]

plain_text/train_r2-00000-of-00001.parqu(…):   0%|          | 0.00/6.53M [00:00<?, ?B/s]

plain_text/dev_r2-00000-of-00001.parquet:   0%|          | 0.00/351k [00:00<?, ?B/s]

plain_text/test_r2-00000-of-00001.parque(…):   0%|          | 0.00/362k [00:00<?, ?B/s]

plain_text/train_r3-00000-of-00001.parqu(…):   0%|          | 0.00/14.3M [00:00<?, ?B/s]

plain_text/dev_r3-00000-of-00001.parquet:   0%|          | 0.00/434k [00:00<?, ?B/s]

plain_text/test_r3-00000-of-00001.parque(…):   0%|          | 0.00/435k [00:00<?, ?B/s]

Generating train_r1 split:   0%|          | 0/16946 [00:00<?, ? examples/s]

Generating dev_r1 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test_r1 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_r2 split:   0%|          | 0/45460 [00:00<?, ? examples/s]

Generating dev_r2 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test_r2 split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating train_r3 split:   0%|          | 0/100459 [00:00<?, ? examples/s]

Generating dev_r3 split:   0%|          | 0/1200 [00:00<?, ? examples/s]

Generating test_r3 split:   0%|          | 0/1200 [00:00<?, ? examples/s]

Dataset({
    features: ['uid', 'premise', 'hypothesis', 'label', 'reason'],
    num_rows: 45460
})

## 3. Tokenization

In [4]:
tokenizer = get_tokenizer("distilroberta-base")

tokenized_train = train.map(lambda b: tokenize_batch(b, tokenizer), batched=True)
tokenized_val = val.map(lambda b: tokenize_batch(b, tokenizer), batched=True)
tokenized_test = test.map(lambda b: tokenize_batch(b, tokenizer), batched=True)

# HF Trainer expects labels column named "labels"
tokenized_train = tokenized_train.rename_column("label", "labels")
tokenized_val = tokenized_val.rename_column("label", "labels")
tokenized_test = tokenized_test.rename_column("label", "labels")

tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_test.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/45460 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

## 4. Define Metrics

In [5]:
def compute_metrics(pred):
    logits, labels = pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds, average="macro")
    return {"accuracy": acc, "macro_f1": f1}

## 5. Initialize Model (DistilRoBERTa)

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilroberta-base",
    num_labels=3
)

model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 6. TrainingArguments + Trainer

In [17]:
from transformers import TrainingArguments



training_args = TrainingArguments(
    output_dir="./checkpoints_roberta",
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,
    report_to="none",
    metric_for_best_model="macro_f1",   # ⬅ FIXED
    greater_is_better=True,
)




trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


## 7. Train Model

In [18]:
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "offline"
os.environ["WANDB_SILENT"] = "true"



In [19]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Macro F1
1,0.3178,2.129012,0.434,0.425686
2,0.3624,1.760993,0.448,0.440987
3,0.3182,2.036632,0.436,0.433434


TrainOutput(global_step=8526, training_loss=0.34644750946988084, metrics={'train_runtime': 2882.0976, 'train_samples_per_second': 47.32, 'train_steps_per_second': 2.958, 'total_flos': 9033113004226560.0, 'train_loss': 0.34644750946988084, 'epoch': 3.0})

## 8. Evaluate on Test Set

In [20]:
test_results = trainer.evaluate(tokenized_test)
test_results

{'eval_loss': 1.8456960916519165,
 'eval_accuracy': 0.446,
 'eval_macro_f1': 0.4382978422911638,
 'eval_runtime': 6.6676,
 'eval_samples_per_second': 149.98,
 'eval_steps_per_second': 9.449,
 'epoch': 3.0}

In [21]:
# Detailed evaluation
raw_preds = trainer.predict(tokenized_test)
pred_labels = np.argmax(raw_preds.predictions, axis=-1)
true_labels = raw_preds.label_ids

print(classification_report(true_labels, pred_labels, target_names=["entailment", "neutral", "contradiction"]))
print(confusion_matrix(true_labels, pred_labels))

               precision    recall  f1-score   support

   entailment       0.43      0.57      0.49       334
      neutral       0.46      0.47      0.47       333
contradiction       0.46      0.29      0.36       333

     accuracy                           0.45      1000
    macro avg       0.45      0.45      0.44      1000
 weighted avg       0.45      0.45      0.44      1000

[[190  96  48]
 [109 158  66]
 [146  89  98]]


## 9. Error Analysis (Misclassified Examples)

In [22]:
import pandas as pd

test_df = test.to_pandas()
test_df["true_label"] = true_labels
test_df["pred_label"] = pred_labels

errors = test_df[test_df["true_label"] != test_df["pred_label"]]
errors.head(10)

Unnamed: 0,uid,premise,hypothesis,label,reason,true_label,pred_label
1,27a054fa-cd64-4925-bcf5-e8406114ac35,"""Look at Me (When I Rock Wichoo)"" is a song by...",The song was released in America in September ...,1,It doesn't state if it was released anywhere o...,1,0
4,fc439129-505b-48cc-8f17-a7b2ccddacdd,Things Happen at Night is a 1947 British super...,Frank Harvey Jnr. wrote Things Happen at Night .,2,"It is based off of the play, but he did not ac...",2,0
6,4a76effc-1221-45fb-99f3-933cf96a3f01,"""Beez in the Trap"" is a song by rapper Nicki M...","The song was released on the last day of May, ...",2,"It was released on May 29, 2012",2,0
7,a0c54d89-cb4a-4596-94bc-9ebdd41165ab,Bullitt East High School is a high school loca...,Bullitt East High School is not in Washington.,0,Bullitt East High School is in Kentucky. The s...,0,2
10,7477697c-0cf8-484f-8bd3-c9accbff8969,"William V. Bidwill Sr. (born July 31, 1931) is...",William V. Bidwill Sr. had more than one brother,1,it is not stated whether or not that is the case,1,0
11,e451cd5e-d012-4de9-9af1-bfc279cd1a1b,The Kilpatrick and Beatty text-messaging scand...,Kilpatrick was a police officer,2,No Kilpatrick was the mayor,2,0
12,47edc5b1-9474-498b-81ec-7ae2293cfaa6,"""Crawling"" is a song by American rock band Lin...","""Crawling"" was written by Linkin Park for the ...",2,"""Crawling"" was written for the album ""Hybrid T...",2,0
14,ddbef566-e9e9-4664-a45d-d77dec01649c,"Lincoln is a town in Providence County, Rhode ...","The population of Lincoln is over 21,105,",1,Because the system cannot confirm the populati...,1,0
17,22ad9c38-3ca8-4576-bd75-4e31f409ac0f,John (Johnnie) White (died 2007) was a high-ra...,John White is serving time for his involvment ...,2,He died in 2007 so he can't be serving time ri...,2,1
18,e84e3eed-7a3e-4c7d-b559-974188538a7b,Sverre Peak ( ) is a small peak 0.5 nautical m...,A nautical mile is 1.8 kilometers.,0,"Reason: If .5 nautical milers is .9km, 1 naut...",0,2


## 10. Save Model

In [23]:
trainer.save_model("roberta_anli_r2")
tokenizer.save_pretrained("roberta_anli_r2")
print("Saved roberta_anli_r2 model.")

Saved roberta_anli_r2 model.


In [24]:
!zip -r roberta_anli_r2.zip roberta_anli_r2/


  adding: roberta_anli_r2/ (stored 0%)
  adding: roberta_anli_r2/tokenizer_config.json (deflated 75%)
  adding: roberta_anli_r2/training_args.bin (deflated 53%)
  adding: roberta_anli_r2/config.json (deflated 52%)
  adding: roberta_anli_r2/vocab.json (deflated 59%)
  adding: roberta_anli_r2/model.safetensors (deflated 7%)
  adding: roberta_anli_r2/tokenizer.json (deflated 82%)
  adding: roberta_anli_r2/merges.txt (deflated 53%)
  adding: roberta_anli_r2/special_tokens_map.json (deflated 52%)


In [25]:
from google.colab import files
files.download("roberta_anli_r2.zip")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>