# Fine-Tuning vs Prompting for Natural Language Inference

This project compares supervised fine-tuning (RoBERTa-base) with zero-shot and few-shot prompting (LLaMA-3-8B-Instruct) on the Natural Language Inference (NLI) task.

Natural Language Inference involves determining the semantic relationship between a *premise* and a *hypothesis*, classifying the pair as entailment, contradiction, or neutral.

## Task Definition

- Task type: Multi-class classification
- Number of labels: 3 (entailment, contradiction, neutral)
- Evaluation metric: Accuracy

In [None]:
#!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [2]:
from collections import Counter
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
from transformers import DataCollatorWithPadding
import torch
import evaluate
import numpy as np

In [3]:
ds = load_dataset("stanfordnlp/snli")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/412k [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/413k [00:00<?, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/19.6M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/550152 [00:00<?, ? examples/s]

In [4]:
torch.cuda.is_available()

True

In [6]:
#remove examples with ambiguous labels (label == -1)
ds = ds.filter(lambda example: example['label'] != -1)

#get the label names and create label2id and id2label mappings
labels = ds["train"].features["label"].names
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for i, l in enumerate(labels)}

#initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")


Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/550152 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [7]:
#tokenize the dataset and map the labels to IDs
def preprocess_function(examples):
    #tokenize the texts
    inputs = [ex for ex in examples["premise"]]
    targets = [ex for ex in examples["hypothesis"]]
    model_inputs = tokenizer(inputs, text_pair=targets, truncation=True, padding=False)

    #map labels to IDs
    model_inputs["labels"] = examples["label"]
    return model_inputs



In [8]:
#map the preprocessing function to the entire training set
train_data = ds['train'].map(preprocess_function,
    batched=True,
)

#map the preprocessing function to the entire validation set
val_data = ds['validation'].map(preprocess_function,
    batched=True,
)
#apply padding with the data_collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



Map:   0%|          | 0/549367 [00:00<?, ? examples/s]

Map:   0%|          | 0/9842 [00:00<?, ? examples/s]

In [9]:
#creating the evaluation metric for Trainer
accuracy = evaluate.load("accuracy")

def compute_metrics(p):
    preds = p.predictions.argmax(-1) #choose the class with the highest score as prediction
    return accuracy.compute(predictions=preds, references=p.label_ids) #return accuracy

Downloading builder script: 0.00B [00:00, ?B/s]

## Model Configuration

We use `roberta-base` from Hugging Face.

To reduce training cost, we freeze the lower encoder layers while keeping higher layers and the classification head trainable.

In [12]:
#Training
#load the pretrained model
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=len(labels), id2label=id2label, label2id=label2id)

#freeze embeddings
for p in model.roberta.embeddings.parameters():
    p.requires_grad = False

#freeze bottom 10 layers of the encoder
for layer in model.roberta.encoder.layer[:6]:
    for p in layer.parameters():
        p.requires_grad = False

#define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    fp16=True,
    report_to="none",

)
#define trainer parameters
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

#train the model
trainer.train()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3643,0.296758,0.892705
2,0.3181,0.279603,0.904796
3,0.3016,0.280861,0.907844


TrainOutput(global_step=103008, training_loss=0.3530738001198049, metrics={'train_runtime': 4365.3503, 'train_samples_per_second': 377.542, 'train_steps_per_second': 23.597, 'total_flos': 3.742705065061633e+16, 'train_loss': 0.3530738001198049, 'epoch': 3.0})

In [13]:
import time

start = time.time()
eval_metrics = trainer.evaluate()
eval_time = time.time() - start

print(f"Validation time: {eval_time:.2f} seconds")
print(eval_metrics)


Validation time: 9.27 seconds
{'eval_loss': 0.27960312366485596, 'eval_accuracy': 0.9047957732168258, 'eval_runtime': 9.2651, 'eval_samples_per_second': 1062.265, 'eval_steps_per_second': 66.486, 'epoch': 3.0}


## Results

Validation accuracy is: 0.905 and validation runtime is estimated at 9.27 seconds.

Training time is estimated to be 4365 seconds, which is approx. 1 hour.

Freezing lower layers reduced training cost while maintaining competitive validation performance.

In [14]:
#run validation only in the first 100 samples for the prompt model comparison
val_subset = val_data.select(range(100))  # take first 100 samples

start_time = time.time()
results_100 = trainer.evaluate(eval_dataset=val_subset)
subset_eval_time = time.time() - start

print(f"Validation time: {subset_eval_time:.2f} seconds")
print("Accuracy on first 100 validation samples:", results_100["eval_accuracy"])


Validation time: 19.91 seconds
Accuracy on first 100 validation samples: 0.96


Validation runitime in 100 samples took approx. 20s, while the accuracy on these 100 samples is reported to be 96%.