# Prompting Large Language Models for Natural Language Inference

This notebook evaluates zero-shot and few-shot prompting strategies using LLaMA-3-8B-Instruct for the Natural Language Inference (NLI) task.

The goal is to compare prompting-based adaptation with supervised fine-tuning (RoBERTa-base) in terms of performance, stability, and computational cost.

## Task Definition

- Task: Natural Language Inference (NLI)
- Labels: entailment, contradiction, neutral
- Evaluation metric: Accuracy
- Output constraint: Model responses restricted to predefined label set

In [42]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
import torch
import numpy as np
import time
from tqdm import tqdm
import string

In [3]:
ds = load_dataset("stanfordnlp/snli")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/412k [00:00<?, ?B/s]

plain_text/validation-00000-of-00001.par(…):   0%|          | 0.00/413k [00:00<?, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/19.6M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/550152 [00:00<?, ? examples/s]

In [4]:
ds = ds.filter(lambda example: example['label'] != -1) #remove -1 labeled instances

labels = ds['train'].features['label'].names
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for i, l in enumerate(labels)}

val_ds = ds["validation"].select(range(100))

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/550152 [00:00<?, ? examples/s]

In [5]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
#load the model and its tokenizer
model_to_use = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_to_use)
model = AutoModelForCausalLM.from_pretrained(
    model_to_use,
    torch_dtype=torch.float16,
    device_map="auto"
)


tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]



## Prompt Design

Two prompting strategies were evaluated:

1. Zero-shot prompting
2. Few-shot prompting (with 7 in-context examples)

Prompts were structured to explicitly restrict the output to one of the predefined labels.

In [116]:
#zero-shots prompt
def build_zero_shot_prompt(premise, hypothesis):
    messages = [
        {
            "role": "user",
            "content": f"""Determine the logical relationship between the premise and hypothesis.

Premise: {premise}
Hypothesis: {hypothesis}

IMPORTANT: Look carefully for contradictions - situations where the hypothesis is definitely false given the premise.

The relationship must be one of:
- entailment: the hypothesis is definitely true given the premise
- contradiction: the hypothesis is definitely false given the premise
- neutral: the hypothesis might or might not be true given the premise

Answer with only one word (entailment, contradiction, or neutral):"""
        }
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [117]:
#few-shots prompt
def build_few_shot_prompt(premise, hypothesis):
  messages = [
        {
            "role": "user",
            "content": f"""Classify the relationship as entailment, contradiction, or neutral.

IMPORTANT: Look carefully for contradictions - situations where the hypothesis is definitely false given the premise.

Premise: A soccer game with multiple males playing.
Hypothesis: Some men are playing a sport.
Answer: entailment

Premise: A child wearing a red coat is pointing into a lighted window.
Hypothesis: A child is wearing a jacket.
Answer: entailment

Premise: A man in a black shirt is looking at a bike in a workshop.
Hypothesis: A man is shopping for a car.
Answer: contradiction

Premise: A woman is outdoors.
Hypothesis: A woman is inside a building.
Answer: contradiction

Premise: The children are playing in the park.
Hypothesis: The children are sleeping in their beds.
Answer: contradiction

Premise: An older and younger man smiling.
Hypothesis: Two men are smiling and laughing at the cats playing on the floor.
Answer: neutral

Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is training his horse for a competition.
Answer: neutral

Premise: {premise}
Hypothesis: {hypothesis}
Answer with only one word:"""
        }
    ]
  return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

In [109]:
#prediction function
def predict_label(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    input_len = inputs.input_ids.shape[1]

    #no track of the gradients
    with torch.no_grad():
        #defining output generation
        outputs = model.generate(
            **inputs, #ensuring all tensors are passed onto the model
            max_new_tokens=20, #making sure that it will generate labels as needed
            do_sample=False, #no random choices, only the highest probability token is chosen
            pad_token_id=tokenizer.eos_token_id
        )

    #convert token IDs to readable strings
    generated_tokens = outputs[0, input_len:]  #skip prompt tokens from output
    text = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip().lower() #tokenize, remove punctuation and lower uppercase text

    #print for ensuring that the generated text is the respective label
    print(f"Raw output: '{text}'")

    #parsing the text
    text = text.split()[0] if text.split() else ""  #take first word only

    #encode the generated label as in the dataset
    if "entail" in text:
        return 0
    elif "neutral" in text:
        return 1
    elif "contra" in text:
        return 2
    else:
        return -1


In [110]:
#evaluation function
def evaluate(prompt_type, dataset):
    preds, refs = [], [] #store predictions and references

    start = time.time() #measure runtime
    #loop through the dataset and get predictions
    for ex in tqdm(dataset):
        prompt = prompt_type(ex["premise"], ex["hypothesis"]) #generate the prompt
        pred = predict_label(prompt) #get the prediction
        preds.append(pred) #add prediction to the list
        refs.append(ex["label"]) #add reference label to the list
    runtime = time.time() - start

    #calculate accuracy
    correct = sum(p == y for p, y in zip(preds, refs) if p != -1)
    total = sum(p != -1 for p in preds)

    accuracy = correct / total
    return accuracy, runtime, preds #return accuracy, runtime and predictions


In [118]:
#zero-shot runtime and accuracy
zero_acc, zero_time, zero_preds = evaluate(
    build_zero_shot_prompt,
    val_ds
)

print("Zero-shot accuracy:", zero_acc)
print("Zero-shot runtime (s):", zero_time)


  1%|          | 1/100 [00:03<05:52,  3.56s/it]

Raw output: 'entailment'


  2%|▏         | 2/100 [00:06<05:33,  3.40s/it]

Raw output: 'entailment'


  3%|▎         | 3/100 [00:08<04:15,  2.63s/it]

Raw output: 'neutral'


  4%|▍         | 4/100 [00:11<04:40,  2.92s/it]

Raw output: 'entailment'


  5%|▌         | 5/100 [00:13<03:58,  2.51s/it]

Raw output: 'neutral'


  6%|▌         | 6/100 [00:15<03:34,  2.28s/it]

Raw output: 'neutral'


  7%|▋         | 7/100 [00:17<03:14,  2.10s/it]

Raw output: 'neutral'


  8%|▊         | 8/100 [00:20<03:49,  2.49s/it]

Raw output: 'entailment'


  9%|▉         | 9/100 [00:23<04:11,  2.76s/it]

Raw output: 'entailment'


 10%|█         | 10/100 [00:27<04:29,  2.99s/it]

Raw output: 'entailment'


 11%|█         | 11/100 [00:29<03:52,  2.61s/it]

Raw output: 'neutral'


 12%|█▏        | 12/100 [00:30<03:26,  2.34s/it]

Raw output: 'neutral'


 13%|█▎        | 13/100 [00:34<03:50,  2.65s/it]

Raw output: 'entailment'


 14%|█▍        | 14/100 [00:37<04:07,  2.87s/it]

Raw output: 'entailment'


 15%|█▌        | 15/100 [00:39<03:37,  2.56s/it]

Raw output: 'neutral'


 16%|█▌        | 16/100 [00:42<03:55,  2.80s/it]

Raw output: 'entailment'


 17%|█▋        | 17/100 [00:46<04:07,  2.98s/it]

Raw output: 'entailment'


 18%|█▊        | 18/100 [00:48<03:33,  2.60s/it]

Raw output: 'neutral'


 19%|█▉        | 19/100 [00:51<03:53,  2.88s/it]

Raw output: 'entailment'


 20%|██        | 20/100 [00:53<03:23,  2.54s/it]

Raw output: 'neutral'


 21%|██        | 21/100 [00:55<03:01,  2.30s/it]

Raw output: 'neutral'


 22%|██▏       | 22/100 [00:58<03:22,  2.60s/it]

Raw output: 'entailment'


 23%|██▎       | 23/100 [01:01<03:37,  2.83s/it]

Raw output: 'entailment'


 24%|██▍       | 24/100 [01:03<03:11,  2.52s/it]

Raw output: 'neutral'


 25%|██▌       | 25/100 [01:05<02:50,  2.27s/it]

Raw output: 'neutral'


 26%|██▌       | 26/100 [01:08<03:10,  2.58s/it]

Raw output: 'entailment'


 27%|██▋       | 27/100 [01:10<02:48,  2.31s/it]

Raw output: 'neutral'


 28%|██▊       | 28/100 [01:11<02:33,  2.13s/it]

Raw output: 'neutral'


 29%|██▉       | 29/100 [01:13<02:23,  2.02s/it]

Raw output: 'neutral'


 30%|███       | 30/100 [01:17<02:51,  2.45s/it]

Raw output: 'contradiction'


 31%|███       | 31/100 [01:20<03:07,  2.72s/it]

Raw output: 'entailment'


 32%|███▏      | 32/100 [01:22<02:44,  2.42s/it]

Raw output: 'neutral'


 33%|███▎      | 33/100 [01:25<03:01,  2.70s/it]

Raw output: 'entailment'


 34%|███▍      | 34/100 [01:28<03:13,  2.94s/it]

Raw output: 'entailment'


 35%|███▌      | 35/100 [01:32<03:19,  3.06s/it]

Raw output: 'contradiction'


 36%|███▌      | 36/100 [01:34<02:50,  2.66s/it]

Raw output: 'neutral'


 37%|███▋      | 37/100 [01:37<02:59,  2.86s/it]

Raw output: 'entailment'


 38%|███▊      | 38/100 [01:39<02:37,  2.54s/it]

Raw output: 'neutral'


 39%|███▉      | 39/100 [01:42<02:49,  2.78s/it]

Raw output: 'entailment'


 40%|████      | 40/100 [01:45<02:57,  2.95s/it]

Raw output: 'entailment'


 41%|████      | 41/100 [01:49<03:01,  3.07s/it]

Raw output: 'entailment'


 42%|████▏     | 42/100 [01:52<03:06,  3.22s/it]

Raw output: 'entailment'


 43%|████▎     | 43/100 [01:54<02:37,  2.76s/it]

Raw output: 'neutral'


 44%|████▍     | 44/100 [01:57<02:43,  2.91s/it]

Raw output: 'entailment'


 45%|████▌     | 45/100 [02:01<02:46,  3.03s/it]

Raw output: 'entailment'


 46%|████▌     | 46/100 [02:04<02:51,  3.17s/it]

Raw output: 'contradiction'


 47%|████▋     | 47/100 [02:07<02:50,  3.23s/it]

Raw output: 'entailment'


 48%|████▊     | 48/100 [02:09<02:24,  2.78s/it]

Raw output: 'neutral'


 49%|████▉     | 49/100 [02:12<02:29,  2.94s/it]

Raw output: 'entailment'


 50%|█████     | 50/100 [02:14<02:09,  2.59s/it]

Raw output: 'neutral'


 51%|█████     | 51/100 [02:18<02:18,  2.83s/it]

Raw output: 'entailment'


 52%|█████▏    | 52/100 [02:21<02:23,  2.98s/it]

Raw output: 'entailment'


 53%|█████▎    | 53/100 [02:23<02:02,  2.61s/it]

Raw output: 'neutral'


 54%|█████▍    | 54/100 [02:26<02:10,  2.84s/it]

Raw output: 'entailment'


 55%|█████▌    | 55/100 [02:28<01:54,  2.54s/it]

Raw output: 'neutral'


 56%|█████▌    | 56/100 [02:31<02:02,  2.79s/it]

Raw output: 'entailment'


 57%|█████▋    | 57/100 [02:33<01:46,  2.47s/it]

Raw output: 'neutral'


 58%|█████▊    | 58/100 [02:36<01:54,  2.73s/it]

Raw output: 'entailment'


 59%|█████▉    | 59/100 [02:40<02:01,  2.97s/it]

Raw output: 'contradiction'


 60%|██████    | 60/100 [02:42<01:43,  2.59s/it]

Raw output: 'neutral'


 61%|██████    | 61/100 [02:45<01:50,  2.82s/it]

Raw output: 'contradiction'


 62%|██████▏   | 62/100 [02:47<01:34,  2.49s/it]

Raw output: 'neutral'


 63%|██████▎   | 63/100 [02:48<01:23,  2.26s/it]

Raw output: 'neutral'


 64%|██████▍   | 64/100 [02:50<01:16,  2.12s/it]

Raw output: 'neutral'


 65%|██████▌   | 65/100 [02:52<01:11,  2.04s/it]

Raw output: 'neutral'


 66%|██████▌   | 66/100 [02:55<01:22,  2.44s/it]

Raw output: 'entailment'


 67%|██████▋   | 67/100 [02:59<01:29,  2.71s/it]

Raw output: 'contradiction'


 68%|██████▊   | 68/100 [03:00<01:17,  2.42s/it]

Raw output: 'neutral'


 69%|██████▉   | 69/100 [03:02<01:09,  2.23s/it]

Raw output: 'neutral'


 70%|███████   | 70/100 [03:06<01:17,  2.59s/it]

Raw output: 'entailment'


 71%|███████   | 71/100 [03:07<01:07,  2.34s/it]

Raw output: 'neutral'


 72%|███████▏  | 72/100 [03:11<01:14,  2.64s/it]

Raw output: 'contradiction'


 73%|███████▎  | 73/100 [03:14<01:17,  2.87s/it]

Raw output: 'entailment'


 74%|███████▍  | 74/100 [03:16<01:06,  2.56s/it]

Raw output: 'neutral'


 75%|███████▌  | 75/100 [03:18<00:57,  2.32s/it]

Raw output: 'neutral'


 76%|███████▌  | 76/100 [03:21<01:03,  2.63s/it]

Raw output: 'entailment'


 77%|███████▋  | 77/100 [03:24<01:05,  2.85s/it]

Raw output: 'entailment'


 78%|███████▊  | 78/100 [03:26<00:55,  2.52s/it]

Raw output: 'neutral'


 79%|███████▉  | 79/100 [03:30<00:58,  2.80s/it]

Raw output: 'contradiction'


 80%|████████  | 80/100 [03:33<00:59,  2.96s/it]

Raw output: 'entailment'


 81%|████████  | 81/100 [03:36<00:58,  3.08s/it]

Raw output: 'contradiction'


 82%|████████▏ | 82/100 [03:40<00:57,  3.21s/it]

Raw output: 'entailment'


 83%|████████▎ | 83/100 [03:42<00:47,  2.77s/it]

Raw output: 'neutral'


 84%|████████▍ | 84/100 [03:43<00:39,  2.45s/it]

Raw output: 'neutral'


 85%|████████▌ | 85/100 [03:47<00:40,  2.73s/it]

Raw output: 'entailment'


 86%|████████▌ | 86/100 [03:50<00:40,  2.91s/it]

Raw output: 'entailment'


 87%|████████▋ | 87/100 [03:54<00:40,  3.10s/it]

Raw output: 'entailment'


 88%|████████▊ | 88/100 [03:57<00:38,  3.17s/it]

Raw output: 'entailment'


 89%|████████▉ | 89/100 [03:59<00:30,  2.74s/it]

Raw output: 'neutral'


 90%|█████████ | 90/100 [04:02<00:29,  2.93s/it]

Raw output: 'entailment'


 91%|█████████ | 91/100 [04:06<00:27,  3.10s/it]

Raw output: 'contradiction'


 92%|█████████▏| 92/100 [04:09<00:25,  3.17s/it]

Raw output: 'entailment'


 93%|█████████▎| 93/100 [04:11<00:19,  2.73s/it]

Raw output: 'neutral'


 94%|█████████▍| 94/100 [04:12<00:14,  2.43s/it]

Raw output: 'neutral'


 95%|█████████▌| 95/100 [04:16<00:13,  2.74s/it]

Raw output: 'entailment'


 96%|█████████▌| 96/100 [04:19<00:11,  2.93s/it]

Raw output: 'entailment'


 97%|█████████▋| 97/100 [04:22<00:09,  3.06s/it]

Raw output: 'entailment'


 98%|█████████▊| 98/100 [04:26<00:06,  3.14s/it]

Raw output: 'entailment'


 99%|█████████▉| 99/100 [04:29<00:03,  3.25s/it]

Raw output: 'entailment'


100%|██████████| 100/100 [04:33<00:00,  2.73s/it]

Raw output: 'contradiction'
Zero-shot accuracy: 0.55
Zero-shot runtime (s): 273.16975712776184





In [119]:
#few-shot runtime and accuracy
few_acc, few_time, few_preds = evaluate(
    build_few_shot_prompt,
    val_ds
)

print("Few-shot accuracy:", few_acc)
print("Few-shot runtime (s):", few_time)


  1%|          | 1/100 [00:03<05:46,  3.50s/it]

Raw output: 'entailment'


  2%|▏         | 2/100 [00:07<05:49,  3.56s/it]

Raw output: 'entailment'


  3%|▎         | 3/100 [00:10<05:46,  3.58s/it]

Raw output: 'contradiction'


  4%|▍         | 4/100 [00:14<05:40,  3.55s/it]

Raw output: 'entailment'


  5%|▌         | 5/100 [00:17<05:35,  3.54s/it]

Raw output: 'entailment'


  6%|▌         | 6/100 [00:19<04:43,  3.02s/it]

Raw output: 'neutral'


  7%|▋         | 7/100 [00:21<04:06,  2.65s/it]

Raw output: 'neutral'


  8%|▊         | 8/100 [00:25<04:29,  2.93s/it]

Raw output: 'entailment'


  9%|▉         | 9/100 [00:28<04:43,  3.11s/it]

Raw output: 'entailment'


 10%|█         | 10/100 [00:32<04:55,  3.28s/it]

Raw output: 'entailment'


 11%|█         | 11/100 [00:34<04:13,  2.85s/it]

Raw output: 'neutral'


 12%|█▏        | 12/100 [00:37<04:28,  3.05s/it]

Raw output: 'entailment'


 13%|█▎        | 13/100 [00:41<04:37,  3.19s/it]

Raw output: 'entailment'


 14%|█▍        | 14/100 [00:44<04:46,  3.33s/it]

Raw output: 'entailment'


 15%|█▌        | 15/100 [00:46<04:05,  2.89s/it]

Raw output: 'neutral'


 16%|█▌        | 16/100 [00:50<04:18,  3.08s/it]

Raw output: 'entailment'


 17%|█▋        | 17/100 [00:53<04:26,  3.21s/it]

Raw output: 'entailment'


 18%|█▊        | 18/100 [00:55<03:52,  2.83s/it]

Raw output: 'neutral'


 19%|█▉        | 19/100 [00:59<04:06,  3.05s/it]

Raw output: 'entailment'


 20%|██        | 20/100 [01:02<04:14,  3.18s/it]

Raw output: 'contradiction'


 21%|██        | 21/100 [01:06<04:19,  3.28s/it]

Raw output: 'entailment'


 22%|██▏       | 22/100 [01:09<04:24,  3.39s/it]

Raw output: 'entailment'


 23%|██▎       | 23/100 [01:13<04:23,  3.42s/it]

Raw output: 'entailment'


 24%|██▍       | 24/100 [01:15<03:44,  2.95s/it]

Raw output: 'neutral'


 25%|██▌       | 25/100 [01:18<03:54,  3.13s/it]

Raw output: 'entailment'


 26%|██▌       | 26/100 [01:22<04:02,  3.28s/it]

Raw output: 'entailment'


 27%|██▋       | 27/100 [01:24<03:28,  2.86s/it]

Raw output: 'neutral'


 28%|██▊       | 28/100 [01:26<03:04,  2.57s/it]

Raw output: 'neutral'


 29%|██▉       | 29/100 [01:28<02:47,  2.36s/it]

Raw output: 'neutral'


 30%|███       | 30/100 [01:29<02:34,  2.21s/it]

Raw output: 'neutral'


 31%|███       | 31/100 [01:33<03:03,  2.65s/it]

Raw output: 'entailment'


 32%|███▏      | 32/100 [01:35<02:44,  2.42s/it]

Raw output: 'neutral'


 33%|███▎      | 33/100 [01:39<03:04,  2.75s/it]

Raw output: 'entailment'


 34%|███▍      | 34/100 [01:42<03:17,  2.99s/it]

Raw output: 'entailment'


 35%|███▌      | 35/100 [01:46<03:26,  3.18s/it]

Raw output: 'contradiction'


 36%|███▌      | 36/100 [01:49<03:30,  3.28s/it]

Raw output: 'entailment'


 37%|███▋      | 37/100 [01:53<03:30,  3.34s/it]

Raw output: 'entailment'


 38%|███▊      | 38/100 [01:55<03:01,  2.92s/it]

Raw output: 'neutral'


 39%|███▉      | 39/100 [01:58<03:11,  3.13s/it]

Raw output: 'entailment'


 40%|████      | 40/100 [02:02<03:14,  3.24s/it]

Raw output: 'entailment'


 41%|████      | 41/100 [02:05<03:15,  3.32s/it]

Raw output: 'entailment'


 42%|████▏     | 42/100 [02:09<03:18,  3.43s/it]

Raw output: 'entailment'


 43%|████▎     | 43/100 [02:12<03:16,  3.45s/it]

Raw output: 'contradiction'


 44%|████▍     | 44/100 [02:16<03:13,  3.46s/it]

Raw output: 'entailment'


 45%|████▌     | 45/100 [02:20<03:12,  3.50s/it]

Raw output: 'entailment'


 46%|████▌     | 46/100 [02:21<02:43,  3.03s/it]

Raw output: 'neutral'


 47%|████▋     | 47/100 [02:25<02:48,  3.17s/it]

Raw output: 'entailment'


 48%|████▊     | 48/100 [02:28<02:50,  3.27s/it]

Raw output: 'entailment'


 49%|████▉     | 49/100 [02:32<02:52,  3.38s/it]

Raw output: 'entailment'


 50%|█████     | 50/100 [02:34<02:26,  2.93s/it]

Raw output: 'neutral'


 51%|█████     | 51/100 [02:37<02:31,  3.10s/it]

Raw output: 'entailment'


 52%|█████▏    | 52/100 [02:41<02:34,  3.22s/it]

Raw output: 'entailment'


 53%|█████▎    | 53/100 [02:43<02:12,  2.83s/it]

Raw output: 'neutral'


 54%|█████▍    | 54/100 [02:47<02:20,  3.06s/it]

Raw output: 'entailment'


 55%|█████▌    | 55/100 [02:50<02:23,  3.19s/it]

Raw output: 'entailment'


 56%|█████▌    | 56/100 [02:53<02:24,  3.28s/it]

Raw output: 'entailment'


 57%|█████▋    | 57/100 [02:55<02:04,  2.88s/it]

Raw output: 'neutral'


 58%|█████▊    | 58/100 [02:59<02:10,  3.10s/it]

Raw output: 'entailment'


 59%|█████▉    | 59/100 [03:03<02:12,  3.22s/it]

Raw output: 'contradiction'


 60%|██████    | 60/100 [03:06<02:12,  3.31s/it]

Raw output: 'entailment'


 61%|██████    | 61/100 [03:10<02:13,  3.42s/it]

Raw output: 'contradiction'


 62%|██████▏   | 62/100 [03:13<02:11,  3.45s/it]

Raw output: 'entailment'


 63%|██████▎   | 63/100 [03:17<02:08,  3.47s/it]

Raw output: 'entailment'


 64%|██████▍   | 64/100 [03:20<02:06,  3.52s/it]

Raw output: 'entailment'


 65%|██████▌   | 65/100 [03:24<02:03,  3.52s/it]

Raw output: 'entailment'


 66%|██████▌   | 66/100 [03:27<01:59,  3.51s/it]

Raw output: 'entailment'


 67%|██████▋   | 67/100 [03:31<01:56,  3.53s/it]

Raw output: 'contradiction'


 68%|██████▊   | 68/100 [03:33<01:38,  3.06s/it]

Raw output: 'neutral'


 69%|██████▉   | 69/100 [03:36<01:39,  3.20s/it]

Raw output: 'contradiction'


 70%|███████   | 70/100 [03:40<01:38,  3.29s/it]

Raw output: 'entailment'


 71%|███████   | 71/100 [03:42<01:23,  2.87s/it]

Raw output: 'neutral'


 72%|███████▏  | 72/100 [03:46<01:26,  3.10s/it]

Raw output: 'contradiction'


 73%|███████▎  | 73/100 [03:49<01:27,  3.22s/it]

Raw output: 'entailment'


 74%|███████▍  | 74/100 [03:53<01:25,  3.30s/it]

Raw output: 'entailment'


 75%|███████▌  | 75/100 [03:56<01:24,  3.39s/it]

Raw output: 'contradiction'


 76%|███████▌  | 76/100 [04:00<01:22,  3.44s/it]

Raw output: 'entailment'


 77%|███████▋  | 77/100 [04:03<01:19,  3.46s/it]

Raw output: 'entailment'


 78%|███████▊  | 78/100 [04:05<01:05,  2.98s/it]

Raw output: 'neutral'


 79%|███████▉  | 79/100 [04:09<01:06,  3.18s/it]

Raw output: 'contradiction'


 80%|████████  | 80/100 [04:12<01:05,  3.28s/it]

Raw output: 'entailment'


 81%|████████  | 81/100 [04:14<00:54,  2.87s/it]

Raw output: 'neutral'


 82%|████████▏ | 82/100 [04:18<00:55,  3.06s/it]

Raw output: 'entailment'


 83%|████████▎ | 83/100 [04:20<00:46,  2.72s/it]

Raw output: 'neutral'


 84%|████████▍ | 84/100 [04:23<00:47,  2.99s/it]

Raw output: 'contradiction'


 85%|████████▌ | 85/100 [04:27<00:47,  3.14s/it]

Raw output: 'entailment'


 86%|████████▌ | 86/100 [04:30<00:45,  3.25s/it]

Raw output: 'entailment'


 87%|████████▋ | 87/100 [04:34<00:43,  3.37s/it]

Raw output: 'entailment'


 88%|████████▊ | 88/100 [04:37<00:40,  3.41s/it]

Raw output: 'entailment'


 89%|████████▉ | 89/100 [04:41<00:37,  3.44s/it]

Raw output: 'contradiction'


 90%|█████████ | 90/100 [04:44<00:34,  3.47s/it]

Raw output: 'entailment'


 91%|█████████ | 91/100 [04:48<00:31,  3.50s/it]

Raw output: 'contradiction'


 92%|█████████▏| 92/100 [04:51<00:27,  3.50s/it]

Raw output: 'entailment'


 93%|█████████▎| 93/100 [04:55<00:24,  3.50s/it]

Raw output: 'entailment'


 94%|█████████▍| 94/100 [04:57<00:18,  3.05s/it]

Raw output: 'neutral'


 95%|█████████▌| 95/100 [05:00<00:15,  3.19s/it]

Raw output: 'entailment'


 96%|█████████▌| 96/100 [05:04<00:13,  3.28s/it]

Raw output: 'entailment'


 97%|█████████▋| 97/100 [05:07<00:10,  3.35s/it]

Raw output: 'contradiction'


 98%|█████████▊| 98/100 [05:11<00:06,  3.43s/it]

Raw output: 'entailment'


 99%|█████████▉| 99/100 [05:15<00:03,  3.46s/it]

Raw output: 'entailment'


100%|██████████| 100/100 [05:18<00:00,  3.19s/it]

Raw output: 'contradiction'
Few-shot accuracy: 0.52
Few-shot runtime (s): 318.61034321784973





## Prompting discussion, decisions and instantiated prompts
-----------------------------------------------------------

Designing effective prompts for the Natural Language Inference task proved challenging, particularly when attempting to constrain the model to produce only a single label as output.

Zero-shot prompting
-------------------

The initial zero-shot prompt was relatively long and descriptive:


'''

You are performing Natural Language Inference task.

Given a Premise and a Hypothesis, classify their relationship as:
- entailment
- contradiction
- neutral

Definitions:
- entailment: the hypothesis must be true if the premise is true
- contradiction: the hypothesis cannot be true if the premise is true
- neutral: the hypothesis may or may not be true

Premise: {premise}
Hypothesis: {hypothesis}

Answer with one word only.

'''

However, with this prompt the model frequently produced long explanations or multiple labels instead of a single class. After experimenting with different prompt formulations, I adapted the prompt to better match the instruction format recommended for LLaMA models, based on the official LLaMA documentation (https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/). So, the prompt was adjusted as following:


'''

"role": "user",

"content": f"""Determine the logical relationship between the premise and hypothesis.

Premise: {premise}

Hypothesis: {hypothesis}

The relationship must be one of:
- entailment: the hypothesis is definitely true given the premise
- contradiction: the hypothesis is definitely false given the premise  
- neutral: the hypothesis might or might not be true given the premise

Answer with only one word (entailment, contradiction, or neutral):

'''


This version successfully constrained the model output to a single label. Using this prompt, the zero-shot model achieved an accuracy of 47% on the first 100 validation examples, with a runtime of approximately 4 minutes, which is substantially faster than fine-tuning but with much lower accuracy.

Few-shot prompting
------------------

A similar issue occurred when applying few-shot prompting with a generic prompt structure of the form:


'''

You are performing Natural Language Inference task.

Given a Premise and a Hypothesis, classify their relationship as:
- entailment
- contradiction
- neutral

{few_shot_examples}

Now classify the following:

Premise: {premise}
Hypothesis: {hypothesis}

Answer with one word only.

'''

Therefore, the few-shot prompt was also reformulated to follow the LLaMA instruction style:


'''

"role": "user",

"content": f"""Classify the relationship as entailment, contradiction, or neutral.

Example 1:
Premise: A person on a horse jumps over a broken down airplane.
Hypothesis: A person is training his horse for a competition.
Answer: neutral

Example 2:
Premise: Children smiling and waving at camera
Hypothesis: There are children present
Answer: entailment

Example 3:
Premise: A boy is jumping on skateboard in the middle of a red bridge.
Hypothesis: The boy skates down the sidewalk.
Answer: contradiction

Premise: {premise}

Hypothesis: {hypothesis}

Answer with only one word (entailment, contradiction, or neutral):

'''

This approach resulted in slightly lower accuracy than the zero-shot version. Since the examples above were taken directly from the dataset, I experimented with manually created examples designed to be clearer and more general:

So the examples created initially were:

'''

Example 1:
Premise: A man is playing guitar.
Hypothesis: A man is playing a musical instrument.
Answer: entailment

Example 2:
Premise: A woman is indoors.
Hypothesis: A woman is outside.
Answer: contradiction

Example 3:
Premise: A dog is running in the park.
Hypothesis: The dog is brown.
Answer: neutral

'''

This modification led to a minor improvement. After inspecting the predicted labels, the model appeared to be biased toward the entailment class, while contradictions were rarely predicted.
To solve this issue, I added the following explicit instruction to both the zero-shot and few-shot prompts:

Instruction:

'''

IMPORTANT: Look carefully for contradictions - situations where the hypothesis is definitely false given the premise.

'''

In addition, the few-shot prompt was expanded to include more contradiction examples (three contradictions, two entailments, and two neutral examples). With these changes, accuracy improved to 55% for zero-shot prompting and 52% for few-shot prompting, while runtime remained low at approximately 5 minutes per evaluation.

Interestingly, few-shot prompting did not outperform zero-shot prompting, suggesting that the additional examples may have maintained bias or failed to generalize well to the SNLI dataset.


Some interesting examples of the instantiated prompts above are:

1. Example 3 in the validation set:

Premise: 'Two women are embracing while holding to go packages.'
Hypothesis: 'The men are fighting outside a deli.'
True Label: Contradiction
Zero-Shot label: Neutral
Few-Shots label: Contradiction

Few-Shots model predicts accurately the label, while zero-shot cannot.

2. Example 5 in validation set:

Premise: 'Two young children in blue jerseys, one with the number 9 and one with the number 2 are standing on wooden steps in a bathroom and washing their hands in a sink.'
Hypothesis: Two kids at a ballgame wash their hands.
True Label: Neutral
Zero-Shot label: Neutral
Few-Shots label: Entailment

Here, zero-shot predicts correctly, while few-shots doesn't.

3. Example 30 in validation set:

Premise: 'Families waiting in line at an amusement park for their turn to ride.'
Hypothesis: 'People are waiting to see a movie.'
True Label: Contradiction
Zero-Shot label: Contradiction
Few-Shots label: Neutral

Again, zero-shot seems to be predicting correctly the contradiction, while few-shots confuses it for neutral.

Overall, these examples highlight the difficulty of controlling LLM behavior through prompting alone. The observed bias toward entailment likely contributes to errors, particularly for contradiction cases. Due to time constraints, further prompt engineering was not explored.

## Comparison of fine-tuning with prompting in 100 samples validation subset

Accuracy of fine-tuned model: 96%

Accuracy of zero-shot model: 55%

Accuracy of few-shots model: 52%


Runtime of fine-tuned model: 20s

Runtime of zero-shot model: 273s

Runtime of few-shots model: 318s


The fine-tuned RoBERTa model outperforms both the zero-shot and few-shot prompting approaches in terms of accuracy and runtime when all models are evaluated on the same 100 examples from the SNLI validation set. This result is expected, as fine-tuning explicitly adapts the model parameters to the NLI task and the label distribution of the dataset.
However, this process has a high computational cost. Fine-tuning RoBERTa (while freezing the embeddings and half of the encoder layers and training only the remaining encoder layers and the classification head) required approximately 1 hour of training time on a GPU in Google Colab. In contrast, the prompting approaches required no training time and no training data. But, although prompting is more flexible and data-efficient, its performance in this task was close to chance level, indicating that the LLM struggled to consistently infer the correct NLI labels, even when few-shot examples were provided. This suggests that, for structured classification tasks such as NLI, prompting without task-specific training may be insufficient.
