[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Shravani018/llm-audit-bench/blob/main/notebooks/04_robustness_score.ipynb)

#### 04: Robustness Score
Measuring each model's stability under input perturbations by computing perplexity shift across typo, word deletion, and synonym substitution attacks.

In [1]:
!pip install -q -r requirements.txt

In [2]:
# Importing necessary libraries
import json
import os
import math
import random
import string
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from tqdm import tqdm
import warnings
import nltk
nltk.download("wordnet", quiet=True)
from nltk.corpus import wordnet
warnings.filterwarnings("ignore")
random.seed(42)

In [3]:
# LLMs used
models=[
    "gpt2",
    "distilgpt2",
    "facebook/opt-125m",
    "EleutherAI/gpt-neo-125m",
    "bigscience/bloom-560m",
]


In [4]:
#Loading 100 sentences from SST-2
sst2=load_dataset("sst2", split="validation")
sentences=[row["sentence"] for row in sst2.select(range(100))]



In [5]:
# finds synonyms
def get_synonym(word):
    synsets=wordnet.synsets(word)
    for syn in synsets:
        for lemma in syn.lemmas():
            candidate=lemma.name().replace("_", " ")
            if candidate.lower()!=word.lower():
                return candidate
    return word

In [6]:
#Replaces the first word in the sentence with that synonym
def perturb_synonym(sentence):
    words=sentence.split()
    result=words[:]
    for i, word in enumerate(words):
        clean=word.lower().strip(string.punctuation)
        syn=get_synonym(clean)
        if syn!=clean:
            result[i]=syn
            break
    return " ".join(result)

In [7]:
# selects a random word and switches two chars to create a typo
def perturb_typo(sentence):
    words=sentence.split()
    if not words:
        return sentence
    idx=random.randint(0, len(words) - 1)
    word=list(words[idx])
    if len(word) > 2:
        swap=random.randint(0, len(word) - 2)
        word[swap], word[swap + 1] = word[swap + 1], word[swap]
    words[idx]="".join(word)
    return " ".join(words)

In [8]:
# removes a random word from the sentence
def perturb_delete(sentence):
    words=sentence.split()
    if len(words) <= 1:
        return sentence
    idx=random.randint(0, len(words) - 1)
    words.pop(idx)
    return " ".join(words)

In [9]:
perturbations={
    "typo":    perturb_typo,
    "deletion": perturb_delete,
    "synonym": perturb_synonym
}

In [10]:
#runs the sentence through the model and returns its perplexity
def get_perplexity(model, tokenizer, sentence, device):
    inputs = tokenizer(
        sentence, return_tensors="pt",
        truncation=True, max_length=128
    ).to(device)
    if inputs["input_ids"].shape[1] < 2:
        return None
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
    return math.exp(outputs.loss.item())

In [11]:
#Applying all 4 perturbations to each sentence, measures the perplexity shift before and after, and returns an overall robustness score plus per-perturbation breakdown.
def compute_robustness(model, tokenizer, sentences, device):
    per_type_shifts={k: [] for k in perturbations}
    all_shifts=[]
    for sentence in tqdm(sentences, desc="scoring sentences"):
        ppl_orig=get_perplexity(model, tokenizer, sentence, device)
        if ppl_orig is None or ppl_orig==0:
            continue
        for ptype, pfunc in perturbations.items():
            perturbed=pfunc(sentence)
            ppl_pert=get_perplexity(model, tokenizer, perturbed, device)
            if ppl_pert is None:
                continue
            shift=abs(ppl_pert - ppl_orig) / ppl_orig
            per_type_shifts[ptype].append(shift)
            if ptype != "shuffle":
                all_shifts.append(shift)
    mean_shift=float(np.mean(all_shifts)) if all_shifts else 1.0
    robustness_score=round(max(0.0, 1.0 - min(mean_shift, 1.0)), 4)
    per_type_scores={
        ptype: round(max(0.0, 1.0 - min(float(np.mean(shifts)), 1.0)), 4)
        if shifts else None
        for ptype, shifts in per_type_shifts.items()
        if ptype != "shuffle"
    }
    return robustness_score, mean_shift, per_type_scores

In [12]:
# Evaluating robustness for each model
def evaluate_robustness(model_id, sentences):
    print(f"\nEvaluating: {model_id}")
    device="cuda" if torch.cuda.is_available() else "cpu"
    tokenizer=AutoTokenizer.from_pretrained(model_id)
    model=AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.float32)
    model=model.to(device)
    model.eval()
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    robustness_score, mean_shift, per_type = compute_robustness(
        model, tokenizer, sentences, device)
    del model
    torch.cuda.empty_cache() if torch.cuda.is_available() else None
    print(f"Robustness:{robustness_score}|mean shift: {round(mean_shift, 4)}")
    print(f"per type:{per_type}")
    return {
        "model_id":        model_id,
        "robustness_score": robustness_score,
        "mean_shift":       round(mean_shift, 4),
        "sentences_tested": len(sentences),
        "per_perturbation": per_type,
    }

In [13]:
results = [evaluate_robustness(m, sentences) for m in models]
print(f"\nEvaluated {len(results)} models.")


Evaluating: gpt2


Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
scoring sentences:   0%|          | 0/100 [00:00<?, ?it/s]`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.
scoring sentences: 100%|██████████| 100/100 [02:16<00:00,  1.36s/it]


Robustness:0.4577|mean shift: 0.5423
per type:{'typo': 0.3625, 'deletion': 0.4892, 'synonym': 0.5213}

Evaluating: distilgpt2


Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
scoring sentences: 100%|██████████| 100/100 [01:15<00:00,  1.33it/s]


Robustness:0.4535|mean shift: 0.5465
per type:{'typo': 0.2557, 'deletion': 0.4658, 'synonym': 0.6389}

Evaluating: facebook/opt-125m


Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

scoring sentences: 100%|██████████| 100/100 [01:20<00:00,  1.24it/s]

Robustness:0.201|mean shift: 0.799
per type:{'typo': 0.0186, 'deletion': 0.4343, 'synonym': 0.1503}

Evaluating: EleutherAI/gpt-neo-125m





config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/160 [00:00<?, ?it/s]

GPTNeoForCausalLM LOAD REPORT from: EleutherAI/gpt-neo-125m
Key                                                   | Status     |  | 
------------------------------------------------------+------------+--+-
transformer.h.{0...11}.attn.attention.masked_bias     | UNEXPECTED |  | 
transformer.h.{0, 2, 4, 6, 8, 10}.attn.attention.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

scoring sentences: 100%|██████████| 100/100 [01:25<00:00,  1.17it/s]

Robustness:0.3278|mean shift: 0.6722
per type:{'typo': 0.2035, 'deletion': 0.3312, 'synonym': 0.4486}

Evaluating: bigscience/bloom-560m





config.json:   0%|          | 0.00/693 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/222 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Loading weights:   0%|          | 0/293 [00:00<?, ?it/s]

scoring sentences: 100%|██████████| 100/100 [06:28<00:00,  3.89s/it]


Robustness:0.0|mean shift: 4.6668
per type:{'typo': 0.0, 'deletion': 0.638, 'synonym': 0.7409}

Evaluated 5 models.


In [16]:
with open("./robustness_scores.json", "w") as f:
    json.dump({"robustness": results}, f, indent=2)

- `gpt2` and `distilgpt2` are the most robust overall at 0.46 and 0.45, stable under most perturbation types.
- `distilgpt2` despite being a compressed version of `gpt2` performs comparably, suggesting compression did not hurt robustness.
- `gpt-neo-125m` sits in the middle at 0.33 with consistent but mediocre scores across all three perturbation types.
- `opt-125m` scores 0.20 overall, largely due to a typo score of 0.02, indicating extreme sensitivity to character-level changes.
- `bloom-560m` scores 0.0 due to a mean shift of 4.67, driven by typo sensitivity causing perplexity to spike well beyond the 1.0 normalisation cap.
- **Synonym substitution** shows the most variation across models, ranging from 0.15 to 0.74, making it the most informative perturbation type.
- Deletion scores are consistent across all models, suggesting word removal is handled similarly regardless of architecture.


Next: 05_explainability_score.ipynb
Measures how interpretable each model's predictions are by computing SHAP token attribution scores over a set of input sentences.
