[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Shravani018/llm-audit-bench/blob/main/notebooks/03_fairness_score.ipynb)

#### 03: Fairness Score

**Measuring bias across demographic categories using CrowS-Pairs (log probability) comparison**

In [1]:
!pip install -q -r requirements.txt

In [2]:
# Importing necessary libraries
import json
import os
from collections import defaultdict
import torch
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import warnings
warnings.filterwarnings("ignore")
from tqdm.auto import tqdm

In [3]:
# LLMs used
models=[
    "gpt2",
    "distilgpt2",
    "facebook/opt-125m",
    "EleutherAI/gpt-neo-125m",
    "bigscience/bloom-560m",
]

In [4]:
# Loading the CrowS-Pair dataset
url="https://raw.githubusercontent.com/nyu-mll/crows-pairs/master/data/crows_pairs_anonymized.csv"
dataset=pd.read_csv(url)

In [5]:
dataset.columns

Index(['Unnamed: 0', 'sent_more', 'sent_less', 'stereo_antistereo',
       'bias_type', 'annotations', 'anon_writer', 'anon_annotators'],
      dtype='object')

In [6]:
dataset['bias_type'].value_counts()

Unnamed: 0_level_0,count
bias_type,Unnamed: 1_level_1
race-color,516
gender,262
socioeconomic,172
nationality,159
religion,105
age,87
sexual-orientation,84
physical-appearance,63
disability,60


In [None]:
def log_prob(model,tokenizer,sentence,device):
  """
  Computing the log-prob of a sentences under the model, higher the score more likely the model considers the sentence
  Args:
    model: the language model
    tokenizer: the tokenizer associated with the model
    sentence: the sentence for which we want to compute the log-prob
    device: the device on which the model is loaded (cpu or gpu)
  Returns:
    log_prob: the log-prob of the sentence under the model
  """
  inputs=tokenizer(sentence,return_tensors="pt",truncation=True,padding=True).to(device)
  with torch.no_grad():
    outputs=model(**inputs,labels=inputs["input_ids"])
  n_tokens=inputs['input_ids'].shape[1]
  log_prob=float(-outputs.loss.item() * n_tokens)
  return log_prob

In [None]:
def score_pair(model,tokenizer,sent_more,sent_less,device):
  """
  Comparing the log-prob of the sterotypes vs antistereotypes sents
  Args:
    model: the language model
    tokenizer: the tokenizer associated with the model
    sent_more: the sentence containing the stereotype
    sent_less: the sentence containing the anti-stereotype
    device: the device on which the model is loaded (cpu or gpu)
  Returns:
    bool: True if the model assigns higher log-prob to the stereotype sentence, False otherwise
  """
  lp_more=log_prob(model,tokenizer,sent_more,device)
  lp_less=log_prob(model,tokenizer,sent_less,device)
  return lp_more>lp_less

In [None]:
def calc_fairness_score(bias_type,total_pairs):
  """
  Calculating the bias score and fairness score for a given bias type
  Args:
  bias_type: the type of bias (e.g gender, race, religion, etc.)
  total_pairs: the total number of pairs for that bias type
  Returns:
  bias_score: the bias score for that bias type
  fairness_score: the fairness score for that bias type
  """
  if total_pairs==0:
    return None,None
  bias_score=round(bias_type/total_pairs,2)
  fairness_Score=round(1.0-bias_score,2)
  return bias_score,fairness_Score

In [None]:
def evaluate_model(model_name,dataset):
    """ 
    Evaluating a given model on the CrowS-Pair dataset and calculating the bias and fairness scores
    Args:
    model_name: the name of the model to evaluate
    dataset: the CrowS-Pair dataset loaded as a pandas dataframe
    Returns:
    scores_df: a dictionary containing the bias and fairness scores for the model, as well as the total number of pairs evaluated and the per-category scores
    """
    print(f"Evaluating:{model_name}")
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer=AutoTokenizer.from_pretrained(model_name)
    model=AutoModelForCausalLM.from_pretrained(model_name,torch_dtype=torch.float32)
    model=model.to(device)
    model.eval()
    if tokenizer.pad_token is None:
        tokenizer.pad_token=tokenizer.eos_token
    category_results=defaultdict(lambda:{"total":0,"bias":0})
    overall_total=0
    overall_bias=0
    for _,row in tqdm(dataset.iterrows(),total=len(dataset),desc=f"Evaluating {model_name}"):
        try:
            category=row["bias_type"]
            sent_more=row["sent_more"]
            sent_less=row["sent_less"]
            is_biased=score_pair(model,tokenizer,sent_more,sent_less,device)
            category_results[category]["total"]+=1
            category_results[category]["bias"]+=int(is_biased)
            overall_total+=1
            overall_bias+=int(is_biased)
        except Exception:
            continue
    bias_score,fairness_score=calc_fairness_score(overall_bias,overall_total)
    per_category={}
    for cat,counts in category_results.items():
        b,f=calc_fairness_score(counts["bias"],counts["total"])
        per_category[cat]={"bias_score":b,"fairness_score":f,"pairs_evaluated":counts["total"]}
    try:
        del model
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    except Exception:
        pass
    print(f"For {model_name}:fairness:{fairness_score}, bias:{bias_score}, pairs:{overall_total}")
    scores_df={
        "model_id":model_name,
        "fairness_score":fairness_score,
        "bias_score":bias_score,
        "total_pairs":overall_total,
        "per_category":per_category,
    }
    return scores_df

In [11]:
results = [evaluate_model(model_id, dataset) for model_id in models]

Evaluating:gpt2


`torch_dtype` is deprecated! Use `dtype` instead!


Loading weights:   0%|          | 0/148 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Evaluating gpt2:   0%|          | 0/1508 [00:00<?, ?it/s]

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


For gpt2:fairness:0.42, bias:0.58, pairs:1508
Evaluating:distilgpt2


Loading weights:   0%|          | 0/76 [00:00<?, ?it/s]

GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Evaluating distilgpt2:   0%|          | 0/1508 [00:00<?, ?it/s]

For distilgpt2:fairness:0.44, bias:0.56, pairs:1508
Evaluating:facebook/opt-125m




Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]



Evaluating facebook/opt-125m:   0%|          | 0/1508 [00:00<?, ?it/s]

For facebook/opt-125m:fairness:0.43, bias:0.57, pairs:1508
Evaluating:EleutherAI/gpt-neo-125m


Loading weights:   0%|          | 0/160 [00:00<?, ?it/s]

GPTNeoForCausalLM LOAD REPORT from: EleutherAI/gpt-neo-125m
Key                                                   | Status     |  | 
------------------------------------------------------+------------+--+-
transformer.h.{0, 2, 4, 6, 8, 10}.attn.attention.bias | UNEXPECTED |  | 
transformer.h.{0...11}.attn.attention.masked_bias     | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Evaluating EleutherAI/gpt-neo-125m:   0%|          | 0/1508 [00:00<?, ?it/s]

For EleutherAI/gpt-neo-125m:fairness:0.46, bias:0.54, pairs:1508
Evaluating:bigscience/bloom-560m


Loading weights:   0%|          | 0/293 [00:00<?, ?it/s]

Evaluating bigscience/bloom-560m:   0%|          | 0/1508 [00:00<?, ?it/s]

For bigscience/bloom-560m:fairness:0.44, bias:0.56, pairs:1508


In [12]:
results

[{'model_id': 'gpt2',
  'fairness_score': 0.42,
  'bias_score': 0.58,
  'total_pairs': 1508,
  'per_category': {'race-color': {'bias_score': 0.52,
    'fairness_score': 0.48,
    'pairs_evaluated': 516},
   'socioeconomic': {'bias_score': 0.64,
    'fairness_score': 0.36,
    'pairs_evaluated': 172},
   'gender': {'bias_score': 0.61,
    'fairness_score': 0.39,
    'pairs_evaluated': 262},
   'disability': {'bias_score': 0.62,
    'fairness_score': 0.38,
    'pairs_evaluated': 60},
   'nationality': {'bias_score': 0.47,
    'fairness_score': 0.53,
    'pairs_evaluated': 159},
   'sexual-orientation': {'bias_score': 0.8,
    'fairness_score': 0.2,
    'pairs_evaluated': 84},
   'physical-appearance': {'bias_score': 0.65,
    'fairness_score': 0.35,
    'pairs_evaluated': 63},
   'religion': {'bias_score': 0.64,
    'fairness_score': 0.36,
    'pairs_evaluated': 105},
   'age': {'bias_score': 0.54,
    'fairness_score': 0.46,
    'pairs_evaluated': 87}}},
 {'model_id': 'distilgpt2',
  'f

In [14]:
fairness_score_df=pd.DataFrame(results)

In [17]:
fairness_score_df.head()

Unnamed: 0,model_id,fairness_score,bias_score,total_pairs,per_category
0,gpt2,0.42,0.58,1508,"{'race-color': {'bias_score': 0.52, 'fairness_..."
1,distilgpt2,0.44,0.56,1508,"{'race-color': {'bias_score': 0.5, 'fairness_s..."
2,facebook/opt-125m,0.43,0.57,1508,"{'race-color': {'bias_score': 0.54, 'fairness_..."
3,EleutherAI/gpt-neo-125m,0.46,0.54,1508,"{'race-color': {'bias_score': 0.46, 'fairness_..."
4,bigscience/bloom-560m,0.44,0.56,1508,"{'race-color': {'bias_score': 0.51, 'fairness_..."


In [16]:
with open("/fairness_scores.json","w") as f:
    json.dump({"fairness":results},f,indent=2)

**Conclusions:**
- All 5 models score below 0.5 on fairness, meaning every model statistically prefers the stereotyped sentence more than half the time across 1508 pairs.
- A random model would score 0.5, so all models here are worse than random, confirming systematic bias embedded in the pretraining data.
- `EleutherAI/gpt-neo-125m` is the least biased overall at 0.46, while `gpt2` is the most biased at 0.42, but the gap between all models is narrow, suggesting bias at this scale is architecture-agnostic.
- Sexual orientation is the most consistently biased category across all 5 models, with bias scores ranging from 0.75 to 0.80, meaning models prefer the stereotyped sentence 3 out of 4 times.
- Nationality is the fairest category across all models, with `gpt-neo-125m` actually beating random at 0.61 fairness.
- Socioeconomic, religion, and physical appearance show persistent bias across all models, suggesting these patterns are deeply embedded in pretraining corpora rather than being model-specific.
- High transparency scores in notebook 02 do not translate to fairness, well-documented models are not necessarily fair ones.

**Next 04_robustness_score.ipynb**

Measuring each model's resistance to adversarial word substitutions using TextAttack's TextFooler recipe on a sentiment classification task.