# Lab 1: Fairness and Ethical Considerations

By: Morgan Mote and Taylor King

Due: Wed Feb 18, 2026 11:59pm2/18/2026
In this lab you will investigate and try to uncover biases in a machine learning model. You are free to use most any data as inputs, such as text data, table data, or images. You are free to use the code from class written in Keras/Tensorflow. As always, you can choose a PyTorch implementation if you prefer. The objective of the lab is to measure groups that are treated differently by one of these models. If using code from another author (not your own), you will be graded on the clarity of explanatory comments you add to the code. 

Remember that the class policy on LLM usage prohibits its use in text generation and text refinement. You are only allowed to use an LLM for coding and you MUST provide a citation and the prompt used (or a summary of the prompt used). 

As part of this lab you need to choose a trained model that you can run on your own hardware and investigate a bias in this model (where different groups may be treated differently or unfairly by the already trained model). As always, smaller models will be more computationally efficient to investigate, especially if your process is iterative or requires retraining of the base model. 

### Here is the rubric for the assignment, worth 15 points total: 

[2 Points] Present an overview for (1) what type of bias you will be investigating and what groups, (2) what pre-trained model you will be investigating, and (3) why the particular investigation you will be doing is relevant.
You might consider asking questions like: Why is it important to find this kind of bias in machine learning models? Why will the type of investigation I am performing be relevant to other researchers or practitioners? Why might this particular model treat these groups unfairly? 
You are free to look and compare bias among any groups. For instance, in ConceptNet, they looked at racial bias in names for a sentiment classifier. However, you might choose to investigate other forms of bias like gender, religion, socioeconomic status, political affiliation, sexual orientation, or another grouping. The aim is to uncover groups that are treated systematically different by a model and why it is important for these groups to be treated fairly.

[2 Points] Present one (or more) research question(s) that you will be answering and explain the methods that you will employ to answer these research questions. Present a hypothesis as part of your research question(s).
Present a transfer learning classification task that will help to uncover the potential biases in the model. That is, discuss what new transfer learning task can be used and how the new classification task of the model will help to uncover bias or a lack of fairness. 
An example research question might look like: For predicting hospitalization and mortality from electronic health record data, does the model performance vary significantly by insurance coverage type? We hypothesize that the model will struggle to properly predict hospitalization of individuals that are uninsured or underinsured because their hospitalization could be influenced by more than chart results and diagnosis. To investigate this, we will use a model trained on MIMIC-III that does not have access to insurance type for the individual. This model will be based on structure table data for the patients only to prevent chart data from accidentally including insurance information. An interesting follow up question would be, if a bias exists, does the bias become more or less pronounced when chart notes are included using BioClinical BERT? 

[2 Points] Discuss one method for potentially reducing the bias among groups. For example, you might choose a loss function as described here to help reduce bias: https://developers.google.com/machine-learning/crash-course/fairness/mitigating-biasLinks to an external site. . Alternatively, you might choose a post-processing method after training to reduce bias. Argue for investigating one of these methods (or a completely different method of reducing bias). You have a lot of free rein to decide on a technique here to investigate. It can be something established or your own idea to help reduce bias. 
As part of your assignment, you will compare the bias of the original model to that of the model with your chosen bias mitigation strategy. Discuss how you will measure a difference between the two model outputs. That is, if you are measuring the difference statistically, what test will you use and why is it appropriate? Are there any limitations to performing this test that you should be aware of? 

[4 Points] Carryout your analysis (and model training, if needed) for the original transfer learned model and the model with bias mitigation. Explain your steps in as much detail so that the instructor can understand your code. 
[4 Points] Present results from your analysis and provide evidence from the results that support or refute your hypothesis. Write a conclusion based upon the various analyses you performed. Be sure to reference your research questions systematically in your conclusion. With your analysis complete, are there any additional research questions or limitations to your conclusions?

[1 Points] Identify two conferences or journals that would be interested in the results of your analysis. Identify why these venues would be interested in this analysis and why your work is of interest to that community. Are there any similar works published in this venue? Do you think this work could be turned into an accepted paper that adds to the body of work in bias mitigation? Why or why not?  
 

 

0) Setup cell (VS Code / Jupyter)

In [None]:
!pip -q install transformers datasets accelerate evaluate scikit-learn pandas numpy scipy matplotlib

1) Imports + model load

In [None]:
import numpy as np
import pandas as pd
import random
import torch

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    pipeline,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)

from scipy.stats import kruskal, chi2_contingency
import matplotlib.pyplot as plt

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

MODEL_NAME = "unitary/toxic-bert"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
base_model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

device = 0 if torch.cuda.is_available() else -1
print("CUDA:", torch.cuda.is_available(), "| device:", device)

# return_all_scores=True gives list-of-label-scores for each input
base_pipe = pipeline(
    "text-classification",
    model=base_model,
    tokenizer=tokenizer,
    return_all_scores=True,
    device=device
)

# Inspect labels
id2label = base_model.config.id2label
labels = [id2label[i] for i in sorted(id2label.keys())]
print("Model labels:", labels)

# Choose which label counts as "toxic"
# Adjust if your model uses different naming (e.g., "LABEL_1")
TOXIC_LABEL = None
for l in labels:
    if l.lower() == "toxic":
        TOXIC_LABEL = l
        break
if TOXIC_LABEL is None:
    raise ValueError(f"Couldn't find 'toxic' label in {labels}. Set TOXIC_LABEL manually.")

print("Using TOXIC_LABEL:", TOXIC_LABEL)

# --- Sanity check: verify model behavior ---
print("\nSanity Check:")
print("Neutral example:", base_pipe("I love everyone.")[0])
print("Toxic example:", base_pipe("I hate you.")[0])

[1] Overview:

### Type of Bias Investigated
We investigate identity-term bias in toxicity classification. Specifically, we evaluate whether neutral sentences referencing different identity groups receive systematically different toxicity scores.

### Groups Examined
We compare identity terms across:
- Religion
- Gender
- Sexual orientation
- Race/ethnicity

### Pre-trained Model
We evaluate the HuggingFace model `unitary/toxic-bert`, a BERT-based toxicity classifier fine-tuned for content moderation tasks.

### Relevance
Toxicity classifiers are widely used in online moderation systems. If neutral references to certain identity groups receive higher toxicity scores, this can lead to disproportionate moderation or censorship. Measuring and mitigating such bias is essential for equitable AI deployment.

2) Build evaluation set (templated neutral sentences)

In [None]:
# Identity terms (groups + terms)
identity_terms = {
    "religion": ["Muslim", "Christian", "Jewish", "Hindu"],
    "gender": ["man", "woman", "transgender person"],
    "orientation": ["gay person", "straight person", "lesbian person"],
    "race_ethnicity": ["Black person", "White person", "Asian person", "Latino person"]
}

# Eval Set A templates (original)
templates_A = [
    "I met a {term} at the store today.",
    "My neighbor is a {term}.",
    "A {term} lives in my apartment building.",
    "I spoke with a {term} about the weather.",
    "A {term} helped me carry groceries."
]

def build_eval_df(identity_terms, templates, set_name):
    rows = []
    for group, terms in identity_terms.items():
        for term in terms:
            for t in templates:
                rows.append({
                    "eval_set": set_name,
                    "group": group,
                    "term": term,
                    "text": t.format(term=term),
                })
    return pd.DataFrame(rows)

eval_A = build_eval_df(identity_terms, templates_A, "A")

In [None]:
templates_B = [
    "Yesterday I had a conversation with a {term}.",
    "In my class, there is a {term}.",
    "The person I spoke to was a {term}.",
    "I recently learned that my coworker is a {term}.",
    "There was a {term} sitting next to me."
]

eval_B = build_eval_df(identity_terms, templates_B, "B")
eval_B.head()

eval_all = pd.concat([eval_A, eval_B], ignore_index=True)
print("Eval size total:", len(eval_all))

[2] Research Questions & Hypotheses

## RQ1
Does the baseline toxicity model assign significantly different toxicity scores to neutral sentences referencing different identity terms within the same category?

### Hypothesis 1
We hypothesize that at least one identity category will show statistically significant differences in mean toxicity score across terms.

## RQ2
Does counterfactual data augmentation (CDA) fine-tuning reduce identity-based toxicity score disparities?

### Hypothesis 2
We hypothesize that fine-tuning with CDA will reduce mean_score_gap and toxic_rate_gap across identity terms.

## Methods Overview
We construct templated neutral sentences containing identity terms. We compute:
- Mean toxicity score per term
- Toxic classification rate at threshold 0.5
- Gap metrics (max–min differences)
- Kruskal–Wallis tests (score distribution differences)
- Chi-square tests (prediction rate differences)
- Bootstrap confidence intervals for gap metrics

## Transfer Learning Task
We fine-tune the pre-trained toxicity classifier using counterfactual data augmentation to encourage counterfactual invariance across identity terms.

3) Run inference + extract toxicity score

In [None]:
def extract_score(all_scores, target_label):
    for item in all_scores:
        if item["label"].lower() == target_label.lower():
            return float(item["score"])
    return float("nan")

def run_scores(df, pipe, batch_size=32):
    texts = df["text"].tolist()
    scores = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        outs = pipe(batch)  # list (len batch) of list-of-dict label scores
        for out in outs:
            scores.append(extract_score(out, TOXIC_LABEL))

    return scores

THRESH = 0.5

def add_predictions(df, pipe, batch_size=64):
    df = df.copy()
    df["toxic_score"] = run_scores(df, pipe, batch_size)
    df["toxic_pred"] = (df["toxic_score"] >= THRESH).astype(int)
    return df

baseline_eval = add_predictions(eval_all, base_pipe, batch_size=64)
baseline_eval.head()

: 

4) Bias metrics

Mean score by term

Mean score by group

“Toxic prediction” rate using a threshold

In [None]:
def compute_term_stats(df):
    return (
        df.groupby(["eval_set","group","term"])
          .agg(mean_score=("toxic_score","mean"),
               toxic_rate=("toxic_pred","mean"),
               n=("toxic_pred","size"))
          .reset_index()
    )

def compute_gaps(term_stats):
    # gaps per eval_set x group
    return (
        term_stats.groupby(["eval_set","group"])
        .apply(lambda d: pd.Series({
            "mean_score_gap": float(d["mean_score"].max() - d["mean_score"].min()),
            "toxic_rate_gap": float(d["toxic_rate"].max() - d["toxic_rate"].min()),
            "max_term": d.sort_values("mean_score", ascending=False).iloc[0]["term"],
            "min_term": d.sort_values("mean_score", ascending=True).iloc[0]["term"],
        }))
        .reset_index()
    )

baseline_term_stats = compute_term_stats(baseline_eval)
baseline_gaps = compute_gaps(baseline_term_stats)

display(baseline_term_stats.sort_values(["eval_set","group","mean_score"], ascending=[True, True, False]).head(20))
display(baseline_gaps)

5) Statistical testing:

In [None]:
# Kruskal–Wallis test for differences in toxicity score distributions across identity terms

def kruskal_test_scores(df, eval_set, group):
    """
    Kruskal–Wallis test across identity terms within a (eval_set, group).
    H0: all terms come from the same distribution of toxic_score.
    """
    sub = df[(df["eval_set"] == eval_set) & (df["group"] == group)]
    terms = sub["term"].unique().tolist()

    # Build one sample array per term
    samples = [sub[sub["term"] == t]["toxic_score"].values for t in terms]

    # Edge case: if only 1 term exists, test is undefined
    if len(samples) < 2:
        return float("nan"), float("nan")

    stat, p = kruskal(*samples)
    return float(stat), float(p)

In [None]:
# Chi-square test for differences in toxic prediction rates (thresholded) across identity terms

def chi2_test_rates(df, eval_set, group):
    """
    Chi-square test on term vs toxic_pred within a (eval_set, group).
    H0: toxic_pred is independent of identity term.
    """
    sub = df[(df["eval_set"] == eval_set) & (df["group"] == group)]

    # Contingency table: rows=term, cols=toxic_pred (0/1)
    ct = pd.crosstab(sub["term"], sub["toxic_pred"])

    # Edge case: if only 1 column present (all 0s or all 1s), chi2 is not meaningful
    if ct.shape[1] < 2 or ct.shape[0] < 2:
        return float("nan"), float("nan"), int(ct.shape[0] - 1)

    chi2, p, dof, expected = chi2_contingency(ct)
    return float(chi2), float(p), int(dof)

Run tests for all eval sets + groups and show a results table.

In [None]:
test_rows = []

for es in sorted(baseline_eval["eval_set"].unique()):
    for g in sorted(baseline_eval["group"].unique()):
        k_stat, k_p = kruskal_test_scores(baseline_eval, es, g)
        c_stat, c_p, dof = chi2_test_rates(baseline_eval, es, g)

        test_rows.append({
            "eval_set": es,
            "group": g,
            "kruskal_stat": k_stat,
            "kruskal_p": k_p,
            "chi2_stat": c_stat,
            "chi2_p": c_p,
            "chi2_dof": dof
        })

baseline_tests = pd.DataFrame(test_rows).sort_values(["eval_set", "group"])
baseline_tests

Flag significant results.

In [None]:
alpha = 0.05
m_tests = len(baseline_tests) * 2  # roughly: kruskal + chi2 per row
alpha_bonf = alpha / m_tests

baseline_tests["kruskal_sig_0.05"] = baseline_tests["kruskal_p"] < alpha
baseline_tests["chi2_sig_0.05"] = baseline_tests["chi2_p"] < alpha

baseline_tests["kruskal_sig_bonf"] = baseline_tests["kruskal_p"] < alpha_bonf
baseline_tests["chi2_sig_bonf"] = baseline_tests["chi2_p"] < alpha_bonf

print("alpha =", alpha, "| Bonferroni alpha ~", alpha_bonf)
baseline_tests

Bootstrap confidence intervals for gaps.

In [None]:
def compute_gaps_from_df(df):
    ts = compute_term_stats(df)
    return compute_gaps(ts)

def bootstrap_gap_ci(df, n_boot=1000, ci=0.95, seed=42):
    rng = np.random.default_rng(seed)
    out_rows = []

    for es in df["eval_set"].unique():
        for grp in df["group"].unique():
            sub = df[(df["eval_set"]==es) & (df["group"]==grp)].copy()

            boot_mean = []
            boot_rate = []

            for _ in range(n_boot):
                samp = sub.sample(len(sub), replace=True, random_state=int(rng.integers(1e9)))
                gaps = compute_gaps_from_df(samp)
                # one row for this eval_set/group
                row = gaps[(gaps["eval_set"]==es) & (gaps["group"]==grp)].iloc[0]
                boot_mean.append(row["mean_score_gap"])
                boot_rate.append(row["toxic_rate_gap"])

            lo = (1-ci)/2
            hi = 1-lo

            out_rows.append({
                "eval_set": es,
                "group": grp,
                "mean_gap_med": float(np.median(boot_mean)),
                "mean_gap_lo": float(np.quantile(boot_mean, lo)),
                "mean_gap_hi": float(np.quantile(boot_mean, hi)),
                "rate_gap_med": float(np.median(boot_rate)),
                "rate_gap_lo": float(np.quantile(boot_rate, lo)),
                "rate_gap_hi": float(np.quantile(boot_rate, hi)),
            })

    return pd.DataFrame(out_rows)

baseline_boot = bootstrap_gap_ci(baseline_eval, n_boot=1000, ci=0.95, seed=SEED)
display(baseline_boot)

[3] Mitigation Strategy

We use Counterfactual Data Augmentation (CDA), which generates counterfactual variants of training sentences by swapping identity terms while preserving labels.

This approach encourages the model to treat identity terms as invariant features when semantic meaning does not change.

To evaluate mitigation effectiveness, we compare:
- mean_score_gap before and after fine-tuning
- toxic_rate_gap before and after fine-tuning
- Kruskal–Wallis p-values
- Bootstrap confidence intervals

Limitations:
- Template-based evaluation may not reflect real-world language variation.
- Threshold-based toxic classification depends on arbitrary cutoff.
- Small fine-tuning dataset may limit generalization.

6) Mitigation: Counterfactual Data Augmentation (CDA) + fine-tune

Create a tiny training set with:

neutral templates (label 0)

toxic templates (label 1) like “I hate {term}.” (be mindful—keep it minimal and purely for model training)

Then for each sentence, create counterfactual versions by swapping identity terms, keeping the label unchanged.

You’ll fine-tune using Trainer or a simple text-classification training script.

Key point for your writeup: you are explicitly encouraging counterfactual invariance.


In [None]:
# Build a small training set for fine-tuning.
# We include neutral + toxic templates, then apply CDA (swap identity terms within group).

NEUTRAL_TRAIN_TEMPLATES = templates_A + [
    "A {term} was kind to me.",
    "I respect a {term} I know."
]

TOXIC_TRAIN_TEMPLATES = [
    "I hate {term}.",
    "{term} are disgusting.",
    "I want to hurt {term}."
]

def build_train_df(identity_terms):
    rows = []
    for group, terms in identity_terms.items():
        for term in terms:
            for t in NEUTRAL_TRAIN_TEMPLATES:
                rows.append({"group": group, "term": term, "text": t.format(term=term), "label": 0})
            for t in TOXIC_TRAIN_TEMPLATES:
                rows.append({"group": group, "term": term, "text": t.format(term=term), "label": 1})
    return pd.DataFrame(rows)

def apply_cda(train_df, identity_terms):
    aug = []
    for _, r in train_df.iterrows():
        group = r["group"]
        term = r["term"]
        text = r["text"]
        label = int(r["label"])

        aug.append({"text": text, "label": label})

        for new_term in identity_terms[group]:
            if new_term == term:
                continue
            aug.append({"text": text.replace(term, new_term), "label": label})

    return pd.DataFrame(aug).drop_duplicates()

base_train = build_train_df(identity_terms)
train_cda = apply_cda(base_train, identity_terms)

print("Base train:", len(base_train), "| CDA train:", len(train_cda))
train_cda.sample(8, random_state=SEED)

Tokenize and Trainer fine tune.

In [None]:
train_ds = Dataset.from_pandas(train_cda.reset_index(drop=True))

def tok(batch):
    return tokenizer(batch["text"], truncation=True)

train_ds = train_ds.map(tok, batched=True)
collator = DataCollatorWithPadding(tokenizer)

mit_model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)

args = TrainingArguments(
    output_dir="./mitigated_cda_model",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1,     # bump to 2 if you have time
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="no",
    report_to="none",
    seed=SEED,
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=mit_model,
    args=args,
    train_dataset=train_ds,
    tokenizer=tokenizer,
    data_collator=collator
)

trainer.train()

7) Re-run evaluation on mitigated model

Repeat Sections 3–5 and compare:

mean_score gap (max–min across terms)

toxic_rate gap (max–min across terms)

p-values (or bootstrap CI overlap)


In [None]:
# This section:
#  - runs inference for the CDA fine-tuned model
#  - recomputes term-level stats + gap metrics
#  - runs the same statistical tests as baseline (Kruskal + Chi-square)
#  - computes bootstrap CIs for gap metrics
#  - produces side-by-side comparison tables (baseline vs mitigated)
# -----------------------------

# Build mitigated pipeline
mit_pipe = pipeline(
    "text-classification",
    model=mit_model,
    tokenizer=tokenizer,
    return_all_scores=True,
    device=device
)

# B) Evaluate mitigated model on BOTH eval sets (A and B)
mit_eval = add_predictions(eval_all, mit_pipe, batch_size=64)

# C) Recompute term-level stats + gap metrics

mit_term_stats = compute_term_stats(mit_eval)
mit_gaps = compute_gaps(mit_term_stats)

print("=== Mitigated term-level stats (top 20 by mean_score) ===")
display(
    mit_term_stats
    .sort_values(["eval_set", "group", "mean_score"], ascending=[True, True, False])
    .head(20)
)

print("=== Mitigated gap metrics ===")
display(mit_gaps)


# D) Statistical tests (mitigated): Kruskal (scores) + Chi-square (rates)
test_rows = []
for es in sorted(mit_eval["eval_set"].unique()):
    for g in sorted(mit_eval["group"].unique()):
        k_stat, k_p = kruskal_test_scores(mit_eval, es, g)
        c_stat, c_p, dof = chi2_test_rates(mit_eval, es, g)

        test_rows.append({
            "eval_set": es,
            "group": g,
            "kruskal_stat": k_stat,
            "kruskal_p": k_p,
            "chi2_stat": c_stat,
            "chi2_p": c_p,
            "chi2_dof": dof
        })

mit_tests = pd.DataFrame(test_rows).sort_values(["eval_set", "group"])

print("=== Mitigated statistical tests ===")
display(mit_tests)

print("=== Baseline vs Mitigated test comparison ===")
test_compare = baseline_tests.merge(
    mit_tests,
    on=["eval_set", "group"],
    suffixes=("_baseline", "_mitigated")
)
display(test_compare)

# E) Bootstrap CIs for mitigated gap metrics
mit_boot = bootstrap_gap_ci(mit_eval, n_boot=1000, ci=0.95, seed=SEED)

print("=== Baseline bootstrap CI ===")
display(baseline_boot)

print("=== Mitigated bootstrap CI ===")
display(mit_boot)


# F) Side-by-side comparison tables (baseline vs mitigated)
print("=== Baseline gaps ===")
display(baseline_gaps)

print("=== Mitigated gaps ===")
display(mit_gaps)

gap_compare = baseline_gaps.merge(
    mit_gaps,
    on=["eval_set", "group"],
    suffixes=("_baseline", "_mitigated")
)

# Add delta columns (mitigated - baseline). Negative values indicate improvement (smaller gaps).
gap_compare["delta_mean_score_gap"] = gap_compare["mean_score_gap_mitigated"] - gap_compare["mean_score_gap_baseline"]
gap_compare["delta_toxic_rate_gap"] = gap_compare["toxic_rate_gap_mitigated"] - gap_compare["toxic_rate_gap_baseline"]

print("=== Gap comparison (baseline vs mitigated) + deltas ===")
display(gap_compare.sort_values(["eval_set", "group"]))

ci_compare = baseline_boot.merge(
    mit_boot,
    on=["eval_set", "group"],
    suffixes=("_baseline", "_mitigated")
)

print("=== Bootstrap CI comparison (baseline vs mitigated) ===")
display(ci_compare.sort_values(["eval_set", "group"]))

Visualizations:

In [None]:
# Figure 1: Baseline Mean Toxicity by Term — Baseline vs Mitigated

def plot_term_means_compare(baseline_term_stats, mit_term_stats):
    for es in sorted(baseline_term_stats["eval_set"].unique()):
        for grp in sorted(baseline_term_stats["group"].unique()):
            b = baseline_term_stats[(baseline_term_stats["eval_set"]==es) & (baseline_term_stats["group"]==grp)].copy()
            m = mit_term_stats[(mit_term_stats["eval_set"]==es) & (mit_term_stats["group"]==grp)].copy()

            merged = b.merge(m, on=["eval_set","group","term"], suffixes=("_baseline","_mitigated"))
            merged = merged.sort_values("mean_score_baseline", ascending=False)

            x = np.arange(len(merged["term"]))
            width = 0.4

            plt.figure(figsize=(9,4))
            plt.bar(x - width/2, merged["mean_score_baseline"], width, label="Baseline")
            plt.bar(x + width/2, merged["mean_score_mitigated"], width, label="Mitigated")

            plt.xticks(x, merged["term"], rotation=45, ha="right")
            plt.ylabel("Mean Toxic Score")
            plt.title(f"Mean Toxicity by Term | Eval {es} | {grp}")
            plt.legend()
            plt.tight_layout()
            plt.show()

print("=== Term Mean Comparison: Baseline vs Mitigated ===")
plot_term_means_compare(baseline_term_stats, mit_term_stats)

In [None]:
# Figure 2: Gap Reduction (Baseline vs Mitigated)

def plot_gap_comparison(gap_compare):
    for es in sorted(gap_compare["eval_set"].unique()):
        sub = gap_compare[gap_compare["eval_set"] == es].sort_values("group")

        x = np.arange(len(sub["group"]))
        width = 0.35

        plt.figure(figsize=(8,4))
        plt.bar(x - width/2, sub["mean_score_gap_baseline"], width, label="Baseline")
        plt.bar(x + width/2, sub["mean_score_gap_mitigated"], width, label="Mitigated")

        plt.xticks(x, sub["group"])
        plt.ylabel("Mean Score Gap")
        plt.title(f"Gap Comparison | Eval {es}")
        plt.legend()
        plt.tight_layout()
        plt.show()

print("=== Gap Comparison Plots ===")
plot_gap_comparison(gap_compare)

In [None]:
# Figure 3: Toxic Rate Gap Comparison (Baseline vs Mitigated)

def plot_rate_gap_comparison(gap_compare):
    for es in sorted(gap_compare["eval_set"].unique()):
        sub = gap_compare[gap_compare["eval_set"] == es].sort_values("group")

        x = np.arange(len(sub["group"]))
        width = 0.35

        plt.figure(figsize=(8,4))
        plt.bar(x - width/2, sub["toxic_rate_gap_baseline"], width, label="Baseline")
        plt.bar(x + width/2, sub["toxic_rate_gap_mitigated"], width, label="Mitigated")

        plt.xticks(x, sub["group"])
        plt.ylabel("Toxic Rate Gap (THRESH=0.5)")
        plt.title(f"Toxic Rate Gap Comparison | Eval {es}")
        plt.legend()
        plt.tight_layout()
        plt.show()

print("=== Toxic Rate Gap Comparison Plots ===")
plot_rate_gap_comparison(gap_compare)

In [None]:
# Figure 4: Bootstrap CI Error Bars

def plot_bootstrap_ci(ci_compare):
    for es in sorted(ci_compare["eval_set"].unique()):
        sub = ci_compare[ci_compare["eval_set"] == es].sort_values("group")

        x = np.arange(len(sub["group"]))

        plt.figure(figsize=(8,4))

        # Baseline error bars
        baseline_err_low = sub["mean_gap_med_baseline"] - sub["mean_gap_lo_baseline"]
        baseline_err_high = sub["mean_gap_hi_baseline"] - sub["mean_gap_med_baseline"]

        plt.errorbar(
            x - 0.1,
            sub["mean_gap_med_baseline"],
            yerr=[baseline_err_low, baseline_err_high],
            fmt='o',
            label="Baseline"
        )

        # Mitigated error bars
        mit_err_low = sub["mean_gap_med_mitigated"] - sub["mean_gap_lo_mitigated"]
        mit_err_high = sub["mean_gap_hi_mitigated"] - sub["mean_gap_med_mitigated"]

        plt.errorbar(
            x + 0.1,
            sub["mean_gap_med_mitigated"],
            yerr=[mit_err_low, mit_err_high],
            fmt='o',
            label="Mitigated"
        )

        plt.xticks(x, sub["group"])
        plt.ylabel("Mean Score Gap (with 95% CI)")
        plt.title(f"Bootstrap CI Comparison | Eval {es}")
        plt.legend()
        plt.tight_layout()
        plt.show()

print("=== Bootstrap CI Plots ===")
plot_bootstrap_ci(ci_compare)

8) Results + conclusion (rubric)

## 8) Results + Conclusion

### Restating Research Questions
**RQ1:** Does the baseline toxicity model assign significantly different toxicity scores to neutral sentences referencing different identity terms within the same category?

**RQ2:** Does Counterfactual Data Augmentation (CDA) fine-tuning reduce identity-based toxicity score disparities?

---

### RQ1 Findings (Baseline model)
We evaluated the baseline `unitary/toxic-bert` model on two neutral template sets (Eval A and Eval B) containing identity terms across four categories: religion, gender, sexual orientation, and race/ethnicity.

**Evidence of identity-term bias:**
- We observed differences in **mean toxicity score** across identity terms within the same group/category (see Figure 1 and `baseline_term_stats`).
- The **gap metrics** (max–min within each group) quantify this disparity as:
  - `mean_score_gap` (continuous score disparity)
  - `toxic_rate_gap` (binary decision disparity at THRESH = 0.5)

**Statistical evidence:**
- We applied **Kruskal–Wallis** tests to compare toxicity score distributions across terms within each group.
- We applied **Chi-square** tests to compare toxic prediction rates across terms within each group.
- Significant p-values indicate the model treats identity terms differently even when sentence meaning remains neutral.

**Conclusion for RQ1:**
If one or more (eval_set, group) combinations show statistically significant differences and non-trivial gap values, this supports the hypothesis that identity-term bias exists in the baseline model.

---

### RQ2 Findings (Mitigated model: CDA fine-tuning)
We fine-tuned the baseline model using **Counterfactual Data Augmentation (CDA)** by swapping identity terms within each group while keeping labels fixed, encouraging counterfactual invariance.

**Evidence CDA reduces bias (when improvement occurs):**
- Compare baseline vs mitigated:
  - `mean_score_gap_baseline` vs `mean_score_gap_mitigated` (Figure 2)
  - `toxic_rate_gap_baseline` vs `toxic_rate_gap_mitigated` (Figure 3)
- Negative deltas:
  - `delta_mean_score_gap < 0` indicates reduced score disparity.
  - `delta_toxic_rate_gap < 0` indicates reduced decision disparity.

**Bootstrap evidence:**
- We computed bootstrap 95% confidence intervals for gap metrics.
- If the mitigated model’s median gaps are lower and CI ranges shift downward, this supports that CDA is meaningfully reducing disparities (Figure 4).

**Conclusion for RQ2:**
If the mitigated model shows smaller gaps across multiple groups and/or fewer statistically significant differences, this supports Hypothesis 2 that CDA reduces identity-based disparities.

---

### Overall conclusion (Hypotheses)
- **Hypothesis 1 (baseline bias exists):** Supported if baseline tests show significant p-values and/or meaningful gap values within one or more identity groups.
- **Hypothesis 2 (CDA reduces gaps):** Supported if mitigated gaps decrease (negative deltas) and bootstrap CI medians shift downward relative to baseline.

---

### Limitations
- **Template-based evaluation** may not represent real-world language diversity or context.
- **Threshold dependence:** toxic_rate_gap depends on THRESH = 0.5; different thresholds may change toxic_rate_gap.
- **Small fine-tuning dataset:** CDA training is synthetic and limited, which may constrain generalization.
- **Model scope:** results are specific to `unitary/toxic-bert` and do not necessarily generalize to other toxicity classifiers.

---

### Future Work / Additional research questions
- Evaluate bias on **more naturalistic sentences** (non-template) and include adversarial phrasing that remains non-toxic.
- Perform a **threshold sweep** (e.g., 0.3 to 0.7) to test robustness of toxic_rate_gap conclusions.
- Investigate other mitigation strategies (e.g., reweighting loss, regularization, post-processing calibration).
- Test whether mitigation affects **overall toxicity detection accuracy** (trade-off between fairness and performance).


9. *** Two venues (rubric 1 point) ***

**ACM FAccT (Fairness, Accountability, and Transparency):**
FAccT is directly aligned with measuring disparate treatment across protected or sensitive groups in deployed ML systems. Identity-term bias in toxicity moderation is a known fairness risk because it can lead to disproportionate censorship or moderation of benign content referencing certain groups. Related work on NLP fairness, content moderation bias, and counterfactual evaluation frequently appears in this community, so this investigation fits well as a small-scale empirical study.

**AIES (AAAI/ACM Conference on AI, Ethics, and Society):**
AIES focuses on the societal implications of AI systems and includes work on bias measurement and mitigation in real-world deployments. Toxicity classifiers are widely used in moderation pipelines, making identity-term disparities a practical ethics concern (equity in speech and participation online). This work could be extended into a publishable paper by adding broader datasets (real comments), more identity terms, additional mitigation baselines, and reporting any fairness–utility trade-offs.



10. *** Complying with class LLM policy ***


### LLM policy compliance: citation + prompt summary

LLM usage disclosure (code only):

Used ChatGPT for code scaffolding for fairness metric computation and error messsgae description and handling. No LLM was used to generate or refine the written narrative content; only code scaffolding/debugging assistance was used.

Prompt summary: “Generate Python code o evaluate identity-term bias in a toxicity classifier, compute group/term mean scores and gaps, and run Kruskal–Wallis and chi-square tests. Then add bootstrap confidence intervals and baseline vs mitigated comparison tables.”

Prompt summary: “Suggest plots to visualize baseline vs mitigated disparities and implement matplotlib code.”


# WHATS LEFT TO DO:

### fill in section 8


### Conferences/Journals section needs clean -

why that community cares

whether similar works exist there

could this be a paper? why/why not?

Add 2–3 sentences total per venue.