Lab 1: Fairness and Ethical Considerations

By: Morgan Mote and Taylor King

Due: Wed Feb 18, 2026 11:59pm2/18/2026
In this lab you will investigate and try to uncover biases in a machine learning model. You are free to use most any data as inputs, such as text data, table data, or images. You are free to use the code from class written in Keras/Tensorflow. As always, you can choose a PyTorch implementation if you prefer. The objective of the lab is to measure groups that are treated differently by one of these models. If using code from another author (not your own), you will be graded on the clarity of explanatory comments you add to the code. 

Remember that the class policy on LLM usage prohibits its use in text generation and text refinement. You are only allowed to use an LLM for coding and you MUST provide a citation and the prompt used (or a summary of the prompt used). 

As part of this lab you need to choose a trained model that you can run on your own hardware and investigate a bias in this model (where different groups may be treated differently or unfairly by the already trained model). As always, smaller models will be more computationally efficient to investigate, especially if your process is iterative or requires retraining of the base model. 

Here is the rubric for the assignment, worth 15 points total: 

[2 Points] Present an overview for (1) what type of bias you will be investigating and what groups, (2) what pre-trained model you will be investigating, and (3) why the particular investigation you will be doing is relevant.
You might consider asking questions like: Why is it important to find this kind of bias in machine learning models? Why will the type of investigation I am performing be relevant to other researchers or practitioners? Why might this particular model treat these groups unfairly? 
You are free to look and compare bias among any groups. For instance, in ConceptNet, they looked at racial bias in names for a sentiment classifier. However, you might choose to investigate other forms of bias like gender, religion, socioeconomic status, political affiliation, sexual orientation, or another grouping. The aim is to uncover groups that are treated systematically different by a model and why it is important for these groups to be treated fairly.

[2 Points] Present one (or more) research question(s) that you will be answering and explain the methods that you will employ to answer these research questions. Present a hypothesis as part of your research question(s).
Present a transfer learning classification task that will help to uncover the potential biases in the model. That is, discuss what new transfer learning task can be used and how the new classification task of the model will help to uncover bias or a lack of fairness. 
An example research question might look like: For predicting hospitalization and mortality from electronic health record data, does the model performance vary significantly by insurance coverage type? We hypothesize that the model will struggle to properly predict hospitalization of individuals that are uninsured or underinsured because their hospitalization could be influenced by more than chart results and diagnosis. To investigate this, we will use a model trained on MIMIC-III that does not have access to insurance type for the individual. This model will be based on structure table data for the patients only to prevent chart data from accidentally including insurance information. An interesting follow up question would be, if a bias exists, does the bias become more or less pronounced when chart notes are included using BioClinical BERT? 

[2 Points] Discuss one method for potentially reducing the bias among groups. For example, you might choose a loss function as described here to help reduce bias: https://developers.google.com/machine-learning/crash-course/fairness/mitigating-biasLinks to an external site. . Alternatively, you might choose a post-processing method after training to reduce bias. Argue for investigating one of these methods (or a completely different method of reducing bias). You have a lot of free rein to decide on a technique here to investigate. It can be something established or your own idea to help reduce bias. 
As part of your assignment, you will compare the bias of the original model to that of the model with your chosen bias mitigation strategy. Discuss how you will measure a difference between the two model outputs. That is, if you are measuring the difference statistically, what test will you use and why is it appropriate? Are there any limitations to performing this test that you should be aware of? 

[4 Points] Carryout your analysis (and model training, if needed) for the original transfer learned model and the model with bias mitigation. Explain your steps in as much detail so that the instructor can understand your code. 
[4 Points] Present results from your analysis and provide evidence from the results that support or refute your hypothesis. Write a conclusion based upon the various analyses you performed. Be sure to reference your research questions systematically in your conclusion. With your analysis complete, are there any additional research questions or limitations to your conclusions?

[1 Points] Identify two conferences or journals that would be interested in the results of your analysis. Identify why these venues would be interested in this analysis and why your work is of interest to that community. Are there any similar works published in this venue? Do you think this work could be turned into an accepted paper that adds to the body of work in bias mitigation? Why or why not?  
 

 

0) Setup cell (VS Code / Jupyter)

In [None]:
!pip -q install transformers datasets accelerate evaluate scikit-learn pandas numpy scipy matplotlib

1) Imports + model load

In [None]:
import numpy as np
import pandas as pd
from transformers import pipeline

tox = pipeline("text-classification", model="unitary/toxic-bert", top_k=None)  # returns all labels


2) Build evaluation set (templated neutral sentences)

In [None]:
identity_terms = {
    "religion": ["Muslim", "Christian", "Jewish", "Hindu"],
    "gender": ["man", "woman", "transgender person"],
    "orientation": ["gay person", "straight person", "lesbian person"],
    "race_ethnicity": ["Black person", "White person", "Asian person", "Latino person"]
}

templates = [
    "I met a {term} at the store today.",
    "My neighbor is a {term}.",
    "A {term} lives in my apartment building.",
    "I spoke with a {term} about the weather.",
    "A {term} helped me carry groceries."
]

rows = []
for group, terms in identity_terms.items():
    for term in terms:
        for t in templates:
            rows.append({"group": group, "term": term, "text": t.format(term=term)})

df = pd.DataFrame(rows)
df.head()


3) Run inference + extract toxicity score

In [None]:
def toxic_score(output):
    # output is list of dicts with labels/scores
    # find label containing "toxic"
    for item in output:
        if item["label"].lower() == "toxic":
            return item["score"]
    return np.nan

scores = []
for text in df["text"].tolist():
    out = tox(text)[0]  # pipeline returns list; first element is list of label dicts
    scores.append(toxic_score(out))

df["toxic_score"] = scores
df.describe()


4) Bias metrics

Mean score by term

Mean score by group

“Toxic prediction” rate using a threshold

In [None]:
THRESH = 0.5
df["toxic_pred"] = (df["toxic_score"] >= THRESH).astype(int)

term_stats = df.groupby(["group","term"]).agg(
    mean_score=("toxic_score","mean"),
    toxic_rate=("toxic_pred","mean"),
    n=("toxic_pred","size")
).reset_index()

group_stats = df.groupby(["group"]).agg(
    mean_score=("toxic_score","mean"),
    toxic_rate=("toxic_pred","mean"),
    n=("toxic_pred","size")
).reset_index()

term_stats.sort_values("mean_score", ascending=False).head(10)


5) Statistical test (example)
One clean test per metric:

In [None]:
from scipy.stats import kruskal

# Compare distributions across terms within a category (example: religion)
relig = df[df["group"]=="religion"]
samples = [relig[relig["term"]==term]["toxic_score"].values for term in relig["term"].unique()]
stat, p = kruskal(*samples)
stat, p


Chi-square on counts for toxic_rate differences

In [None]:
from scipy.stats import chi2_contingency

ct = pd.crosstab(relig["term"], relig["toxic_pred"])
chi2, p, dof, expected = chi2_contingency(ct)
chi2, p


6) Mitigation: Counterfactual Data Augmentation (CDA) + fine-tune

Create a tiny training set with:

neutral templates (label 0)

toxic templates (label 1) like “I hate {term}.” (be mindful—keep it minimal and purely for model training)

Then for each sentence, create counterfactual versions by swapping identity terms, keeping the label unchanged.

You’ll fine-tune using Trainer or a simple text-classification training script.

Key point for your writeup: you are explicitly encouraging counterfactual invariance.

(If you want, tell me whether you prefer TensorFlow/Keras or PyTorch/HF Trainer and I’ll drop in a complete, runnable fine-tuning block tailored to your environment—but the outline above is already enough to start.)


In [None]:
from data

7) Re-run evaluation on mitigated model

Repeat Sections 3–5 and compare:

mean_score gap (max–min across terms)

toxic_rate gap (max–min across terms)

p-values (or bootstrap CI overlap)


In [None]:
from evaluation

8) Results + conclusion (rubric)

In markdown:

Restate RQ1/RQ2

Summarize key findings numerically (top 3 highest mean_score terms, pre vs post)

State whether hypothesis supported

Limitations: synthetic templates, threshold choice, model label interpretation, small fine-tune set



In [None]:
from results

Two venues (rubric 1 point)

Pick two and justify:

ACM FAccT (Fairness, Accountability, and Transparency)
Interested because your work measures and mitigates differential treatment across protected groups in a deployed-style NLP setting (content moderation).

AAAI/ACM Conference on AI, Ethics, and Society (AIES)
Interested because it focuses on bias measurement + mitigation and societal impacts of AI systems like toxicity moderation.



In [None]:
from ACM FAccT

from AAAI/ACM

Complying with class LLM policy

Markdown cell message on LLM usage disclosure: “Used ChatGPT for code scaffolding for HuggingFace inference + fairness metric computation.”

Prompt summary example: “Generate Python code to evaluate toxicity model outputs across identity-term templates and compute group-wise mean scores and chi-square tests.”
