# FIT5230 Week 10: Generative AI Bias & Safety

## 1. Understanding Bias in Generative AI

Bias in generative AI systems is not a random flaw; it's a reflection of the data and design choices behind them. Models like ChatGPT are trained on vast amounts of public online text and, as a result, inherit the societal biases, stereotypes, and prejudices present in that data.

---
<hr>

### Safety Concerns Arising from Bias

* **Reinforcing Stereotypes**: When trained on biased datasets, generative AI can produce content that reinforces harmful stereotypes. For example, a model trained on text where nurses are predominantly female and engineers are male might generate job descriptions that unintentionally favor one gender, undermining diversity efforts.
* **Discriminatory Outcomes**: Biased AI can lead to decisions that unfairly discriminate against specific groups. An AI-powered loan approval system, for instance, might learn from biased historical data to deny loans to individuals from certain racial backgrounds, thus perpetuating financial inequality.

---
<hr>

### Types of Biases in AI

1.  **Cognitive and Societal Bias**: This bias originates from human psychology and culture. Our personal experiences and preferences influence how we process information. This can manifest as favoring data from one's own country over a global sample or perpetuating cultural prejudices and stereotypes.
2.  **Training Data Bias**: This occurs when the data used to train a model is not representative of the real world. A common example is a facial recognition algorithm trained on a dataset that over-represents white faces, which can cause it to have a higher error rate when identifying people of color.
3.  **Algorithmic Bias**: This happens when the algorithm itself produces unfair outcomes, either by amplifying biases present in the training data or because developers unintentionally build their own conscious or unconscious biases into the model's logic. For example, weighting factors like income or vocabulary could inadvertently cause the algorithm to discriminate based on race or gender.

---
<hr>

## 2. Detecting and Measuring Bias

Identifying bias is the first step toward mitigating it. This is done through different methods for text and visual data.

### Detecting Bias in Text

A key method is **toxicity analysis**, especially in language models assigned a specific "persona".

* **Persona-Assigned Toxicity**: When a model like ChatGPT is instructed to speak like a certain person (e.g., Muhammad Ali), its toxicity can increase significantly compared to its default state. This creates a risk that malicious users can exploit these personas to generate harmful content.
* **Analysis Across Groups**: Researchers test for bias by prompting persona-assigned models to generate text about various professions, genders, races, and orientations. Studies have shown that toxicity levels can be significantly higher when discussing certain races, though it's often difficult to pinpoint the exact cause in the model's development.

### Measuring Text Bias

* **PerspectiveAPI**: An ML-based tool that evaluates text and generates a **toxicity score** between 0 (not toxic) and 1 (highly toxic). This score helps developers and moderators filter abusive language and can be used to create non-toxic datasets for model training.
* **Probability of Responding (POR)**: This metric measures how often a model like ChatGPT responds to a potentially toxic prompt versus declining with a safety message (e.g., "I am sorry, but as an AI language model..."). A high POR for toxic queries indicates the model is more inclined to generate harmful content. POR is calculated as the fraction of times the model provides a substantive response instead of a refusal.

### Detecting Bias in Visuals

Biases and stereotypes are often glaringly obvious in AI-generated images.

* **Cultural Misrepresentation**: A significant bias is the projection of a dominant (often Western) cultural context onto other cultures. For example, AI-generated images of various cultural groups often depict them smiling broadly, which reflects an American cultural norm where smiling emphasizes social calm.
* **Context of a Smile**: However, the meaning of a smile is deeply cultural. Research shows that in cultures with high "uncertainty avoidance," smiling can be perceived as a sign of unintelligence, as confidence in an uncertain world is seen as foolish. In Russia, public officials are expected to maintain solemn expressions to reflect the seriousness of their work, in contrast to American expectations. This leads to AI generating culturally inaccurate images, such as broadly smiling Native American chiefs, which contrasts sharply with historical photographs.


---
<hr>

## 3. Mitigating Bias in Generative AI

Several strategies can be employed to reduce bias in AI systems.

* **Diverse Data Collection**: Ensure training data is representative of diverse populations and carefully evaluate it for hidden biases.
* **Algorithmic Auditing**: Conduct impartial evaluations of an algorithm's output to identify and correct biases that could harm specific groups. Tools like **AI Fairness 360** can assist with this.
* **Algorithmic Fairness Techniques**: Implement methods like **adversarial debiasing** and **fairness constraints** directly into the model's training process.
* **Transparency and Explainability**: Make AI models less of a "black box" to better understand how they make decisions, which makes it easier to spot and correct bias.
* **Human-in-the-Loop Systems**: Integrate human oversight to review and correct AI-driven decisions, especially in high-stakes applications.

### In-Depth: Adversarial Debiasing

This is a powerful mitigation technique that uses an adversarial training process to reduce bias during model development.

* **How It Works**: A "main network" is trained on its primary task (e.g., face recognition). Simultaneously, an "adversary network" is trained to predict a protected attribute (e.g., gender) from the main network's output. The main network is then updated to not only succeed at its task but also to fool the adversary, making its output independent of the protected attribute.
* **AGENDA (Example)**: In face recognition, this method is used to create gender-neutral face descriptors. The total loss function guides the training:
    $$L_{br}(\phi_C , \phi_M, \phi_B) = L_{class}(\phi_C , \phi_M) + \lambda L_{deb}(\phi_M, \phi_B)$$
    * **Conceptual Meaning**: The model's training is driven by two competing goals. It tries to minimize the classification loss `L_class` (to be accurate at its main task) while also minimizing the debiasing loss `L_deb`. The debiasing loss penalizes the model whenever the adversary successfully predicts gender. The ultimate goal is to train the main model until the adversary's prediction is no better than a random guess (a 0.5 probability).

---
<hr>

## 4. Challenges and Nuances

Resolving bias in AI is a persistent challenge for several reasons:
* **Data Complexity**: Real-world data is multifaceted, and identifying all sources of bias is difficult.
* **Model Complexity**: Modern AI models are often "black boxes" with billions of parameters, making them hard to explain or audit.
* **No Universal Solution**: Biases are often context-dependent (e.g., racial, gender, cultural), meaning there is no one-size-fits-all framework for fairness.

Interestingly, while generative models contain biases, they can also be used as tools to **help identify bias** in human-written content by analyzing framing, language choices, and source selection.