# Part 7: Challenges and Recommendations for Fairness in Machine Learning

## Why Improving Fairness is Hard

Fairness in machine learning is not a one-time technical fix.  
It is a **continuous, context-sensitive process** that must balance competing values and account for the real-world effects of algorithmic decisions.

This section highlights **five major challenges** in achieving fairness — each one illustrated through the three case studies:  
**student dropout prediction**, **gender classification**, and **predictive policing**.

---

### 1. Conceptual limitations of fairness definitions

Fairness in machine learning is often defined using mathematical metrics like **Equalized Odds** or **Calibration**.  
These definitions are useful diagnostic tools but they come with **mathematical and conceptual limitations**.

---

#### Correlation vs. Causality<sup>16</sup>

Observational fairness metrics are based on **statistical correlations**, not **causal relationships**. 
Machine learning models trained on observational data often learn patterns that **reflect existing inequalities**, rather than their causes.

**Example: Predictive Policing**

A model might learn that certain neighborhoods have higher crime rates — not because they are more dangerous, but because they are more heavily policed.
Treating such correlations as causal can reinforce biased enforcement strategies and fail to address the actual causes of crime.

>Fairness interventions that ignore causal mechanisms risk producing **harmful or misleading outcomes**.
Whenever possible, **causal reasoning** should complement statistical fairness metrics — even though it requires strong **domain knowledge** and may not always be feasible.

---

#### Incompatibility of metrics

As already discussed in previous sections, some fairness metrics are **mathematically incompatible**. It is often impossible to satisfy both simultaneously unless the model is perfect or group base rates are identical.<sup>1,</sup><sup>2</sup>

**Example: Student Dropout Prediction**:

Enforcing **Separation** via threshold adjustment led to a loss in **Sufficiency**.  
This meant that predicted risk scores no longer had a consistent meaning across groups.  

---

#### Inframarginality<sup>3</sup>

Many fairness metrics, such as Statistical Parity or Predictive Equality, operate on **infra-marginal statistics**. They summarize model performance **across the whole distribution**, rather than focusing on **decisions near the threshold**, where real-world consequences are greatest.

**Example: Diabetes Screening (Corbett-Davies et al., 2023)**  
Suppose a single threshold (e.g., 1.5% risk) is used to screen for diabetes.  
Groups with **higher baseline risk** (e.g. Asian Americans) will be screened more often — maximizing **overall health outcomes**.  
But this **violates Statistical Parity**, because not all groups are screened at the same rate.

Now, if we **adjust the thresholds** to **enforce parity**, we must:
- Raise the threshold for high-risk groups → they lose access to needed care.
- Lower the threshold for low-risk groups → they receive unnecessary tests.

**The result: every group is worse off.**

This fairness intervention leads to a situation where a different policy (e.g. keeping a uniform threshold) would produce better outcomes **for all**. 

---

#### Pareto-dominated outcomes<sup>3</sup>

As Corbett-Davies et al. (2023) argue, fairness interventions that **optimize infra-marginal metrics** (like Statistical Parity) can lead to **pareto-dominated policies**.

A policy is pareto-dominated when an **alternative exists that would improve outcomes for all groups** — but is rejected due to the fairness constraint.

**See example above: Diabetes Screening**  
Enforcing statistical parity led to **worse outcomes for every group**, even though the metric was satisfied.  
The fairness intervention was **pareto-inefficient**: all groups could have been better off under the original threshold.

**General insight:**  
This shows that **fairness metrics detached from real-world utility** can be misleading.  
> **Satisfying a metric ≠ Fair outcome**

Fairness should not be judged solely by mathematical parity but by **whether interventions improve outcomes for the intended groups** — especially at the decision margin.

---

**Key Insight:**  
- Many standard fairness metrics rely on simplified assumptions and ignore what happens at the decision boundary.  
- This can lead to **pareto-dominated outcomes**, where **no group benefits**, even if fairness metrics are satisfied.  
- Therefore, **fairness interventions should not be judged solely by mathematical definitions**, but by whether they **actually improve outcomes for the people involved**.

This calls for a **consequentialist perspective** on fairness:<sup>3</sup>  
Instead of optimizing metrics in isolation, we must ask:  
> *“What are the real-world effects of this intervention – and who benefits or is harmed?”*

Only by evaluating fairness in terms of **context, utility, and impact** can machine learning systems be designed to support equitable outcomes.

---

### 2. Context dependence of fairness metrics

Fairness definitions are **not universal**, they depend on **application domain**, **harms**, and **social context**.
What is a fair decision in one setting may be inappropriate in another.<sup>4,</sup><sup>5</sup>

**Example: Student Dropout Prediction**

Choosing between fairness metrics (e.g. Equal Opportunity vs. Predictive Parity) required ethical reflection about the **consequences** for students.

**Example: Gender Classification**  

Standard metrics like Statistical Parity fail here.  
**Representational harms** need other forms of assessment, like intersectional accuracy, and take visibility into account.

**Example: Predictive Policing**  

Even if a model minimizes overall error, it may reinforce **historical over-policing**.  
Fairness here must account for political and social history. Evaluating clustering requires a different procedure and cannot be captured by standard classification fairness metrics.

---

### 3. Representation and measurement issues

Bias often stems from **how the world is represented** in data and labels — not just how the model learns.<sup>6</sup>

**Example: Student Dropout Prediction**  

The label “dropout” may reflect financial or structural **disadvantages** — not actual academic ability.  
Models risk learning that "disadvantage = failure".

**Example: Gender Classification**  

Gender is not always binary or observable. Labels in datasets like FairFace are often **annotator-assigned**, not **self-identified**.  
This raises issues of **label validity** and representational fairness.<sup>27</sup>

**Example: Predictive Policing**  

Training on historical police data reflects where police were active, not necessarily where crime occurred.  
Models predict **policing behavior**, not **crime risk**, reinforcing **feedback loops**.

---

### 4. Lack of standards, transparency, and accountability

There are **no consistent standards** for evaluating fairness.<sup>7,</sup><sup>8</sup>  
Without transparency or clear responsibilities, fairness becomes difficult to assess or enforce.

**Example: Student Dropout Prediction**  

There is **no agreement** on which metric to use.  
Without standards, fairness choices become arbitrary and students may not understand or contest the decision.

**Example: Gender Classification**  

Systems are often **black boxes**. Candidates might be sorted based on gender without knowing it.  
This undermines autonomy and prevents contestation.

**Example: Predictive Policing**  

Police officers at times don’t even know how risk predictions are made.<sup>9</sup>  
Lack of **explainability** and **accountability** limits meaningful oversight.

Tools like **Model Cards**<sup>10</sup> and **Datasheets for Datasets**<sup>11</sup> help improve transparency but are rarely adopted unless legally required.

---

### 5. Sociotechnical and institutional constraints

Fairness is not just a technical issue — it is embedded in **sociotechnical systems** shaped by organizations, stakeholders, and processes.<sup>5</sup>

> A **sociotechnical system** refers to the interplay between **technical components** (e.g. models, data, software) and the **social structures** around them — including institutions, regulations and user behavior.  
In such systems, technical decisions are never neutral. They are shaped by and impact human contexts.

**Example: Predictive Policing**  

Even a technically fair model still operates within **biased policing practices**.  
ML systems can **legitimize unjust structures** if context is ignored.<sup>12</sup>

**Example: Student Dropout Prediction** 

Universities may prioritize **cost-efficiency** over full support for all high-risk students.  
Fairness interventions can clash with **institutional goals**.

**Example: Gender Classification**  

HR systems often rely on **third-party black-box models**.  
Institutions may have no control over data or modeling decisions.

---

### The Five Abstraction Traps<sup>5</sup>

Fairness problems often result from **abstracting away social context**, to build generalizable, modular systems.  
**Selbst et al.** (2019) describe five “abstraction traps” that lead to failed fairness efforts.

| Trap | Description | Example |
|:-----|:------------|:--------|
| **Framing Trap** | Treating fairness as a modeling problem only | Ignoring why students drop out in the first place |
| **Portability Trap** | Assuming fairness solutions are generalizable | Using standard metrics for gender classification |
| **Formalism Trap** | Reducing fairness to math | Overemphasis on metric optimization (e.g. Equalized Odds) |
| **Ripple Effect Trap** | Ignoring how ML systems change behavior | Predictive policing influences policing patterns |
| **Solutionism Trap** | Assuming ML is always the best solution | Automating gender recognition without clear benefit |

**Using the Traps as Reflective Questions**

Selbst et al. suggest that these traps can be **turned into a critical checklist** by reversing their order and framing them as questions.  
This helps structure ethical reflection in the development or assessment process.

> **Start with the most fundamental question and work upward:**

1. **Solutionism** → *Should we build this system at all? Is machine learning the right tool here?*  
2. **Ripple Effect** → *How will the system change the environment it operates in?*  
3. **Formalism** → *Which notions of fairness are relevant – and are they contested?*  
4. **Portability** → *Does the fairness approach actually fit the specific context?*  
5. **Framing** → *Are we addressing the right problem – or abstracting away key social dynamics?*

> Applying fairness is not just about choosing a metric. It's about **asking the right questions**, in the right order.  
Using these reversed abstraction traps as a guide helps ensure that fairness efforts are not only technical, but also socially meaningful.

---

## Practical Implications and Tools

Improving fairness in machine learning requires more than technical interventions.  
It involves institutional, procedural, and ethical considerations and faces **real obstacles** in practice.

Despite growing awareness, many fairness recommendations are **not widely implemented**, unless supported by **regulatory enforcement**.  
> Some companies adopt ethical guidelines mainly to **avoid binding oversight**, not out of intrinsic commitment.<sup>13</sup>

In the following section, we highlight practical implications that can help support fairness.

---

### Fairness as a Process, Not a Fix

Fairness is **not a static goal** or a property of the algorithm alone.  
It is shaped by how the system is **designed, deployed, monitored** and situated within its **social context**.<sup>5</sup>

This process-oriented view emphasizes:
- **Continuous reflection** across the ML lifecycle
- The need to assess **real-world consequences**
- The importance of **stakeholder inclusion** and **interdisciplinary input**

As discussed earlier, fairness metrics can be misleading when detached from impact.  
> A **consequentialist perspective** asks: *Who benefits? Who is harmed? What actually changes for people involved?*

The shift from solution-based to **process-based approaches** is supported by many authors. It shows that fairness is not a final state to be achieved, but more a commitment to **continuous reflection**.<sup>3,</sup><sup>5,</sup><sup>11,</sup><sup>14</sup>

---

### Fairness across the ML lifecycle

Fairness interventions can happen at different stages:<sup>15</sup>

- **Pre-processing:** Modify data before training (e.g. resampling, reweighting)
- **In-processing:** Adjust algorithms (e.g. fairness constraints, adversarial learning)
- **Post-processing:** Modify outputs after training (e.g. Fairlearn's ThresholdOptimizer)

Each method requires choosing which type of fairness matters most and being aware of **contextual consequences**.

---

### Transparency, explainability, and accountability

Improving fairness in machine learning is about making system **understandable, accessible, and contestable**.

A key factor for this is **communication**:<sup>6,</sup><sup>7,</sup><sup>16</sup>
> Fair ML systems must enable both **developers** and **affected individuals** to understand how decisions are made and to respond when they are unfair.

#### Tools for transparency and documentation:

- **Model Cards**:<sup>10</sup>
   Document the model’s purpose, assumptions, limitations, and performance across demographic groups.  
  → Should include **intended use**, **intersectional evaluation**, and **risk of misuse**.
  
- **Datasheets for Datasets**:<sup>11</sup>
     Describe dataset origin, structure, collection methods, and potential biases.  
  → Help evaluate whether a dataset is suitable for a given task or context.

These tools increase visibility but they only help **if the information is accessible and communicated clearly**.

#### Explainability<sup>7</sup>

Even when models are documented, they may remain opaque.  
**Explainability** means making the system’s behavior **understandable for humans**, especially those affected by its decisions.

- For end users: explanations should be **simple and meaningful**
- For domain experts: tools like **LIME** or **SHAP** can offer technical insights

However, explainability often comes with trade-offs:
- High-performing models may be hard to explain
- Privacy concerns may limit what can be disclosed<sup>17</sup>

#### Accountability

Transparency and explainability are **necessary, but not sufficient**.  
> Without **clear responsibility**, even a well-documented system lacks fairness.

- Who decided which features to use?
- Who defined the target variable?
- Who can be contacted to contest a decision?

Accountability requires **defined roles**, **oversight mechanisms**, and (ideally) the ability to **appeal or challenge** automated decisions.

In all three case studies, affected individuals had little understanding of the system and no way to contest outcomes:
- **Students** receiving dropout risk labels without explanation
- **Job applicants** being filtered by black-box gender classifiers
- **Citizens** being policed based on opaque hotspot predictions

> Fairness depends on more than open data or transparent models.  
It requires **communication structures** that enable understanding, trust, and recourse for **all stakeholders**, especially those affected by the system.<sup>4,</sup><sup>16</sup>

---

### Tools and Institutional Practices

Several tools and practices can support fairness. However, these tools should not be used in isolation. They must be applied with an understanding of the social context and stakeholder needs.

**Example Toolkits:**
- **AIF360 (IBM)**:<sup>18</sup> Metrics and pre/in/post-processing methods for fairness evaluation
- **Fairlearn (Microsoft)**:<sup>19</sup> Model diagnostics, fairness-accuracy trade-offs, and threshold tuning
- **SageMaker Clarify (Amazon)**:<sup>20</sup> Bias detection and explainability tools

**Institutional Practices:**

Fairness in machine learning is not only a question of metrics or tools, it depends on **how organizations develop and govern these systems**.

To move from abstract principles to meaningful practice, institutions can adopt several strategies:

- Establish **fairness review boards**, ethics committees, or internal audit processes  
  → These structures ensure that fairness decisions are **not left to individual developers**, but handled collectively and transparently.<sup>21</sup>

- Provide **training on fairness, bias awareness, and societal impacts**<sup>16</sup>  
  → Teams need to understand how their decisions affect real people — beyond just technical performance.

- Promote **diverse team composition**<sup>22,</sup><sup>23</sup>  
  → Homogeneous teams often fail to anticipate harms experienced by marginalized groups.  
  **Example:** Lee (2018) shows that the lack of diversity in the tech industry can lead to serious blind spots. She describes a photo-tagging algorithm that labeled Black individuals as “gorillas”. This failure that could likely have been avoided with greater team diversity.

- Enable **interdisciplinary collaboration**<sup>6</sup>  
  → Fairness challenges require knowledge from law, sociology, ethics, and domain-specific fields.  
  → Including these perspectives helps detect risks and design systems that are better aligned with societal values.

- Involve **affected communities and domain experts** early in the process<sup>16,</sup><sup>24</sup>  
  → Fairness cannot be defined solely by engineers.  
  → Participatory design improves contextual understanding and **legitimizes fairness goals**.

As the case studies illustrate, many harms arise not from malicious intent, but from **limited perspectives and missing expertise**.  
> Addressing fairness requires institutions to invest in **inclusive practices, interdisciplinary thinking, and stakeholder engagement** — not just tools.

---

### Z-Inspection®: A practical fairness framework<sup>13,</sup><sup>25,</sup><sup>26</sup>

Z-Inspection® offers a **holistic and interdisciplinary framework** to evaluate fairness in high-impact AI systems.


| Phase | Description |
|-------|-------------|
| **Set-up** | Define scope and form independent, diverse expert team |
| **Assess** | Develop sociotechnical scenarios, identify ethical tensions, map to EU AI guidelines |
| **Resolve** | Generate concrete recommendations — including technical fixes or stopping deployment |

![](Images/Z-Inspection.png)

**Example:**  
An AI system for detecting cardiac arrest in emergency calls showed reduced performance for non-native speakers.  
Fairness concerns were not visible through metrics alone — they emerged through contextual assessment.

> Z-Inspection® emphasizes that **fairness is embedded in systems, not just models**.  
It requires **interdisciplinary collaboration**, **continuous monitoring**, and **ethical reflection** throughout the lifecycle.

---

### Summary

- **Fairness is not a fixed goal** but a **continuous process** shaped by context and consequences.
- Many standard fairness metrics have **mathematical and conceptual limitations**.
- Each of the three case studies shows different challenges — and why **technical fixes alone aren’t enough**.
- Addressing fairness means looking at the complete sociotechnical system and not just numbers.

---

### Looking Ahead

Hopefully this notebook has provided you with a solid foundation for understanding the **ethical challenges of fairness in machine learning**.
The goal was not to present exhaustive solutions, but to introduce core concepts, raise awareness of common pitfalls, and encourage critical reflection.
By combining theory, code, and case studies, the notebook offers an **entry point** into a complex and evolving field.
Throughout the three integrated studies — on **student dropout prediction, gender classification**, and **predictive policing** – you have seen that fairness:

- cannot be reduced to a single metric or technical fix
- requires attention to representation, context, and long-term effects
- must be assessed within the **broader sociotechnical environment** in which systems operate

The notebook is designed to help you build a **conceptual framework** that you can expand over time — as technologies evolve, debates progress, and regulatory frameworks emerge.
> **Note**: Legal and regulatory aspects (such as the EU AI Act) were not addressed in this notebook, but they are increasingly influencing real-world practice.


If you are interested in learning more, here are some recommendations:

| Source | Description | Link |
|----------------|-------------|------|
| Barocas et al. (2023) | Comprehensive overview of fairness, bias, and algorithmic decision-making | [Fair ML Book](https://fairmlbook.org) |
| Verma & Rubin (2018) | Summary of different fairness metrics | [Fairness Metrics](https://doi.org/10.1145/3194770.3194776) |
| Zicari et al. (2021) & Boonstra et al. (2024) | Real-world applications of the Z-Inspection® framework — shows what fairness assessments look like in practice | [AI in healthcare](https://doi.org/10.3389/fhumd.2021.673104) & [AI in nature monitoring](https://doi.org/10.48550/arXiv.2404.14366) |
| Angwin et al. (2016) | Original investigation of the COMPAS case  | [COMPAS](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) |
| Corbett-Davies (2023) | Highlights key challenges like pareto domination in fairness metrics | [Challenges in Fair ML](http://jmlr.org/papers/v24/22-1511.html.) |
| Kaggle: Intro to AI Ethics | Practical exercises, interactive tutorials, and further links such as [Google's interactive explainer](https://research.google.com/bigpicture/attacking-discrimination-in-ml/) | [Kaggle course](https://www.kaggle.com/learn/intro-to-ai-ethics) |
| Mehrabi et al. (2021) | A more compact overview of fairness and bias in ML (compared to Barocas et al.) | [Survey on Bias and Fairness](https://doi.org/10.1145/3457607) |
| Lum & Isaac (2016) | Introduction to predictive policing | [Intro Predictive Policing](https://doi.org/10.1111/j.1740-9713.2016.00960.x) |
| Robinson & Koepke (2016) | Builds on insights from Lum & Isaac | [More Predictive Policing](https://www.upturn.org/work/stuck-in-a-pattern) |

Other references used in this notebook are listed in the user guide and cited throughout the sections.  
For a deeper exploration, you can also access the master's thesis on which this notebook is based: *[GitHub](https://www.github.com/LukasWel/ethical-challenges-in-ml)*.

Fairness in machine learning cannot be fully solved but it can be better **understood**, more **transparently discussed**, and more **responsibly addressed**.
We hope this notebook helped you take the first step in that direction.

---

### Quiz

**1. True or False:**
Fairness metrics can always tell us whether a system is fair in practice.

1. [ ] True
2. [ ] False

**2. What does inframarginality describe?**
*(Select one option)*

1. [ ] Errors that occur due to biased sampling
2. [ ] Ignoring fairness concerns at the decision boundary
3. [ ] When a fairness metric applies only to large groups
4. [ ] When predictions are perfectly calibrated across all subgroups

**3. What is a pareto-dominated outcome?**
*(Select one option)*

1. [ ] A system that performs equally across all metrics
2. [ ] An outcome that benefits one group at the expense of another
3. [ ] An outcome where a better alternative exists for all groups
4. [ ] A fairness intervention that leads to identical thresholds for all groups
   
---

#### Sources:
1. Chouldechova, 2017
2. Kleinberg et al., 2016
3. Corbett-Davies et al., 2023
4. Osoba & Welser, 2017
5. Selbst et al., 2019
6. Suresh & Guttag, 2021
7. Ntoutsi et al., 2020
8. Olteanu et al., 2019
9. Robinson & Koepke, 2016
10. Mitchell et al., 2019
11. Gebru et al., 2018
12. Lum & Isaac, 2016
13. Boonstra et al., 2024
14. Binns, 2018
15. Calegari et al., 2023
16. Barocas et al., 2023
17. Schmidt et al., 2024
18. LF AI & Data Foundation, n.d.
19. Fairlearn Organization, n.d.
20. Amazon Web Services, n.d.
21. Raji et al., 2020
22. Lee, 2018
23. Yapo & Weiss, 2018
24. Howard & Borenstein, 2018
25. Zicari et al., 2021
26. Zicari et al., 2022
27. Scheuerman et al., 2020