This project notebook documents a hands-on approach to identifying, analyzing, and mitigating unfair bias in a machine learning model. Using the classic Adult Income dataset, this analysis demonstrates a full responsible AI workflow, from establishing a biased baseline to exploring and implementing various mitigation strategies. The ultimate goal is to build a model that performs equitably across different demographic groups, even when the underlying data is flawed.
- Identifying Bias: We use subgroup analysis to expose performance disparities in a model, proving that relying solely on overall accuracy can be misleading.
- Data-Centric Mitigation: This project explores the challenges and trade-offs of a data-centric approach by attempting to mitigate bias through oversampling.
- Model-Centric Mitigation: We implement a more advanced, model-centric approach using the Fairlearn library to directly enforce fairness during the prediction phase.
- The Iterative Nature of Responsible AI: The notebook's narrative highlights that building fair technology is not a one-time fix but a continuous process of testing, learning, and making deliberate trade-offs.
The project follows a step-by-step methodology to illustrate the complete process:
- Exploratory Data Analysis (EDA): Initial analysis is performed to identify data imbalances and potential sources of selection bias.
- Baseline Model Training: A Logistic Regression model is trained on the original, unbalanced data to establish a performance benchmark.
- Subgroup Analysis: The baseline model's performance is meticulously evaluated on sensitive subgroups (
sexandrace) to quantify the extent of the bias. - Data-Centric Mitigation Attempt: Oversampling is applied to the training data, and a new model is trained to demonstrate the unintended consequences of this approach.
- Model-Centric Mitigation with Fairlearn: The
ThresholdOptimizerfrom the Fairlearn library is implemented to fine-tune the model's predictions for a more equitable outcome. - Final Analysis and Conclusion: The results of all models are compared side-by-side to draw conclusions on the most effective mitigation strategy for this project.
To run this notebook, you'll need a Python environment with the following dependencies:
pandasscikit-learnmatplotlibfairlearnimblearn
It is highly recommended to use a virtual environment like Conda to manage these dependencies and avoid conflicts.