# Detecting and Mitigating Unfairness in Models

Machine learning models can incorporate unintentional bias, which can lead to issues with *fairness*. For example, a model that predicts the likelihood of diabetes might work well for some age groups, but not for others - subjecting a subset of patients to unnecessary tests, or depriving them of tests that would confirm a diabetes diagnosis.

In this exercise, you'll use the **FairLearn** package to analyze a model and find any disparity in prediction performance for different subsets of patients based on age.

## Before You Start

Before you start this lab, ensure that you have completed the *Create an Azure Machine Learning Workspace* and *Create a Compute Instance* tasks in [Lab 1: Getting Started with Azure Machine Learning](./labdocs/Lab01.md). Then open Jupyter on your Compute Instance and create a new **Terminal**.

The FairLearn package used in this exercise has dependencies on specific versions of common Python packages. To avoid potential conflicts, you're going to create a separate Conda environment and associated Jupyter kernel specifically for this exercise.

In the terminal, run the following commands to create a new Conda environment and Jupyter kernel that incldues all of the packages you need to work with FairLearn.

```
conda create -y -n fair python=3.6 scikit-learn pandas numpy
conda activate fair
pip install azureml-sdk[notebooks] azureml-contrib-fairness fairlearn==0.4.6
conda install -y ipykernel
ipython kernel install --user --name=aml-fair
conda deactivate
```

Then open this notebook and select the **aml-fair** kernel before running the cells below.

## Train a model

You'll start by training a classification model to predict the likelihood of diabetes. In addition to splitting the data into training a test sets of features and labels, you'll extract *sensitive* features that are used to define subpopulations of the data for which you want to compare fairness. In this case, you'll use the **Age** column to define two categories of patient: those over 50 years old, and those 50 or younger.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# load the diabetes dataset
print("Loading Data...")
data = pd.read_csv('data/diabetes.csv')

# Separate features and labels
features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']
X, y = data[features].values, data['Diabetic'].values

# Get sensitive features
A = data[['Age']].astype(int)
# Change value to represent age groups
A['Age'] = np.where(A.Age > 50, 'Over 50', '50 or younger')

# Split data into training set and test set
X_train, X_test, y_train, y_test, A_train, A_test = train_test_split(X, y, A, test_size=0.20, random_state=0, stratify=y)

# Train a classification model
print("Training model...")
base_model = LogisticRegression(solver='liblinear').fit(X_train, y_train)

print("Model trained.")

Now you can use the sensitive features with the FairLearn package to compare the model's predictive performance for the different patient categories. To do this, you'll use the FairLearn dashboard:

1. Run the cell below.
2. When the widget is displayed, use the **Get started** link to start configuring your visualization.
3. Select the sensitive features you want to compare (in this case, there's only one: **Age**).
4. Select the model performance metric you want to compare (in this case, it's a binary classification model so the options are *Accuracy*, *Balanced accuracy*, *Precision*, and *Recall*)
5. View the dashboard visualization, which shows:
    - **Disparity in performance** - how the selected performance metric compares for the subpopulations, including *underprediction* (false negatives) and *overprediction* (false positives).
    - **Disparity in predictions** - A comparison of the number of positive cases per subpopulation.
6. Edit the configuration to compare the predictions based on a different performance metric.

> **Note**: We're deliberately not training an optimal model so you can see the disparity in the prediction performance for the two age groups in this dataset.

In [None]:
from fairlearn.widget import FairlearnDashboard

# View this model in Fairlearn's fairness dashboard, and see the disparities which appear:
FairlearnDashboard(sensitive_features=A_test, 
                   sensitive_feature_names=['Age'],
                   y_true=y_test,
                   y_pred={"diabetes_model": base_model.predict(X_test)})

Whichever performance metric you choose, the model tends to predict positively for patients who are over 50 years old, potentially subjecting this subpopulation to a larger volume of unnecessary diabetes tests.

> **Note**: In reality, age is a genuine factor in diabetes, so you would expect more positive cases among older patients; however, the model exhibits some evidence of overpredicting positive cases for the older subpopulation.

Let's see what happens if we exclude the **Age** feature when training the model.

In [None]:
# Separate features and labels
features2 = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree']
X2, y2 = data[features2].values, data['Diabetic'].values

# Get sensitive features
A2 = data[['Age']].astype(int)
# Change value to represent age groups
A2['Age'] = np.where(A2.Age > 50, 'Over 50', '50 or younger')

# Split data into training set and test set
X_train2, X_test2, y_train2, y_test2, A_train2, A_test2 = train_test_split(X2, y2, A2, test_size=0.20, random_state=0, stratify=y2)

# Train a classification model
print("Training model...")
model2 = LogisticRegression(solver='liblinear').fit(X_train2, y_train2)

print("Model trained.")

# View this model in Fairlearn's fairness dashboard, and see the disparities which appear:
FairlearnDashboard(sensitive_features=A_test2, 
                   sensitive_feature_names=['Age'],
                   y_true=y_test2,
                   y_pred={"diabetes_model2": model2.predict(X_test2)})

Now the model significantly underpredicts positive cases for older patients, showing that even though **Age** was not a feature used in training, the model still exhibits disparity in how well it predicts for older and younger patients. 

The overall predictive performance of the model has diminished, so clearly **Age** *is* a predictive feature - we just need to mitigate the tendency for the model to overpredict positive labels for older patients.

## Register the model and upload the dashboard to Azure Machine Learning

You've trained the model and reviewed the dashboard locally in this notebook; but it might be useful to register the model in your Azure Machine Learning workspace and create an experiement to analyze it for fairness so you can keep track of your mitigation strategy.

Let's start by registering the original model (which included **Age** as a feature).

> **Note**: If prompted, follow the link and enter the authentication code provided to sign into your Azure subscription.

In [None]:
from azureml.core import Workspace, Experiment, Model
import joblib
import os

# Load the Azure ML workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

# Save the trained model
model_file = 'diabetes_model.pkl'
joblib.dump(value=base_model, filename=model_file)

# Register the model
print('Registering model...')
registered_model = Model.register(model_path=model_file,
                                    model_name='diabetes_classifier',
                                    workspace=ws)
model_id= registered_model.id


print('Model registered.', model_id)

Now you can use the FairLearn package to create a group of metrics for one or more models, and use an Azure Machine Learning experiment to upload the metrics.

In [None]:
from fairlearn.metrics._group_metric_set import _create_group_metric_set
from azureml.contrib.fairness import upload_dashboard_dictionary, download_dashboard_by_upload_id

#  Create a dictionary of model(s) you want to assess for fairness 
sf = { 'Age': A_test.Age}
ys_pred = { model_id:base_model.predict(X_test) }
dash_dict = _create_group_metric_set(y_true=y_test,
                                    predictions=ys_pred,
                                    sensitive_features=sf,
                                    prediction_type='binary_classification')

exp = Experiment(ws, "Diabetes_Fairness")
print(exp)

run = exp.start_logging()

# Upload the dashboard to Azure Machine Learning
try:
    dashboard_title = "Fairness insights of Diabetes Classifier"
    upload_id = upload_dashboard_dictionary(run,
                                            dash_dict,
                                            dashboard_name=dashboard_title)
    print("\nUploaded to id: {0}\n".format(upload_id))

    # To test the dashboard, you can download it back and ensure it contains the right information
    downloaded_dict = download_dashboard_by_upload_id(run, upload_id)
    print(downloaded_dict)
finally:
    run.complete()

The preceding code downloaded the metrics generated in the experiement just to confirm it completed successfully. The real benefit of uploading the metrics to an experiement is that you can now view the FairLearn dashboard in Azure Machine Learning studio.

Run the cell below to see the output of the experiment, and click the link to see the run in Azure Machine Learning studio. Then view the **Fairness** tab of the experiment run to view the dashboard, which behaves the same way as the widget you viewed previously in this notebook.

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

You can also find the fairness dashboard by selecting a model in the **Models** page of Azure Machine Learning studio and reviewing its **Fairness** tab. This enables your organization to maintain a log of fairness analysis for the models you train and register.

## Mitigate Unfairness in the Model

Now that you've analyzed the model for fairness, you can use any of the *mitigation* techniques supported by the FairLearn package to find a model that achieves the best balance of predictive performance and fairness.

In this exercise, we'll use the **GridSearch** feature, which trains multiple models in an attempt to minimize the disparity of predictive performance for the sensitive features in the dataset (in this case, the age groups).

> *This may take some time to run*

In [None]:
from fairlearn.reductions import GridSearch, DemographicParity, ErrorRate

print('Finding mitigated models...')

# Train multiple models
sweep = GridSearch(LogisticRegression(solver='liblinear'),
                   constraints=DemographicParity(),
                   grid_size=71)

sweep.fit(X_train, y_train, sensitive_features=A_train.Age)
models = sweep._predictors

# Examine the models, finding the error and disparities in each based on the sensitive features
errors, disparities = [], []
for m in models:
    classifier = lambda X: m.predict(X)
    
    error = ErrorRate()
    error.load_data(X_train, pd.Series(y_train), sensitive_features=A_train.Age)
    disparity = DemographicParity() #use the DemographicParity constraint to define the mitigation strategy
    disparity.load_data(X_train, pd.Series(y_train), sensitive_features=A_train.Age)
    
    errors.append(error.gamma(classifier)[0])
    disparities.append(disparity.gamma(classifier).max())
    
all_results = pd.DataFrame( {"model": models, "error": errors, "disparity": disparities})

# Retain only the most dominant models (those with lower errors than others with the same or lower disparity)
dominant_models_dict = dict()
base_name_format = "diabetes_model_{0}"
row_id = 0
for row in all_results.itertuples():
    model_name = base_name_format.format(row_id)
    errors_for_lower_or_eq_disparity = all_results["error"][all_results["disparity"]<=row.disparity]
    if row.error <= errors_for_lower_or_eq_disparity.min():
        dominant_models_dict[model_name] = row.model
    row_id = row_id + 1

# Create dictionaries of the mitigated models and predictions (plus the original unmitigated one for comparison)
predictions_dominant = {"diabetes_unmitigated": base_model.predict(X_test)}
models_dominant = {"diabetes_unmitigated": base_model}
for name, model in dominant_models_dict.items():
    value = model.predict(X_test)
    predictions_dominant[name] = value
    models_dominant[name] = model

# Display the model names
for model_name in models_dominant:
    print(model_name)

Now you can use the FairLearn dashboard to compare the mitigated models:

Run the cell below and use the wizard to display the models. They are shown as a scatter plot that helps you compare the model performance (based on your chosen metric) with the level of disparity in the model's predictions for the sensitive feature groups (in this case, age ranges). You can select an individual point to see a breakdown of the predictive performance and disparity by sensitive feature for that model.

In [None]:
FairlearnDashboard(sensitive_features=A_test, 
                   sensitive_feature_names=['Age'],
                   y_true=y_test.tolist(),
                   y_pred=predictions_dominant)

Based on the model comparison, you can choose the right balance of predictive performance and fairness for your particular needs, and deploy the most appropriate model.

## Upload the Mitigation Dashboard to Azure Machine Learning

As before, you might want to keep track of your mitigation experimentation. To do this, you can:

1. Register the models found by the GridSearch process.
2. Compute the performance and disparity metrics for the models.
3. Upload the metrics in an Azure Machine Learning experiment.

In [None]:
# Register the models
os.makedirs('models', exist_ok=True)
registered_models = dict()
for name, model in models_dominant.items():
    model_file = "models/{0}.pkl".format(name)
    joblib.dump(value=model, filename=model_file)
    registered_model = Model.register(model_path=model_file,
                                    model_name=name,
                                    workspace=ws)
    registered_models[name] = registered_model.id

# Get the computed metrics calculated previously 
prediction_metrics = dict()
for name, y_pred in predictions_dominant.items():
    prediction_metrics[registered_models[name]] = y_pred

#  Create a dictionary of model(s) you want to assess for fairness 
sf = { 'Age': A_test.Age}
dash_dict = _create_group_metric_set(y_true=y_test,
                                     predictions=prediction_metrics,
                                     sensitive_features=sf,
                                     prediction_type='binary_classification')

exp = Experiment(ws, "Diabetes_Fairness_Mitigation")
print(exp)

run = exp.start_logging()

# Upload the dashboard to Azure Machine Learning
try:
    dashboard_title = "Fairness Comparison of Diabetes Models"
    upload_id = upload_dashboard_dictionary(run,
                                            dash_dict,
                                            dashboard_name=dashboard_title)
    print("\nUploaded to id: {0}\n".format(upload_id))

    # To test the dashboard, you can download it back and ensure it contains the right information
    downloaded_dict = download_dashboard_by_upload_id(run, upload_id)
    print(downloaded_dict)
finally:
    run.complete()
    RunDetails(run).show()

When the experiement has finished running, click the link in the widget at the bottom of the output to view the run in Azure Machine Learning studio, and view the FairLearn dashboard on the **fairness** tab.