# Evaluating Machine Learning Models

Today, we will use machine learning tools to train models while being careful of model fairness.

First, we will use **fairkit-learn** to train and evaluate models using the ProPublica COMPAS Dataset.

Next, we will use **scikit-learn** to train and evaluate models using the German Credit Dataset. 

Finally, we will use **AI Fairness 360** to train and evaluate models using the Adult Census Income Dataset. 

Along with the provided tooling and resources within this notebook, you will be allowed to use outside resources (e.g. Google) to help you complete this exercise.

Please plan to complete the entire exercise in one sitting. Make sure you have time and your computer is plugged into power before you start; you'll be running machine learning algorithms, which will wear your battery down.

Responses for this exercise will be entered in the <a href="https://form.jotform.com/92474488429169" target="_blank">Evaluating ML Models Exercise Response Form</a>. You will first be asked some demographic questions then each page that follows maps to each task you complete. You will be expected to enter responses regarding each task and will have to submit for your assignment to be graded.



## Models

Because there are a variety of models provided by scikit-learn and AI Fairness 360, we will only use a subset for this assignment. The models you will be evaluating are as follows:

* **Logistic Regression**: a Machine Learning algorithm which is used for the classification problems, it is a predictive analysis algorithm and based on the concept of probability. [More info here.](https://machinelearningmastery.com/logistic-regression-for-machine-learning/) [Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
* **K Nearest Neighbor Classifier**: a model that classifies data points based on the points that are most similar to it. It uses test data to make an “educated guess” on what an unclassified point should be classified as. [More info here.](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761) [Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* **Random Forest**: an ensemble machine learning algorithm that is used for classification and regression problems. Random forest applies the technique of bagging (bootstrap aggregating) to decision tree learners. [More info here.](https://towardsdatascience.com/understanding-random-forest-58381e0602d2) [Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
* **Support Vector Classifier**:  a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side. [More info here.](https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-f0812effc72) [Scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* **Adversarial Debiasing**: learns a classifier to maximize prediction accuracy and simultaneously reduce an adversary's ability to determine the protected attribute from the predictions. [Documentation.](https://aif360.readthedocs.io/en/latest/modules/inprocessing.html#adversarial-debiasing)

The Adversarial Debiasing model is only available for use when using AI Fairness 360 or fairkit-learn.


## Bias Mitigating Algorithms

When using AI Fairness 360 and fairkit-learn, you will have access to the following bias mitigating pre- and post- processing algorithms:

* **Pre-processing algorithms**
    - *Disparate Impact Remover*: a preprocessing technique that edits feature Values increase group fairness while preserving rank-ordering within groups
    - *Reweighing*: a preprocessing technique that Weights the examples in each (group, label) combination differently to ensure fairness before classification
    
    
* **Post-processing algorithms**
    - *Calibrated Equalized Odds*: a post-processing technique that optimizes over calibrated classifier score outputs to find probabilities with which to change output labels with an equalized odds objective
    - *Reject Option Classification*: a postprocessing technique that gives favorable outcomes to unpriviliged groups and unfavorable outcomes to priviliged groups in a confidence band around the decision boundary with the highest uncertainty 


## Model Evaluation Metrics

To evaluate your trained models, you will be using one or more of the following metrics:

* **Performance metrics**:
    - *Accuracy Score* (UnifiedMetricLibrary.accuracy_score) When evaluating a model with this metric, the goal is to *maximize* the value.
    
    
* **Fairness Metrics**:
    - *Equal Opportunity Difference* (UnifiedMetricLibrary.equal_opportunity_difference) also known as "true positive rate difference". When evaluating a model with this metric, the goal is to *minimize* the value.
    - *Average Odds Difference* (UnifiedMetricLibrary.average_odds_difference) When evaluating a model with this metric, the goal is to *minimize* the value.
    - *Statistical Parity Difference* (UnifiedMetricLibrary.mean_difference) also known as "mean difference". When evaluating a model with this metric, the goal is to *minimize* the value.
    - *Disparate Impact* (UnifiedMetricLibrary.disparate_impact)  When evaluating a model with this metric, the goal is to *maximize* the value.
    
    
* **Overall Model Quality**:
    - *Classifier Quality Score* (classifier_quality_score) When evaluating a model with this metric, the goal is to *maximize* the value.

## Getting started 

Before beginning task 1, make sure to run the following cell to import all necessary packages. If you need any additional packages, add the import statement(s) to the cell below and re-run the cell before adding and running code that uses the additional packages.


In [None]:
# Load all necessary packages
import numpy as np
import sklearn as skl
import six
import tensorflow as tf

# dataset
from aif360.datasets import CompasDataset

# metrics
from fklearn.metric_library import UnifiedMetricLibrary, classifier_quality_score

# models
from fklearn.scikit_learn_wrapper import LogisticRegression, KNeighborsClassifier, RandomForestClassifier, SVC
from aif360.algorithms.inprocessing import AdversarialDebiasing

# pre/post-processing algorithms
from aif360.algorithms.preprocessing import DisparateImpactRemover, Reweighing
from aif360.algorithms.postprocessing import CalibratedEqOddsPostprocessing, RejectOptionClassification

# search
from fklearn.fair_selection_aif import ModelSearch, DEFAULT_ADB_PARAMS

# Tutorial 1: fairkit-learn

First, we show you how to train and evaluate models using fairkit-learn. You will use the knowledge from this tutorial to complete Task 1, so please read thoroughly and execute the code cells in order.

## Step 1: Import the dataset

First we need to import the dataset we will use for training and testing our model.

Below, we provide code that imports the COMPAS recidivism dataset. **Note: a warning may pop up when you run this cell. As long as you don't see any errors in the code, it is fine to continue.**

In [None]:
data_orig = CompasDataset()

## Step 2: Set protected attributes

To use the grid search functionality provided by fairkit-learn, we again need to specify the privileged and unprivileged (protected) attributes. 

Below we provide code that stores the protected attributes (*race* is 0 for "Not Caucasian", *sex* is 0 for "Male").

In [None]:
unprivileged = [{'race': 0, 'sex': 0}]
privileged = [{'race': 1, 'sex': 1}]

## Step 3: Specify parameters for grid search

Now we need to specify the various parameters required for the grid search provided by fairkit-learn. Each search parameter is a dictionary of options to include in the search. For each search parameter, you can input one or multiple options to consider. 

Below we provide code that sets parameters for a simple grid search across different hyper-parameter values for the Logistic Regression model, with and without the specified pre-/post-processing algorithms. We specify all performance and fairness metrics for the search -- given the way the classifier quality score is calculated, this cannot be added to the grid search and will be calculated later.

In [None]:
# we use one model here
models = {'LogisticRegression': LogisticRegression}

# here we add all the metrics we want to evaluate on (performance and fairness)
metrics = {'UnifiedMetricLibrary': [UnifiedMetricLibrary,
                                    'accuracy_score',
                                    'average_odds_difference',
                                    'statistical_parity_difference',
                                    'equal_opportunity_difference',
                                    'disparate_impact'
                                   ]
          }

# Hyperparameters may either be specified as a dictionary of string to lists, or by an empty dictionary to
# use the default ones set by sklearn (or AIF360). The keys are the names of the hyperparameters, and the
# values and lists of possible values to form a grid search over (example shown with LogisticRegression)

# For the AdversarialDebiasing classifier, you would specify hyperparameters using the following dictionary
# entry:
# 'AdversarialDebiasing' : DEFAULT_ADB_PARAMS(unprivileged=unprivileged, privileged=privileged)

hyperparameters = {'LogisticRegression':{'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1]}}

# this parameter is needed for the search and does not need to be modified
thresholds = [i * 10.0/100 for i in range(5)]

# Specify pre/post-processors as a list of initialized AIF360 pre/post-processing instances; 
# you can also run without any pre/post-processing algorithms (empty list)

# Options: DisparateImpactRemover(), Reweighing(unprivileged_groups=unprivileged,privileged_groups=privileged), or both
preprocessors=[DisparateImpactRemover()]
# Options: CalibratedEqOddsPostprocessing(unprivileged_groups=unprivileged,privileged_groups=privileged), RejectOptionClassification(unprivileged_groups=unprivileged,privileged_groups=privileged), or both
postprocessors=[CalibratedEqOddsPostprocessing(unprivileged_groups=unprivileged,privileged_groups=privileged)]


## Step 4: Run the grid search

Now that we've set all the parameters necessary for the grid search, we're ready to run it. The output of the grid search is saved to a .csv file.

Below we provide code that creates and uses the `ModelSearch` object to run a grid search over the parameters we specified and saves the output to a .csv file in the specified directory.  **The search take a while to complete. Wait until the search completes before attempting to execute more cells.**

**Note: warnings may appear during search, however, as long as you don’t see any code errors it is fine to continue.**

In [None]:
Search = ModelSearch(models, metrics, hyperparameters, thresholds)
Search.grid_search(data_orig, privileged=privileged, unprivileged=unprivileged, preprocessors=preprocessors, postprocessors=postprocessors)

Search.to_csv("fklearn/interface/static/data/test-file.csv")

## Step 5: Render visualization of search results

Along with the ability to run a grid search, fairkit-learn also provides functionality to visualize the results of the grid search. Fairkit-learn uses Bokeh to render a visualization within the notebook, which you can use when completing the next task to explore trained models' performance and fairness.

The visualzation includes a graph that plots the search results that are pareto optimal. Each data point in the graph is a model with its own settings (e.g., hyper-parameters, pre/post processing). Each model class has its own color to make it easier to see which models are being shown in the visualization. To get more information on each model's settings, hover over the data point of interest; a tooltip will pop up with model settings.

Within the visualization, you can control what metrics and models are being included in the visualization. The drop down menus allow you to specify the x and y axes for the graph. The checklist below the list of models allows you to select which metrics can be considered in the graph.

To view the *Pareto frontier* for any two metrics (e.g., accuracy and disparate impact), select those two metrics from the drop down menu and **only** check those boxes in the checklist

Below we provide code to load Bokeh and plot the results from the search in the interactive plot.

In [None]:
# Import packages for visualization
from bokeh.io import output_notebook
from bokeh.application.handlers import FunctionHandler
from bokeh.application import Application

# load Bokeh
output_notebook()

In [None]:
from fklearn.interface.plot import *

# Define function that takes in a document and attaches the bokeh server to it
def modify_doc(doc):
    
    # Load custom styles (for notebook only)
    custom_css = Div(text="<link rel='stylesheet' type='text/css' href='fklearn/interface/static/css/styles-notebook.css'>")
    add_btn = Button(label="Add Plot", button_type="success")
    remove_btn = Button(label="Remove Plot", button_type="danger")

    # Construct our viewport
    l = layout([
        [custom_css],
        create_plot("fklearn/interface/static/data/test-file.csv")
    ], sizing_mode="fixed", css_classes=["layout-container"])

    doc.add_root(l)
    
# Set up the application
handler = FunctionHandler(modify_doc)
app = Application(handler)

# Render visualization in the notebook
show(app)

## Step 6: Export visualization (optional)

The visualization can be viewed within this notebook and re-rendered as many times as needed, but can also be exported for future viewing and comparison to other plots. You can export the visualization and relevant information by clicking the ``Export Plot`` button in the visualization. 

This will save create two files: plot.png and plot.json.
Plot.png is an image of the plot.
Plot.json is a JSON file with the informational bits from the plot, such as what models are being shown and what metrics are selected.

Each time the export button is clicked, if plot.png and plot.json exist they are overwritten. If you wish to save plots for comparision, make sure you rename each file after export.

## Step 7: Evaluate overall model quality

Now that we've explored the various model configurations and their performance and fairness, we are ready to select the model(s) that we want to evaluate for overall quality.

To do so, you will need to create and train the model(s) (with proper hyperparameters, pre-processing, and post-processing as specified in the search output) you selected and then evaluate overall model quality.

Below we provide (commented) code that shows how to intialize the various models and algorithms you have access to. You can use the code provided or modify as you see fit when completing the task.


In [None]:
# split dataset for evaluation
# data_orig_train, data_orig_test = data_orig.split([0.7], shuffle=False)

# model is populated with default values; modifying parameters is allowed but optional
# model = LogisticRegression(penalty='l2', dual=False,tol=0.0001,C=1.0,
#                       fit_intercept=True,intercept_scaling=1,class_weight=None,
#                       random_state=None,solver='liblinear',max_iter=100, 
#                       multi_class='warn',verbose=0,warm_start=False,
#                       n_jobs=None)

#model = KNeighborsClassifier(n_neighbors=5,weights='uniform',algorithm='auto',
#                          leaf_size=30,p=2,metric='minkowski',metric_params=None,
#                          n_jobs=None)

#model = RandomForestClassifier(n_estimators='warn',criterion='gini',max_depth=None,
#                            min_samples_leaf=1,min_weight_fraction_leaf=0.0,
#                            min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, 
#                             random_state=None, verbose=0, warm_start=False, class_weight=None)

#model = SVC(C=1.0, kernel='rbf', degree=3, gamma='auto_deprecated', coef0=0.0, shrinking=True, 
#          probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, 
#          max_iter=-1, decision_function_shape='ovr', random_state=None)

# If this is not your first time creating the Adversarial Debiasing model, to avoid future errors,
# uncomment the code below before running the code that initializing TensorFlow session and model:
# sess.close()
# tf.reset_default_graph()

#sess = tf.Session()
#model = AdversarialDebiasing(privileged_groups=privileged,
#                          unprivileged_groups=unprivileged,
#                          scope_name='debiased_classifier',
#                          debias=True,
#                          sess=sess)


# you can modify repair level (optional)
#pre_alg = DisparateImpactRemover(repair_level=1.0)
# training data
#pre_train_data = pre_alg.fit_transform(data_orig_train)
# test data
#pre_test_data = pre_alg.fit_transform(data_orig_test)




# Reweighing

#pre_alg = Reweighing(unprivileged_groups=unprivileged, privileged_groups=privileged)
# train
#pre_alg = pre_alg.fit(dataset_orig_train)
#pre_train_data = pre_alg.transform(data_orig_train)
# test
#pre_alg = pre_alg.fit(dataset_orig_test)
#pre_test_data = pre_alg.transform(data_orig_test)


# train model with pre-processed data
#model.fit(pre_train_data)

# train model with original data 
# model.fit(data_orig_train)


# process trained model
# Calibrated Equal Odds
#post_alg = CalibratedEqOddsPostprocessing(unprivileged_groups=unprivileged,
#                                         privileged_groups=privileged,
#                                         cost_constraint='weighted',
#                                         seed=None)



# Reject Option Classification 
# With this algorithm, you can specify "metric_name" with the metric you want to optimize for.
# The options are "Statistical parity difference", "Average odds difference", or "Equal opportunity difference"

# post_alg = RejectOptionClassification(unprivileged_groups=unprivileged,
#                                      privileged_groups=privileged,
#                                      low_class_thresh=0.01,
#                                      high_class_thresh=0.99,num_class_thresh=100, 
#                                      num_ROC_margin=50,metric_name='Statistical parity difference',
#                                      metric_ub=0.05, metric_lb=-0.05)



# test with pre-processed test data
#predictions = model.predict(pre_test_data)

# test with original test data
# predictions = model.predict(data_orig_test)


# fit with post-processing model using pre-processed data
#post_model = post_alg.fit(pre_test_data, predictions)

# fit with post-processing model using original data
# post_model = post_alg.fit(data_orig_test, predictions)



# update predictions using post-processed model
#predictions = post_model.predict(pre_test_data)

# evaluate overall model quality on post-processed model
#quality_score = classifier_quality_score(post_model, predictions, 
#                                             unprivileged_groups=unprivileged, 
#                                             privileged_groups=privileged)

# evaluate overall model quality on model without post-processing
#quality_score = classifier_quality_score(model, predictions, 
#                                             unprivileged_groups=unprivileged, 
#                                             privileged_groups=privileged)

#print("Overall quality = " + str(quality_score))

# Task 1: Model evaluation with fairkit-learn

Your turn! Use what you learned in the above tutorial to train and evaluate models for performance, fairness, and overall quality. You will use functionality provided by fairkit-learn to meet the following goals:

1. **Describe a model you believe will perform the best (e.g., have the highest accuracy score).** 

2. **Describe a model you believe will be the most fair, regardless of performance (e.g., minimizes the value of difference fairness metrics or maximizes disparate impact).** 

3. **Describe a model you believe will best balance both performance and fairness (e.g., have the highest classifier quality score).** 

Make sure you include any modifications to model hyper-parameters and any pre-/post-processing algorithms used. **As a reminder, there is no "absolute best" model for each of the above goals. You are expected to explore the space of model configurations available to find a model that best meets the above goals.**

**Keep in mind, training machine learning models is often a time intensive endeavor.** One way you can minimize time to finish this task is to minimize the search space (e.g., number of models included in a single search). You can also minimize time when evaluating the number of times you have to, for example, train a given model to then evaluate it. You can do this by putting the code that initializes and trains your model(s) in its own separate cell and only execute this cell when needed.

## Submitting your response 

Once you feel you've met the above goals, go to the Evaluating ML Models Exercise Response Form to enter your responses under the section labeled 'Task 1'. 

If you haven't opened/started a response form yet, click <a href="https://form.jotform.com/92474488429169" target="_blank">here</a> to get started.

If you accidentally closed your response form, check your email for the link to re-open it.

In [1]:
# TODO : Use this cell to write code for completing task 1



When you're ready to go on to the next task, open a new tab and click <a href="http://localhost:8888/notebooks/Task_2.ipynb" target="_blank">here</a>.