# Tutorial: Build Trusted ML models with Certifai on Azure Notebooks

This tutorial picks up after part one of the Azure regression tutorial. In part one you prepared the NYC taxi data for regression modeling. The referenced parts 1 & 2 can be found under My projects on the [Azure Notebooks Portal](https://notebooks.azure.com/)

In this tutorial, you learn:

> * How to set up a Certifai scan from scratch
> * How to run Explainability, Fairness and Robustness scans
> * Explore their results
> * Log Certifai results to the Azure portal

If you don't have an Azure subscription, create a [free account](https://aka.ms/AMLfree) before you begin. 

If you don't have the Certifai Toolkit, get it now from our [Certifai page](https://www.cognitivescale.com/download-certifai/)

## Usage/Preparation Steps
There are two ways to enjoy this tutorial
1. Running on Azure Notebooks: By logging to the [Azure Notebooks Portal](https://notebooks.azure.com/)
2. Running locally


### 1. Running on Azure Notebook portal
1. In the [Azure Notebooks Portal](https://notebooks.azure.com/) go to My Projects. Use the "Clone Repo" button on the top right. Clone this project's [repo](https://github.com/mdungarov-cs/cortex_certifai_azure_notebooks_ny_taxi.git)
2. Download the toolkit from [Certifai page](https://www.cognitivescale.com/download-certifai/) and unzip
3. Upload the cat_encoder.py from certifai_toolkit/examples/notebooks
4. Upload the scanner, engine and from certifai_toolkit/packages folder
5. Upload the requirements.txt file from the certifai_toolkit folder
6. Use terminal or your notebook to `pip install` the files from steps 3&4 (remember to `pip install -r` the requirements file)  
NB: if using hosted terminal to install dependencies: ensure it is in the right environment and running the right python version).  
NB: If using the notebook to install dependencies, you might need to restart the notebook after installations

You are all set!

### 2. Running locally

1. Ensure you have a Python 3.x notebook server with the following installed:
2. Install Azure SDK dependencies `pip install --upgrade azureml-sdk[notebooks,automl,widgets]`
3. Follow the instructions to install the Certifai toolkit and dependencies from the [documentation page](https://cognitivescale.github.io/cortex-certifai/docs/toolkit/setup/install-certifai-cli-lib)

You are all set!

# Tutorial Contents

1. Data prep
2. Training an AutoML model
3. Model Selection for Certifai scan
4. Certifai Scan Setup
5. Review of Results and Evaluation

# Data prep

This part follows closely part 2 of the tutorial mentioned before, with minor differences to data prep needed for Certifai to run properly.

We start by loading data from part 1, and selecting relevant columns, storing results as csv - Certifai will directly consume the data as a csv file. For some of the runs we can also do with a smaller dataset, hence we also prepare a `_sample` dataset with only 500 rows to shorten the time needed to run

In [None]:
import os
import azureml.dataprep as dprep

file_path = os.path.join(os.getcwd(), "dflows.dprep")
dflow_prepared = dprep.Dataflow.open(file_path)

dflow_reduced = dflow_prepared.keep_columns(['pickup_weekday','pickup_hour', 'distance',
                                             'passengers', 'vendor', 'cost'])

df=dflow_reduced.to_pandas_dataframe()

# NOTE: DATASET CANNOT HAVE AN INDEX COLUMN FOR CERTIFAI
df.to_csv('all_data_NY_Taxi.csv',index=False)
df.sample(500,random_state=0).to_csv('all_data_NY_Taxi_sample.csv',index=False)

Here, we split the data into test and train but also prepare the CatEncoder by specifying the correct columns as categoricals. This enables multi-processing in the context of a notebook

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import numpy as np
import random
from cat_encoder import CatEncoder

In [None]:

base_path = '.'
all_data_file = f"{base_path}/all_data_NY_taxi.csv"
sample_data_file = f"{base_path}/all_data_NY_taxi_sample.csv"
df = pd.read_csv(all_data_file)

label_column = 'cost'

# Separate outcome
y = df[label_column]
X = df.drop(label_column, axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

cat_columns = [
    'vendor',
    'pickup_weekday',
    'passengers'
    ]
# Note - to support python multi-processing in the context of a notebook the encoder MUST
# be in a separate file, which is why `CatEncoder` is defined outside of this notebook
encoder = CatEncoder(cat_columns, X)


# Training AutoML

The below simply runs Azure AutoML in the same fashion as in the Azure regression Tutorial after the minor modifications to the data and model encoder we had to do.

We will:
- log in to our Azure Workspace
- Set up and run AutoML
- Review results and select models for further review using Certifai

### Azure Workspace Login

- Type in your workspace credentials
- use those to creste a config file
- build workspace from config file

Please note that you need to populate the appropriate credentials

In [None]:
subscription_id = os.getenv("SUBSCRIPTION_ID", default="<USER-SUBSCRIPTION-ID>")
resource_group = os.getenv("RESOURCE_GROUP", default="<USER-RESOURCE-GROUP>")
workspace_name = os.getenv("WORKSPACE_NAME", default="<USER-WORKSPACE-NAME>")
workspace_region = os.getenv("WORKSPACE_REGION", default="<USER-WORKSPACE-REGION>")  # eg. - 'useast'

In [None]:
from azureml.core import Workspace

try:
    ws = Workspace.from_config()
except:
    # Assuming you have set up a fresh resource group, and workspace, in the Azure console then deploy with the
    # following
    ws = Workspace.create(name=workspace_name,
                   subscription_id=subscription_id,
                   resource_group=resource_group,
                   create_resource_group=False,
                   location=workspace_region
                   )
    # write the details of the workspace to a configuration file to the notebook library
    ws.write_config()

In [None]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "taxi-experiment")

### Auto ML

Setup and run of Auto ML

In [None]:
import  logging

automl_settings = {
    "iteration_timeout_minutes": 2,
    "iterations": 20,
    "primary_metric": 'spearman_correlation',
    "featurization": 'auto',
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

In [None]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='regression',
                             debug_log='automated_ml_errors.log',
                             X=encoder(np.array(X_train)),
                             y=y_train.values.flatten(),
                             **automl_settings)

In [None]:

from azureml.core.experiment import Experiment
experiment = Experiment(ws, "taxi-experiment")
local_run = experiment.submit(automl_config, show_output=True)

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()


# Model Selection for Certifai Scan

After training a range of models and can now select a pair of the best performing ones to evaluate with Certifai. Note that you can select any number of models to evaluate, however, we only pick two here for illustration.

From the above we see that the best performing model is the Voting Ensemble. Followed closely by a Stack Ensemble model. We will consider the Voting Ensemble to be our "Champion" model

For our challenger, we will use a the best performing Random Tree Classifier (Model 12, Extreme Random Trees). This is simply to demonstrate that in a Certifai context, model type does not make a difference and allows us to compare any selection of models

In [None]:
champ_run, champion_model = local_run.get_output()

challenger_run, challenger_model = local_run.get_output(12)


# Certifai Scan Setup

The below sections take us through the scan setup. We start with library imports

In [None]:
from certifai.scanner.builder import (CertifaiScanBuilder, CertifaiPredictorWrapper, CertifaiModel, CertifaiModelMetric,
                                      CertifaiDataset, CertifaiGroupingFeature, CertifaiDatasetSource,
                                      CertifaiPredictionTask, CertifaiTaskOutcomes, CertifaiOutcomeValue)

### Task definition

This is the core of what we will evaluate. Specifically, we will be running:

- Regression type task, as opposed to classification. The task is defined as `CertifaiTaskOutcomes.regression`
- The favorable outcome is a reduction of the outcome variable, ie we consider favorable for the Taxi Ride to cost less rather than more `increased_favorable=False`
- significant change is about 50% of the empirical standard deviation. `change_std_deviation=0.5`. In this case, the empirical standard deviation is about $9.6, hence we consider significant change to be about $5. 


We start defining a scan by assigning it this prediction task. In the remaining steps we will add additional characteristics of the scan

In [None]:

task = CertifaiPredictionTask(CertifaiTaskOutcomes.regression(
        increased_favorable=False,
        change_std_deviation=0.5),
    prediction_description='Predict taxi fare based on features of the trip')

scan = CertifaiScanBuilder.create('test_user_case',
                                  prediction_task=task)

### Adding Models & Data to the Scan

We add the selected models to the scan. Here, we use a PredictorWrapper

We also add the dataset by specifying its location. Here this is done via the `all_data_file` value we set earlier when preparing the data for training. Notice that this is simply a pointer to the location of the dataset 

In [None]:
# Wrap the model up for use by Certifai as a local model
champion_model_proxy = CertifaiPredictorWrapper(champion_model, encoder=encoder)
challenger_model_proxy = CertifaiPredictorWrapper(challenger_model, encoder=encoder)

# Add our local models
first_model = CertifaiModel('champion',
                            local_predictor=champion_model_proxy)
scan.add_model(first_model)

second_model = CertifaiModel('challenger',
                            local_predictor=challenger_model_proxy)
scan.add_model(second_model)

# Add the eval dataset
eval_dataset = CertifaiDataset('evaluation',
                               CertifaiDatasetSource.csv(all_data_file))
scan.add_dataset(eval_dataset)
eval_dataset = CertifaiDataset('explanation',
                               CertifaiDatasetSource.csv(sample_data_file))
scan.add_dataset(eval_dataset)

scan.evaluation_dataset_id = 'evaluation'
# For this analysis we'll generate explanations for the entire dataset so we have a good number
# on which to base statistical measures
scan.explanation_dataset_id = 'explanation'

### Adding Evaluations

Adding evaluations is now very simple, one can just list the ones needed and those will be run by the scan and included in the result object.

Notice that for 'explanation' and 'robustness', simply adding evaluation is sufficient to have the report run. For fairness, we also need to specify the actual sensitive feature we want to assess 'fairness' for. In this case: `passengers` feature

In [None]:
# Setup an evaluation for explanation on the above dataset using the model
scan.add_evaluation_type('explanation')
scan.add_evaluation_type('robustness')
scan.add_evaluation_type('fairness')

scan.add_fairness_grouping_feature(CertifaiGroupingFeature('passengers'))


# Run the scan.
# By default this will write the results into individual report files (one per model and evaluation
# type) in the 'reports' directory relative to the Jupyter root.  This may be disabled by specifying
# `write_reports=False` as below
# The result is a dictionary of dictionaries of reports.  The top level dict key is the evaluation type
# and the second level key is model id.
# Reports saved as JSON (which `write_reports=True` will do) may be visualized in the console app
result = scan.run(write_reports=False)

# Review of Certifai Scan Results

The results of the scan are stored in an extensive result variable. Here we would like to just preview the high-level outcomes.

## Explanations
We start with model level explanations to gauge what are the main drivers of taxi fares.

What we will do here is essentially aggregate the number of times a certain variable has been used for generating a counterfactual, thus assessing its overall importance to the model

The outcome is not surprising - by far the most important variable is distance traveled. Second in only a third of the first is pickup hour, which also makes sense as in some cases late evening and early morning hours are charged differently. Number of passengers  as well as the actual vendor used seem to play the least role in determining the fare. All of this is good as it validates our expectations on how taxi fares work and the sensibility of the models we have in place.

Notice, that crucially, this level of detail is rarely available from simply training a model in itself or interpretability might be lost due to latent and/or dummy features generated along the way of making the model work. This analysis, however, is crucial for our understanding, confidence and ultimately trust in the behavior of the model

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2
from tutorial_utils import get_feature_frequency, plot_histogram,plot_fairness_burden

In [None]:
# Plot a histogram of frequency of occurrence of changes to each feature in counterfactuals
%matplotlib inline
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[15,6])
fig.suptitle('Feature occurrence frequency by model', fontsize=20)

plot_histogram(ax1, 'champion', result)
plot_histogram(ax2, 'challenger', result)

# Put them both on the same scale
ylim = max(ax1.get_ylim()[1], ax2.get_ylim()[1])
ax1.set_ylim(top=ylim)
ax2.set_ylim(top=ylim)

plt.show()

## Fairness to different sized groups

Finally, we set out here to determine the fairness of Taxi fares between differently sized groups of riders

Looking at the results it seems that the burden on all groups is largely the same with groups of 4 passengers seemingly receiving a slightly higher burden than others. However, those also come with a much wider confidence range which, when taken into account, makes them not significantly different from other groups. Additionally, looking at group counts, groups of 4 are also far less present in the sample - only 41 examples out of ca 6250.


In [None]:
%matplotlib inline

from certifai.scanner.report_utils import scores, construct_scores_dataframe


df_rslt=construct_scores_dataframe(scores('fairness', result, max_depth=1))
display(df_rslt)

group_categories=[f"({i}.0)" for i in range(1,6)]
group_xlabels=['single']+[f'{i} passengers' for i in range(2,6)]

plot_fairness_burden(df_rslt,group_categories,group_xlabels)

# Logging Results

Finally, using the workspace we already set up and assign a dedicated experiment for these runs. Here, we can log some key metrics for future reference

In [None]:

experiment = Experiment(ws, "certifai-rslt")

run = experiment.start_logging()
run.log('Fairness-Champion',value=result['fairness']['champion']['fairness']['score'])
run.log('Fairness-Challenger',value=result['fairness']['challenger']['fairness']['score'])
run.complete()