Copyright (c) 2020. Cognitive Scale Inc. All rights reserved.
Licensed under CognitiveScale Example Code [License](https://github.com/CognitiveScale/cortex-certifai-examples/blob/7998b8a481fccd467463deb1fc46d19622079b0e/LICENSE.md)

# Introduction

This fourth notebook shows how to scan models for their trust scores using Certifai, using a previously created scan definition as the starting point. 

If you have not already done so, please run the [first notebook](patient-readmission-train.ipynb) to train the models to be explained and the [second notebook](patient-readmission-explain-scan.ipynb) to create the `explain-scan-def.yaml` scan definition.

In this notebook, we will:
1. Load the previously saved `explain-scan-def.yaml` scan definition and models
2. Modify this scan definition to scan for the trust scores (fairness, explainability, robustness)
4. View the results in the Console


In [1]:
import numpy as np
import pandas as pd
import pickle
import pprint

from certifai.scanner.builder import (CertifaiScanBuilder, CertifaiGroupingFeature, 
                                      CertifaiPredictorWrapper, CertifaiModel, 
                                      CertifaiDataset, CertifaiDatasetSource)

# Loading the Certifai Scan object

In this section, we load the previously defined scan definition to use as a starting point. This is a convenience that avoids us having to recreate information about the prediction task, datasets and feature schema. 

Load the scan definition from file.

In [2]:
scan = CertifaiScanBuilder.from_file('explain-scan-def.yaml')



Because we're running with local models in the notebook, we need to reload the models and reassociate them with the scan. If the models were running externally, we would instead update the scan definition with the correct predict_endpoint URL.

In [3]:
for model_name in ['logit', 'mlp']:
    with open(f'readmission_{model_name}.pkl', 'rb') as f:
        saved = pickle.load(f)
        model = CertifaiPredictorWrapper(saved.get('model'))
        scan.remove_model(model_name)
        scan.add_model(CertifaiModel(model_name, local_predictor=model))
        

# Create and run the Trust Scan

In this Section, we're now going to modify the scan definition to run a scan for fairness, explainability and robustness. We could include explanations in this new scan, but for now we'll omit them. 

First, remove the explanation evaluation and add fairness, explainability and robustness.

In [4]:
scan.remove_evaluation_type('explanation')
scan.add_evaluation_type('fairness')
scan.add_evaluation_type('robustness')
scan.add_evaluation_type('explainability')

Note that in this analysis, 'fairness' is less about discrimination against protected groups, but can be useful in detecting how specific groups may be more biased to unfavorable predictions. 

To scan for fairness, we need to provide some additional information on the groups to be scanned. We'll choose race, gender and age.

In [5]:
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('race'))
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('gender'))
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('age'))

These results need to be interpreted with care.  [Racial/Ethnic Disparities in Readmissions in US Hospitals: The Role of Insurance Coverage](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946640/) shows that lower readmission rates may not always be construed as a good outcome, and could relate to a lack of insurance coverage and poor access to care. 

Further, for a use case such as this, using an alternative measure such as equalized odds or equal opportunity instead of the default 'burden' based fairness metric may be appropriate. These measures will show where the model has different success rates in prediction across groups and can be measured with Certifai providing ground truth is available. The [fairness metrics notebook](https://github.com/CognitiveScale/cortex-certifai-examples/blob/master/notebooks/fairness_metrics/FairnessMetrics.ipynb) illustrates using alternative metrics.

When running a fairness analysis, it is important that there are sufficient samples of each class within the grouping features. We can check this and other potential issues using a preflight analysis. It will also give us a time estimate for each evaluation.

In [6]:
preflight_result = scan.run_preflight()
pprint.pprint(preflight_result)

Starting Preflight Scan
[--------------------] 2020-11-02 13:15:38.681479 - 0 of 8 checks (0.0% complete) - Running model nondeterminism preflight check for model logit
[##------------------] 2020-11-02 13:15:38.717233 - 1 of 8 checks (12.5% complete) - Running scan time estimate preflight check for model logit




[#####---------------] 2020-11-02 13:16:30.810780 - 2 of 8 checks (25.0% complete) - Running unknown outcome class preflight check for model logit
[#######-------------] 2020-11-02 13:16:30.831550 - 3 of 8 checks (37.5% complete) - Running fairness class samples preflight check for model logit
[##########----------] 2020-11-02 13:16:31.255051 - 4 of 8 checks (50.0% complete) - Finished all preflight checks for model logit
[##########----------] 2020-11-02 13:16:31.255189 - 4 of 8 checks (50.0% complete) - Running model nondeterminism preflight check for model mlp
[############--------] 2020-11-02 13:16:31.306776 - 5 of 8 checks (62.5% complete) - Running scan time estimate preflight check for model mlp




[###############-----] 2020-11-02 13:17:28.652213 - 6 of 8 checks (75.0% complete) - Running unknown outcome class preflight check for model mlp
[#################---] 2020-11-02 13:17:28.673776 - 7 of 8 checks (87.5% complete) - Running fairness class samples preflight check for model mlp
[####################] 2020-11-02 13:17:29.055711 - 8 of 8 checks (100.0% complete) - Finished all preflight checks for model mlp
{'logit': {'errors': [],
           'messages': ['Passed model non determinism check',
                        'Expected time for fairness analysis is 387 seconds',
                        'Expected time for robustness analysis is 48 seconds',
                        'Expected time for explainability analysis is 48 '
                        'seconds',
                        'Model logit passed time estimation check',
                        'Passed unknown outcome classes check'],
                        "size for 'Unknown/Invalid' with 3 examples"]},
 'mlp': {'errors': [

In the warnings, 'gender' has a small sample size (3) for 'Unknown/Invalid'. We will address this by dropping those rows, given they are a tiny proportion and are not a useful class for analysis. 

In other cases (e.g. for underrepresented age ranges), a better approach is to combine smaller classes into one larger class using bucketing. This is illustrated in the [Practical Issues](https://github.com/CognitiveScale/cortex-certifai-examples/blob/master/notebooks/practical_issues/PracticalIssues.ipynb) notebook.

Drop rows with 'gender_Unknown/Invalid' and replace the evaluation dataset in the scan.

In [7]:
df = pd.read_csv('diabetic_data_processed.csv')
scan.remove_dataset('evaluation')
eval_dataset = CertifaiDataset('evaluation',
                                CertifaiDatasetSource.dataframe(df[df['gender_Unknown/Invalid'] == 0]))
scan.add_dataset(eval_dataset)

Save the scan definition and run the scan.

In [8]:
with open('trust-scan-def.yaml', "w") as f:
    scan.save(f)
results = scan.run(write_reports=True)



Starting scan with model_use_case_id: 'readmission' and scan_id: 'eafa5614d892', total estimated time is 17 minutes
[--------------------] 2020-11-02 13:17:35.132169 - 0 of 6 reports (0.0% complete) - Running fairness evaluation for model: logit, estimated time is 387 seconds




[###-----------------] 2020-11-02 13:48:51.449418 - 1 of 6 reports (16.67% complete) - Running robustness evaluation for model: logit, estimated time is 48 seconds
[######--------------] 2020-11-02 13:49:43.340452 - 2 of 6 reports (33.33% complete) - Running explainability evaluation for model: logit, estimated time is 48 seconds
[##########----------] 2020-11-02 13:50:52.592953 - 3 of 6 reports (50.0% complete) - Running fairness evaluation for model: mlp, estimated time is 422 seconds




[#############-------] 2020-11-02 14:25:14.558623 - 4 of 6 reports (66.67% complete) - Running robustness evaluation for model: mlp, estimated time is 51 seconds
[################----] 2020-11-02 14:26:08.660852 - 5 of 6 reports (83.33% complete) - Running explainability evaluation for model: mlp, estimated time is 51 seconds
[####################] 2020-11-02 14:27:17.787133 - 6 of 6 reports (100.0% complete) - Completed all evaluations


# View the Results

The results can be viewed in the Certifai console using the CLI command `certifai console`, run from this folder. 
Go to `http://localhost:8000` in your browser. 

The results can also be analyzed in a notebook. See the [fifth notebook](patient-readmission-trust-results.ipynb) for how to load and work with the results of the trust score scan in a separate notebook.