# Introduction

This fourth notebook shows how to scan models for their trust scores using Certifai, using a previously created scan definition as the starting point. 

If you have not already done so, please run the [first notebook](patient-readmission-train.ipynb) to train the models to be explained and the [second notebook](patient-readmission-explain-scan.ipynb) to create the `explain-scan-def.yaml` scan definition.

In this notebook, we will:
1. Load the previously saved `explain-scan-def.yaml` scan definition and models
2. Modify this scan definition to scan for the trust scores (fairness, explainability, robustness)
4. View the results in the Console


In [1]:
import numpy as np
import pandas as pd
import pickle
import pprint

from certifai.scanner.builder import (CertifaiScanBuilder, CertifaiGroupingFeature, 
                                      CertifaiPredictorWrapper, CertifaiModel, 
                                      CertifaiDataset, CertifaiDatasetSource)

# Loading the Certifai Scan object

In this section, we load the previously defined scan definition to use as a starting point. This is a convenience that avoids us having to recreate information about the prediction task, datasets and feature schema. 

Load the scan definition from file.

In [2]:
scan = CertifaiScanBuilder.from_file('explain-scan-def.yaml')

Because we're running with local models in the notebook, we need to reload the models and reassociate them with the scan. This isn't necessary if the models are running externally.

In [3]:
from encoder import Encoder

for model_name in ['logit', 'mlp']:
    with open(f'readmission_{model_name}.pkl', 'rb') as f:
        saved = pickle.load(f)
        model = CertifaiPredictorWrapper(saved.get('model'), encoder=Encoder())
        scan.remove_model(model_name)
        scan.add_model(CertifaiModel(model_name, local_predictor=model))
        

# Create and run the Trust Scan

In this Section, we're now going to modify the scan definition to run a scan for fairness, explainability and robustness. We could include explanations in this new scan, but for now we'll omit them. 

First, remove the explanation evaluation and add fairness, explainability and robustness.

In [4]:
scan.remove_evaluation_type('explanation')
scan.add_evaluation_type('fairness')
scan.add_evaluation_type('robustness')
scan.add_evaluation_type('explainability')

Note that in this use case, 'fairness' is less about with bias against protected groups, and more to do with detecting how specific groups may be more biased to unfavorable outcomes. To scan for these disparities, we need to provide some additional information on the groups to be scanned. We'll choose race, gender and age.

These results need to be interpreted with care.  [Racial/Ethnic Disparities in Readmissions in US Hospitals: The Role of Insurance Coverage](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946640/) shows that lower readmission rates may not always be construed as a good outcome, and could relate to a lack of insurance coverage and poor access to care. 

In [5]:
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('race'))
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('gender'))
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('age'))

When running a fairness analysis, it is important that there are sufficient samples of each class within the grouping features. We can check this and other potential issues using a preflight analysis. It will also give us a time estimate for each evaluation.

In [6]:
preflight_result = scan.run_preflight()
pprint.pprint(preflight_result)

Starting Preflight Scan
[--------------------] 2020-10-04 12:23:04.361895 - 0 of 8 checks (0.0% complete) - Running model nondeterminism preflight check for model logit
[##------------------] 2020-10-04 12:23:04.408516 - 1 of 8 checks (12.5% complete) - Running scan time estimate preflight check for model logit




[#####---------------] 2020-10-04 12:24:04.340559 - 2 of 8 checks (25.0% complete) - Running unknown outcome class preflight check for model logit
[#######-------------] 2020-10-04 12:24:04.362417 - 3 of 8 checks (37.5% complete) - Running fairness class samples preflight check for model logit
[##########----------] 2020-10-04 12:24:04.760288 - 4 of 8 checks (50.0% complete) - Finished all preflight checks for model logit
[##########----------] 2020-10-04 12:24:04.760464 - 4 of 8 checks (50.0% complete) - Running model nondeterminism preflight check for model mlp
[############--------] 2020-10-04 12:24:04.808456 - 5 of 8 checks (62.5% complete) - Running scan time estimate preflight check for model mlp




[###############-----] 2020-10-04 12:25:21.620083 - 6 of 8 checks (75.0% complete) - Running unknown outcome class preflight check for model mlp
[#################---] 2020-10-04 12:25:21.643150 - 7 of 8 checks (87.5% complete) - Running fairness class samples preflight check for model mlp
[####################] 2020-10-04 12:25:22.039223 - 8 of 8 checks (100.0% complete) - Finished all preflight checks for model mlp
{'logit': {'errors': [],
           'messages': ['Passed model non determinism check',
                        'Expected time for fairness analysis is 540 seconds',
                        'Expected time for robustness analysis is 63 seconds',
                        'Expected time for explainability analysis is 58 '
                        'seconds',
                        'Model logit passed time estimation check',
                        'Passed unknown outcome classes check'],
                        "size for 'Unknown/Invalid' with 3 examples"]},
 'mlp': {'errors': [

In the warnings, 'gender' has a small sample size (3) for 'Unknown/Invalid'. We will address this by dropping those rows, given they are a tiny proportion and are not a useful class for analysis. 

In other cases (e.g. for underrepresented age ranges), a better approach is to combine smaller classes into one larger class using bucketing. This is illustrated in the [Practical Issues](cortex-certifai-examples/notebooks/practical_issues/PracticalIssues.ipynb) notebook.

Drop rows with 'gender_Unknown/Invalid' and replace the evaluation dataset in the scan.

In [7]:
df = pd.read_csv('diabetic_data_processed.csv')
scan.remove_dataset('evaluation')
eval_dataset = CertifaiDataset('evaluation',
                                CertifaiDatasetSource.dataframe(df[df['gender_Unknown/Invalid'] == 0]))
scan.add_dataset(eval_dataset)

Save the scan definition and run the scan.

In [8]:
with open('trust-scan-def.yaml', "w") as f:
    scan.save(f)
results = scan.run(write_reports=True)

Starting scan with model_use_case_id: 'readmission' and scan_id: 'e9ffd5d26bb7', total estimated time is 25 minutes
[--------------------] 2020-10-04 12:25:28.110357 - 0 of 6 reports (0.0% complete) - Running fairness evaluation for model: logit, estimated time is 540 seconds




[###-----------------] 2020-10-04 13:11:21.181134 - 1 of 6 reports (16.67% complete) - Running robustness evaluation for model: logit, estimated time is 63 seconds
[######--------------] 2020-10-04 13:13:07.149930 - 2 of 6 reports (33.33% complete) - Running explainability evaluation for model: logit, estimated time is 58 seconds
[##########----------] 2020-10-04 13:14:29.533358 - 3 of 6 reports (50.0% complete) - Running fairness evaluation for model: mlp, estimated time is 659 seconds




[#############-------] 2020-10-04 14:04:26.739811 - 4 of 6 reports (66.67% complete) - Running robustness evaluation for model: mlp, estimated time is 75 seconds
[################----] 2020-10-04 14:05:44.046731 - 5 of 6 reports (83.33% complete) - Running explainability evaluation for model: mlp, estimated time is 75 seconds
[####################] 2020-10-04 14:07:25.433319 - 6 of 6 reports (100.0% complete) - Completed all evaluations


# View the Results

The results can be viewed in the Certifai console using the CLI command `certifai console`, run from this folder. 
Go to `http://localhost:8000` in your browser. 

The results can also be analyzed in a notebook. See the [fifth notebook](patient-readmission-trust-results.ipynb) for how to load and work with the results of the trust score scan in a separate notebook.