# Introduction

This is the second notebook in this example of how to scan models using Certifai. If you have not already done so, please run the [first notebook](patient-readmission-train) to train the models to be explained.

In this notebook, we will:
1. Create a Certifai scan object with the information Certifai needs to explain the models
2. Run the explanations scan and save its definition for future use
3. Run a second scan, this time to get the trust scores (fairness, explainability, robustness)
4. View the results in the Console


In [1]:
import numpy as np
import pandas as pd
import pickle
import pprint

from certifai.scanner.builder import (CertifaiScanBuilder, CertifaiPredictorWrapper, CertifaiModel,
                                      CertifaiDataset, CertifaiDatasetSource, CertifaiGroupingFeature,
                                      CertifaiPredictionTask, CertifaiTaskOutcomes, CertifaiOutcomeValue,
                                      CertifaiFeatureDataType, CertifaiFeatureSchema, CertifaiDataSchema)

# Creating the Certifai Scan object

In this section, we create a Certifai scan object containing with the information Certifai needs to run a scan that explains the models. This information consists of:
* Metadata about the prediction task being performed
* What evaluations to run
* The models to be scanned
* The datasets to be used
* Metadata about the datasets that is needed for the scan

Create a Certifai scan object, providing metadata about the prediction task that is performed by the models. Define the evaluations to be performed, which in this case is just 'explanation'.

In [2]:
task = CertifaiPredictionTask(CertifaiTaskOutcomes.classification(
    [
        CertifaiOutcomeValue(0, name='Not Readmitted', favorable=True),
        CertifaiOutcomeValue(1, name='Readmitted')
    ]),
    prediction_description='Determine whether a patient will be readmitted')

scan = CertifaiScanBuilder.create('readmission',
                                  prediction_task=task)
scan.add_evaluation_type('explanation')

Load the two models we saved in the first notebook, and wrap them so that they can be called by Certifai. Add these models into the scan object.

In [3]:
from encoder import Encoder

for model_name in ['logit', 'mlp']:
    with open(f'readmission_{model_name}.pkl', 'rb') as f:
        saved = pickle.load(f)
        model = CertifaiPredictorWrapper(saved.get('model'), encoder=Encoder())
        scan.add_model(CertifaiModel(model_name, local_predictor=model))

Add two datasets to the scan. The evaluation dataset is used by Certifai to create an initial population for the genetic algorithm used in the scan, and needs to be a representative sample of the expected data (minimum c. 1K rows, ideally 10-50K rows, larger is OK). The explanation dataset contains the points to be explained. Note the time to run the scan will depend linearly on the size of the explanation dataset, so it is best to keep this relatively small (in this case, 100 rows).

In [4]:
eval_dataset = CertifaiDataset('evaluation',
                               CertifaiDatasetSource.csv('diabetic_data_processed.csv'))
scan.add_dataset(eval_dataset)
scan.evaluation_dataset_id = 'evaluation'

df = pd.read_csv('diabetic_data_processed.csv')
explan_dataset = CertifaiDataset('explanation',
                                CertifaiDatasetSource.dataframe(df.sample(100)))
scan.add_dataset(explan_dataset)
scan.explanation_dataset_id = 'explanation'

Read in the metadata about one-hot encoding that we saved in the first notebook and use this to define the feature schema in the scan object. This lets Certifai know the value mappings to columns for both the analysis and when presenting explanations.

In [6]:
with open('cat_value_mappings.pkl', 'rb') as f:
    cat_value_mappings = pickle.load(f)

cat_features = []
for feature, value_columns in cat_value_mappings.items():
    data_type = CertifaiFeatureDataType.categorical(value_columns=value_columns.items())
    feature_schema = CertifaiFeatureSchema(name=feature, data_type=data_type)
    cat_features.append(feature_schema)
schema = CertifaiDataSchema(features=cat_features)
scan.dataset_schema = schema


Tell Certifai about the label/outcome column in the dataset, so that it won't be passed in the predict calls or used in the genetic algorithm. 

In [7]:
scan.dataset_schema.outcome_feature_name = 'readmitted'

# Run the Explanations Scan

Run the scan, saving the results in the `reports` folder. 

In [8]:
results = scan.run(write_reports=True)

Starting scan with model_use_case_id: 'readmission' and scan_id: '9e5292a83fae', total estimated time is 2 minutes
[--------------------] 2020-10-03 20:06:54.056010 - 0 of 2 reports (0.0% complete) - Running explanation evaluation for model: logit, estimated time is 45 seconds
[##########----------] 2020-10-03 20:07:37.539557 - 1 of 2 reports (50.0% complete) - Running explanation evaluation for model: mlp, estimated time is 60 seconds
[####################] 2020-10-03 20:08:32.690934 - 2 of 2 reports (100.0% complete) - Completed all evaluations


Save the scan definition as a yaml file so that it can be rerun in the future, either in a notebook or from the CLI. This is useful for example to get explanations for additional datapoints, for updated models, or for a model that has been deployed as a service.

In [9]:
with open('explain-scan-def.yaml', "w") as f:
    scan.save(f)

The scan definition can be loaded into a new notebook using `CertifaiScanBuilder.from_file('explain-scan-def.yaml')`.

# View the Results

The results can be viewed in the Certifai console using the CLI command `certifai console`, run from this folder. 
Go to `http://localhost:8000` in your browser. 

The results can also be analyzed in this notebook; or analyzed later in a separate notebook.
TODO LINK TO ANALYSIS NOTEBOOK

# Create and run a Trust Scan

In this Section, we're now going to modify the scan definition to run a scan for fairness, explainability and robustness. We could include explanations in this new scan, but for now we'll omit them. 

First, remove the explanation evaluation and add fairness, explainability and robustness.

In [10]:
scan.remove_evaluation_type('explanation')
scan.add_evaluation_type('fairness')
scan.add_evaluation_type('robustness')
scan.add_evaluation_type('explainability')

Note that in this use case, 'fairness' is less about with bias against protected groups, and more to do with detecting how specific groups may be more biased to unfavorable outcomes. To scan for these disparities, we need to provide some additional information on the groups to be scanned. We'll choose race, gender and age.

These results need to be interpreted with care.  [Racial/Ethnic Disparities in Readmissions in US Hospitals: The Role of Insurance Coverage](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946640/) shows that lower readmission rates may not always be construed as a good outcome, and could relate to a lack of insurance coverage and poor access to care. 

In [11]:
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('race'))
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('gender'))
scan.add_fairness_grouping_feature(CertifaiGroupingFeature('age'))

When running a fairness analysis, it is important that there are sufficient samples of each class within the grouping features. We can check this and other potential issues using a preflight analysis. It will also give us a time estimate for each evaluation.

In [12]:
preflight_result = scan.run_preflight()
pprint.pprint(preflight_result)

Starting Preflight Scan
[--------------------] 2020-10-03 20:08:39.280893 - 0 of 8 checks (0.0% complete) - Running model nondeterminism preflight check for model logit
[##------------------] 2020-10-03 20:08:39.322757 - 1 of 8 checks (12.5% complete) - Running scan time estimate preflight check for model logit




[#####---------------] 2020-10-03 20:09:42.265016 - 2 of 8 checks (25.0% complete) - Running unknown outcome class preflight check for model logit
[#######-------------] 2020-10-03 20:09:42.287747 - 3 of 8 checks (37.5% complete) - Running fairness class samples preflight check for model logit
[##########----------] 2020-10-03 20:09:42.771684 - 4 of 8 checks (50.0% complete) - Finished all preflight checks for model logit
[##########----------] 2020-10-03 20:09:42.771800 - 4 of 8 checks (50.0% complete) - Running model nondeterminism preflight check for model mlp
[############--------] 2020-10-03 20:09:42.813925 - 5 of 8 checks (62.5% complete) - Running scan time estimate preflight check for model mlp




[###############-----] 2020-10-03 20:11:00.715962 - 6 of 8 checks (75.0% complete) - Running unknown outcome class preflight check for model mlp
[#################---] 2020-10-03 20:11:00.739645 - 7 of 8 checks (87.5% complete) - Running fairness class samples preflight check for model mlp
[####################] 2020-10-03 20:11:01.182076 - 8 of 8 checks (100.0% complete) - Finished all preflight checks for model mlp
{'logit': {'errors': [],
           'messages': ['Passed model non determinism check',
                        'Expected time for fairness analysis is 529 seconds',
                        'Expected time for robustness analysis is 62 seconds',
                        'Expected time for explainability analysis is 59 '
                        'seconds',
                        'Model logit passed time estimation check',
                        'Passed unknown outcome classes check'],
                        "size for 'Unknown/Invalid' with 3 examples",
                      

In the warnings, 'gender' has a small sample size (3) for 'Unknown/Invalid'. We will address this by dropping those rows, given they are a tiny proportion and are not a useful class for analysis. 

In other cases (e.g. for underrepresented age ranges), a better approach is to combine smaller classes into one larger class using bucketing. This is illustrated in the [Practical Issues](cortex-certifai-examples/notebooks/practical_issues/PracticalIssues.ipynb) notebook.

There are also warnings about 'nan' classes in 'age' and 'gender' with 0 examples. This is an artifact of including a nan column for one-hot encodings with no null values. These warnings can be ignored as a class with 0 examples will not be considered in the analysis. However, it would make sense to change the data pipeline to eliminate these columns and not pass them to the models. 

Drop rows with 'gender_Unknown/Invalid' and replace the evaluation dataset in the scan.

In [13]:
scan.remove_dataset('evaluation')
eval_dataset = CertifaiDataset('evaluation',
                                CertifaiDatasetSource.dataframe(df[df['gender_Unknown/Invalid'] == 0]))
scan.add_dataset(eval_dataset)

Save the scan definition and run the scan.

In [None]:
with open('trust-scan-def.yaml', "w") as f:
    scan.save(f)
results = scan.run(write_reports=True)

[--------------------] 2020-10-03 20:11:06.961680 - 0 of 6 reports (0.0% complete) - Starting scan with model_use_case_id: 'readmission' and scan_id: 'f9ec20548880', total estimated time is 24 minutes
[--------------------] 2020-10-03 20:11:06.961851 - 0 of 6 reports (0.0% complete) - Running fairness evaluation for model: logit, estimated time is 529 seconds




In [None]:

#scan.add_fairness_metric('demographic_parity')
scan.fairness_metrics



In [None]:
#scan.add_fairness_metric('demographic_parity')
#scan.primary_fairness_metric = 'demographic_parity'
#scan.remove_fairness_metric('burden')
#scan.remove_evaluation_type('robustness')
#scan.remove_evaluation_type('explainability')

In [None]:
# results = scan.run(write_reports=True)