# Fast bulk explanations

This notebook illustrates a technique for scaling Certifai counterfactual explanations to large explanation
datasets.  To do this the process is split into two stages:

1. A single global (expensive but one one-off) pre-calculation step is performed
2. Explanation scans may then be performed with a fast approximation mechanism that utilizes the results of (1)

Subsequently further explanation scans can continue to use the fast mechanism.  Step (1) need be repeated only
if either the model to be explained is changed (or a new one added), or if the data distribution has shifted
significantly

*Note* - this example requires Certifai version 1.3.7 or above

In [1]:
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import svm

from certifai.scanner.builder import (CertifaiScanBuilder, CertifaiPredictorWrapper, CertifaiModel, CertifaiModelMetric,
                                      CertifaiDataset, CertifaiGroupingFeature, CertifaiDatasetSource,
                                      CertifaiPredictionTask, CertifaiTaskOutcomes, CertifaiOutcomeValue)
from certifai.common.utils.encoding import CatEncoder
from certifai.scanner.report_utils import scores, construct_scores_dataframe
from certifai.scanner.explanation_utils import explanations, ExplainedPrediction
from sklearn.preprocessing import StandardScaler

from certifai.common.utils.encoding import CatEncoder

%matplotlib inline
import matplotlib.pyplot as plt

# Train a model

We'll use the adult income dataset for this example, which has a little over 60000 examples.  The fast approximation technique is only recommended when
large numbers of explanations are required and there are large datasets.  In particular the pre-calculation phase
strongly benefits from having a large evaluation dataset to train from.

In [2]:
base_path = "../datasets"
dataset_file = f"{base_path}/adult_income_eval.csv"

df = pd.read_csv(dataset_file)

In [3]:
np.random.seed(0)

cat_columns = [
    'workclass',
    'education',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'gender',
    'native-country'
]

label_column = 'income'

# Separate outcome
y = df[label_column]
X = df.drop(label_column, axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

encoder = CatEncoder(cat_columns, X)

MAX_TRAIN_SAMPLE_SIZE = 5000

def build_model(data, name, model_family, test=None):
    if test is None:
        test = data
    
    X = data[0][:MAX_TRAIN_SAMPLE_SIZE]
    y = data[1][:MAX_TRAIN_SAMPLE_SIZE]
    X_test = test[0]
    if encoder is not None:
        X = encoder(X.values)
        X_test = encoder(X_test.values)

    if model_family == 'SVM':
        parameters = {'kernel': ('linear', 'rbf', 'poly'), 'C': [0.1, .5, 1, 2, 4, 10],
                      'gamma': ['auto']}
        m = svm.SVC()
    elif model_family == 'logistic':
        parameters = {'C': (0.5, 1.0, 2.0), 'solver': ['lbfgs'], 'max_iter': [1000]}
        m = LogisticRegression()
    elif model_family == 'RF':
        parameters = {'n_estimators': [100, 200]}
        m = RandomForestClassifier()
    model = GridSearchCV(m, parameters, cv=3)
    model.fit(X, y)

    # Assess on the test data
    accuracy = model.score(X_test, test[1].values)
    print(f"Model '{name}' accuracy is {accuracy}")
    return model

rf_model = build_model((X_train, y_train),
                        'Random Forest',
                        'RF',
                        test=(X_test, y_test))

Model 'Random Forest' accuracy is 0.8582249974408844


In [4]:
rf_model_proxy = CertifaiPredictorWrapper(rf_model, encoder=encoder)

# Set up the explanation set

For the purposes of this notebook we'll run 1000 explanations.  This is less than would typically justify use of the fast explainer in a production environment, but serves as an example.  More typically one would be running
10s or 100s of thousands of explanations to gain full benefit of amortization of the cost of the precalculation step

In [5]:
# Make sure we are evaluating on a set that is held out from the clustering training set
X_cluster_train, X_cluster_test, y_cluster_train, y_cluster_test = train_test_split(X, y, test_size=0.5, random_state=42)
clustering_df = X_cluster_train

EXPLANATION_SIZE = 1000

explanation_set = X_cluster_test[:EXPLANATION_SIZE]


# Set up and run the precalculation step

Scan setup is much the same as for any other analysis.  They key difference is that when we run it we
use `run_explain()` and pass the `precalculate=True` flag

In [6]:
# First define the possible prediction outcomes
task = CertifaiPredictionTask(CertifaiTaskOutcomes.classification(
    [
        CertifaiOutcomeValue(1, favorable=True),
        CertifaiOutcomeValue(0)
    ]))

scan = CertifaiScanBuilder.create('stock',
                                  prediction_task=task)

# Add our local model
model = CertifaiModel('rf', local_predictor=rf_model_proxy)
scan.add_model(model)

# Add the eval dataset
precalc_dataset = CertifaiDataset('precalc',
                                  CertifaiDatasetSource.dataframe(X_cluster_train))
scan.add_dataset(precalc_dataset)
scan.evaluation_dataset_id = 'precalc'

# Because the dataset contains a ground truth outcome column which the model does not
# expect to receive as input we need to state that in the dataset schema (since it cannot
# be inferred from the CSV)
scan.dataset_schema.outcome_feature_name = label_column

# First we need to run the pre-calculation for the use-case and model.  This step will perform
# the relatively expensive one-time compute necessary to support subsequent fast explanation
# of bulk data
scan.run_explain(precalculate=True)

Starting Fast Explanations Precalculate Step
[--------------------] 2020-11-24 09:35:17.291400 - 0 of 1 checks (0.0% complete) - Computing clustering information for model: rf
[####################] 2020-11-24 10:06:14.373596 - 1 of 1 checks (100.0% complete) - Finished fast explanations precalculate step for all models


{'rf': {'status': <ScanStatusEnum.completed: 'Completed'>,
  'error': None,
  'location': '/Users/sdraper/projects/cortex-certifai-examples/notebooks/fast_explanations/reports/stock/certifai-precalculate-rf.pkl'}}

# Run the bulk explanations

We can now run the fast explanation step.  Note that again we use `run_explain()`, but this time with a different flag, `fast=True`

In [7]:
# Add the final explanation dataset
expl_dataset = CertifaiDataset('explanation',
                               CertifaiDatasetSource.dataframe(explanation_set))
scan.add_dataset(expl_dataset)
scan.explanation_dataset_id = 'explanation'

# Run the fast explanation mechanism on the (large) explanation set.
# Note - this requires the precalculate to have been run previously, though
# not necessarily in the same session as the precalculation results are cached to disk
# and can be used with as many subsequent calls with 'fast=True' as desired
fast_result = scan.run_explain(fast=True)

[--------------------] 2020-11-24 10:06:14.508406 - 0 of 1 checks (0.0% complete) - Starting scan with model_use_case_id: 'stock' and scan_id: '55ec412c5328'
[--------------------] 2020-11-24 10:06:14.508530 - 0 of 1 checks (0.0% complete) - Running fast explanation evaluation for model: rf
[####################] 2020-11-24 10:07:09.419479 - 1 of 1 checks (100.0% complete) - Completed all evaluations


# Display some results

Below we show the first 10 explanations

In [8]:
from certifai.scanner.explanation_utils import explanations, construct_explanations_dataframe, counterfactual_changes

pd.set_option('display.max_columns', None)

# Using Certifai's explanation utilities we can programmatically explore counterfactuals produced
# during the explanation evaluation. Below we examine only the first 10 explanations
# by displaying the original input data followed by what features were changed by each
# counterfactual.
logistic_explanations = construct_explanations_dataframe(explanations(fast_result, model_id='rf'))


def display_explanation(df):
    df_original = df[df['instance']=='original']
    display(df_original)
    changes = counterfactual_changes(df)
    display(changes)

print("Explanations for random forest Model:\n")
for row in range(10):
    display_explanation(logistic_explanations[logistic_explanations['row'] == row+1]) 


Explanations for random forest Model:



Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,rf,1,original,0,original prediction,0,0.0,56,workclass_Private,33115,education_HS-grad,9,marital-status_Divorced,occupation_Other-service,relationship_Unmarried,race_White,gender_Female,0,0,40,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,occupation,relationship,capital-gain
0,original,0,original prediction,0,0.0,occupation_Other-service,relationship_Unmarried,0
0,counterfactual,1,prediction changed,1,0.33,occupation_Adm-clerical,relationship_Husband,7589


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
2,rf,2,original,0,original prediction,0,0.0,25,workclass_Private,112847,education_HS-grad,9,marital-status_Married-civ-spouse,occupation_Transport-moving,relationship_Own-child,race_Other,gender_Male,0,0,40,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,capital-gain
0,original,0,original prediction,0,0.0,0
0,counterfactual,1,prediction changed,1,0.99,7780


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
4,rf,3,original,0,original prediction,1,0.0,43,workclass_Private,170525,education_Bachelors,13,marital-status_Divorced,occupation_Prof-specialty,relationship_Not-in-family,race_White,gender_Female,14344,0,40,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,capital-gain
0,original,0,original prediction,1,0.0,14344
0,counterfactual,1,prediction changed,0,1.07,7172


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
6,rf,4,original,0,original prediction,0,0.0,32,workclass_Private,186788,education_HS-grad,9,marital-status_Married-civ-spouse,occupation_Transport-moving,relationship_Husband,race_White,gender_Male,0,0,40,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,capital-gain
0,original,0,original prediction,0,0.0,0
0,counterfactual,1,prediction changed,1,1.41,5448


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
8,rf,5,original,0,original prediction,0,0.0,39,workclass_Private,277886,education_Bachelors,13,marital-status_Married-civ-spouse,occupation_Sales,relationship_Wife,race_White,gender_Female,0,0,30,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,age
0,original,0,original prediction,0,0.0,39
0,counterfactual,1,prediction changed,1,10.0,40


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
10,rf,6,original,0,original prediction,0,0.0,20,workclass_Private,323009,education_HS-grad,9,marital-status_Never-married,occupation_Adm-clerical,relationship_Unmarried,race_White,gender_Female,0,0,40,native-country_Germany


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,occupation,relationship,capital-gain
0,original,0,original prediction,0,0.0,occupation_Adm-clerical,relationship_Unmarried,0
0,counterfactual,1,prediction changed,1,0.32,occupation_Priv-house-serv,relationship_Husband,8500


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
12,rf,7,original,0,original prediction,0,0.0,54,workclass_Private,146834,education_HS-grad,9,marital-status_Divorced,occupation_Transport-moving,relationship_Not-in-family,race_White,gender_Male,0,0,45,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,capital-gain
0,original,0,original prediction,0,0.0,0
0,counterfactual,1,prediction changed,1,0.84,9123


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
14,rf,8,original,0,original prediction,1,0.0,25,workclass_Private,166977,education_Bachelors,13,marital-status_Married-civ-spouse,occupation_Prof-specialty,relationship_Wife,race_White,gender_Female,0,1887,40,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,capital-loss
0,original,0,original prediction,1,0.0,1887
0,counterfactual,1,prediction changed,0,4.08,1786


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
16,rf,9,original,0,original prediction,0,0.0,30,workclass_Private,209317,education_HS-grad,9,marital-status_Never-married,occupation_Machine-op-inspct,relationship_Not-in-family,race_White,gender_Male,0,0,50,native-country_Dominican-Republic


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,capital-gain
0,original,0,original prediction,0,0.0,0
0,counterfactual,1,prediction changed,1,0.84,9122


Unnamed: 0,model,row,instance,cf_num,cf_type,prediction,fitness,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
18,rf,10,original,0,original prediction,0,0.0,33,workclass_Private,92865,education_Some-college,10,marital-status_Never-married,occupation_Adm-clerical,relationship_Own-child,race_White,gender_Female,0,0,40,native-country_United-States


Unnamed: 0,instance,cf_num,cf_type,prediction,fitness,capital-gain
0,original,0,original prediction,0,0.0,0
0,counterfactual,1,prediction changed,1,0.84,9126
