In this notebook, we work with look-up tables that were created for Section 4.3 to compute Shapley values for the four CatBoost models from Section 4.2 which were trained on public datasets. A sample of size 100 from the test set is provided in each case, and is available in the folder `Samples`. 

The look-up tables were created via a proprietary code of Discover Financial Services which is a fast implementation of Algorithm 3.12. These are located in folders `Shapley_values_loc_1`, `Shapley_values_loc_2` etc. where there is a `.csv` file for each tree of the ensemble under consideration containing all Shapley values arising from that tree. The rows are indexed by the leaves of the oblivious tree and the columns capture the features on which the tree splits. The non-realizable leaves corresponding to vacuous regions are excluded. The `.json` file in each folder relates the local enumeration of features appearing in a tree to their global index in the training data. 

We verify these precomputed Shapley values through checking the 
[efficiency axiom](https://christophm.github.io/interpretable-ml-book/shapley.html#the-shapley-value-in-detail): Choosing a tree from one of the four ensembles randomly, for each data sample, the difference between tree's output (i.e. the leaf value) and the sum of Shapley values associated with the corresponding leaf is always a constant—it should be equal to the average of outputs of that tree over the whole training data:

$$\sum_i\varphi_i[g](\mathbf{x})=g(\mathbf{x})-\mathbb{E}[g]\quad \forall\mathbf{x}.$$

In [1]:
import pandas as pd
import numpy as np
import glob
import random
import pickle

import catboost
from catboost import CatBoostClassifier, CatBoostRegressor

We first load the CatBoost model and the corresponding data sample. Only `experiment_number` should be declared (a number between 1 and 4).

In [2]:
experiment_number=4

if experiment_number==1 or experiment_number==2:
    model_type='Regressor'
elif experiment_number==3 or experiment_number==4:
    model_type='Classifier'
else:
    raise ValueError('experiment_number should be 1,2,3 or 4.')
    
sample_path='./Samples/Sample_'+str(experiment_number)+'.csv'
sample=pd.read_csv(sample_path)
n_samples=sample.shape[0]

model_path='./Models/'+model_type+'_CatBoost_'+str(experiment_number)
model_cat=pickle.load(open(model_path,'rb'))

local_shapley_folder_path='./Shapley_values_loc_'+str(experiment_number)
n_trees=len(glob.glob1(local_shapley_folder_path,'*.csv'))
print(f'We consider the CatBoost ensemble from experiment {experiment_number} which has {n_trees} trees.')

We consider the CatBoost ensemble from experiment 4 which has 1000 trees.


We already have a script `EnsembleParser.py` for deriving various statistics from a trained CatBoost ensemble. Below, we import this library to compute averages of leaf values which we shall need for our efficiency test.

In [3]:
import sys
sys.path.insert(0, '..')
import EnsembleParser
from EnsembleParser import Parser
averages=Parser(model_cat).tree_average()

In [4]:
#We pick a random tree. Since leaves corresponding to degenerate regions are not considered in look-up tables, 
#we only consider tables with (number of rows)=2**(number of columns), that is, trees without repeated features. 
#For such trees, the internal enumeration of leaves matches the order of rows. 

while True:
    tree_index=random.randint(0,n_trees-1)
    local_shapley=pd.read_csv(local_shapley_folder_path+'/game_value_tree_'+str(tree_index)+'.csv',
                         header=None)
    if local_shapley.shape[0]==2**(local_shapley.shape[1]):
        break
print(f'The tree of index {tree_index} was chosen randomly from the CatBoost ensemble.')
print(f'The average of its outputs over the training data is {averages[tree_index]}.')
        
#The outputs of the chosen tree at the sample points. These are leaf values (logit probability values in the case of classifiers).         
outputs=model_cat.predict(sample,prediction_type='RawFormulaVal',
                          ntree_start=tree_index,ntree_end=tree_index+1)
        

#Determining leaves of the tree at which sample points land:
leaf_indices=model_cat.calc_leaf_indexes(sample,ntree_start=tree_index,ntree_end=tree_index+1).reshape(n_samples)

#Adding the sum of rows to the table of Shapley values
local_shapley['sum']=local_shapley.sum(axis=1)

#Subtracting the sum of Shapley values at the leaf corresponding to a sample point from the leaf value:
difference=outputs-np.asarray(local_shapley['sum'][leaf_indices].to_list())
print(f'\nVerifying the efficiency axiom: the output minus the sum of local Shapley values should be the same for all {n_samples} sample data points; this difference always coincides with the average output of the tree.')
difference

The tree of index 182 was chosen randomly from the CatBoost ensemble.
The average of its outputs over the training data is -8.096037103613889e-05.

Verifying the efficiency axiom: the output minus the sum of local Shapley values should be the same for all 100 sample data points; this difference always coincides with the average output of the tree.


array([-8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
       -8.0960371e-05, -8.0960371e-05, -8.0960371e-05, -8.0960371e-05,
      

Similarly, the efficiency axiom can be confirmed for Owen values. For the same models, and for appropriate partitions of their respective features (check folder `MIC_based_grouping`), we have generated look-up tables of Owen values based on a proprietary implementation of Theorem F.1. The tables are saved in folders `Owen_values_loc_1`, `Owen_values_loc_2` etc. which have the same structure: a `.csv` file for each tree recording the Owen values at its (realizable) leaves for the features on which the tree splits along with a `.json` file capturing features relevant to each tree, their global indices (and the partition in hand).

These precomputed Owen values are verified, again, through checking the 
[efficiency axiom](https://christophm.github.io/interpretable-ml-book/shapley.html#the-shapley-value-in-detail): Choosing a tree from one of the four ensembles randomly, for each data sample, the difference between tree's output (i.e. the leaf value) and the sum of Owen values associated with the corresponding leaf is always a constant—it should be equal to the average of outputs of that tree over the whole training data:

$$\sum_iOw_i[g](\mathbf{x})=g(\mathbf{x})-\mathbb{E}[g]\quad \forall\mathbf{x}.$$

In [5]:
#We pick a random tree. Since leaves corresponding to degenerate regions are not considered in look-up tables, 
#we only consider tables with (number of rows)=2**(number of columns), that is, trees without repeated features. 
#For such trees, the internal enumeration of leaves matches the order of rows. 

local_owen_folder_path='./Owen_values_loc_'+str(experiment_number)

while True:
    tree_index=random.randint(0,n_trees-1)
    local_owen=pd.read_csv(local_owen_folder_path+'/game_value_tree_'+str(tree_index)+'.csv',
                         header=None)
    if local_owen.shape[0]==2**(local_owen.shape[1]):
        break
print(f'The tree of index {tree_index} was chosen randomly from the CatBoost ensemble.')
print(f'The average of its outputs over the training data is {averages[tree_index]}.')
        
#The outputs of the chosen tree at the sample points. These are leaf values (logit probability values in the case of classifiers).        
outputs=model_cat.predict(sample,prediction_type='RawFormulaVal',
                          ntree_start=tree_index,ntree_end=tree_index+1)
        

#Determining leaves of the tree at which sample points land:
leaf_indices=model_cat.calc_leaf_indexes(sample,ntree_start=tree_index,ntree_end=tree_index+1).reshape(n_samples)

#Adding the sum of rows to the table of Owen values
local_owen['sum']=local_owen.sum(axis=1)

#Subtracting the sum of Shapley values at the leaf corresponding to a sample point from the leaf value:
difference=outputs-np.asarray(local_owen['sum'][leaf_indices].to_list())
print(f'\nVerifying the efficiency axiom: the output minus the sum of local Owen values should be the same for all {n_samples} sample data points; this difference always coincides with the average output of the tree..')
difference

The tree of index 217 was chosen randomly from the CatBoost ensemble.
The average of its outputs over the training data is -3.341477025607011e-05.

Verifying the efficiency axiom: the output minus the sum of local Owen values should be the same for all 100 sample data points; this difference always coincides with the average output of the tree..


array([-3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -3.34147703e-05, -3.34147703e-05, -3.34147703e-05,
       -3.34147703e-05, -