# Calibrated Explanations for Binary Classification
## Stability and Robustness

Author: Tuwe Löfström (tuwe.lofstrom@ju.se)  
Copyright 2023 Tuwe Löfström  
License: BSD 3 clause
Sources:
1. ["Calibrated Explanations: with Uncertainty Information and Counterfactuals"](https://arxiv.org/abt/2305.02305) by [Helena Löfström](https://github.com/Moffran), [Tuwe Löfström](https://github.com/tuvelofstrom), Ulf Johansson, and Cecilia Sönströd.

### 1. Import packages

In [252]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [253]:
import pickle
import numpy as np
from scipy import stats as st

### 2 Import results from the pickled result file

In [254]:
with open('results_sota.pkl', 'rb') as f:
    results = pickle.load(f)
with open('results_stab_rob.pkl', 'rb') as f:
    ce = pickle.load(f)
data_characteristics = {'colic': 60, 
                        'creditA': 43, 
                        'diabetes': 9, 
                        'german': 28, 
                        'haberman': 4, 
                        'haberman': 4,
                        'heartC': 23,
                        'heartH': 21,
                        'heartS': 14,
                        'hepati': 20,
                        'iono': 34,
                        'je4042': 9,
                        'je4243': 9, 
                        'kc1': 22,
                        'kc2': 22,
                        'kc3': 40,
                        'liver': 7,
                        'pc1req': 9,
                        'pc4': 38,
                        'sonar': 61,
                        'spect': 23,
                        'spectf': 45,
                        'transfusion': 5,
                        'ttt': 28,
                        'vote': 17,
                        'wbc': 10,}

### 3 Stability and Robustness
In order to verify some of the claims of CE, two experiments have been run. The two experiments evaluate the stability and the robustness of the method. Stability is evaluated through experiments where the same model, calibration set and test set have been explained 30 times per data set. The only source of variation was the random seed. Robustness is evaluated through experiments where training and calibration sets have been randomly resampled before a new model was trained and explained. The experiment where run 30 times per data set with the same test set for all runs. A test set with 20 stratified instances (making sure both classes being equally represented) were used. Robustness is measured in this way to avoid inferring perturbed instances which are not from the same distribution as the test instances being explained. The expectation is that a stable and robust explanation method should result in low variance in the feature weights. 

Both random forests and xGBoost are used and both factual and counterfactual explanations are evaluated. The probability estimate of each of the models was computed on the same test set, as comparison to the robustness results. Furthermore, the probability estimates from each model calibrated using VA was also computed on the same test set for comparison.

The two state-of-the-art (sota) techniques LIME (version 0.2.0.1 and using `LimeTabularExplainer` class) and SHAP (version 0.44.0 `Explainer` class) were also evaluated in the same way and using the same instances and models. These two techniques were selected as the two obvious sota techniques based on their accessibility (through e.g., `pip` installation) and their large user base. No obvious sota technique for counterfactuals were identified, as all the proposed algorithms, like e.g., LORE or MACE, seem to lack in either accessibility, user base, or both.

Everything was run on 25 datasets. See the `Classification_Experiment_sota.py` for details on the experiment.

The tabulated results are the mean variance of the stability and robustness measured over the 30 runs and 20 instances. The variance is measured per instance and computed over the 30 runs on the feature importance weight of the most influential feature, defined as the feature most often having highest absolute feature importance weight. The average variance is computed over the 20 instances. The most influential feature is used since it is the feature that is most likely to be used in a decision but also the feature with the greatest expected variation (as a consequence of the weights having the highest absolute values). 

The results are printed as a latex table for inclusion in the paper.

In [255]:
stab_rank = {}
stab_val = {}
average_results = {}
for a in ['xGB','RF']:
    average_results[a+'_stab_ce'] = {}
    average_results[a+'_stab_cce'] = {}
    average_results[a+'_stab_lime'] = {}
    average_results[a+'_stab_lime_va'] = {}
    average_results[a+'_stab_shap'] = {}
    average_results[a+'_stab_shap_va'] = {}

n = results['test_size']
r = results['num_rep']
print('Stability & xGB & xGB & xGB & xGB & xGB & xGB \\\\')
print('Dataset & CE & CCE & L_C & S_C & L_U & S_U \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'num_rep']:
        continue
    print(d, end='')
    algorithms = results[d].keys()
    for a in algorithms:
        stability = results[d][a]['stability']
        
        for key in ['ce', 'cce']:    
            # if key == 'cce':
            #     stability = ce[d][a]['stability']
            # else:
            #     stability = results[d][a]['stability']
            ranks = []
            for j in range(n):
                rank = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(stability[key][i][j]['predict']))[-1:][0])
                ranks.append(rank)
            stab_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([stability[key][i][j]['predict'][stab_rank[key][j]] for i in range(r)]), np.var([stability[key][i][j]['predict'][stab_rank[key][j]] for i in range(r)])])
            stab_val[key] = value 

        stability = results[d][a]['stability']
                    
        for key in ['lime', 'lime_va', 'shap', 'shap_va']:    
            ranks = []
            for j in range(n):
                rank = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(stability[key][i][j]))[-1:][0])
                ranks.append(rank)
            stab_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([stability[key][i][j][stab_rank[key][j]] for i in range(r)]), np.var([stability[key][i][j][stab_rank[key][j]] for i in range(r)])])
            stab_val[key] = value 
        
        for key in ['ce', 'cce', 'lime', 'lime_va', 'shap', 'shap_va']:
            average_results[a+'_stab_'+key][d] = np.mean([t[1] for t in stab_val[key]])
        # print(f'{np.mean([t[1] if t[1] > 1e-20 else 0 for t in stab_val["ce"]]):.1e} & {np.var([t[1] if t[1] > 1e-20 else 0 for t in stab_val["cce"]]):.1e} & ',end='')
        
        for key in ['ce', 'cce', 'lime_va', 'shap_va', 'lime', 'shap']:
            print(f' & {average_results[a+"_stab_"+key][d]:.0e}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in algorithms:
    for key in ['ce', 'cce', 'lime_va', 'shap_va', 'lime', 'shap']:
        print(f' & {np.mean([v for v in average_results[a+"_stab_"+key].values()]):.0e}',end='')
print(' \\\\')

Stability & xGB & xGB & xGB & xGB & xGB & xGB \\
Dataset & CE & CCE & L_C & S_C & L_U & S_U \\
\hline
colic & 0e+00 & 0e+00 & 1e-08 & 3e-33 & 4e-05 & 1e-04 & 3e-32 & 3e-32 & 1e-08 & 1e-33 & 5e-06 & 8e-06 \\
creditA & 0e+00 & 2e-34 & 1e-61 & 0e+00 & 4e-05 & 1e-04 & 1e-32 & 1e-32 & 7e-61 & 0e+00 & 8e-06 & 1e-05 \\
diabetes & 2e-34 & 2e-33 & 4e-11 & 0e+00 & 9e-05 & 3e-33 & 1e-33 & 4e-33 & 2e-11 & 3e-36 & 3e-05 & 1e-33 \\
german & 4e-35 & 2e-35 & 6e-11 & 9e-33 & 8e-05 & 8e-05 & 2e-33 & 2e-33 & 6e-11 & 5e-33 & 3e-05 & 4e-05 \\
haberman & 4e-34 & 5e-34 & 1e-63 & 0e+00 & 1e-04 & 3e-33 & 1e-33 & 9e-34 & 9e-64 & 0e+00 & 4e-05 & 1e-33 \\
heartC & 8e-34 & 0e+00 & 2e-10 & 1e-32 & 7e-05 & 1e-04 & 4e-33 & 3e-33 & 2e-10 & 5e-33 & 1e-05 & 5e-06 \\
heartH & 6e-37 & 6e-37 & 2e-10 & 0e+00 & 5e-05 & 9e-05 & 5e-33 & 5e-33 & 2e-10 & 2e-32 & 8e-06 & 8e-06 \\
heartS & 2e-34 & 2e-33 & 7e-11 & 6e-33 & 6e-05 & 5e-05 & 4e-33 & 5e-33 & 7e-11 & 1e-32 & 1e-05 & 3e-06 \\
hepati & 6e-34 & 0e+00 & 7e-05 & 5e-33 & 3e-05

As can be seen above, the stability is practically 0 for both factual CE (CE) and counterfactual CE (CCE), illustrating that the method is stable by definition. Explanations extracted using SHAP (S_C) from calibrated models are also practically 0. LIME on calibrated models (L_C) and both LIME (L_U) and SHAP (S_U) on uncalibrated models are clearly less stable.

In [256]:
rob_rank = {}
rob_val = {}
rob_proba = []
rob_proba_va = []
# average_results = {}
for a in ['xGB','RF']:
    average_results[a+'_rob_ce'] = {}
    average_results[a+'_rob_cce'] = {}
    average_results[a+'_rob_lime'] = {}
    average_results[a+'_rob_lime_va'] = {}
    average_results[a+'_rob_shap'] = {}
    average_results[a+'_rob_shap_va'] = {}
    average_results[a+'_rob_proba'] = {}
    average_results[a+'_rob_proba_va'] = {}

n = results['test_size']
r = results['num_rep']

print('Robustness & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB \\\\')
print('Dataset & CE & CCE & L_C & S_C & C & L_U & S_U & U \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'num_rep']:
        continue
    print(d, end='')
    algorithms = results[d].keys()
    for a in algorithms:
        robustness = results[d][a]['robustness']
        
        for key in ['ce', 'cce']:  
            # if key == 'cce':
            #     robustness = ce[d][a]['robustness']
            # else:
            #     robustness = results[d][a]['robustness']               
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(robustness[key][i][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            rob_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([robustness[key][i][j]['predict'][rob_rank[key][j]] for i in range(r)]), np.var([robustness[key][i][j]['predict'][rob_rank[key][j]] for i in range(r)])])
            rob_val[key] = value

        robustness = results[d][a]['robustness']
            
        for key in ['lime', 'lime_va', 'shap', 'shap_va']:    
            ranks = []
            for j in range(n):
                rank = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(robustness[key][i][j]))[-1:][0])
                ranks.append(rank)
            rob_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([robustness[key][i][j][rob_rank[key][j]] for i in range(r)]), np.var([robustness[key][i][j][rob_rank[key][j]] for i in range(r)])])
            rob_val[key] = value 
        
        for inst in range(n):
            rob_proba.append(np.var([robustness['proba'][j][inst] for j in range(r)]))
            rob_proba_va.append(np.var([robustness['proba_va'][j][inst] for j in range(r)]))
        
        for key in ['ce', 'cce', 'lime', 'lime_va', 'shap', 'shap_va']:
            average_results[a+'_rob_'+key][d] = np.mean([t[1] for t in rob_val[key]])
        average_results[a+'_rob_proba'][d] = np.mean(rob_proba)
        average_results[a+'_rob_proba_va'][d] = np.mean(rob_proba_va)
        # print(f'{np.mean([t[1] if t[1] > 1e-20 else 0 for t in rob_val["ce"]]):.1e} & {np.mean([t[1] if t[1] > 1e-20 else 0 for t in rob_val["cce"]]):.1e} & ',end='')
        
        for key in ['ce', 'cce', 'lime_va', 'shap_va', 'proba_va', 'lime', 'shap', 'proba']:
            print(f' & {average_results[a+"_rob_"+key][d]:.3f}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in algorithms:
    for key in ['ce', 'cce', 'lime_va', 'shap_va', 'proba_va', 'lime', 'shap', 'proba']:
        print(f' & {np.mean([v for v in average_results[a+"_rob_"+key].values()]):.3f}',end='')
print(' \\\\')

Robustness & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB \\
Dataset & CE & CCE & L_C & S_C & C & L_U & S_U & U \\
\hline
colic & 0.017 & 0.017 & 0.001 & 0.000 & 0.000 & 0.006 & 0.002 & 0.015 & 0.017 & 0.017 & 0.001 & 0.000 & 0.000 & 0.001 & 0.000 & 0.009 \\
creditA & 0.023 & 0.022 & 0.000 & 0.000 & 0.000 & 0.002 & 0.003 & 0.010 & 0.015 & 0.015 & 0.000 & 0.000 & 0.000 & 0.000 & 0.001 & 0.009 \\
diabetes & 0.017 & 0.009 & 0.000 & 0.000 & 0.001 & 0.004 & 0.005 & 0.014 & 0.015 & 0.007 & 0.000 & 0.000 & 0.001 & 0.001 & 0.001 & 0.012 \\
german & 0.003 & 0.003 & 0.006 & 0.002 & 0.001 & 0.006 & 0.006 & 0.014 & 0.005 & 0.004 & 0.002 & 0.001 & 0.001 & 0.001 & 0.001 & 0.014 \\
haberman & 0.011 & 0.009 & 0.000 & 0.000 & 0.002 & 0.008 & 0.011 & 0.016 & 0.011 & 0.009 & 0.000 & 0.000 & 0.003 & 0.004 & 0.004 & 0.016 \\
heartC & 0.014 & 0.014 & 0.001 & 0.001 & 0.003 & 0.006 & 0.004 & 0.017 & 0.012 & 0.011 & 0.001 & 0.001 & 0.002 & 0.001 & 0.001 & 0.016 \\
heartH & 0.016 & 0.015 & 0.171 & 0.002 & 0.00

Here, the picture is somewhat different, as CE and CCE have higher variability in their feature weights compared to LIME and SHAP. The results for CE and CCE are clearly lower than the uncalibrated model (U) from which they have been extracted and slightly more than the VA calibrated model (C). While these results could be interpreted as indicating low robustness, we argue that the experiment shows that the method updates its feature weights in accordance with how much the underlying model is changing its predictions. A similar pattern could be seen when comparing LIME and SHAP explanations from calibrated vs uncalibrated models, with low variability for the explanations of calibrated models having themselves lower variability and vice versa.