# Calibrated Explanations for Binary Classification
## Ablatioon Analysis

Author: Tuwe Löfström (tuwe.lofstrom@ju.se)  
Copyright 2023 Tuwe Löfström  
License: BSD 3 clause
Sources:
1. ["Calibrated Explanations: with Uncertainty Information and Counterfactuals"](https://arxiv.org/abt/2305.02305) by [Helena Löfström](https://github.com/Moffran), [Tuwe Löfström](https://github.com/tuvelofstrom), Ulf Johansson, and Cecilia Sönströd.

### 1. Import packages

In [110]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [111]:
import pickle
import pandas as pd
import numpy as np
from scipy import stats as st

### 2 Import results from the pickled result file

In [112]:
with open('results_ablation.pkl', 'rb') as f:
    results = pickle.load(f)
data_characteristics = {'colic': 60, 
                        'creditA': 43, 
                        'diabetes': 9, 
                        'german': 28, 
                        'haberman': 4, 
                        'haberman': 4,
                        'heartC': 23,
                        'heartH': 21,
                        'heartS': 14,
                        'hepati': 20,
                        'iono': 34,
                        'je4042': 9,
                        'je4243': 9, 
                        'kc1': 22,
                        'kc2': 22,
                        'kc3': 40,
                        'liver': 7,
                        'pc1req': 9,
                        'pc4': 38,
                        'sonar': 61,
                        'spect': 23,
                        'spectf': 45,
                        'transfusion': 5,
                        'ttt': 28,
                        'vote': 17,
                        'wbc': 10,}

### 3 Ablation analysis
The ablation analysis is focused on evaluating how tth algorithm i affected by the calibration size and the number of percentiles sampled for numerical features. It is using a similar setup as the stability experiment, but with the following changes:
* The number of percentiles sampled for numerical features is varied between 1, 2, 3 (default), 4, and 10.
* The calibration size is varied between 10%, 20% and 40% of the data not used for testing.
* Test size is fixed to 10% of the data. 
* Only one repetition per percentile and calibration size is used.

Everything was run on 25 datasets. See the `Classification_Experiment_Ablation.py` for details on the experiment.

The tabulated results are the mean variance of the ablation measured over per calibration size or percentile sampling. The variance is measured per instance and computed over the runs having the same calibration size/percentile sampling on the feature importance weight of the most influential feature, defined as the feature most often having highest absolute feature importance weight. The average variance is computed over the entire test set. The most influential feature is used since it is the feature that is most likely to be used in a decision but also the feature with the greatest expected variation (as a consequence of the weights having the highest absolute values). 

#### 3.1 Calibration Size
First out is a table with results per calibration size. Since different sampling sizes may result in different results for numerical features, the mean variance is only expected to be 0 for categorical-only datasets. The results are printed as a latex table.

In [113]:
ranking = {}
val = {}
average_results = {}
n = results['test_size']
cal_sizes = results['calibration_sizes']
perc_samples = results['sample_percentiles']

for a in ['xGB', 'RF']:
    for cal in cal_sizes:
        average_results[a+'_'+str(cal)+'_ce'] = []
        average_results[a+'_'+str(cal)+'_cce'] = []

print(' & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF \\\\\nCalibration Size', end='')
for i in range(2):
    for key in ['ce', 'cce']:  
        for cal in cal_sizes:
            print(f' & {cal}',end='')
print('\\\\')
print('Dataset & CE & CE & CE & CCE & CCE & CCE & CE & CE & CE & CCE & CCE & CCE \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'calibration_sizes', 'sample_percentiles']:
        continue
    print(d, end='')
    for a in results[d]:
        ablation = results[d][a]['ablation']
        
        for key in ['ce', 'cce']:    
            n = len(ablation[key][cal_sizes[0]][str(perc_samples[0])][0])
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for cal in ablation[key]:
                    for p in ablation[key][cal]:
                        rank.append(np.argsort(np.abs(ablation[key][cal][p][0][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            ranking[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            # print(ranking[key])
            
            value = []
            for cal in ablation[key]: 
                print(' & ', end='')  
                for j in range(n):
                    values = [ablation[key][cal][p][0][j]['predict'][ranking[key][j][0]] for p in ablation[key][cal]]
                    value.append([np.mean(values), np.var(values)])
                val[key] = value 

                res = np.mean([t[1] for t in val[key]])
                average_results[a+'_'+str(cal)+'_'+key].append(res)
                print(f'{res:.1e}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:  
        for cal in ablation[key]:
            print(' & ', end='')
            print(f'{np.mean(average_results[a+"_"+str(cal)+"_"+key]):.1e}',end='')
print(' \\\\')
# df = pd.DataFrame.from_dict(average_results, orient='index')
# display (df)


 & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF \\
Calibration Size & 0.1 & 0.2 & 0.4 & 0.1 & 0.2 & 0.4 & 0.1 & 0.2 & 0.4 & 0.1 & 0.2 & 0.4\\
Dataset & CE & CE & CE & CCE & CCE & CCE & CE & CE & CE & CCE & CCE & CCE \\
\hline
haberman & 1.5e-03 & 2.4e-03 & 1.9e-03 & 1.5e-03 & 2.4e-03 & 1.9e-03 & 7.3e-04 & 2.1e-03 & 2.1e-03 & 7.3e-04 & 2.1e-03 & 2.1e-03 \\
heartC & 1.3e-06 & 4.0e-05 & 5.1e-05 & 1.3e-06 & 4.0e-05 & 5.1e-05 & 1.2e-05 & 3.8e-05 & 3.4e-05 & 1.2e-05 & 3.8e-05 & 3.4e-05 \\
heartH & 9.1e-05 & 8.8e-05 & 7.9e-05 & 9.1e-05 & 8.8e-05 & 7.9e-05 & 3.7e-07 & 9.7e-06 & 7.2e-06 & 3.7e-07 & 9.7e-06 & 7.2e-06 \\
heartS & 1.5e-04 & 2.2e-04 & 2.1e-04 & 1.5e-04 & 2.2e-04 & 2.1e-04 & 1.3e-05 & 2.3e-05 & 1.5e-05 & 1.3e-05 & 2.3e-05 & 1.5e-05 \\
hepati & 7.6e-05 & 4.6e-05 & 3.6e-05 & 7.6e-05 & 4.6e-05 & 3.6e-05 & 7.0e-05 & 5.4e-05 & 4.9e-05 & 7.0e-05 & 5.4e-05 & 4.9e-05 \\
je4243 & 9.8e-05 & 2.8e-04 & 2.3e-04 & 9.8e-05 & 2.8e-04 & 2.3e-04 & 3.1e-04 & 4.9e-04 & 3.9e-04 & 3.1e

The most interesting observation from the results above is that difference in mean variance is fairly low between the different calibration sizes. This indicates that the calibration size does not have a large impact on the feature importance weights. In fact, a smaller calibration set even tend to have a lower mean variance.
#### 3.2 Percentile Sampling
Below are the results per percentile sampling. The results are printed as a latex table.

In [114]:
ranking = {}
val = {}
average_results = {}
n = results['test_size']
cal_sizes = results['calibration_sizes']
perc_samples = results['sample_percentiles']

for a in ['xGB', 'RF']:
    for p in perc_samples:
        average_results[a+'_'+str(p)+'_ce'] = []
        average_results[a+'_'+str(p)+'_cce'] = []

print(' & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF \\\\\nSample Size', end='')
for i in range(2):
    for key in ['ce', 'cce']:  
        for p in perc_samples:
            print(f' & {str(len(p))}',end='')
print('\\\\')
print('Dataset & CE & CE & CE & CCE & CCE & CCE & CE & CE & CE & CCE & CCE & CCE \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'calibration_sizes', 'sample_percentiles']:
        continue
    print(d, end='')
    for a in results[d]:
        ablation = results[d][a]['ablation']
        
        for key in ['ce', 'cce']:    
            n = len(ablation[key][cal_sizes[0]][str(perc_samples[0])][0])
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for cal in ablation[key]:
                    for p in ablation[key][cal]:
                        rank.append(np.argsort(np.abs(ablation[key][cal][p][0][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            ranking[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            # print(ranking[key])
            
            value = []
            for p in ablation[key][cal]: 
                print(' & ', end='')  
                for j in range(n):
                    values = [ablation[key][cal][p][0][j]['predict'][ranking[key][j][0]] for p in ablation[key][cal]]
                    value.append([np.mean(values), np.var(values)])
                val[key] = value 

                res = np.mean([t[1] for t in val[key]]) # mean of instance variance
                average_results[a+'_'+p+'_'+key].append(res)
                print(f'{res:.2}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:  
        for p in ablation[key][cal]:
            print(' & ', end='')
            print(f'{np.mean(average_results[a+"_"+p+"_"+key]):.2}',end='')
print(' \\\\')
# df = pd.DataFrame.from_dict(average_results, orient='index')
# display (df)


 & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF \\
Sample Size & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9\\
Dataset & CE & CE & CE & CCE & CCE & CCE & CE & CE & CE & CCE & CCE & CCE \\
\hline
haberman & 0.00084 & 0.00084 & 0.00084 & 0.00084 & 0.00084 & 0.00084 & 0.00084 & 0.00084 & 0.00084 & 0.00084 & 0.002 & 0.002 & 0.002 & 0.002 & 0.002 & 0.002 & 0.002 & 0.002 & 0.002 & 0.002 \\
heartC & 7.3e-05 & 7.3e-05 & 7.3e-05 & 7.3e-05 & 7.3e-05 & 7.3e-05 & 7.3e-05 & 7.3e-05 & 7.3e-05 & 7.3e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 & 2.6e-05 \\
heartH & 6.1e-05 & 6.1e-05 & 6.1e-05 & 6.1e-05 & 6.1e-05 & 6.1e-05 & 6.1e-05 & 6.1e-05 & 6.1e-05 & 6.1e-05 & 2e-06 & 2e-06 & 2e-06 & 2e-06 & 2e-06 & 2e-06 & 2e-06 & 2e-06 & 2e-06 & 2e-06 \\
heartS & 0.00019 & 0.00019 & 0.00019 & 0.00019 & 0.00019 & 0.00019 & 0.00019 & 0.00019 & 0.00019 & 0.00019 & 7.5e-07 & 7.5e-07 & 7.5e-07 & 7.5e-07 & 7.5e-07 & 7

Even if there is some difference in the mean variance when varying the percentile sampling, the difference can be attributed only to the difference in underlying ML algorithm. This indicates that the percentile sampling does not have a large impact on the feature importance weights. 

### 4 Computing time
Now, lets look at the runtime taken to compute the explanations. The tabulated runtimes are the average time in seconds per instance. First, detailed results per calibration size AND percentile sampling is shown, then results aggregated per calibration size and percentile sampling are shown separately. The results are printed as a latex table.

In [116]:
timer = []
n = results['test_size']
cal_sizes = results['calibration_sizes']
perc_samples = results['sample_percentiles']
average_time = {}
average_time['num_features'] = []
print('Runtime & ', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:            
        for cal in ablation[key]:
            for p in perc_samples:
                average_time[a+"_"+key+"_"+str(cal)+"_"+str(p)] = []
                average_time[a+"_"+key+"_"+str(cal)+"_"+str(p)] = []
                print(f'{a+"_"+key+"_"+str(cal)+"_"+str(len(p))} & ',end='')
print(' \\\\\n\\hline')

for d in np.sort(np.sort([k for k in results.keys()])):
    if d in ['test_size', 'calibration_sizes', 'sample_percentiles']:
        continue
    print(d, end='')
    for a in results[d]:
        n = len(results[d][a]['ablation']['ce'][cal_sizes[0]][str(perc_samples[0])][0])
        for key in ['ce', 'cce']:  
            a_time = results[d][a]['timer']
            for cal in ablation[key]:
                for p in ablation[key][cal]:          
                    print(' & ', end='')
                    res = np.mean([t/n for t in a_time[key][cal][p]])
                    average_time[a+"_"+key+"_"+str(cal)+"_"+p].append(res)
                    average_time['num_features'].append(data_characteristics[d])
                    print(f'{res:.2f}',end='')
    print(f' & {data_characteristics[d]}', end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']: 
            for cal in ablation[key]:
                for p in ablation[key][cal]:           
                    print(' & ', end='')
                    print(f'{np.mean(average_time[a+"_"+key+"_"+str(cal)+"_"+p]):.2f}',end='')
print(f' & {np.mean(average_time["num_features"]):.1f}', end='')
print(' \\\\')


print('\nRuntime Calibrations Sizes & ', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:            
        for cal in ablation[key]:
            print(f'{a+"_"+key+"_"+str(cal)} & ',end='')
print(' \\\\\n\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']: 
        for cal in ablation[key]:      
            print(' & ', end='')
            print(f'{np.mean([average_time[a+"_"+key+"_"+str(cal)+"_"+p] for p in ablation[key][cal]]):.2f}',end='')
print(' \\\\')

print('\nRuntime Sample Sizes & ', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:            
        for p in perc_samples:
            print(f'{a+"_"+key+"_"+str(len(p))} & ',end='')
print(' \\\\\n\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:           
        for p in perc_samples:     
            print(' & ', end='')
            print(f'{np.mean([average_time[a+"_"+key+"_"+str(cal)+"_"+str(p)] for cal in ablation[key]]):.2f}',end='')
print(' \\\\')

Runtime & xGB_ce_0.1_1 & xGB_ce_0.1_2 & xGB_ce_0.1_3 & xGB_ce_0.1_4 & xGB_ce_0.1_9 & xGB_ce_0.2_1 & xGB_ce_0.2_2 & xGB_ce_0.2_3 & xGB_ce_0.2_4 & xGB_ce_0.2_9 & xGB_ce_0.4_1 & xGB_ce_0.4_2 & xGB_ce_0.4_3 & xGB_ce_0.4_4 & xGB_ce_0.4_9 & xGB_cce_0.1_1 & xGB_cce_0.1_2 & xGB_cce_0.1_3 & xGB_cce_0.1_4 & xGB_cce_0.1_9 & xGB_cce_0.2_1 & xGB_cce_0.2_2 & xGB_cce_0.2_3 & xGB_cce_0.2_4 & xGB_cce_0.2_9 & xGB_cce_0.4_1 & xGB_cce_0.4_2 & xGB_cce_0.4_3 & xGB_cce_0.4_4 & xGB_cce_0.4_9 & RF_ce_0.1_1 & RF_ce_0.1_2 & RF_ce_0.1_3 & RF_ce_0.1_4 & RF_ce_0.1_9 & RF_ce_0.2_1 & RF_ce_0.2_2 & RF_ce_0.2_3 & RF_ce_0.2_4 & RF_ce_0.2_9 & RF_ce_0.4_1 & RF_ce_0.4_2 & RF_ce_0.4_3 & RF_ce_0.4_4 & RF_ce_0.4_9 & RF_cce_0.1_1 & RF_cce_0.1_2 & RF_cce_0.1_3 & RF_cce_0.1_4 & RF_cce_0.1_9 & RF_cce_0.2_1 & RF_cce_0.2_2 & RF_cce_0.2_3 & RF_cce_0.2_4 & RF_cce_0.2_9 & RF_cce_0.4_1 & RF_cce_0.4_2 & RF_cce_0.4_3 & RF_cce_0.4_4 & RF_cce_0.4_9 &  \\
\hline
haberman & 0.02 & 0.04 & 0.05 & 0.07 & 0.15 & 0.02 & 0.04 & 0.05 & 0.07 & 0.15 

The results regarding runtime are as can be expected and the observations are summarized below:
* The runtime increases with the number of percentiles sampled for numerical features.
* The runtime increases with the calibration size, even if the difference in runtime is fairly small.
* CE is faster than CCE, as expected. The reason is that CCE will generally require additional calculations than CE, at least for numerical features.
* The runtime tend to increase with the number of features, even if it is not a linear increase. This is due to the fact that categorical features with many categories are more expensive to compute, at least as long as the sampling size is small. Consequently, the number of categorical features together with the number of categories per feature is more important than the total number of features, especially when the sampling size (only affecting numerical features) is small.
* The greatest difference in runtime can be attributed to the underlying model, indicating that the choice of model used will have a great impact on runtime.
### 5 Conclusion
Consequently, the individual parameter that influence runtime most is the number of percentiles sampled for numerical features. As this tend to have a very small impact on the feature importance, it may be a reason to consider decreasing the number of percentiles sampled by default for numerical features (currently the default is 3: [25, 50, 75]).