# Calibrated Explanations for Binary Classification
## Ablatioon Analysis

Author: Tuwe Löfström (tuwe.lofstrom@ju.se)  
Copyright 2023 Tuwe Löfström  
License: BSD 3 clause
Sources:
1. ["Calibrated Explanations: with Uncertainty Information and Counterfactuals"](https://arxiv.org/abt/2305.02305) by [Helena Löfström](https://github.com/Moffran), [Tuwe Löfström](https://github.com/tuvelofstrom), Ulf Johansson, and Cecilia Sönströd.

### 1. Import packages

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pickle
import pandas as pd
import numpy as np
from scipy import stats as st

### 2 Import results from the pickled result file

In [3]:
with open('results_ablation.pkl', 'rb') as f:
    results = pickle.load(f)
data_characteristics = {'colic': 60, 
                        'creditA': 43, 
                        'diabetes': 9, 
                        'german': 28, 
                        'haberman': 4, 
                        'haberman': 4,
                        'heartC': 23,
                        'heartH': 21,
                        'heartS': 14,
                        'hepati': 20,
                        'iono': 34,
                        'je4042': 9,
                        'je4243': 9, 
                        'kc1': 22,
                        'kc2': 22,
                        'kc3': 40,
                        'liver': 7,
                        'pc1req': 9,
                        'pc4': 38,
                        'sonar': 61,
                        'spect': 23,
                        'spectf': 45,
                        'transfusion': 5,
                        'ttt': 28,
                        'vote': 17,
                        'wbc': 10,}

### 3 Ablation analysis
The ablation analysis is focused on evaluating how the algorithm is affected by the calibration size and the number of percentiles sampled for numerical features. It is using a similar setup as the stability experiment, but with the following changes:
* The number of percentiles sampled for numerical features is varied between 1, 2, 3 (default), 4, and 9. The set of percentiles used are: [50], [33, 67], [25, 50, 75], [20, 40, 60, 80], [10, 20, 30, 40, 50, 60, 70, 80, 90]
* The calibration size is varied between 10%, 20% and 40% of the data not used for testing.
* Test size is fixed to 10% of the data. 
* Only one repetition per percentile and calibration size is used.

Everything was run on 25 datasets. See the `Classification_Experiment_Ablation.py` for details on the experiment.

The tabulated results are the mean variance of the ablation measured per calibration size or percentile sampling. The variance is measured per instance and computed over the runs having the same calibration size/percentile sampling on the feature importance weight of the most influential feature, defined as the feature most often having highest absolute feature importance weight. The average variance is computed over the entire test set. The most influential feature is used since it is the feature that is most likely to be used in a decision but also the feature with the greatest expected variation (as a consequence of the weights having the highest absolute values). 

#### 3.1 Calibration Size
First out is a table with results per calibration size. Since different sampling sizes may result in different results for numerical features, the mean variance is only expected to be 0 for categorical-only datasets. The results are printed as a latex table.

In [4]:
ranking = {}
val = {}
average_results = {}
n = results['test_size']
cal_sizes = results['calibration_sizes']
perc_samples = results['sample_percentiles']

for a in ['xGB', 'RF']:
    for cal in cal_sizes:
        average_results[a+'_'+str(cal)+'_ce'] = []
        average_results[a+'_'+str(cal)+'_cce'] = []

print(' & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF \\\\\nCalibration Size', end='')
for i in range(2):
    for key in ['ce', 'cce']:  
        for cal in cal_sizes:
            print(f' & {cal}',end='')
print('\\\\')
print('Dataset & CE & CE & CE & CCE & CCE & CCE & CE & CE & CE & CCE & CCE & CCE \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'calibration_sizes', 'sample_percentiles']:
        continue
    print(d, end='')
    for a in results[d]:
        ablation = results[d][a]['ablation']
        
        for key in ['ce', 'cce']:    
            n = len(ablation[key][cal_sizes[0]][str(perc_samples[0])][0])
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for cal in ablation[key]:
                    for p in ablation[key][cal]:
                        rank.append(np.argsort(np.abs(ablation[key][cal][p][0][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            ranking[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            # print(ranking[key])
            
            value = []
            for cal in ablation[key]: 
                print(' & ', end='')  
                for j in range(n):
                    values = [ablation[key][cal][p][0][j]['predict'][ranking[key][j]] for p in ablation[key][cal]]
                    value.append([np.mean(values), np.var(values)])
                val[key] = value 

                res = np.mean([t[1] for t in val[key]])
                average_results[a+'_'+str(cal)+'_'+key].append(res)
                print(f'{res:.2}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:  
        for cal in ablation[key]:
            print(' & ', end='')
            print(f'{np.mean(average_results[a+"_"+str(cal)+"_"+key]):.2}',end='')
print(' \\\\')
# df = pd.DataFrame.from_dict(average_results, orient='index')
# display (df)


 & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF \\
Calibration Size & 0.1 & 0.2 & 0.4 & 0.1 & 0.2 & 0.4 & 0.1 & 0.2 & 0.4 & 0.1 & 0.2 & 0.4\\
Dataset & CE & CE & CE & CCE & CCE & CCE & CE & CE & CE & CCE & CCE & CCE \\
\hline
colic & 2.4e-05 & 1.8e-05 & 2.7e-05 & 2.4e-05 & 1.8e-05 & 2.7e-05 & 6.9e-08 & 1.1e-06 & 7.4e-07 & 6.9e-08 & 1.1e-06 & 7.4e-07 \\
creditA & 0.00056 & 0.00043 & 0.0004 & 0.00056 & 0.00043 & 0.0004 & 2.4e-05 & 4.1e-05 & 6e-05 & 2.4e-05 & 4.1e-05 & 6e-05 \\
diabetes & 0.00024 & 0.00024 & 0.00027 & 0.00024 & 0.00024 & 0.00027 & 0.00037 & 0.00032 & 0.0003 & 0.00037 & 0.00032 & 0.0003 \\
german & 2.2e-05 & 1.1e-05 & 2.2e-05 & 2.2e-05 & 1.1e-05 & 2.2e-05 & 5.1e-06 & 2.6e-06 & 1.8e-05 & 5.1e-06 & 2.6e-06 & 1.8e-05 \\
haberman & 0.0014 & 0.00084 & 0.00062 & 0.0014 & 0.00084 & 0.00062 & 0.00034 & 0.00094 & 0.00083 & 0.00034 & 0.00094 & 0.00083 \\
heartC & 7.6e-06 & 6.4e-05 & 4.7e-05 & 7.6e-06 & 6.4e-05 & 4.7e-05 & 2.7e-07 & 1.5e-05 & 1.4e-05 & 2.7e-07 & 1.

The most interesting observation from the results above is that difference in mean variance is fairly low between the different calibration sizes. This indicates that the calibration size does not have a large impact on the feature importance weights. In fact, a smaller calibration set even tend to have a lower mean variance.
#### 3.2 Percentile Sampling
Below are the results per percentile sampling. The results are printed as a latex table.

In [5]:
ranking = {}
val = {}
average_results = {}
n = results['test_size']
cal_sizes = results['calibration_sizes']
perc_samples = results['sample_percentiles']

for a in ['xGB', 'RF']:
    for p in perc_samples:
        average_results[a+'_'+str(p)+'_ce'] = []
        average_results[a+'_'+str(p)+'_cce'] = []

print(' & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF & RF & RF & RF & RF \\\\\nSample Size', end='')
for i in range(2):
    for key in ['ce', 'cce']:  
        for p in perc_samples:
            print(f' & {str(len(p))}',end='')
print('\\\\')
print('Dataset & CE & CE & CE & CE & CE & CCE & CCE & CCE & CCE & CCE & CE & CE & CE & CE & CE & CCE & CCE & CCE & CCE & CCE \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'calibration_sizes', 'sample_percentiles']:
        continue
    print(d, end='')
    for a in results[d]:
        ablation = results[d][a]['ablation']
        
        for key in ['ce', 'cce']:    
            n = len(ablation[key][cal_sizes[0]][str(perc_samples[0])][0])
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for cal in ablation[key]:
                    for p in ablation[key][cal]:
                        rank.append(np.argsort(np.abs(ablation[key][cal][p][0][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            ranking[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            # print(ranking[key])
            
            value = []
            for p in ablation[key][cal]: 
                print(' & ', end='')  
                for j in range(n):
                    values = [ablation[key][cal][p][0][j]['predict'][ranking[key][j]] for cal in ablation[key]]
                    value.append([np.mean(values), np.var(values)])
                val[key] = value 

                res = np.mean([t[1] for t in val[key]]) # mean of instance variance
                average_results[a+'_'+p+'_'+key].append(res)
                print(f'{res:.2}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:  
        for p in ablation[key][cal]:
            print(' & ', end='')
            print(f'{np.mean(average_results[a+"_"+p+"_"+key]):.2}',end='')
print(' \\\\')
# df = pd.DataFrame.from_dict(average_results, orient='index')
# display (df)


 & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & RF & RF & RF & RF & RF & RF & RF & RF & RF & RF \\
Sample Size & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9\\
Dataset & CE & CE & CE & CE & CE & CCE & CCE & CCE & CCE & CCE & CE & CE & CE & CE & CE & CCE & CCE & CCE & CCE & CCE \\
\hline
colic & 0.0026 & 0.0027 & 0.0027 & 0.0027 & 0.0027 & 0.0026 & 0.0027 & 0.0027 & 0.0027 & 0.0027 & 0.0043 & 0.0043 & 0.0043 & 0.0043 & 0.0043 & 0.0043 & 0.0043 & 0.0043 & 0.0043 & 0.0043 \\
creditA & 0.007 & 0.0066 & 0.0064 & 0.0062 & 0.0061 & 0.007 & 0.0066 & 0.0064 & 0.0062 & 0.0061 & 0.003 & 0.003 & 0.003 & 0.003 & 0.0029 & 0.003 & 0.003 & 0.003 & 0.003 & 0.0029 \\
diabetes & 0.0045 & 0.0044 & 0.0043 & 0.0042 & 0.0041 & 0.0045 & 0.0044 & 0.0043 & 0.0042 & 0.0041 & 0.0096 & 0.0097 & 0.0096 & 0.0095 & 0.0094 & 0.0096 & 0.0097 & 0.0096 & 0.0095 & 0.0094 \\
german & 0.0011 & 0.0011 & 0.0011 & 0.0011 & 0.0011 & 0.0011 & 0.0011 & 0.0011 & 0.0011 & 0.0011 & 0.0

Even if there is some difference in the mean variance when varying the percentile sampling, the difference can mainly be attributed to the difference in underlying ML algorithm. There is a tendency that a larger set of percentiles tend to reduce the mean variance, which is expected. However, the tendency is not very strong. 

### 4 Computing time
Now, lets look at the runtime taken to compute the explanations. The tabulated runtimes are the average time in seconds per instance. First, detailed results per calibration size AND percentile sampling is shown, then results aggregated per calibration size and percentile sampling are shown separately. The results are printed as a latex table.

In [14]:
timer = []
n = results['test_size']
cal_sizes = results['calibration_sizes']
perc_samples = results['sample_percentiles']
average_time = {}
average_time['num_features'] = []
for a in ['xGB', 'RF']:
    print('Learner & ', end='')
    for key in ['ce', 'cce']:            
        for cal in ablation[key]:
            for p in perc_samples:
                average_time[a+"_"+key+"_"+str(cal)+"_"+str(p)] = []
                average_time[a+"_"+key+"_"+str(cal)+"_"+str(p)] = []
                print(f'{a} & ',end='')
    print(' \\\\')
    print('CE & ', end='')
    for key in ['ce', 'cce']:            
        for cal in ablation[key]:
            for p in perc_samples:
                print(f'{key} & ',end='')
    print(' \\\\')
    print('Cal.Size & ', end='')
    for key in ['ce', 'cce']:            
        for cal in ablation[key]:
            for p in perc_samples:
                print(f'{str(cal)} & ',end='')
    print(' \\\\')
    print('Sample Size & ', end='')
    for key in ['ce', 'cce']:            
        for cal in ablation[key]:
            for p in perc_samples:
                print(f'{str(len(p))} & ',end='')
    print(' \\\\\n\\hline')

    for d in np.sort(np.sort([k for k in results.keys()])):
        if d in ['test_size', 'calibration_sizes', 'sample_percentiles']:
            continue
        print(d, end='')
    # for a in results[d]:
        n = len(results[d][a]['ablation']['ce'][cal_sizes[0]][str(perc_samples[0])][0])
        for key in ['ce', 'cce']:  
            a_time = results[d][a]['timer']
            for cal in ablation[key]:
                for p in ablation[key][cal]:          
                    print(' & ', end='')
                    res = np.mean([t/n for t in a_time[key][cal][p]])
                    average_time[a+"_"+key+"_"+str(cal)+"_"+p].append(res)
                    average_time['num_features'].append(data_characteristics[d])
                    print(f'{res:.2f}',end='')
        print(f' & {data_characteristics[d]}', end='')
        print(' \\\\')
    print('\\hline\nAverage', end='')

    for key in ['ce', 'cce']: 
            for cal in ablation[key]:
                for p in ablation[key][cal]:           
                    print(' & ', end='')
                    print(f'{np.mean(average_time[a+"_"+key+"_"+str(cal)+"_"+p]):.2f}',end='')
    print(f' & {np.mean(average_time["num_features"]):.1f}', end='')
    print(' \\\\\n\n')

Learner & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB & xGB &  \\
CE & ce & ce & ce & ce & ce & ce & ce & ce & ce & ce & ce & ce & ce & ce & ce & cce & cce & cce & cce & cce & cce & cce & cce & cce & cce & cce & cce & cce & cce & cce &  \\
Cal.Size & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.2 & 0.2 & 0.2 & 0.2 & 0.2 & 0.4 & 0.4 & 0.4 & 0.4 & 0.4 & 0.1 & 0.1 & 0.1 & 0.1 & 0.1 & 0.2 & 0.2 & 0.2 & 0.2 & 0.2 & 0.4 & 0.4 & 0.4 & 0.4 & 0.4 &  \\
Sample Size & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 & 1 & 2 & 3 & 4 & 9 &  \\
\hline
colic & 0.12 & 0.13 & 0.15 & 0.16 & 0.24 & 0.13 & 0.14 & 0.15 & 0.17 & 0.24 & 0.13 & 0.14 & 0.16 & 0.18 & 0.25 & 0.15 & 0.17 & 0.18 & 0.21 & 0.32 & 0.15 & 0.17 & 0.19 & 0.22 & 0.31 & 0.16 & 0.17 & 0.20 & 0.22 & 0.34 & 60 \\
creditA & 0.16 & 0.18 & 0.20 & 0.21 & 0.31 & 0.20 & 0.22 & 0.24 & 0.

In [7]:
print('\nRuntime Calibrations Sizes & ', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:            
        for cal in ablation[key]:
            print(f'{a+"_"+key+"_"+str(cal)} & ',end='')
print(' \\\\\n\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']: 
        for cal in ablation[key]:      
            print(' & ', end='')
            print(f'{np.mean([average_time[a+"_"+key+"_"+str(cal)+"_"+p] for p in ablation[key][cal]]):.2f}',end='')
print(' \\\\')

print('\nRuntime Sample Sizes & ', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:            
        for p in perc_samples:
            print(f'{a+"_"+key+"_"+str(len(p))} & ',end='')
print(' \\\\\n\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    for key in ['ce', 'cce']:           
        for p in perc_samples:     
            print(' & ', end='')
            print(f'{np.mean([average_time[a+"_"+key+"_"+str(cal)+"_"+str(p)] for cal in ablation[key]]):.2f}',end='')
print(' \\\\')


Runtime Calibrations Sizes & xGB_ce_0.1 & xGB_ce_0.2 & xGB_ce_0.4 & xGB_cce_0.1 & xGB_cce_0.2 & xGB_cce_0.4 & RF_ce_0.1 & RF_ce_0.2 & RF_ce_0.4 & RF_cce_0.1 & RF_cce_0.2 & RF_cce_0.4 &  \\
\hline
Average & 0.11 & 0.11 & 0.11 & 0.20 & 0.21 & 0.22 & 0.11 & 0.11 & 0.11 & 0.20 & 0.21 & 0.22 \\

Runtime Sample Sizes & xGB_ce_1 & xGB_ce_2 & xGB_ce_3 & xGB_ce_4 & xGB_ce_9 & xGB_cce_1 & xGB_cce_2 & xGB_cce_3 & xGB_cce_4 & xGB_cce_9 & RF_ce_1 & RF_ce_2 & RF_ce_3 & RF_ce_4 & RF_ce_9 & RF_cce_1 & RF_cce_2 & RF_cce_3 & RF_cce_4 & RF_cce_9 &  \\
\hline
Average & 0.05 & 0.06 & 0.09 & 0.11 & 0.25 & 0.08 & 0.12 & 0.16 & 0.21 & 0.49 & 0.04 & 0.06 & 0.08 & 0.10 & 0.25 & 0.07 & 0.12 & 0.16 & 0.21 & 0.48 \\


The results regarding runtime are as can be expected and the observations are summarized below:
* The runtime increases almost linearly with the number of percentiles sampled for numerical features.
* The runtime increases with the calibration size, even if the difference in runtime is fairly small.
* CE is faster than CCE, as expected. The reason is that CCE will generally require additional calculations than CE, at least for numerical features.
* The runtime tend to increase with the number of features, even if it is not a linear increase. This is due to the fact that categorical features with many categories are more expensive to compute, at least as long as the sampling size is small. Consequently, the number of categorical features together with the number of categories per feature is more important than the total number of features, especially when the sampling size (only affecting numerical features) is small.
* A great part of the difference in runtime can be attributed to the underlying model, indicating that the choice of model used will have a great impact on runtime.
### 5 Conclusion
The individual algorithmic parameter that influence runtime most is the number of percentiles sampled for numerical features. As this tend to have a fairly small impact on the feature importance, it may be a reason to consider decreasing the number of percentiles sampled by default for numerical features (currently the default is 3: [25, 50, 75]). Using only the median ([50]) would on average reduce the runtime with almost half.
#### Final note
The core algorithm was updated early August 2024. The results reported here are using the code committed [chore: redirected __call__() to explain()](https://github.com/Moffran/calibrated_explanations/commit/068091186b6c383eafdae538b6b02defc78857b9). Average speedups range from 2-9 depending on setup, with substantially higher speedups of up to almost 40 times faster for individual datasets and setups. Previous results shown below:


Runtime & xGB_ce_0.1_1 & xGB_ce_0.1_2 & xGB_ce_0.1_3 & xGB_ce_0.1_4 & xGB_ce_0.1_9 & xGB_ce_0.2_1 & xGB_ce_0.2_2 & xGB_ce_0.2_3 & xGB_ce_0.2_4 & xGB_ce_0.2_9 & xGB_ce_0.4_1 & xGB_ce_0.4_2 & xGB_ce_0.4_3 & xGB_ce_0.4_4 & xGB_ce_0.4_9 & xGB_cce_0.1_1 & xGB_cce_0.1_2 & xGB_cce_0.1_3 & xGB_cce_0.1_4 & xGB_cce_0.1_9 & xGB_cce_0.2_1 & xGB_cce_0.2_2 & xGB_cce_0.2_3 & xGB_cce_0.2_4 & xGB_cce_0.2_9 & xGB_cce_0.4_1 & xGB_cce_0.4_2 & xGB_cce_0.4_3 & xGB_cce_0.4_4 & xGB_cce_0.4_9 & RF_ce_0.1_1 & RF_ce_0.1_2 & RF_ce_0.1_3 & RF_ce_0.1_4 & RF_ce_0.1_9 & RF_ce_0.2_1 & RF_ce_0.2_2 & RF_ce_0.2_3 & RF_ce_0.2_4 & RF_ce_0.2_9 & RF_ce_0.4_1 & RF_ce_0.4_2 & RF_ce_0.4_3 & RF_ce_0.4_4 & RF_ce_0.4_9 & RF_cce_0.1_1 & RF_cce_0.1_2 & RF_cce_0.1_3 & RF_cce_0.1_4 & RF_cce_0.1_9 & RF_cce_0.2_1 & RF_cce_0.2_2 & RF_cce_0.2_3 & RF_cce_0.2_4 & RF_cce_0.2_9 & RF_cce_0.4_1 & RF_cce_0.4_2 & RF_cce_0.4_3 & RF_cce_0.4_4 & RF_cce_0.4_9 &  \\
\hline
colic & 0.32 & 0.38 & 0.40 & 0.44 & 0.61 & 0.35 & 0.38 & 0.43 & 0.47 & 0.66 & 0.36 & 0.39 & 0.43 & 0.47 & 0.66 & 0.35 & 0.39 & 0.45 & 0.50 & 0.74 & 0.37 & 0.42 & 0.47 & 0.52 & 0.76 & 0.38 & 0.42 & 0.48 & 0.52 & 0.77 & 0.81 & 0.89 & 0.96 & 1.03 & 1.53 & 0.79 & 0.88 & 0.97 & 1.06 & 1.49 & 0.79 & 0.87 & 0.96 & 1.06 & 1.48 & 0.84 & 0.99 & 1.05 & 1.24 & 1.72 & 0.82 & 0.95 & 1.06 & 1.17 & 1.72 & 0.82 & 0.94 & 1.08 & 1.18 & 1.76 & 60 \\
creditA & 0.26 & 0.29 & 0.33 & 0.36 & 0.51 & 0.30 & 0.34 & 0.37 & 0.40 & 0.56 & 0.32 & 0.36 & 0.39 & 0.42 & 0.56 & 0.28 & 0.31 & 0.34 & 0.38 & 0.55 & 0.31 & 0.36 & 0.40 & 0.44 & 0.64 & 0.34 & 0.38 & 0.42 & 0.45 & 0.64 & 0.64 & 0.79 & 0.86 & 0.94 & 1.31 & 0.79 & 0.86 & 0.96 & 1.04 & 1.43 & 0.86 & 0.94 & 0.99 & 1.08 & 1.47 & 0.73 & 0.81 & 0.94 & 1.00 & 1.47 & 0.83 & 0.91 & 1.02 & 1.13 & 1.66 & 0.90 & 0.98 & 1.08 & 1.21 & 1.70 & 43 \\
diabetes & 0.05 & 0.10 & 0.15 & 0.19 & 0.42 & 0.06 & 0.10 & 0.15 & 0.19 & 0.43 & 0.06 & 0.11 & 0.16 & 0.20 & 0.43 & 0.07 & 0.13 & 0.18 & 0.24 & 0.51 & 0.08 & 0.14 & 0.20 & 0.26 & 0.61 & 0.08 & 0.15 & 0.21 & 0.26 & 0.56 & 0.13 & 0.27 & 0.36 & 0.48 & 1.04 & 0.13 & 0.25 & 0.36 & 0.50 & 1.05 & 0.13 & 0.24 & 0.36 & 0.47 & 1.03 & 0.20 & 0.32 & 0.47 & 0.62 & 1.29 & 0.18 & 0.33 & 0.48 & 0.67 & 1.35 & 0.17 & 0.32 & 0.47 & 0.60 & 1.35 & 9 \\
german & 0.16 & 0.17 & 0.18 & 0.18 & 0.20 & 0.16 & 0.17 & 0.18 & 0.18 & 0.21 & 0.17 & 0.17 & 0.19 & 0.19 & 0.21 & 0.16 & 0.17 & 0.18 & 0.18 & 0.20 & 0.16 & 0.17 & 0.17 & 0.18 & 0.21 & 0.17 & 0.18 & 0.19 & 0.19 & 0.22 & 0.42 & 0.45 & 0.46 & 0.50 & 0.54 & 0.41 & 0.43 & 0.45 & 0.47 & 0.54 & 0.42 & 0.44 & 0.44 & 0.47 & 0.59 & 0.43 & 0.46 & 0.47 & 0.49 & 0.53 & 0.42 & 0.43 & 0.44 & 0.45 & 0.53 & 0.43 & 0.43 & 0.44 & 0.50 & 0.60 & 28 \\
haberman & 0.02 & 0.04 & 0.05 & 0.07 & 0.15 & 0.02 & 0.04 & 0.05 & 0.07 & 0.15 & 0.02 & 0.04 & 0.06 & 0.07 & 0.15 & 0.02 & 0.04 & 0.05 & 0.07 & 0.15 & 0.02 & 0.04 & 0.06 & 0.07 & 0.15 & 0.02 & 0.04 & 0.06 & 0.07 & 0.15 & 0.05 & 0.09 & 0.13 & 0.18 & 0.39 & 0.05 & 0.09 & 0.14 & 0.18 & 0.38 & 0.05 & 0.09 & 0.13 & 0.18 & 0.38 & 0.05 & 0.09 & 0.13 & 0.17 & 0.39 & 0.05 & 0.09 & 0.13 & 0.18 & 0.38 & 0.05 & 0.09 & 0.13 & 0.17 & 0.38 & 4 \\
heartC & 0.12 & 0.15 & 0.18 & 0.21 & 0.34 & 0.14 & 0.16 & 0.19 & 0.22 & 0.35 & 0.14 & 0.17 & 0.20 & 0.22 & 0.36 & 0.14 & 0.17 & 0.20 & 0.24 & 0.41 & 0.15 & 0.18 & 0.22 & 0.26 & 0.44 & 0.16 & 0.19 & 0.23 & 0.27 & 0.46 & 0.32 & 0.39 & 0.47 & 0.53 & 0.85 & 0.35 & 0.40 & 0.47 & 0.56 & 0.86 & 0.34 & 0.41 & 0.47 & 0.55 & 0.89 & 0.34 & 0.44 & 0.53 & 0.61 & 1.08 & 0.37 & 0.45 & 0.54 & 0.62 & 1.08 & 0.38 & 0.47 & 0.56 & 0.68 & 1.12 & 23 \\
heartH & 0.10 & 0.13 & 0.15 & 0.18 & 0.30 & 0.12 & 0.15 & 0.17 & 0.20 & 0.31 & 0.12 & 0.15 & 0.17 & 0.20 & 0.32 & 0.11 & 0.14 & 0.18 & 0.20 & 0.36 & 0.13 & 0.16 & 0.20 & 0.23 & 0.40 & 0.13 & 0.17 & 0.20 & 0.24 & 0.43 & 0.27 & 0.33 & 0.40 & 0.45 & 0.77 & 0.29 & 0.36 & 0.42 & 0.49 & 0.79 & 0.29 & 0.36 & 0.43 & 0.47 & 0.79 & 0.29 & 0.37 & 0.45 & 0.53 & 0.90 & 0.32 & 0.40 & 0.50 & 0.58 & 1.01 & 0.31 & 0.42 & 0.49 & 0.58 & 1.04 & 21 \\
heartS & 0.09 & 0.12 & 0.15 & 0.18 & 0.31 & 0.10 & 0.13 & 0.15 & 0.18 & 0.32 & 0.10 & 0.13 & 0.16 & 0.18 & 0.32 & 0.10 & 0.13 & 0.18 & 0.21 & 0.40 & 0.11 & 0.15 & 0.18 & 0.22 & 0.39 & 0.11 & 0.15 & 0.19 & 0.23 & 0.42 & 0.23 & 0.30 & 0.37 & 0.44 & 0.77 & 0.24 & 0.31 & 0.38 & 0.45 & 0.77 & 0.24 & 0.31 & 0.37 & 0.44 & 0.78 & 0.25 & 0.35 & 0.44 & 0.53 & 0.98 & 0.26 & 0.35 & 0.45 & 0.53 & 0.95 & 0.27 & 0.36 & 0.45 & 0.54 & 1.00 & 14 \\
hepati & 0.10 & 0.13 & 0.16 & 0.19 & 0.35 & 0.12 & 0.15 & 0.18 & 0.21 & 0.36 & 0.12 & 0.15 & 0.19 & 0.22 & 0.38 & 0.11 & 0.15 & 0.18 & 0.24 & 0.42 & 0.13 & 0.17 & 0.21 & 0.25 & 0.44 & 0.13 & 0.18 & 0.22 & 0.26 & 0.47 & 0.26 & 0.35 & 0.41 & 0.53 & 0.92 & 0.30 & 0.37 & 0.44 & 0.49 & 0.89 & 0.28 & 0.35 & 0.44 & 0.53 & 0.93 & 0.28 & 0.38 & 0.48 & 0.60 & 1.09 & 0.32 & 0.43 & 0.51 & 0.59 & 1.07 & 0.30 & 0.42 & 0.52 & 0.63 & 1.14 & 20 \\
iono & 0.22 & 0.39 & 0.57 & 0.72 & 1.54 & 0.22 & 0.41 & 0.62 & 0.75 & 1.52 & 0.23 & 0.38 & 0.53 & 0.68 & 1.40 & 0.30 & 0.51 & 0.74 & 0.96 & 2.07 & 0.29 & 0.52 & 0.82 & 1.10 & 2.13 & 0.31 & 0.54 & 0.76 & 1.02 & 2.06 & 0.50 & 0.96 & 1.44 & 1.82 & 3.77 & 0.49 & 0.92 & 1.29 & 1.78 & 3.91 & 0.50 & 0.88 & 1.32 & 1.63 & 3.46 & 0.64 & 1.51 & 1.85 & 2.36 & 5.15 & 0.64 & 1.23 & 1.84 & 2.49 & 5.40 & 0.67 & 1.25 & 1.85 & 2.42 & 5.16 & 34 \\
je4042 & 0.06 & 0.10 & 0.13 & 0.17 & 0.35 & 0.06 & 0.10 & 0.13 & 0.17 & 0.35 & 0.07 & 0.11 & 0.15 & 0.20 & 0.38 & 0.06 & 0.10 & 0.13 & 0.17 & 0.35 & 0.06 & 0.10 & 0.13 & 0.17 & 0.36 & 0.07 & 0.12 & 0.15 & 0.20 & 0.42 & 0.17 & 0.27 & 0.37 & 0.46 & 0.96 & 0.16 & 0.29 & 0.39 & 0.50 & 1.01 & 0.18 & 0.29 & 0.40 & 0.52 & 1.02 & 0.17 & 0.27 & 0.36 & 0.47 & 1.00 & 0.16 & 0.28 & 0.38 & 0.48 & 1.05 & 0.18 & 0.30 & 0.40 & 0.51 & 1.04 & 9 \\
je4243 & 0.06 & 0.09 & 0.13 & 0.17 & 0.35 & 0.06 & 0.10 & 0.13 & 0.17 & 0.37 & 0.07 & 0.10 & 0.15 & 0.18 & 0.38 & 0.06 & 0.10 & 0.14 & 0.17 & 0.36 & 0.06 & 0.10 & 0.14 & 0.17 & 0.35 & 0.07 & 0.11 & 0.15 & 0.18 & 0.38 & 0.15 & 0.25 & 0.34 & 0.43 & 0.96 & 0.15 & 0.26 & 0.35 & 0.46 & 0.94 & 0.16 & 0.26 & 0.36 & 0.45 & 0.94 & 0.15 & 0.24 & 0.34 & 0.43 & 0.96 & 0.15 & 0.26 & 0.35 & 0.45 & 0.96 & 0.16 & 0.26 & 0.35 & 0.44 & 0.94 & 9 \\
kc1 & 0.14 & 0.26 & 0.38 & 0.51 & 1.07 & 0.15 & 0.28 & 0.37 & 0.50 & 1.06 & 0.14 & 0.26 & 0.36 & 0.48 & 1.04 & 0.18 & 0.34 & 0.48 & 0.63 & 1.34 & 0.18 & 0.33 & 0.46 & 0.61 & 1.28 & 0.17 & 0.31 & 0.45 & 0.58 & 1.29 & 0.32 & 0.60 & 0.91 & 1.18 & 2.57 & 0.31 & 0.59 & 0.89 & 1.17 & 2.50 & 0.32 & 0.59 & 0.87 & 1.16 & 2.69 & 0.41 & 0.79 & 1.14 & 1.49 & 3.22 & 0.40 & 0.75 & 1.09 & 1.45 & 3.08 & 0.38 & 0.72 & 1.07 & 1.33 & 3.12 & 22 \\
kc2 & 0.12 & 0.23 & 0.34 & 0.45 & 0.96 & 0.13 & 0.25 & 0.35 & 0.45 & 0.99 & 0.15 & 0.26 & 0.37 & 0.47 & 1.00 & 0.15 & 0.29 & 0.41 & 0.55 & 1.18 & 0.17 & 0.30 & 0.44 & 0.58 & 1.22 & 0.19 & 0.34 & 0.49 & 0.63 & 1.30 & 0.30 & 0.57 & 0.83 & 1.09 & 2.46 & 0.33 & 0.58 & 0.88 & 1.14 & 2.44 & 0.33 & 0.59 & 0.86 & 1.12 & 2.43 & 0.36 & 0.70 & 1.02 & 1.33 & 3.42 & 0.39 & 0.73 & 1.13 & 1.44 & 2.97 & 0.42 & 0.78 & 1.12 & 1.48 & 3.29 & 22 \\
kc3 & 0.24 & 0.46 & 0.66 & 0.86 & 1.85 & 0.25 & 0.46 & 0.66 & 0.87 & 1.82 & 0.26 & 0.46 & 0.67 & 0.85 & 1.83 & 0.30 & 0.56 & 0.81 & 1.04 & 2.22 & 0.30 & 0.55 & 0.79 & 1.00 & 2.13 & 0.31 & 0.57 & 0.81 & 1.04 & 2.18 & 0.58 & 1.10 & 1.53 & 2.09 & 4.48 & 0.60 & 1.09 & 1.56 & 2.02 & 4.41 & 0.59 & 1.11 & 1.52 & 1.98 & 4.25 & 0.70 & 1.31 & 1.96 & 2.50 & 5.57 & 0.68 & 1.28 & 1.81 & 2.37 & 5.13 & 0.72 & 1.29 & 1.82 & 2.38 & 5.11 & 40 \\
liver & 0.04 & 0.07 & 0.10 & 0.13 & 0.28 & 0.04 & 0.07 & 0.10 & 0.13 & 0.26 & 0.04 & 0.07 & 0.10 & 0.14 & 0.28 & 0.05 & 0.09 & 0.13 & 0.17 & 0.37 & 0.05 & 0.09 & 0.13 & 0.16 & 0.34 & 0.05 & 0.10 & 0.14 & 0.19 & 0.39 & 0.10 & 0.18 & 0.26 & 0.34 & 0.74 & 0.09 & 0.17 & 0.25 & 0.33 & 0.66 & 0.09 & 0.18 & 0.26 & 0.35 & 0.72 & 0.12 & 0.24 & 0.34 & 0.44 & 0.98 & 0.11 & 0.22 & 0.32 & 0.41 & 0.89 & 0.14 & 0.24 & 0.37 & 0.48 & 1.04 & 7 \\
pc1req & 0.07 & 0.07 & 0.07 & 0.08 & 0.10 & 0.07 & 0.07 & 0.08 & 0.09 & 0.11 & 0.08 & 0.09 & 0.10 & 0.10 & 0.13 & 0.06 & 0.07 & 0.07 & 0.07 & 0.10 & 0.07 & 0.08 & 0.08 & 0.09 & 0.11 & 0.08 & 0.08 & 0.10 & 0.10 & 0.14 & 0.16 & 0.18 & 0.19 & 0.20 & 0.27 & 0.18 & 0.19 & 0.21 & 0.22 & 0.30 & 0.20 & 0.21 & 0.23 & 0.24 & 0.31 & 0.17 & 0.17 & 0.19 & 0.20 & 0.26 & 0.18 & 0.19 & 0.22 & 0.23 & 0.30 & 0.20 & 0.21 & 0.23 & 0.25 & 0.31 & 9 \\
pc4 & 0.22 & 0.40 & 0.58 & 0.76 & 1.54 & 0.25 & 0.41 & 0.58 & 0.75 & 1.60 & 0.26 & 0.46 & 0.63 & 0.87 & 1.76 & 0.28 & 0.51 & 0.73 & 0.95 & 1.96 & 0.30 & 0.53 & 0.74 & 0.99 & 2.09 & 0.33 & 0.57 & 0.81 & 1.09 & 2.22 & 0.58 & 1.04 & 1.50 & 1.95 & 3.91 & 0.60 & 1.02 & 1.47 & 1.90 & 3.88 & 0.56 & 0.98 & 1.39 & 1.96 & 4.22 & 0.72 & 1.28 & 1.86 & 2.44 & 5.03 & 0.73 & 1.31 & 1.89 & 2.46 & 4.92 & 0.70 & 1.25 & 1.77 & 2.47 & 5.16 & 38 \\
sonar & 0.40 & 0.71 & 1.04 & 1.33 & 2.91 & 0.38 & 0.71 & 1.03 & 1.35 & 3.01 & 0.39 & 0.74 & 1.07 & 1.42 & 3.08 & 0.52 & 0.92 & 1.35 & 1.74 & 3.79 & 0.51 & 0.94 & 1.38 & 1.81 & 4.05 & 0.53 & 0.99 & 1.45 & 1.84 & 4.12 & 0.85 & 1.64 & 2.41 & 3.17 & 7.06 & 0.87 & 1.71 & 2.52 & 3.24 & 7.32 & 0.93 & 2.03 & 2.61 & 3.39 & 7.49 & 1.09 & 2.12 & 3.13 & 4.17 & 9.47 & 1.17 & 2.42 & 3.27 & 4.36 & 10.48 & 1.28 & 2.41 & 3.36 & 4.56 & 10.70 & 61 \\
spect & 0.12 & 0.13 & 0.13 & 0.13 & 0.13 & 0.14 & 0.13 & 0.14 & 0.14 & 0.13 & 0.13 & 0.14 & 0.14 & 0.13 & 0.13 & 0.12 & 0.12 & 0.13 & 0.13 & 0.14 & 0.13 & 0.14 & 0.14 & 0.13 & 0.13 & 0.14 & 0.13 & 0.13 & 0.14 & 0.13 & 0.31 & 0.32 & 0.32 & 0.32 & 0.32 & 0.31 & 0.33 & 0.31 & 0.31 & 0.31 & 0.31 & 0.31 & 0.32 & 0.33 & 0.32 & 0.33 & 0.34 & 0.33 & 0.31 & 0.31 & 0.31 & 0.31 & 0.31 & 0.31 & 0.31 & 0.30 & 0.32 & 0.31 & 0.32 & 0.32 & 23 \\
spectf & 0.27 & 0.51 & 0.75 & 0.98 & 2.18 & 0.30 & 0.55 & 0.80 & 1.06 & 2.35 & 0.31 & 0.58 & 0.80 & 1.08 & 2.33 & 0.27 & 0.52 & 0.74 & 0.99 & 2.24 & 0.30 & 0.56 & 0.79 & 1.07 & 2.32 & 0.31 & 0.56 & 0.81 & 1.05 & 2.26 & 0.65 & 1.23 & 1.87 & 2.62 & 6.05 & 0.64 & 1.31 & 1.92 & 2.43 & 5.66 & 0.65 & 1.31 & 1.98 & 2.63 & 5.75 & 0.64 & 1.24 & 1.87 & 2.44 & 5.97 & 0.64 & 1.30 & 1.86 & 2.46 & 5.79 & 0.70 & 1.35 & 2.07 & 2.56 & 5.84 & 45 \\
transfusion & 0.03 & 0.05 & 0.07 & 0.10 & 0.21 & 0.03 & 0.05 & 0.08 & 0.10 & 0.22 & 0.03 & 0.05 & 0.07 & 0.10 & 0.22 & 0.03 & 0.05 & 0.08 & 0.10 & 0.21 & 0.03 & 0.05 & 0.08 & 0.12 & 0.21 & 0.03 & 0.05 & 0.08 & 0.10 & 0.22 & 0.07 & 0.12 & 0.19 & 0.24 & 0.53 & 0.07 & 0.12 & 0.18 & 0.24 & 0.52 & 0.06 & 0.12 & 0.18 & 0.23 & 0.50 & 0.07 & 0.13 & 0.18 & 0.24 & 0.52 & 0.07 & 0.12 & 0.18 & 0.24 & 0.51 & 0.07 & 0.12 & 0.18 & 0.23 & 0.51 & 5 \\
ttt & 0.15 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.17 & 0.16 & 0.17 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.16 & 0.17 & 0.16 & 0.17 & 0.42 & 0.43 & 0.43 & 0.43 & 0.40 & 0.39 & 0.39 & 0.39 & 0.40 & 0.39 & 0.39 & 0.39 & 0.39 & 0.39 & 0.39 & 0.42 & 0.44 & 0.42 & 0.40 & 0.40 & 0.39 & 0.39 & 0.39 & 0.40 & 0.40 & 0.39 & 0.40 & 0.39 & 0.39 & 0.39 & 28 \\
vote & 0.09 & 0.09 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.09 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.10 & 0.23 & 0.23 & 0.23 & 0.23 & 0.24 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.24 & 0.24 & 0.23 & 0.23 & 0.23 & 0.23 & 0.23 & 0.22 & 0.22 & 0.23 & 0.23 & 0.23 & 17 \\
wbc & 0.07 & 0.11 & 0.15 & 0.18 & 0.35 & 0.07 & 0.11 & 0.14 & 0.17 & 0.35 & 0.07 & 0.11 & 0.14 & 0.18 & 0.35 & 0.07 & 0.12 & 0.16 & 0.19 & 0.37 & 0.08 & 0.13 & 0.17 & 0.22 & 0.44 & 0.09 & 0.14 & 0.18 & 0.23 & 0.44 & 0.17 & 0.27 & 0.38 & 0.45 & 0.91 & 0.16 & 0.26 & 0.35 & 0.44 & 0.90 & 0.17 & 0.28 & 0.37 & 0.48 & 0.92 & 0.18 & 0.30 & 0.40 & 0.48 & 0.97 & 0.19 & 0.32 & 0.43 & 0.57 & 1.14 & 0.20 & 0.34 & 0.45 & 0.59 & 1.09 & 10 \\
\hline

Summary:
Runtime Calibrations Sizes & xGB_ce_0.1 & xGB_ce_0.2 & xGB_ce_0.4 & xGB_cce_0.1 & xGB_cce_0.2 & xGB_cce_0.4 & RF_ce_0.1 & RF_ce_0.2 & RF_ce_0.4 & RF_cce_0.1 & RF_cce_0.2 & RF_cce_0.4 &  \\
\hline
Average & 0.34 & 0.35 & 0.36 & 0.40 & 0.41 & 0.43 & 0.84 & 0.85 & 0.86 & 1.00 & 1.01 & 1.03 \\

Runtime Sample Sizes & xGB_ce_1 & xGB_ce_2 & xGB_ce_3 & xGB_ce_4 & xGB_ce_9 & xGB_cce_1 & xGB_cce_2 & xGB_cce_3 & xGB_cce_4 & xGB_cce_9 & RF_ce_1 & RF_ce_2 & RF_ce_3 & RF_ce_4 & RF_ce_9 & RF_cce_1 & RF_cce_2 & RF_cce_3 & RF_cce_4 & RF_cce_9 &  \\
\hline
Average & 0.15 & 0.22 & 0.29 & 0.36 & 0.71 & 0.17 & 0.26 & 0.35 & 0.43 & 0.85 & 0.36 & 0.54 & 0.71 & 0.89 & 1.75 & 0.40 & 0.63 & 0.83 & 1.05 & 2.14 \\