# Calibrated Explanations for Binary Classification
## Stability and Robustness

Author: Tuwe Löfström (tuwe.lofstrom@ju.se)  
Copyright 2023 Tuwe Löfström  
License: BSD 3 clause
Sources:
1. ["Calibrated Explanations: with Uncertainty Information and Counterfactuals"](https://doi.org/10.1016/j.eswa.2024.123154) by [Helena Löfström](https://github.com/Moffran), [Tuwe Löfström](https://github.com/tuvelofstrom), Ulf Johansson, and Cecilia Sönströd.

### 1. Import packages

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import pickle
import numpy as np
from scipy import stats as st

### 2 Import results from the pickled result file

In [4]:
with open('results_stab_rob.pkl', 'rb') as f:
    results = pickle.load(f)
data_characteristics = {'colic': 60, 
                        'creditA': 43, 
                        'diabetes': 9, 
                        'german': 28, 
                        'haberman': 4, 
                        'haberman': 4,
                        'heartC': 23,
                        'heartH': 21,
                        'heartS': 14,
                        'hepati': 20,
                        'iono': 34,
                        'je4042': 9,
                        'je4243': 9, 
                        'kc1': 22,
                        'kc2': 22,
                        'kc3': 40,
                        'liver': 7,
                        'pc1req': 9,
                        'pc4': 38,
                        'sonar': 61,
                        'spect': 23,
                        'spectf': 45,
                        'transfusion': 5,
                        'ttt': 28,
                        'vote': 17,
                        'wbc': 10,}

### 3 Stability and Robustness
Create a table with the robustness and stability results. The stability results stem from experiments where the same model, calibration set and test set have been explained 30 times. The only source of variation is the random seed. The robustness results stem from experiments where training and calibration sets have been randomly resampled before a new model have been trained and explained. The experiment where run 30 times and the test set was the same for all models. The robustness is measured in this way to avoid inferring perturbed instances which are not from the same distribution as the test instances being explained. The probability estimate of each of the models was computed on the same test set, as comparison to the robustness results. The expectation is that a stable and robust explanation method should result in low variance in the feature importance weights.

Everything was run on 25 datasets. See the `Classification_Experiment_stab_rob.py` for details on the experiment.

The tabulated results are the mean variance of the stability and robustness measured over the 30 runs and 20 instances. The variance is measured per instance and computed over the 30 runs on the feature importance weight of the most influential feature, defined as the feature most often having highest absolute feature importance weight. The average variance is computed over the 20 instances. The most influential feature is used since it is the feature that is most likely to be used in a decision but also the feature with the greatest expected variation (as a consequence of the weights having the highest absolute values). 

The results are printed as a latex table.

In [5]:
stab_rank = {}
stab_val = {}
average_results = {}
for a in ['xGB', 'RF']:
    average_results[a+'_stab_ce'] = []
    average_results[a+'_stab_cce'] = []

n = results['test_size']
r = results['num_rep']
print('Stability & xGB & xGB & RF & RF \\\\')
print('Dataset & CE & CCE & CE & CCE \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'num_rep']:
        continue
    print(d, end='')
    for a in results[d]:
        print(' & ', end='')
        stability = results[d][a]['stability']
        
        for key in ['ce', 'cce']:    
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(stability[key][i][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            stab_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([stability[key][i][j]['predict'][stab_rank[key][j][0]] for i in range(r)]), np.var([stability[key][i][j]['predict'][stab_rank[key][j][0]] for i in range(r)])])
            stab_val[key] = value 
        
        average_results[a+'_stab_ce'].append(np.mean([t[1] for t in stab_val["ce"]]))
        average_results[a+'_stab_cce'].append(np.mean([t[1] for t in stab_val["cce"]]))
        # print(f'{np.mean([t[1] if t[1] > 1e-20 else 0 for t in stab_val["ce"]]):.1e} & {np.var([t[1] if t[1] > 1e-20 else 0 for t in stab_val["cce"]]):.1e} & ',end='')
        print(f'{np.mean([t[1] for t in stab_val["ce"]]):.1e} & {np.mean([t[1] for t in stab_val["cce"]]):.1e}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    print(' & ', end='')
    print(f'{np.mean(average_results[a+"_stab_ce"]):.1e} & {np.mean(average_results[a+"_stab_cce"]):.1e}',end='')
print(' \\\\')

Stability & xGB & xGB & RF & RF \\
Dataset & CE & CCE & CE & CCE \\
\hline
colic & 3.9e-35 & 3.9e-35 & 7.8e-33 & 7.8e-33 \\
creditA & 1.5e-34 & 1.5e-34 & 1.5e-32 & 1.5e-32 \\
diabetes & 1.2e-33 & 1.2e-33 & 5.0e-33 & 5.0e-33 \\
german & 1.2e-34 & 1.2e-34 & 9.6e-34 & 9.6e-34 \\
haberman & 5.2e-34 & 5.2e-34 & 1.2e-33 & 1.2e-33 \\
heartC & 0.0e+00 & 0.0e+00 & 2.7e-33 & 2.7e-33 \\
heartH & 2.6e-33 & 2.6e-33 & 4.9e-33 & 4.9e-33 \\
heartS & 1.6e-33 & 1.6e-33 & 4.3e-33 & 4.3e-33 \\
hepati & 1.9e-34 & 1.9e-34 & 5.2e-33 & 5.2e-33 \\
iono & 4.1e-35 & 4.1e-35 & 3.6e-33 & 3.6e-33 \\
je4042 & 1.5e-33 & 1.5e-33 & 6.4e-33 & 6.4e-33 \\
je4243 & 5.3e-34 & 5.3e-34 & 5.0e-33 & 5.0e-33 \\
kc1 & 5.4e-34 & 5.4e-34 & 1.7e-33 & 1.7e-33 \\
kc2 & 7.7e-34 & 7.7e-34 & 5.0e-34 & 5.0e-34 \\
kc3 & 1.4e-35 & 1.4e-35 & 5.7e-34 & 5.7e-34 \\
liver & 2.2e-33 & 2.2e-33 & 8.7e-33 & 8.7e-33 \\
pc1req & 7.7e-35 & 7.7e-35 & 1.8e-33 & 1.8e-33 \\
pc4 & 3.7e-34 & 3.7e-34 & 2.5e-33 & 2.5e-33 \\
sonar & 1.8e-34 & 1.8e-34 & 2.2e-33 

As can be seen above, the stability is practically 0 for both factual CE (CE) and counterfactual CE (CCE), illustrating that the method is stable by definition. 

In [6]:
rob_rank = {}
rob_val = {}
rob_proba = []
average_results = {}
for a in ['xGB', 'RF']:
    average_results[a+'_rob_ce'] = []
    average_results[a+'_rob_cce'] = []
    average_results[a+'_rob_proba'] = []

n = results['test_size']
r = results['num_rep']

print('Robustness & xGB & xGB & xGB & RF & RF & RF \\\\')
print('Dataset & CE & CCE & Model & CE & CCE & Model \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'num_rep']:
        continue
    print(d, end='')
    for a in results[d]:
        print(' & ', end='')
        robustness = results[d][a]['robustness']
        
        for key in ['ce', 'cce']:                
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(robustness[key][i][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            rob_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([robustness[key][i][j]['predict'][rob_rank[key][j][0]] for i in range(r)]), np.var([robustness[key][i][j]['predict'][rob_rank[key][j][0]] for i in range(r)])])
            rob_val[key] = value
        
        for inst in range(n):
            rob_proba.append(np.var([robustness['proba'][j][inst] for j in range(r)]))
        
        average_results[a+'_rob_ce'].append(np.mean([t[1] for t in rob_val["ce"]]))
        average_results[a+'_rob_cce'].append(np.mean([t[1] for t in rob_val["cce"]]))
        average_results[a+'_rob_proba'].append(np.mean(rob_proba))
        # print(f'{np.mean([t[1] if t[1] > 1e-20 else 0 for t in rob_val["ce"]]):.1e} & {np.mean([t[1] if t[1] > 1e-20 else 0 for t in rob_val["cce"]]):.1e} & ',end='')
        print(f'{np.mean([t[1] for t in rob_val["ce"]]):.3f} & {np.mean([t[1] for t in rob_val["cce"]]):.3f} & ',end='')
        print(f'{np.mean(rob_proba):.3f}', end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    print(' & ', end='')
    print(f'{np.mean(average_results[a+"_rob_ce"]):.3f} & {np.mean(average_results[a+"_rob_cce"]):.3f} & ',end='')
    print(f'{np.mean(average_results[a+"_rob_proba"]):.3f}', end='')
print(' \\\\')

Robustness & xGB & xGB & xGB & RF & RF & RF \\
Dataset & CE & CCE & Model & CE & CCE & Model \\
\hline
colic & 0.015 & 0.015 & 0.015 & 0.017 & 0.017 & 0.010 \\
creditA & 0.022 & 0.022 & 0.011 & 0.015 & 0.015 & 0.009 \\
diabetes & 0.017 & 0.017 & 0.014 & 0.015 & 0.015 & 0.013 \\
german & 0.003 & 0.003 & 0.015 & 0.005 & 0.005 & 0.014 \\
haberman & 0.010 & 0.010 & 0.016 & 0.011 & 0.011 & 0.016 \\
heartC & 0.012 & 0.012 & 0.017 & 0.012 & 0.012 & 0.016 \\
heartH & 0.017 & 0.017 & 0.017 & 0.011 & 0.011 & 0.016 \\
heartS & 0.019 & 0.019 & 0.018 & 0.017 & 0.017 & 0.017 \\
hepati & 0.022 & 0.022 & 0.018 & 0.014 & 0.014 & 0.017 \\
iono & 0.028 & 0.028 & 0.017 & 0.013 & 0.013 & 0.016 \\
je4042 & 0.018 & 0.018 & 0.017 & 0.015 & 0.015 & 0.017 \\
je4243 & 0.010 & 0.010 & 0.018 & 0.010 & 0.010 & 0.017 \\
kc1 & 0.011 & 0.011 & 0.017 & 0.009 & 0.009 & 0.017 \\
kc2 & 0.018 & 0.018 & 0.018 & 0.007 & 0.007 & 0.017 \\
kc3 & 0.007 & 0.007 & 0.017 & 0.005 & 0.005 & 0.017 \\
liver & 0.025 & 0.025 & 0.017 & 0.

In [8]:
stab_rank = {}
stab_val = {}
average_results = {}
for a in ['xGB', 'RF']:
    average_results[a+'_stab_ce'] = []
    average_results[a+'_stab_cce'] = []
rob_rank = {}
rob_val = {}
rob_proba = []
for a in ['xGB', 'RF']:
    average_results[a+'_rob_ce'] = []
    average_results[a+'_rob_cce'] = []
    average_results[a+'_rob_proba'] = []

n = results['test_size']
r = results['num_rep']

print(' & xGB & xGB & RF & RF & xGB & xGB & xGB & RF & RF & RF \\\\')
print('Dataset & CE & CCE & CE & CCE & CE & CCE & Model & CE & CCE & Model \\\\\n\\hline')

for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'num_rep']:
        continue
    print(d, end='')
    for a in results[d]:
        print(' & ', end='')
        stability = results[d][a]['stability']
        
        for key in ['ce', 'cce']:    
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(stability[key][i][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            stab_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([stability[key][i][j]['predict'][stab_rank[key][j][0]] for i in range(r)]), np.var([stability[key][i][j]['predict'][stab_rank[key][j][0]] for i in range(r)])])
            stab_val[key] = value 
        
        average_results[a+'_stab_ce'].append(np.mean([t[1] for t in stab_val["ce"]]))
        average_results[a+'_stab_cce'].append(np.mean([t[1] for t in stab_val["cce"]]))
        # print(f'{np.mean([t[1] if t[1] > 1e-20 else 0 for t in stab_val["ce"]]):.1e} & {np.var([t[1] if t[1] > 1e-20 else 0 for t in stab_val["cce"]]):.1e} & ',end='')
        print(f'{np.mean([t[1] for t in stab_val["ce"]]):.1e} & {np.mean([t[1] for t in stab_val["cce"]]):.1e}',end='')
        
    for a in results[d]:
        print(' & ', end='')
        robustness = results[d][a]['robustness']
        
        for key in ['ce', 'cce']:                
            ranks = []
            values = []
            for j in range(n):
                rank = []
                value = []
                for i in range(r):
                    rank.append(np.argsort(np.abs(robustness[key][i][j]['predict']))[-1:][0])
                ranks.append(rank)
                values.append(value)
            rob_rank[key] = st.mode(ranks, axis=1)[0] # Find most important feature per instance
            value = []
            for j in range(n):
                value.append([np.mean([robustness[key][i][j]['predict'][rob_rank[key][j][0]] for i in range(r)]), np.var([robustness[key][i][j]['predict'][rob_rank[key][j][0]] for i in range(r)])])
            rob_val[key] = value
        
        for inst in range(n):
            rob_proba.append(np.var([robustness['proba'][j][inst] for j in range(r)]))
        
        average_results[a+'_rob_ce'].append(np.mean([t[1] for t in rob_val["ce"]]))
        average_results[a+'_rob_cce'].append(np.mean([t[1] for t in rob_val["cce"]]))
        average_results[a+'_rob_proba'].append(np.mean(rob_proba))
        # print(f'{np.mean([t[1] if t[1] > 1e-20 else 0 for t in rob_val["ce"]]):.1e} & {np.mean([t[1] if t[1] > 1e-20 else 0 for t in rob_val["cce"]]):.1e} & ',end='')
        print(f'{np.mean([t[1] for t in rob_val["ce"]]):.3f} & {np.mean([t[1] for t in rob_val["cce"]]):.3f} & ',end='')
        print(f'{np.mean(rob_proba):.3f}', end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    print(' & ', end='')
    print(f'{np.mean(average_results[a+"_stab_ce"]):.1e} & {np.mean(average_results[a+"_stab_cce"]):.1e}',end='')
for a in ['xGB', 'RF']:
    print(' & ', end='')
    print(f'{np.mean(average_results[a+"_rob_ce"]):.3f} & {np.mean(average_results[a+"_rob_cce"]):.3f} & ',end='')
    print(f'{np.mean(average_results[a+"_rob_proba"]):.3f}', end='')
print(' \\\\')

 & xGB & xGB & RF & RF & xGB & xGB & xGB & RF & RF & RF \\
Dataset & CE & CCE & CE & CCE & CE & CCE & Model & CE & CCE & Model \\
\hline
colic & 3.9e-35 & 3.9e-35 & 7.8e-33 & 7.8e-33 & 0.015 & 0.015 & 0.015 & 0.017 & 0.017 & 0.010 \\
creditA & 1.5e-34 & 1.5e-34 & 1.5e-32 & 1.5e-32 & 0.022 & 0.022 & 0.011 & 0.015 & 0.015 & 0.009 \\
diabetes & 1.2e-33 & 1.2e-33 & 5.0e-33 & 5.0e-33 & 0.017 & 0.017 & 0.014 & 0.015 & 0.015 & 0.013 \\
german & 1.2e-34 & 1.2e-34 & 9.6e-34 & 9.6e-34 & 0.003 & 0.003 & 0.015 & 0.005 & 0.005 & 0.014 \\
haberman & 5.2e-34 & 5.2e-34 & 1.2e-33 & 1.2e-33 & 0.010 & 0.010 & 0.016 & 0.011 & 0.011 & 0.016 \\
heartC & 0.0e+00 & 0.0e+00 & 2.7e-33 & 2.7e-33 & 0.012 & 0.012 & 0.017 & 0.012 & 0.012 & 0.016 \\
heartH & 2.6e-33 & 2.6e-33 & 4.9e-33 & 4.9e-33 & 0.017 & 0.017 & 0.017 & 0.011 & 0.011 & 0.016 \\
heartS & 1.6e-33 & 1.6e-33 & 4.3e-33 & 4.3e-33 & 0.019 & 0.019 & 0.018 & 0.017 & 0.017 & 0.017 \\
hepati & 1.9e-34 & 1.9e-34 & 5.2e-33 & 5.2e-33 & 0.022 & 0.022 & 0.018 & 0.

The robustness is also low, even if the mean variance robustness is clearly larger than for stability. The robustness is comparable to the variance of the probability estimates used as reference. This indicates that the method is fairly robust to perturbations such as variations of the calibration set and the model. Obviously, since the method is explaining the calibrated probability estimates of the model, it must be expected that the method is sensitive to changes in the model.  

### 4 Computing time
Now, lets look at the runtime taken to compute the explanations. The tabulated runtimes are the average time in seconds per instance. The results are printed as a latex table.

In [79]:
s_timer = []
r_timer = []
n = results['test_size']
r = results['num_rep']
average_time = {}
average_time['num_features'] = []
for a in ['xGB', 'RF']:
    average_time[a+'_stab_ce'] = []
    average_time[a+'_stab_cce'] = []
    average_time[a+'_rob_ce'] = []
    average_time[a+'_rob_cce'] = []

print('Runtime & xGB & xGB & xGB & xGB & RF & RF & RF & RF & \\\\')
print('Dataset & CE Stability & CCE Stability & CE Robustness & CCE Robustness & CE Stability & CCE Stability & CE Robustness & CCE Robustness & #Features\\\\\n\\hline')
for d in np.sort(np.sort([k for k in results.keys()])):
    if d in ['test_size', 'num_rep']:
        continue
    print(d, end='')
    for a in results[d]:
        print(' & ', end='')
        s_time = results[d][a]['stab_timer']
        r_time = results[d][a]['rob_timer']
        average_time[a+'_stab_ce'].append(np.mean([t/n for t in s_time["ce"]]))
        average_time[a+'_stab_cce'].append(np.mean([t/n for t in s_time["cce"]]))
        average_time[a+'_rob_ce'].append(np.mean([t/n for t in r_time["ce"]]))
        average_time[a+'_rob_cce'].append(np.mean([t/n for t in r_time["cce"]]))
        average_time['num_features'].append(data_characteristics[d])
        print(f'{np.mean([t/n for t in s_time["ce"]]):.2f} & {np.mean([t/n for t in s_time["cce"]]):.2f} & ',end='')
        print(f'{np.mean([t/n for t in r_time["ce"]]):.2f} & {np.mean([t/n for t in r_time["cce"]]):.2f}',end='')
    print(f' & {data_characteristics[d]}', end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    print(' & ', end='')
    print(f'{np.mean(average_time[a+"_stab_ce"]):.2f} & {np.mean(average_time[a+"_stab_cce"]):.2f} & ',end='')
    print(f'{np.mean(average_time[a+"_rob_ce"]):.2f} & {np.mean(average_time[a+"_rob_cce"]):.2f}',end='')
print(f' & {np.mean(average_time["num_features"]):.1f}', end='')
print(' \\\\')

Runtime & xGB & xGB & xGB & xGB & RF & RF & RF & RF & \\
Dataset & CE Stability & CCE Stability & CE Robustness & CCE Robustness & CE Stability & CCE Stability & CE Robustness & CCE Robustness & #Features\\
\hline
colic & 0.38 & 0.41 & 0.38 & 0.41 & 1.03 & 1.12 & 1.05 & 3.40 & 60 \\
creditA & 0.37 & 0.39 & 0.37 & 0.40 & 0.94 & 1.01 & 0.94 & 1.01 & 43 \\
diabetes & 0.13 & 0.17 & 0.13 & 0.17 & 0.35 & 0.46 & 0.35 & 0.47 & 9 \\
german & 0.18 & 0.18 & 0.19 & 0.19 & 0.45 & 0.45 & 0.46 & 0.46 & 28 \\
haberman & 0.06 & 0.06 & 0.06 & 0.06 & 0.14 & 0.14 & 0.14 & 0.15 & 4 \\
heartC & 0.20 & 0.23 & 0.21 & 0.25 & 0.51 & 0.56 & 0.50 & 0.58 & 23 \\
heartH & 0.17 & 0.20 & 0.17 & 0.20 & 0.42 & 0.48 & 0.44 & 0.52 & 21 \\
heartS & 0.16 & 0.19 & 0.16 & 0.19 & 0.37 & 0.44 & 0.38 & 0.45 & 14 \\
hepati & 0.19 & 0.23 & 0.20 & 0.23 & 0.47 & 0.55 & 0.46 & 0.55 & 20 \\
iono & 0.50 & 0.73 & 0.52 & 0.75 & 1.22 & 1.76 & 1.21 & 1.77 & 34 \\
je4042 & 0.14 & 0.14 & 0.15 & 0.15 & 0.36 & 0.36 & 0.36 & 0.36 & 9 \\
je4243

As can be seen, the runtime is fairly low for both CE and CCE. As expected, there is very little difference between CE and CCE. The runtime for the factual CE is slightly lower than for the CCE. This is because the CCE method computes the counterfactuals, which occasionally require some additional calculations. The main difference in time stems from the underlying model used, with xGBoost being about twice as fast as Random Forest. 

It is reassuring that the runtime is close to identical for both the Stability and Robustness experiments. This indicates that the main source affecting runtime is the number of features and the machine learning algorithm used.

Finally, lets look at the variation in runtime. The variation is measured as the standard deviation of the runtime over the 30 runs and 20 instances. The average variation is computed over the 20 instances. The results are printed as a latex table.

In [77]:
s_timer = []
r_timer = []
n = results['test_size']
r = results['num_rep']
average_time = {}
for a in ['xGB', 'RF']:
    average_time[a+'_stab_ce'] = []
    average_time[a+'_stab_cce'] = []
    average_time[a+'_rob_ce'] = []
    average_time[a+'_rob_cce'] = []

print('Runtime & xGB & xGB & xGB & xGB & RF & RF & RF & RF \\\\')
print('Dataset & CE Stability & CCE Stability & CE Robustness & CCE Robustness & CE Stability & CCE Stability & CE Robustness & CCE Robustness \\\\\n\\hline')
for d in np.sort([k for k in results.keys()]):
    if d in ['test_size', 'num_rep']:
        continue
    print(d, end='')
    for a in results[d]:
        print(' & ', end='')
        s_time = results[d][a]['stab_timer']
        r_time = results[d][a]['rob_timer']
        average_time[a+'_stab_ce'].append(np.var([t/n for t in s_time["ce"]]))
        average_time[a+'_stab_cce'].append(np.var([t/n for t in s_time["cce"]]))
        average_time[a+'_rob_ce'].append(np.var([t/n for t in r_time["ce"]]))
        average_time[a+'_rob_cce'].append(np.var([t/n for t in r_time["cce"]]))
        print(f'{np.var([t/n for t in s_time["ce"]]):.2e} & {np.var([t/n for t in s_time["cce"]]):.2e} & ',end='')
        print(f'{np.var([t/n for t in r_time["ce"]]):.2e} & {np.var([t/n for t in r_time["cce"]]):.2e}',end='')
    print(' \\\\')
print('\\hline\nAverage', end='')
for a in ['xGB', 'RF']:
    print(' & ', end='')
    print(f'{np.mean(average_time[a+"_stab_ce"]):.2e} & {np.mean(average_time[a+"_stab_cce"]):.2e} & ',end='')
    print(f'{np.mean(average_time[a+"_rob_ce"]):.2e} & {np.mean(average_time[a+"_rob_cce"]):.2e}',end='')
print(' \\\\')

Runtime & xGB & xGB & xGB & xGB & RF & RF & RF & RF \\
Dataset & CE Stability & CCE Stability & CE Robustness & CCE Robustness & CE Stability & CCE Stability & CE Robustness & CCE Robustness \\
\hline
colic & 2.52e-05 & 2.37e-05 & 9.83e-06 & 3.46e-05 & 1.60e-04 & 1.52e-04 & 1.10e-03 & 1.47e+02 \\
creditA & 4.60e-05 & 2.41e-05 & 3.23e-04 & 2.40e-04 & 1.26e-04 & 2.38e-04 & 1.46e-03 & 1.47e-03 \\
diabetes & 4.45e-06 & 5.56e-06 & 4.12e-06 & 8.40e-06 & 5.78e-05 & 7.14e-05 & 3.70e-05 & 6.95e-05 \\
german & 1.56e-05 & 1.79e-05 & 1.07e-05 & 1.00e-05 & 3.73e-04 & 6.88e-04 & 3.93e-05 & 1.93e-05 \\
haberman & 6.55e-06 & 6.81e-06 & 1.72e-06 & 3.05e-06 & 1.85e-05 & 1.96e-05 & 1.49e-05 & 1.91e-05 \\
heartC & 1.10e-04 & 9.48e-05 & 4.67e-05 & 1.56e-04 & 4.45e-04 & 6.05e-04 & 3.61e-04 & 4.01e-04 \\
heartH & 1.18e-05 & 1.10e-05 & 1.08e-05 & 1.55e-05 & 2.99e-05 & 9.33e-06 & 5.30e-04 & 1.38e-03 \\
heartS & 1.98e-05 & 1.77e-05 & 3.17e-05 & 9.52e-05 & 1.72e-05 & 3.05e-05 & 2.05e-05 & 1.32e-04 \\
hepati & 1.

As can be seen, the variation in runtime per dataset is small. This indicates that the runtime is fairly stable and that the method is not overly sensitive to changes in the model or calibration set.