# Empirical Estimation of Confidence Intervals for Machine Learning
This notebook presents the usage of python functions to perform empirical
estimation of confidence intervals for machine learning models. This kind of confidence interval estimation is often
called "Bootstrap Confidence Interval" of "Monte Carlo Confidence Interval" estimation.

In [1]:
import numpy as np
from tqdm.auto import tqdm

In [2]:
%%capture

from keras.models import Sequential
from keras.layers import Dense

In [3]:
from sklearn.datasets import load_iris

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier 

In [4]:
from bootstrap_confindence_intervals import get_accuracy_on_samples, get_confidence_interval

# Basic Usage
The basic usage of the functions is presented by computing different confidence intervals for a random-forest classifier over the iris dataset.

In [5]:
# Data loading and presentation

iris_dataset = load_iris()
iris_X, iris_y = iris_dataset['data'], iris_dataset['target']

In [6]:
[(a, b) for (a, b) in zip(iris_X, iris_y)][:5]

[(array([5.1, 3.5, 1.4, 0.2]), 0),
 (array([4.9, 3. , 1.4, 0.2]), 0),
 (array([4.7, 3.2, 1.3, 0.2]), 0),
 (array([4.6, 3.1, 1.5, 0.2]), 0),
 (array([5. , 3.6, 1.4, 0.2]), 0)]

In [7]:
[(a, b) for (a, b) in zip(iris_X, iris_y)][-5:]

[(array([6.7, 3. , 5.2, 2.3]), 2),
 (array([6.3, 2.5, 5. , 1.9]), 2),
 (array([6.5, 3. , 5.2, 2. ]), 2),
 (array([6.2, 3.4, 5.4, 2.3]), 2),
 (array([5.9, 3. , 5.1, 1.8]), 2)]

In [8]:
# These parameters will be passed to the constructor of RandomForestClassifier for the creation of each new instance

rfc_params_dict = {"n_estimators":20, "max_depth":3}

In [9]:
# The accuracy of a random-forest classifier is evaluated on 100 samples of the iris dataset. Each sample is sampled
# with replacement from the original dataset.

accs=get_accuracy_on_samples(RandomForestClassifier, iris_X, iris_y,
                             model_params_dict=rfc_params_dict,
                             n_iterations=100, sample_ratio=0.7,
                             train_ratio=0.8, random_seed=None,
                             sample_with_replacement=True, verbose=True)

Evaluating model on 100 samples of (X, y).
Each sample will contain 105 elements from X, out of which 84 elements
will be used to train the model, while the remaining 21 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [10]:
[(k, accs[k]) for k in range(5)]

[(0, 1.0),
 (1, 0.9047619047619048),
 (2, 1.0),
 (3, 1.0),
 (4, 0.9523809523809523)]

In [11]:
# Confidence intervals are estimated from the performance of the models in the 100 evaluations.

get_confidence_interval(accs, 10)

From the given data, with 90% probability,
the accuracy of the model is 95.24% +/- 4.76


(0.9047619047619048, 1.0)

In [12]:
get_confidence_interval(accs, 5)

From the given data, with 95% probability,
the accuracy of the model is 92.86% +/- 7.14


(0.8571428571428571, 1.0)

In [13]:
get_confidence_interval(accs, 1)

From the given data, with 99% probability,
the accuracy of the model is 89.27% +/- 10.73


(0.7854761904761904, 1.0)

In [14]:
rfc_params_dict = {"n_estimators":5, "max_depth":5}

In [15]:
accs=get_accuracy_on_samples(RandomForestClassifier, iris_X, iris_y,
                             model_params_dict=rfc_params_dict,
                             n_iterations=200, sample_ratio=0.8,
                             train_ratio=0.85, random_seed=None,
                             sample_with_replacement=True, verbose=True)

Evaluating model on 200 samples of (X, y).
Each sample will contain 120 elements from X, out of which 102 elements
will be used to train the model, while the remaining 18 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=200.0), HTML(value='')))




In [16]:
[(k, accs[k]) for k in range(6)]

[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0), (5, 0.8888888888888888)]

In [17]:
get_confidence_interval(accs, 10)

From the given data, with 90% probability,
the accuracy of the model is 94.44% +/- 5.56


(0.8888888888888888, 1.0)

In [18]:
get_confidence_interval(accs, 5)

From the given data, with 95% probability,
the accuracy of the model is 94.44% +/- 5.56


(0.8888888888888888, 1.0)

In [19]:
get_confidence_interval(accs, 1)

From the given data, with 99% probability,
the accuracy of the model is 94.43% +/- 5.57


(0.888611111111111, 1.0)

In [20]:
rfc_params_dict = {"n_estimators":10, "max_depth":5}

In [21]:
# In this example the resampling is performed without replacement

accs=get_accuracy_on_samples(RandomForestClassifier, iris_X, iris_y,
                             model_params_dict=rfc_params_dict,
                             n_iterations=1000, sample_ratio=0.8,
                             train_ratio=0.8, random_seed=None,
                             sample_with_replacement=False, verbose=True)

Evaluating model on 1000 samples of (X, y).
Each sample will contain 120 elements from X, out of which 96 elements
will be used to train the model, while the remaining 24 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1000.0), HTML(value='')))




In [22]:
[(k, accs[k]) for k in range(6)]

[(0, 0.9583333333333334),
 (1, 1.0),
 (2, 0.9166666666666666),
 (3, 0.9583333333333334),
 (4, 1.0),
 (5, 1.0)]

In [23]:
get_confidence_interval(accs, 10, verbose=False)

(0.875, 1.0)

In [24]:
get_confidence_interval(accs, 5, verbose=False)

(0.8333333333333334, 1.0)

In [25]:
get_confidence_interval(accs, 1, verbose=False)

(0.7916666666666666, 1.0)

# Neural Networks and One-Hot-Encoding
The following example shows how to use the functions with simple neural networks and one hot encoding of labels.

In [26]:
%%capture
iris_y_onehot = OneHotEncoder().fit_transform(iris_y.reshape(-1, 1)).toarray()

In [27]:
iris_y_onehot[:5], iris_y_onehot[-5:]

(array([[1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.],
        [1., 0., 0.]]), array([[0., 0., 1.],
        [0., 0., 1.],
        [0., 0., 1.],
        [0., 0., 1.],
        [0., 0., 1.]]))

In [28]:
# The following function will return instances of the model

def get_model(hidden_units,  n_output_units, hidden_activation="relu",
              output_activation="softmax", loss="binary_crossentropy",
              optimizer="adam", metric="accuracy"):
    model = Sequential()
    
    for u in hidden_units:
        model.add(Dense(u, activation=hidden_activation))
    model.add(Dense(n_output_units, activation=output_activation))
    
    model.compile(loss=loss, optimizer=optimizer, metrics=[metric])
    return model

In [29]:
model = get_model(hidden_units=[30, 20, 10],  n_output_units=3)

In [30]:
model.fit(iris_X, iris_y_onehot, epochs = 8, verbose=1, batch_size=10)

W1229 16:22:36.758777 139724031637312 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
W1229 16:22:38.383352 139724031637312 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.



Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.callbacks.History at 0x7f1360213a90>

In [31]:
model.predict(iris_X[:5])

array([[0.9513953 , 0.03880021, 0.00980448],
       [0.91564775, 0.06664806, 0.01770419],
       [0.9358865 , 0.05006092, 0.01405252],
       [0.917441  , 0.06461114, 0.01794783],
       [0.95593554, 0.03497979, 0.00908462]], dtype=float32)

In [32]:
dnn_params_dict = {'hidden_units':[30, 20, 10], 'hidden_activation':"relu", 'n_output_units':3,
                   'output_activation':"softmax", 'loss':'binary_crossentropy', 'optimizer':'adam'}

# These parameters will be passed to the network on the calls of the fit method.
dnn_fit_params_dict={'epochs':50, 'batch_size':10, 'verbose':0}

In [33]:
accs=get_accuracy_on_samples(get_model, iris_X, iris_y_onehot, model_params_dict=dnn_params_dict,
                            fit_params_dict=dnn_fit_params_dict, n_iterations=20, sample_ratio=0.7,
                            train_ratio=0.8, random_seed=None, sample_with_replacement=True,
                             verbose=True, is_one_hot=True)

Evaluating model on 20 samples of (X, y).
Each sample will contain 105 elements from X, out of which 84 elements
will be used to train the model, while the remaining 21 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))




In [34]:
get_confidence_interval(accs, 10)

From the given data, with 90% probability,
the accuracy of the model is 97.5% +/- 2.5


(0.95, 1.0)

In [35]:
get_confidence_interval(accs, 5)

From the given data, with 95% probability,
the accuracy of the model is 96.37% +/- 3.63


(0.9273809523809523, 1.0)

In [36]:
get_confidence_interval(accs, 1)

From the given data, with 99% probability,
the accuracy of the model is 95.46% +/- 4.54


(0.9092857142857144, 1.0)

# Deterministic Resampling
The following example shows the usage of the $\texttt{random_seed}$ parameter.

In [37]:
rfc_params_dict = {"n_estimators":10, "max_depth":5}

In [38]:
# Here the random_seed param is set.

accs_run_1=get_accuracy_on_samples(RandomForestClassifier, iris_X, iris_y,
                             model_params_dict=rfc_params_dict,
                             n_iterations=100, sample_ratio=0.7,
                             train_ratio=0.8, random_seed=42,
                             sample_with_replacement=True, verbose=True)

Evaluating model on 100 samples of (X, y).
Each sample will contain 105 elements from X, out of which 84 elements
will be used to train the model, while the remaining 21 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [39]:
[(k, accs_run_1[k]) for k in range(6)]

[(0, 0.9523809523809523),
 (1, 1.0),
 (2, 0.9047619047619048),
 (3, 0.9523809523809523),
 (4, 0.9523809523809523),
 (5, 1.0)]

In [40]:
get_confidence_interval(accs_run_1, 5)

From the given data, with 95% probability,
the accuracy of the model is 91.61% +/- 8.39


(0.8321428571428571, 1.0)

In [41]:
accs_run_2=get_accuracy_on_samples(RandomForestClassifier, iris_X, iris_y,
                             model_params_dict=rfc_params_dict,
                             n_iterations=100, sample_ratio=0.7,
                             train_ratio=0.8, random_seed=42,
                             sample_with_replacement=True, verbose=True)

Evaluating model on 100 samples of (X, y).
Each sample will contain 105 elements from X, out of which 84 elements
will be used to train the model, while the remaining 21 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [42]:
# As you can see, the results are identical to the previous run of the function. Be aware that the results
# also depend on the model's characteristic. As an example, if the random seed is set, but the model has 
# some sort of random initialization (i.e. Neural Networks), you will see different results although the
# resampling was identical.

[(k, accs_run_2[k]) for k in range(6)]

[(0, 0.9523809523809523),
 (1, 1.0),
 (2, 0.9047619047619048),
 (3, 0.9523809523809523),
 (4, 0.9523809523809523),
 (5, 1.0)]

In [43]:
get_confidence_interval(accs_run_2, 5)

From the given data, with 95% probability,
the accuracy of the model is 91.61% +/- 8.39


(0.8321428571428571, 1.0)

In [44]:
# Here the random_seed param is set to None

accs_run_3=get_accuracy_on_samples(RandomForestClassifier, iris_X, iris_y,
                             model_params_dict=rfc_params_dict,
                             n_iterations=100, sample_ratio=0.7,
                             train_ratio=0.8, random_seed=None,
                             sample_with_replacement=True, verbose=True)

Evaluating model on 100 samples of (X, y).
Each sample will contain 105 elements from X, out of which 84 elements
will be used to train the model, while the remaining 21 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [45]:
[(k, accs_run_3[k]) for k in range(6)]

[(0, 1.0),
 (1, 1.0),
 (2, 1.0),
 (3, 0.9523809523809523),
 (4, 1.0),
 (5, 0.8571428571428571)]

In [46]:
get_confidence_interval(accs_run_3, 5)

From the given data, with 95% probability,
the accuracy of the model is 92.86% +/- 7.14


(0.8571428571428571, 1.0)

In [47]:
accs_run_4=get_accuracy_on_samples(RandomForestClassifier, iris_X, iris_y,
                             model_params_dict=rfc_params_dict,
                             n_iterations=100, sample_ratio=0.7,
                             train_ratio=0.8, random_seed=None,
                             sample_with_replacement=True, verbose=True)

Evaluating model on 100 samples of (X, y).
Each sample will contain 105 elements from X, out of which 84 elements
will be used to train the model, while the remaining 21 elements to test its accuracy.


HBox(children=(HTML(value=''), FloatProgress(value=0.0), HTML(value='')))




In [48]:
[(k, accs_run_4[k]) for k in range(6)]

[(0, 0.9047619047619048),
 (1, 1.0),
 (2, 0.9523809523809523),
 (3, 0.9523809523809523),
 (4, 0.9523809523809523),
 (5, 1.0)]

In [49]:
get_confidence_interval(accs_run_4, 5)

From the given data, with 95% probability,
the accuracy of the model is 95.24% +/- 4.76


(0.9047619047619048, 1.0)