In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Import Data

We start by importing the Iris dataset. 

Since we are representing our single classifier as a Bernoulli trial, this is a binary classification problem. The Iris dataset has three labels, so we drop the data that is labelled $0$.

In [2]:
iris = load_iris()
iris = iris
X = iris.data
y = iris.target

X = X[y != 0]
y = y[y != 0]

We then split it into training and testing sets of equal size.

In [3]:
X_train, X_test, y_train, y_test = \
    train_test_split(X,
                        y, 
                        test_size=0.5, 
                        random_state=42)
data = {
    'X_train': X_train, 
    'X_test': X_test, 
    'y_train': y_train, 
    'y_test': y_test
}

# The Process

Before Equation $4.2$, we assumed $p_i$ are generated by a distribution with mean $\mu_p$ and variance $\sigma_p^2$.

Thus, we need to estimate this distribution. We do so by generating $100$ random forests, each of size $1$ and getting the accuracy of each using the `approx_learner_dist` function, from the `ensembleEstimation` class.

In [4]:
from ensembleEstimation import ensembleEstimation

ensemble_1 = ensembleEstimation(1, data)
probs = ensemble_1.approx_learner_dist()

print(probs)

[0.94 0.76 0.9  0.94 0.86 0.9  0.7  0.7  0.94 0.9  0.94 0.94 0.9  0.94
 0.86 0.94 0.4  0.76 0.94 0.9  0.9  0.44 0.88 0.9  0.88 0.44 0.68 0.94
 0.88 0.86 0.88 0.88 0.88 0.94 0.66 0.68 0.9  0.68 0.88 0.86 0.88 0.88
 0.94 0.86 0.88 0.68 0.88 0.94 0.86 0.68 0.6  0.68 0.86 0.88 0.86 0.92
 0.94 0.88 0.88 0.92 0.76 0.4  0.9  0.9  0.9  0.86 0.88 0.94 0.94 0.7
 0.9  0.94 0.92 0.88 0.94 0.9  0.88 0.72 0.72 0.94 0.44 0.88 0.9  0.88
 0.94 0.9  0.9  0.94 0.9  0.94 0.94 0.9  0.94 0.86 0.4  0.7  0.9  0.86
 0.94 0.72]


In [5]:
print(f'mu_p : {ensemble_1.mu_p}')
print(f'sigma_p : {ensemble_1.sigma_p}')

mu_p : 0.8370000000000001
sigma_p : 0.13441354098453026


Since, $\mu_p$ and $\sigma_p$ are fairly away from $0$, we need to use the normal approximation. Suppose we use ensembles of size $11$, $21$, $\dots$, $51$.

We store the actual and predicted accuracies in a Pandas DataFrame for convenience.

In [6]:
Ns = []
actual_acc = []
pred_acc = []

for N in range(11, 51 + 1, 10):
    Ns.append(N)
    actual_acc.append(ensemble_1.find_actual_accuracy(N))
    pred_acc.append(ensemble_1.approximate(N, "normal"))

In [7]:
results = pd.DataFrame(
    {
        'N': Ns,
        'Actual Accuracy': actual_acc,
        'Predicted Accuracy': pred_acc
    }
)
results['Relative Error'] = (results['Predicted Accuracy'] - results['Actual Accuracy']) /results['Actual Accuracy'] * 100

In [None]:
results.to_excel('Paper/tables/Section 4.xlsx', index=False)
results