# Neural network for ensemble tuning

## Introduction

Ensemble represents one of the most powerful intuition in data science. This technique combines multiple classifier predictions in order to obtain more accurate ones (at least in most of cases). But how can we combine prediction? We have two different technique for this purpose: hard voting and soft voting.

Hard voting output as prediction the ensemble the *mode* value of the classifiers' predictions distribution, in other words the class that has received more votes. On the other hand soft voting compute the probability for an instance to belong to each class averaging the probabilty predicted by the classifiers. This lead to smoother results, giving more weight to classifier confident of what they are predicting. Probability are generally averaged with a simple equally weighted mean.

The idea of this notebook is to use a deep neural network to obtain a more efficient weights' distribution for the soft voting method, in order to optimize the overall accuracy of the model. As base estimator i am going to use logistic regression as a classifier, since it is able to output the probability for each class through the sigmoid function. Each tree will be trained on a different sample of the original dataset, in order to reduce the likelyhood of overfitting

### Setup

In [1]:
# libraries for uploading data 
import cv2
import os

# deep learning libraries
import tensorflow as tf
from tensorflow import keras

# common imports
import pandas as pd
import numpy as np

# setting random seed
np.random.seed(42)
tf.random.set_seed(42)

# Style setup
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rc('axes', labelsize = 14)
mpl.rc('xtick', labelsize = 16)
mpl.rc('ytick', labelsize = 12)
plt.style.use('fivethirtyeight')
plt.xkcd(False) 

<matplotlib.rc_context at 0x14a4ed10978>

### Importing the data

Loading the fashion mnist dataset from keras

In [2]:
fashion_mnist = keras.datasets.fashion_mnist
(images_train, labels_train), (images_valid, labels_valid) = fashion_mnist.load_data()
images_train = images_train.reshape((-1, 28 * 28)) / 255
images_valid = images_valid.reshape((-1, 28 * 28)) / 255

### Training the trees

For the purpose of the project i am going to implement from scratch an ensemble. The clfs variable is a dictionary which contains the trained estimators

In [3]:
def random_sample(X_set, y_set, length = 5000):
    index = np.random.randint(0, len(X_set), length)
    return X_set[index], y_set[index]


from sklearn.linear_model import LogisticRegression

clfs = {}
n_estimators = 25

for i in range(n_estimators):
    x, y = random_sample(images_train, labels_train)
    log_clf = LogisticRegression(solver = 'newton-cg', multi_class = 'auto', max_iter = 500)
    log_clf.fit(x, y)
    clfs['clf_' + str(i)] = log_clf
    print('clf trained')

clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained


### Hard and soft voting

The next step is to define three functions used to extract the hard and soft predictions of the ensemble to compare them with the 'tuned' one. For th soft voting predictions i am going to need also the predicted probabilities of every classifiers for each instance.

In [4]:
def return_hard_predictions(clfs, x_set):
    ensemble_hard_pred = []
    for instance in x_set:
        instance_pred = []
        for i in range(n_estimators):
            prediction = int(clfs['clf_' + str(i)].predict(instance.reshape(1, -1)))
            instance_pred.append(prediction)
        ensemble_hard_pred.append(max(set(instance_pred), key = instance_pred.count))
    ensemble_hard_pred = np.array(ensemble_hard_pred)
    return ensemble_hard_pred

def return_class(predictions):
    return np.where(predictions[0] == max(predictions[0]))

def return_soft_predictions(clfs, x_set):
    ensemble_soft_pred = []
    for instance in images_valid:
        instance_pred_proba = []
        for i in range(n_estimators):
            prediction = clfs['clf_' + str(i)].predict_proba(instance.reshape(1, -1))
            instance_pred_proba.append(prediction)
        instance_pred = np.mean(np.array(instance_pred_proba), axis = 0)
        ensemble_soft_pred.append(return_class(instance_pred))
    ensemble_soft_pred = np.array(ensemble_soft_pred).reshape(labels_valid.shape)
    return ensemble_soft_pred

ensemble_hard_pred = return_hard_predictions(clfs, images_valid)
ensemble_soft_pred = return_soft_predictions(clfs, images_valid)

I decided to use accuracy as evaluation metrics. I am going to compare it also with the null accuracy, in order to see in my ensembles are above the minimum threshold.

In [6]:
from sklearn.metrics import accuracy_score

acc_hard_voting = accuracy_score(ensemble_hard_pred, labels_valid)
acc_soft_voting = accuracy_score(ensemble_soft_pred, labels_valid)
print('Hard voting ensemble accuracy: ', acc_hard_voting)
print('Soft voting ensemble accuracy: ', acc_soft_voting)

lst_labels = list(labels_train)
mode_train = max(set(lst_labels), key = lst_labels.count)
y_null = np.zeros(labels_valid.shape) + mode_train
null_accuracy = accuracy_score(y_null, labels_valid)
print('Null accuracy: ', null_accuracy)

Hard voting ensemble accuracy:  0.84
Soft voting ensemble accuracy:  0.8407
Null accuracy:  0.1


As stated in the introduction, the soft ensemble led to better perfomances.

## Neural tuning

To tune the weights of the soft probabilities I am gonna need an additional set. The idea is to split again the original training set into a train set for the estimators and a train set for the neural network.
Since I am going to modify the training space for my estimators I will compute again the accuracy score for both of the ensemble methods 

In [8]:
from sklearn.model_selection import train_test_split

images_train_, images_nn, labels_train_, labels_nn = train_test_split(images_train, labels_train,
                                                                      test_size = 0.4, random_state = 42)
images_train_.shape

(36000, 784)

Now I am going to train the estimators again as well as the accuracy scores

In [9]:
clfs_ = {}
n_estimators_ = 25

for i in range(n_estimators):
    x, y = random_sample(images_train_, labels_train_)
    log_clf = LogisticRegression(solver = 'newton-cg', multi_class = 'auto', max_iter = 500)
    log_clf.fit(x, y)
    clfs_['clf_' + str(i)] = log_clf
    print('clf trained')

clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained
clf trained


In [10]:
ensemble_hard_pred_nn = return_hard_predictions(clfs_, images_valid)
ensemble_soft_pred_nn = return_soft_predictions(clfs_, images_valid)
acc_hard_voting = accuracy_score(ensemble_hard_pred_nn, labels_valid)
acc_soft_voting = accuracy_score(ensemble_soft_pred_nn, labels_valid)
print('Hard voting ensemble accuracy: ', acc_hard_voting)
print('Soft voting ensemble accuracy: ', acc_soft_voting)

Hard voting ensemble accuracy:  0.8398
Soft voting ensemble accuracy:  0.8412


The reduction of the training set did not affect significantly the perfomance of the ensemble. On the opposite the soft voting led to a general improve of the accuracy, but we can impute this to the randomness of the training process (the single sample on which a classifier is trained is draw randomly). Now I need to create the inputs set for the neural network. Note that the following function can be also used for returning the classic soft voting predictions 

In [None]:
def return_soft_predictions(clfs, x_set, nn_inputs = False):
    ensemble_soft_pred = []
    ensemble_soft_proba = []
    for instance in x_set:
        instance_pred_proba = []
        for i in range(n_estimators):
            prediction = clfs['clf_' + str(i)].predict_proba(instance.reshape(1, -1))
            instance_pred_proba.append(prediction)
        if nn_inputs:
            ensemble_soft_proba.append(instance_pred_proba)
            ensemble_soft_pred.append(return_class(instance_pred_proba))
        else:
            instance_pred = np.mean(np.array(instance_pred_proba), axis = 0)
            ensemble_soft_proba.append(instance_pred)
            ensemble_soft_pred.append(return_class(instance_pred))
              
    ensemble_soft_proba = np.array(ensemble_soft_proba).reshape(x_set.shape[0], n_estimators * 10)
    return ensemble_soft_pred, ensemble_soft_proba

labels_nn_one_hot = np.zeros((labels_nn.shape[0], 10))
for i in range(labels_nn.shape[0]):
    labels_nn_one_hot[i, labels_nn[i]] = 1
    
labels_valid_one_hot = np.zeros((labels_valid.shape[0], 10))
for i in range(labels_valid_one_hot.shape[0]):
    labels_valid_one_hot[i, labels_valid[i]] = 1
    
nn, nn_inputs = return_soft_predictions(clfs, images_nn, True)
nn, nn_valid = return_soft_predictions(clfs, images_valid, True)

### Model building 

The last step is to build, compile and train the neural network. In this section there are a lot of decision variable to take into account to minimize the final loss (number of hidden layers, number of neuron, activation functions, optimizer...). The parameters i decided to use are not the output of any sort of tuning: this means that potentially there is still space for accuracy improvement

In [83]:
model = keras.models.Sequential([keras.layers.Dense(200, input_shape = [25 * 10]),
                                 keras.layers.Dense(150, activation = 'elu', kernel_initializer = 'he_normal',
                                                    kernel_regularizer = keras.regularizers.l2(0.02)),
                                 keras.layers.BatchNormalization(),
                                 keras.layers.Dense(150, activation = 'elu', kernel_initializer = 'he_normal',
                                                    kernel_regularizer = keras.regularizers.l2(0.02)),
                                 keras.layers.BatchNormalization(),
                                 keras.layers.Dense(150, activation = 'elu', kernel_initializer = 'he_normal',
                                                    kernel_regularizer = keras.regularizers.l2(0.02)),
                                 keras.layers.BatchNormalization(),
                                 keras.layers.Dense(10, activation = 'softmax')])
optimizer = keras.optimizers.Adam(lr = 5e-2, beta_1 = 0.9, beta_2 = 0.999, decay = 1e-2)
model.compile(loss = 'categorical_crossentropy', optimizer = optimizer, metrics = ['accuracy'])
model.fit(n, labels_nn_one_hot, batch_size = 64, epochs = 60,
          validation_data = (nn_valid, labels_valid_one_hot))
nn_tuning_acc = model.evaluate(nn_valid, labels_valid_one_hot)[1]

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


<tensorflow.python.keras.callbacks.History at 0x14a439a5f98>

In [90]:
print('Hard voting ensemble accuracy: ', acc_hard_voting)
print('Soft voting ensemble accuracy: ', acc_soft_voting)
print('Neural ensemble accuracy: ', round(nn_tuning_acc, 4))

Hard voting ensemble accuracy:  0.8398
Soft voting ensemble accuracy:  0.8412
Neural ensemble accuracy:  0.8442


## Conclusion and future improvements

The neural ensemble achieved significantly higher perfomance rather than traditional voting methods. Even if at a first look the improvement does not seem so remarkable, We have to keep in mind that the different ensemble criteria are build at the top of the same classifiers: this means that It will be impossible to boost global perfomance beyond a certain threshold. The downsize of the neural ensemble is that in order to implement it we need a consistent number of observations in the training set, since it will be split in two sub-set (my initial training set had 60000 images). The general suggestion to apply the neural ensemble of smaller training set is to prioritize the training of the classifiers, because they are the main driver of overall accuracy, and of course applying regularization measures to avoid overfitting.

The next step will be to wrap everything together to build a class, in order to be able to use the neural ensemble as a traditional SciKit-learn estimator.