In [151]:
import numpy as np
import sklearn.model_selection
import numpy.random as rng
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import tensorflow as tf
import sklearn.metrics
import tensorflow.keras.regularizers as reg 
import itertools

tf.compat.v1.disable_eager_execution()

# Module 9: Practical Matters

## Cross-Validation

- So far, we estimated the performance of a model under one random training and testing split.
- This can be problematic as we can luck into a particularly good split and vice-versa. 
- The different random initializations of the weights on the NN can also introduce extra variability.
- A better approach would be to evaluate the classifier under different random splits and average the result (e.g., to get average MSE)
- An even better approach is to use _cross-validation_, which results in less biased estimates of the error.

- CV shuffles the dataset and splits it into $N$ sets of non-overlapping _folds_ (usually $N=5$)
- The model is trained and evaluated $N$ times, in each time it is trained on $N-1$ folds (green boxes) and tested on the remaining one (blue boxes), as follows (image courtsey of [Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)):

![alt text](figures/cv.jpg)

- Scikit-learn provides us with convenient functions to do CV

### Standard Splitting

In [152]:
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [10, 11], [-1, -2], [0, 0], [3, 3]])
y = np.array([1, 2, 3, 4, 5, 6, 7, 8])

In [153]:
# try using n_splits greater than the number of data points
kf = sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=43)
print(kf)
for train_index, test_index in kf.split(X):
    print("TRAIN index:", train_index, "TEST index:", test_index)
    
    # build training and testing splits
    Xtrain, ytrain = X[train_index,:], y[train_index]
    Xtest, ytest = X[test_index,:], y[test_index]
    
    # train and evaluate the model ...

KFold(n_splits=5, random_state=43, shuffle=True)
TRAIN index: [0 1 2 4 5 6] TEST index: [3 7]
TRAIN index: [0 1 2 3 4 7] TEST index: [5 6]
TRAIN index: [0 3 4 5 6 7] TEST index: [1 2]
TRAIN index: [1 2 3 4 5 6 7] TEST index: [0]
TRAIN index: [0 1 2 3 5 6 7] TEST index: [4]


### Stratified Splitting

- If you are dealing with a classification problem, it is possible that standard KFold generates training and testing class distributions that are too different.
- To solve this issue, scikit-learn provides the `StratifiedKFold` class which ensures that the class distributions are as close as possible in training and testing

In [154]:
# generate some dummy data, input does not matter, it is the number of instances that matters
X = np.zeros((10000, 2))

# generate classes according to a categorical distribution with p=[0.05, 0.05, 0.1, 0.7, 0.1]
# this will be one-hot-encoded
y = rng.multinomial(1, [0.05, 0.05, 0.1, 0.7, 0.1], 10000)

# go back to regular class labels [3, 1, 0, ..]
y_class = np.argmax(y, axis=1)

# standard version
kf = sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=43)

# stratified version
kf = sklearn.model_selection.StratifiedKFold(n_splits=5, shuffle=True, random_state=43)

for train_index, test_index in kf.split(X, y_class):
    
    # build training and testing splits
    Xtrain, ytrain = X[train_index,:], y[train_index]
    Xtest, ytest = X[test_index,:], y[test_index]
    
    print()
    print("Train distribution: ", np.mean(ytrain, axis=0))
    print("Test distribution:  ", np.mean(ytest, axis=0))


Train distribution:  [0.05075  0.05125  0.096875 0.699125 0.102   ]
Test distribution:   [0.051 0.051 0.097 0.699 0.102]

Train distribution:  [0.05075  0.05125  0.096875 0.699125 0.102   ]
Test distribution:   [0.051 0.051 0.097 0.699 0.102]

Train distribution:  [0.05075  0.05125  0.096875 0.699125 0.102   ]
Test distribution:   [0.051 0.051 0.097 0.699 0.102]

Train distribution:  [0.050875 0.051125 0.096875 0.699125 0.102   ]
Test distribution:   [0.0505 0.0515 0.097  0.699  0.102 ]

Train distribution:  [0.050875 0.051125 0.097    0.699    0.102   ]
Test distribution:   [0.0505 0.0515 0.0965 0.6995 0.102 ]


### Grouped Splitting

- You may encounter a scenario where observations belong to groups. For example, a dataset that characterizes student performance on different subjects will have many obervations belonging to a single student. In this scenario, the student represents the "group".
- In such cases, it can be desirable to split the dataset such that no group occurs in both training and testing sets, to ensure that the model can't do well by simply memorizing the identity of the group.
- In the students example, we'd split the dataset such that observations belonging to a student occur in the training or testing sets, but not both.

In [155]:
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])
groups = np.array([0, 0, 2, 2])
group_kfold = sklearn.model_selection.GroupKFold(n_splits=2)
for train_index, test_index in group_kfold.split(X, y, groups):
    print("TRAIN:", train_index, "TEST:", test_index)
    print("Train groups: ", groups[train_index], ", Test groups: ", groups[test_index])

TRAIN: [0 1] TEST: [2 3]
Train groups:  [0 0] , Test groups:  [2 2]
TRAIN: [2 3] TEST: [0 1]
Train groups:  [2 2] , Test groups:  [0 0]


### Example: Performance of multiclass classifier

In [156]:
def train_model(Xtrain, Ytrain, n_hidden=10, l2_penalty=0.05, epochs=100, verbose=True, activation='relu', batch_size_divisor=10):
    keras.backend.clear_session()
    model = keras.Sequential(
        [
            layers.InputLayer(input_shape=(Xtrain.shape[1],)),
            layers.Dense(n_hidden, activation=activation, kernel_regularizer=reg.l2(l2_penalty)),
            layers.Dense(Ytrain.shape[1], activation="softmax", kernel_regularizer=reg.l2(l2_penalty)),
        ]
    )
    
    model.compile(loss="categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
    
    batch_size = Xtrain.shape[0] // batch_size_divisor
    
    h = model.fit(x = Xtrain, y=Ytrain, verbose=verbose, epochs=epochs, batch_size=batch_size)
    
    return model, h


In [157]:
data = np.load('data/multiclass_classification_hard.npz')
X, y = data['X'], data['y']
Y = tf.keras.utils.to_categorical(y)

In [158]:
# initialize splitter
kf = sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=43)

# keep track of test performance
test_loss = []
test_acc = []
test_cms = []
for train_index, test_index in kf.split(X):
    
    # build training and testing splits
    Xtrain, Ytrain = X[train_index,:], Y[train_index,:]
    Xtest, Ytest = X[test_index,:], Y[test_index,:]
    model, history = train_model(Xtrain, Ytrain, verbose=False)
    
    Yhat = model.predict(Xtest, batch_size=Xtest.shape[0])
    
    loss = -np.mean(np.sum(Ytest * np.log(Yhat), axis=1))
    test_loss.append(loss)
    
    Yhat_hard = np.argmax(Yhat, axis=1)
    acc = sklearn.metrics.accuracy_score(y[test_index], Yhat_hard) 
    test_acc.append(acc)
    
    cm = sklearn.metrics.confusion_matrix(y[test_index], Yhat_hard)
    test_cms.append(cm)
    
    print("Finished split %d" % len(test_loss))
    

Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5


In [159]:
n = len(test_loss)
print("Mean loss: %0.2f (stderr  %0.2f), accuracy: %0.2f (stderr %0.2f)" % (np.mean(test_loss), np.std(test_loss)/np.sqrt(n), np.mean(test_acc), np.std(test_acc) / np.sqrt(n)))
print("Mean confusion matrix:")
np.mean(test_cms, axis=0)

Mean loss: 0.45 (stderr  0.01), accuracy: 0.86 (stderr 0.02)
Mean confusion matrix:


array([[33.2,  0. ,  0. ,  0. ,  0. ,  0.2],
       [ 0. , 31.2,  0. ,  2.2,  0. ,  0. ],
       [ 0.4,  0. , 32.6,  0. ,  0.4,  0. ],
       [ 0. ,  4.6,  0. , 27.8,  0. ,  1. ],
       [ 2.8,  1.2,  8.6,  1. , 15.4,  4.2],
       [ 0. ,  0. ,  0. ,  1. ,  0. , 32.2]])

### Exercises

- Set the regularizer to 1 and re-run the example, but happens to mean accuracy?
- Switch to a stratified KFold sampler, do results change?

## Hyperparameter optimization

- As you may have noticed, designing a neural network model involves many decisions: 
    - Learning rate
    - Regularization strength
    - Mini-batch size
    - Number of hidden layers
    - Number of neurons in a layer
    - Activation functions
    - Dropout rate
    - ...
- Just like the weights in a neural network model, those "decisions" are considered to be free parameters and are explicitly labeled as "hyperparameters" -- higher-level parameters of the model
- Unfortunately, hyperparameters cannot be learned via gradient descent so some other method is required to determine them
- The simplest way to is to pick a few candidate options for each hyperparameter and try all possible combinations of them, this is known as brute force or grid-search hyperparameter optimization

In [166]:
hyperparams = {
    'l2_penalty' : [0.001, 0.01, 0.1],
    'n_hidden' : [1, 5, 10],
    'batch_size_divisor' : [10]
}

# get a sorted list of hyperparameters
keys = sorted(hyperparams.keys())
print(keys)

# get the corresponding values
vals = [hyperparams[k] for k in keys]
print(vals)

# generate all combinations of those values
all_combs = itertools.product(*vals)

# attach key names to them
hyperparam_combs = []
for comb in all_combs:
    hyperparam_combs.append({ k: v for k, v in zip(keys, comb) })
    
hyperparam_combs

['batch_size_divisor', 'l2_penalty', 'n_hidden']
[[10], [0.001, 0.01, 0.1], [1, 5, 10]]


[{'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 1},
 {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 5},
 {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 10},
 {'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 1},
 {'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 5},
 {'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 10},
 {'batch_size_divisor': 10, 'l2_penalty': 0.1, 'n_hidden': 1},
 {'batch_size_divisor': 10, 'l2_penalty': 0.1, 'n_hidden': 5},
 {'batch_size_divisor': 10, 'l2_penalty': 0.1, 'n_hidden': 10}]

- In conjunction with CV, we can do brute force hyperparameter optimization as follows: for each combination of hyperparameters, compute the average test loss via CV, and pick the combination that has the smallest loss.

In [168]:
# initialize splitter
kf = sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=43)

min_loss = np.inf
best_hyperparam = None

for hyperparam_comb in hyperparam_combs:
    
    # keep track of test performance
    test_loss = []
    test_acc = []
    
    print("Evaluating combination: ", hyperparam_comb)
    
    for train_index, test_index in kf.split(X):

        # build training and testing splits
        Xtrain, Ytrain = X[train_index,:], Y[train_index,:]
        Xtest, Ytest = X[test_index,:], Y[test_index,:]
        model, history = train_model(Xtrain, Ytrain, verbose=False, epochs=200, **hyperparam_comb)

        Yhat = model.predict(Xtest, batch_size=Xtest.shape[0])

        loss = -np.mean(np.sum(Ytest * np.log(Yhat), axis=1))
        test_loss.append(loss)

        Yhat_hard = np.argmax(Yhat, axis=1)
        acc = sklearn.metrics.accuracy_score(y[test_index], Yhat_hard) 
        test_acc.append(acc)

        cm = sklearn.metrics.confusion_matrix(y[test_index], Yhat_hard)
        test_cms.append(cm)

        print("Finished split %d" % len(test_loss))
    
    mean_loss = np.mean(test_loss)
    mean_acc = np.mean(test_acc)
    
    if mean_loss < min_loss:
        min_loss = mean_loss 
        best_hyperparam = hyperparam_comb
        print("New best: loss=%0.2f, acc=%0.2f" % (mean_loss, mean_acc))

Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 1}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
New best: loss=1.20, acc=0.44
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 5}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
New best: loss=0.33, acc=0.88
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 10}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
New best: loss=0.29, acc=0.89
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 1}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 5}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
Evaluating combination:  {'batch_size_divisor'

In [169]:
best_hyperparam

{'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 10}

## Nested Cross-Validation

- After performing hyperparameter optimization above, you may be inclined to report the CV test performance of the best performing combination
- But that is not appropriate, because CV itself was used to select the best hyperparameter combination
- In ML parlance, we say that the test data _leaked_ into training
- The correct approach to do this is to consider the hyperparameter optimization as part of the model training
- This gives rise to _nested cross-validation_ which consists of two nested loops
- The outer loop performs K-Fold cross-validation as usual
- The innter loop performs K-Fold cross-validation on the ___training set___ given by the outer loop
- First, let's refactor our code above so that it is easy to use within nested cross validation

In [171]:
def hyperparam_optimize(X, Y, hyperparam_combs, splits=5):
    
    # initialize splitter
    kf = sklearn.model_selection.KFold(n_splits=splits, shuffle=True)

    min_loss = np.inf
    best_hyperparam = None

    for hyperparam_comb in hyperparam_combs:

        # keep track of test performance
        test_loss = []
        test_acc = []

        print("Evaluating combination: ", hyperparam_comb)

        for train_index, test_index in kf.split(X):

            # build training and testing splits
            Xtrain, Ytrain = X[train_index,:], Y[train_index,:]
            Xtest, Ytest = X[test_index,:], Y[test_index,:]
            model, history = train_model(Xtrain, Ytrain, verbose=False, epochs=200, **hyperparam_comb)

            Yhat = model.predict(Xtest, batch_size=Xtest.shape[0])

            loss = -np.mean(np.sum(Ytest * np.log(Yhat), axis=1))
            test_loss.append(loss)

            Yhat_hard = np.argmax(Yhat, axis=1)
            acc = sklearn.metrics.accuracy_score(y[test_index], Yhat_hard) 
            test_acc.append(acc)

            cm = sklearn.metrics.confusion_matrix(y[test_index], Yhat_hard)
            test_cms.append(cm)

            print("Finished split %d" % len(test_loss))

        mean_loss = np.mean(test_loss)
        mean_acc = np.mean(test_acc)

        if mean_loss < min_loss:
            min_loss = mean_loss 
            best_hyperparam = hyperparam_comb
            print("New best: loss=%0.2f, acc=%0.2f" % (mean_loss, mean_acc))
    
    # now we can train the model on whole dataset that is given to this function
    model, history = train_model(X, Y, verbose=False, epochs=200, **best_hyperparam)
    
    return model, history

Now we can implement nested CV:

In [173]:
# initialize splitter
kf = sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=43)

# keep track of test performance
test_loss = []
test_acc = []
test_cms = []
for train_index, test_index in kf.split(X):
    
    # build training and testing splits
    Xtrain, Ytrain = X[train_index,:], Y[train_index,:]
    Xtest, Ytest = X[test_index,:], Y[test_index,:]
    
    # now we perform hyper parameter optimization on the TRAINING SET only
    model, history = hyperparam_optimize(Xtrain, Ytrain, hyperparam_combs)
    
    # predict
    Yhat = model.predict(Xtest, batch_size=Xtest.shape[0])
    
    loss = -np.mean(np.sum(Ytest * np.log(Yhat), axis=1))
    test_loss.append(loss)
    
    Yhat_hard = np.argmax(Yhat, axis=1)
    acc = sklearn.metrics.accuracy_score(y[test_index], Yhat_hard) 
    test_acc.append(acc)
    
    cm = sklearn.metrics.confusion_matrix(y[test_index], Yhat_hard)
    test_cms.append(cm)
    
    print("Finished split %d" % len(test_loss))

Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 1}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
New best: loss=1.24, acc=0.14
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 5}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
New best: loss=0.37, acc=0.14
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.001, 'n_hidden': 10}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
New best: loss=0.31, acc=0.14
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 1}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
Evaluating combination:  {'batch_size_divisor': 10, 'l2_penalty': 0.01, 'n_hidden': 5}
Finished split 1
Finished split 2
Finished split 3
Finished split 4
Finished split 5
Evaluating combination:  {'batch_size_divisor'

In [174]:
n = len(test_loss)
print("Mean loss: %0.2f (stderr  %0.2f), accuracy: %0.2f (stderr %0.2f)" % (np.mean(test_loss), np.std(test_loss)/np.sqrt(n), np.mean(test_acc), np.std(test_acc) / np.sqrt(n)))
print("Mean confusion matrix:")
np.mean(test_cms, axis=0)

Mean loss: 0.28 (stderr  0.01), accuracy: 0.91 (stderr 0.01)
Mean confusion matrix:


array([[5.77391304, 4.93043478, 3.6826087 , 4.17826087, 3.07826087,
        4.71304348],
       [5.35652174, 5.96521739, 4.19565217, 4.05652174, 2.48695652,
        4.29565217],
       [5.52608696, 5.54347826, 5.13043478, 4.40434783, 2.87391304,
        4.44347826],
       [5.03478261, 4.74782609, 3.99130435, 4.6       , 2.57391304,
        4.43043478],
       [5.46956522, 5.36956522, 4.54782609, 4.00434783, 3.52173913,
        4.80869565],
       [5.03043478, 5.06086957, 4.16521739, 4.36521739, 3.52608696,
        4.98695652]])

## Feature Selection

- When you have a large number of input features, you might be interested in identifying which ones are important for prediction
- Typically, when you are exploring the input features, you'd remove duplicated features that are highly correlated or anti-correlated with each other
- After that, you may still have a large number of features to choose from
- One possible selection strategy is to exhaustively enumerate all feature combinations and chose the subset of features that performs best on test, or has the fewest features with the least performance deficit
- Another strategy is to greedily add the best features one at a time, which is much cheaper than exhaustive search
- The optimal set of features of a model should be thought of as a hyper parameter of the model
- But doing feature selection and hyper parameter optimization at the same time, while ideal, can be **_very_** expensive so one strategy is to first identify a good set of hyperparameters, then perform feature selection
- Let's implement the forward greedy selection procedure:

In [218]:
def forward_greedy_feature_selection(X, Y, features, thres):
    
    # randomly split the data into training and testing
    # you could make this more elaborate and use CV here
    # but because this is just a demo, we're going to use a single 
    # training/testing split
    Xtrain, Xtest, Ytrain, Ytest = sklearn.model_selection.train_test_split(X, Y, test_size=0.2)
    
    selected_features = set()
    
    all_features = set(features)
    
    curr_loss = None 
    
    while len(selected_features) < len(features):
        
        best_feature = None
        best_loss = np.inf 
        
        rem_features = all_features - selected_features
        
        # for each remaining feature
        for feature in rem_features:
            
            # form the candidate feature subset (existing features + the new feature)
            candidate_set = list(selected_features) + [feature]
            
            # train the model
            model, history = train_model(Xtrain[:, candidate_set], Ytrain, verbose=False, epochs=200, l2_penalty=0.01, n_hidden=10)

            # evaluate it
            Yhat = model.predict(Xtest[:, candidate_set], batch_size=Xtest.shape[0])
            
            # check the loss
            loss = -np.mean(np.sum(Ytest * np.log(Yhat), axis=1))
            
            if loss < best_loss:
                best_loss = loss
                best_feature = feature
        
            print("Finished ", feature)
        
        if curr_loss is not None:
            rel_loss = ( (curr_loss - best_loss) / curr_loss )
            print("Relative loss of best feature: %f, Absolute loss=%f" % (rel_loss, best_loss))
        # add the best performing feature if we are just starting out
        if curr_loss is None:
            selected_features.add(best_feature)
            
            curr_loss = best_loss 
            
        # add the best performing feature if it decreases the loss by more than thres%
        elif best_loss < curr_loss and ( (curr_loss - best_loss) / curr_loss ) >= thres:
            
            selected_features.add(best_feature)
            curr_loss = best_loss
        
        # otherwise, terminate the process
        else:
            break
    
    # train the model
    selected_features = list(selected_features)
    
    model, history = train_model(X[:,selected_features], Y, verbose=False, epochs=200, l2_penalty=0.01, n_hidden=10)

    return model, history, selected_features

In [219]:
# initialize splitter
kf = sklearn.model_selection.KFold(n_splits=5, shuffle=True, random_state=43)

# keep track of test performance
test_loss = []
test_acc = []
test_cms = []
for train_index, test_index in kf.split(X):
    
    # build training and testing splits
    Xtrain, Ytrain = X[train_index,:], Y[train_index,:]
    Xtest, Ytest = X[test_index,:], Y[test_index,:]
    
    model, history, selected_features = forward_greedy_feature_selection(Xtrain, Ytrain, range(X.shape[1]), 0.05)
    print("Selected: ", selected_features)
    
    # predict
    Yhat = model.predict(Xtest[:,selected_features], batch_size=Xtest.shape[0])
    
    loss = -np.mean(np.sum(Ytest * np.log(Yhat), axis=1))
    test_loss.append(loss)
    
    Yhat_hard = np.argmax(Yhat, axis=1)
    acc = sklearn.metrics.accuracy_score(y[test_index], Yhat_hard) 
    test_acc.append(acc)
    
    cm = sklearn.metrics.confusion_matrix(y[test_index], Yhat_hard)
    test_cms.append(cm)
    
    print("Finished split %d" % len(test_loss))
    

Finished  0
Finished  1
Finished  2
Finished  3
Finished  4
Finished  5
Finished  6
Finished  7
Finished  8
Finished  9
Finished  10
Finished  11
Finished  1
Finished  2
Finished  3
Finished  4
Finished  5
Finished  6
Finished  7
Finished  8
Finished  9
Finished  10
Finished  11
Relative loss of best feature: 0.646144, Absolute loss=0.274795
Finished  2
Finished  3
Finished  4
Finished  5
Finished  6
Finished  7
Finished  8
Finished  9
Finished  10
Finished  11
Relative loss of best feature: 0.071689, Absolute loss=0.255095
Finished  2
Finished  3
Finished  4
Finished  6
Finished  7
Finished  8
Finished  9
Finished  10
Finished  11
Relative loss of best feature: -0.036151, Absolute loss=0.264317
Selected:  [0, 1, 5]
Finished split 1
Finished  0
Finished  1
Finished  2
Finished  3
Finished  4
Finished  5
Finished  6
Finished  7
Finished  8
Finished  9
Finished  10
Finished  11
Finished  1
Finished  2
Finished  3
Finished  4
Finished  5
Finished  6
Finished  7
Finished  8
Finished  9
Fin

In [221]:
n = len(test_loss)
print("Mean loss: %0.2f (stderr  %0.2f), accuracy: %0.2f (stderr %0.2f)" % (np.mean(test_loss), np.std(test_loss)/np.sqrt(n), np.mean(test_acc), np.std(test_acc) / np.sqrt(n)))
print("Mean confusion matrix:")
np.mean(test_cms, axis=0)

Mean loss: 0.26 (stderr  0.00), accuracy: 0.92 (stderr 0.01)
Mean confusion matrix:


array([[33.2,  0. ,  0. ,  0. ,  0.2,  0. ],
       [ 0. , 31.6,  0. ,  1.8,  0. ,  0. ],
       [ 0.2,  0. , 31.4,  0. ,  1.8,  0. ],
       [ 0. ,  2.6,  0. , 29. ,  0.8,  1. ],
       [ 0.4,  0. ,  3.8,  0.8, 27. ,  1.2],
       [ 0. ,  0. ,  0. ,  1.2,  1.2, 30.8]])

## Recap

- Cross-validation is used to estimate the generalization performance of ML models
- Hyperparameter optimization is concerned with finding optimal high-level parameters of a NN (parameters that are not trained with gradient descent)
- It is important not to report model performance based CV results obtained from hyperparameter optimization
- Nested cross-validation solves this problem by treating the hyperparameter optimization as part of the model itself, and performing it on each training set generated by CV
- Feature selection is concerned with finding the optimal subset of features for prediction
- Greedy forward feature selection adds one feature at a time until some criterion is met (i.e., minimum improvement percentage)
- Like hyper parameter optimization, feature selection should only be performed on the training set, ideally via nested cross-validation
- The point is: NEVER touch the test set until you are ready. That means don't use it for hyper parameter optimization or feature selection.
- To be safe when starting a data science project, split your data into development and testing. Do all your model development on the development set, and do one final test on the test set.