# Cross Validation with Keras

A brief guide to k-fold validation

Cross validation is the current gold standard to preventing overfitting. The essence of all cross validation methods are the same: We split our data into one or more "train" sets, where we conduct all model fitting and parameter tuning, and a "validation" set where the model is evaluated. 

Typically, cross-validation is used to select the best free parameters of a model in order to train a final model on the entire dataset. 

In the context of neural networks, the "free parameters" refer to the architecture of the network (i.e. how many layers, how many neurons in each layer).


In [1]:
%matplotlib inline
import numpy as np
import sklearn.model_selection as ms
from sklearn.datasets import make_classification
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.utils import to_categorical
from sklearn.metrics import accuracy_score


Using TensorFlow backend.


Let's generate some data and split it to a training set and testing set. All of our fitting and optimization will be conducted on the training set and NOT the testing set. To emphasize this point, any data the model is tested on for validation is referred to as a "validation" set.

In [2]:
X, y = make_classification(n_samples=200, n_features=10, 
                                     n_informative=5, 
                                     scale=2.0, 
                                     shuffle=True, random_state=42)

X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.5, random_state=42)
# put X_test and y_test in a "box" for later. We won't touch these until the very end.

# We also need to vectorize y_train for neural network testing (See other notebook for an explanation)
y_train_vectorized = to_categorical(y_train)

While scikit-learn's cross-validation utilities are designed to be used with their classifiers, they are general purpose and can interface with keras nicely.

## Set up models
Let's create our candidate models that we would like to compare using Keras (refer to the other notebook for an explanation of all of the steps)

In [3]:
sgd = SGD(lr=0.001, decay=1e-7, momentum=.9)  # Stochastic gradient descent
model1_layer_sizes = [X_train.shape[1], 10, 5, np.unique(y_train).shape[0]]  
# 2 hidden layers of size 10 and 5, respectively
model2_layer_sizes = [X_train.shape[1], 5, 5, 10, 10, 5, 5, np.unique(y_train).shape[0]]  
# 6 hidden layers 

In [4]:
# create model 1
def build_model1():
    model1 = Sequential()
    model1.add(Dense(
            input_dim=model1_layer_sizes[0], 
            units=model1_layer_sizes[1],
            activation="relu"
        ))

    # Now our second hidden layer with 10 inputs (from the first
    # hidden layer) and 5 outputs. Also with ReLU activation
    model1.add(Dense(
        input_dim=model1_layer_sizes[1], 
        units=model1_layer_sizes[2],
        activation="relu"
    ))

    # Finally, add a readout layer, mapping the 5 hidden units
    # to two output units using the softmax function
    model1.add(Dense(units=model1_layer_sizes[3], 
                    kernel_initializer='uniform',
                    activation="softmax"))

    model1.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=["accuracy"])  

    return model1

In [5]:
# create model 2
def build_model2():

    model2 = Sequential()
    model2.add(Dense(
            input_dim=model2_layer_sizes[0], 
            units=model2_layer_sizes[1],
            activation="relu"
        ))

    # Now our second hidden layer with 10 inputs (from the first
    # hidden layer) and 5 outputs. Also with ReLU activation
    model2.add(Dense(
        input_dim=model2_layer_sizes[1], 
        units=model2_layer_sizes[2],
        activation="relu"
    ))

    model2.add(Dense(
        input_dim=model2_layer_sizes[2], 
        units=model2_layer_sizes[3],
        activation="relu"
    ))

    model2.add(Dense(
        input_dim=model2_layer_sizes[3], 
        units=model2_layer_sizes[4],
        activation="relu"
    ))

    model2.add(Dense(
        input_dim=model2_layer_sizes[4], 
        units=model2_layer_sizes[5],
        activation="relu"
    ))

    model2.add(Dense(
        input_dim=model2_layer_sizes[5], 
        units=model2_layer_sizes[6],
        activation="relu"
    ))

    # Finally, add a readout layer, mapping the 5 hidden units
    # to two output units using the softmax function
    model2.add(Dense(units=model2_layer_sizes[7], 
                    kernel_initializer='uniform',
                    activation="softmax"))

    model2.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=["accuracy"])  

    return model2

## Set up K-fold validation

Recall that in K-fold validation, the data are partitioned into k chunks so that every data point serves as a testing point exactly once. A sketch of the k-fold procedure is written below:

1) partition the data into k chunks. 

2) train your model using k-1 chunks

3) test your model on the remaining chunk

4) repeat for all k chunks and average the results

The partitioning procedure is already implemented in scikit learn. We demonstrate its use below:

In [6]:
k = 4  # 4 fold cross validation
kf = ms.KFold(k, shuffle=True)  # kf is an object that gets applied to the dataset


we can pass any dataset into kf.split(), and it will return the indices of the "train" and "validation" partitions


In [7]:
for train_idx, val_idx in kf.split(X_train):
    # train_idx are the indices of the training set
    # val_idx are the indices of the validation set
    # notice we haven't used the testing set (from the beginning of this notebook!)
    print(len(train_idx), len(val_idx))
    
print(train_idx, val_idx)

75 25
75 25
75 25
75 25
[ 0  1  3  5  6  7  8  9 10 11 12 14 15 16 17 18 19 21 22 23 24 26 27 28
 29 33 34 35 36 39 41 42 43 45 46 47 48 49 50 53 54 55 56 57 59 60 61 62
 63 64 65 66 68 70 71 72 76 78 79 81 82 83 85 86 88 89 90 91 92 93 94 96
 97 98 99] [ 2  4 13 20 25 30 31 32 37 38 40 44 51 52 58 67 69 73 74 75 77 80 84 87
 95]


Let's make sense of that output above:

The first 4 iterations are just the size of the training and validation sets. as you can see, every iteration of the k-fold the training set is of size 75, and the validation set is of size 25.

Finally, we print the indices from the last iteration just to verify that the splitting worked.

Let's use the kf object to conduct cross validation

## Cross validation workflow:

We will use K-fold cross-validation to determine which of two multi-layer perceptron architectures is best for the data we have. The proper procedure for selecting and t0raining a model is as follows:

1. Use cross-validation on the training set to determine the correct free parameters

2. Train the model on the entire dataset using the optimized parameters

3. Test the model on the testing set (aka deploy the model)

### Step 1: Run cross validation and compare the average performance for each model

Cross validation for model 1

In [8]:
# we can pass any dataset into kf.split(), and it will return the indices of the "train" and "validation" partitions
accuracies = []

# STEP 1: partition the data chunks
for train_idx, val_idx in kf.split(X_train):
    # train_idx are the indices of the training set
    # val_idx are the indices of the validation set
    
    # we need to rebuild the model every time because by default it learns incrementally
    model1 = build_model1()  
    
    # STEP 2: train the model on the k-1 chunks
    model1.fit(X_train[train_idx], y_train_vectorized[train_idx], epochs=500, batch_size=50, verbose = 0)
    
    # STEP 3: predict the kth chunk and evaluate accuracy
    proba = model1.predict_proba(X_train[val_idx], batch_size=32)  # predict the classes for the validation set
    classes = np.argmax(proba, axis=1)
    
    # save the accuracy
    accuracies.append(accuracy_score(y_train[val_idx], classes))

# STEP 4: average across the k accuracies
model1_accuracy = np.array(accuracies).mean()  # the mean performance of model 1

Cross validation for model 2

In [9]:
# we can pass any dataset into kf.split(), and it will return the indices of the "train" and "validation" partitions
accuracies = []

# STEP 1: partition the data chunks
for train_idx, val_idx in kf.split(X_train):
    # train_idx are the indices of the training set
    # val_idx are the indices of the validation set
    
    # we need to rebuild the model every time because by default it learns incrementally
    model2 = build_model1()  
    
    # STEP 2: train the model on the k-1 chunks
    model2.fit(X_train[train_idx], y_train_vectorized[train_idx], epochs=500, batch_size=50, verbose = 0)
    
    # STEP 3: predict the kth chunk and evaluate accuracy
    proba = model2.predict_proba(X_train[val_idx], batch_size=32)  # predict the classes for the validation set
    classes = np.argmax(proba, axis=1)
    
    # save the accuracy
    accuracies.append(accuracy_score(y_train[val_idx], classes))

# STEP 4: average across the k accuracies
model2_accuracy = np.array(accuracies).mean()  # the mean performance of model 1

Compare the two averaged accuracies

In [10]:
print("Model 1: {} | Model 2: {}".format(model1_accuracy, model2_accuracy))

Model 1: 0.76 | Model 2: 0.7100000000000001


It seems that after 1000 epochs, model 1 seems to perform better than model 2. So, let's select model1 as our final model.

### Step 2: Fit the final model

In [11]:
final_model = build_model1()
final_model.fit(X_train, y_train_vectorized, 
                epochs=500, batch_size=50, verbose = 0)  
# fit the final model on the FULL training set

<keras.callbacks.History at 0x7fa1326703c8>

### Step 3: Evaluate the final model

Notice that we haven't touched X_test and y_test until now, and we ONLY use them once.

In [12]:
proba = final_model.predict_proba(X_test, batch_size=32)  # predict the classes for the validation set
classes = np.argmax(proba, axis=1)
print("Test set performance: {}".format(accuracy_score(y_test, classes)))

Test set performance: 0.77


Normally, we wouldn't fit the second model because we already chose the first. But, we fit the second model below just to show that indeed, cross validation guided us towards the correct model.

(Note that by chance, sometimes model 2 will actually perform better than model 1)

In [13]:
final_model_alt = build_model2()
final_model_alt.fit(X_train, y_train_vectorized, epochs=500, batch_size=50, verbose = 0)
proba = final_model_alt.predict_proba(X_test, batch_size=32)  # predict the classes for the validation set
classes = np.argmax(proba, axis=1)
print("Test set performance: {}".format(accuracy_score(y_test, classes)))

Test set performance: 0.48


## Caveats

In the context of deep learning, the number of epochs should always be set as high as possible (keeping computation time in mind). This will be discussed later when we cover stochastic gradient descent. The problem of "overfitting" arises from the number of parameters being fitted, which manifests as the architecture of the model, rather than the number of iterations through the data.