This notbook shows how to perform k-fold cross validation and a grid search for hyperparameters.

The k-fold cross validation splits up the data into k sets, and trains the neural network k times. Each time, it selects one of the k sets to be the validation sets, and trains on the other k-1 sets. After training the network k times, it reports the mean of the loss function. This process gives you a good idea of how the neural network will perform on unseen data.

The grid hyperparameter search, allows you to systematically test the neural network across a range of different hyperparameters simultaneously. This is superior to tuning the parmeters individually, because many parameters are codependent. It can also be more exhaustive than manually testing hyperparameters. However, this can take a long time and may need to be run on a cluster.

It is also probably a good idea to use RandomizedSearchCV instead of GridSearchCV. This is because every parameter is varied in every interation. Thus, if the network is insensitive to one of the parameters, you won't waste time varying only that parameter.

In [1]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Load and convert the data to python3 type

In [2]:
# Columns: Run, Energy, Zen, Time delay, Q400, MuonVEM, nMuon, Type
# Note that right now, nMuon is useless
data = np.load('./data/NN_data.npy')

# convert the python 2 bytes into python 3 format
data_ = []
for i in range(len(data)):
    data_.append([])
    for j in range(0,7):
        data_[i].append(float(data[i,j]))
    if data[i,7] == b'PPlus':
        data_[i].append(1)
    else:
        data_[i].append(2)
data = np.array(data_)

Select data with $\ log(E) \in (16.0,16.5)\ $  and  $\ cos(zenith) > 0.9$

In [3]:
data_ = []
for shower in data:
    E    = shower[1]
    logE = np.log10(E)
    zen  = shower[2]
    if logE >= 16 and logE <= 16.5 and np.cos(zen) > 0.9:
        data_.append(shower)
data_trimmed = np.array(data_)

Split data into input (X) and output (Y)

In [5]:
X = data_trimmed[:,1:-3]
Y = data_trimmed[:,-3]

Define neural-network model to be used in the cross-validation

In [6]:
def model_base():
    # create model
    model = Sequential()
    model.add(Dense(4,input_dim=4,kernel_initializer='normal',activation='relu'))
    model.add(Dense(1,kernel_initializer='normal'))
    # compile model
    model.compile(loss='mean_squared_error',optimizer='adam')
    return model

The next cell does several things:

1. Sets a random seed 
2. Creates a pipeline that allows the data to be rescaled within each fold
3. Performs k-fold cross-validation on the neural network model

The output is the mean of the mean square error for the k folds, with the standard deviation in parentheses

In [None]:
seed = 11
np.random.seed(seed)

# The Pipeline - 1. rescale data, 2. wrapper for sklearn
scaler = ('rescale',StandardScaler())
estimator = ('mlp',KerasRegressor(build_fn=model_base,epochs=50,batch_size=5,verbose=0))
pipeline = Pipeline([scaler,estimator])

# evaluate the model with kfold cross-validation
n_splits = 17 # data set has len = 1462. 17 is a divisor close to 10
kfold = KFold(n_splits=n_splits,random_state=seed)

results = cross_val_score(pipeline,X,Y,cv=kfold)
print("Results: %.2f (%.2f) MSE" % (-1*results.mean(), results.std()))

Now I will grid search for different epochs and batch sizes to determine the optimum. (Note: This uses the pipeline and kfold defined in the previous cell)

NOTE: This takes too long, so I am not going to grid search or random search right now. In principle this should be done to find the best hyperparameters. Instead I am going to make individual NN's in neural_network_manual_tuning.ipynb.

In [None]:
seed = 11
np.random.seed(seed)

# The Pipeline - 1. rescale data, 2. wrapper for sklearn
scaler = ('rescale',StandardScaler())
estimator = ('mlp',KerasRegressor(build_fn=model_base,verbose=0))
pipeline = Pipeline([scaler,estimator])

# evaluate the model with kfold cross-validation
n_splits = 2 # data set has len = 1462. 17 is a divisor close to 10
kfold = KFold(n_splits=n_splits,random_state=seed)

# define the grid search parameters
batch_size = [10, 40, 80]
epochs = [50, 100,150]#,200,250]
param_grid = dict(batch_size=batch_size, epochs=epochs)
# note I set n_jobs = -1. This uses all available cores. Set it equal to 1,2,3... to specify number of cores
grid = GridSearchCV(pipeline, param_grid=param_grid, scoring='mean_squared_error',cv=kfold,n_jobs=-1)
grid_result = grid.fit(X, Y)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))