# MNIST with sklearn

The goal of this exercise is to 
* explore some of the sklearn functionality for training a MLP classifier (see https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)  
* by using cross validation 
* learn how to compute the confusion matrix and its derived quantities and how to interpret them
* explore the test error as a function of the complexity (number of units, number of layers)
* explore the impact of L2 regularisation

__IMPORTANT REMARK__: We here follow the convention of sklearn to enumerate the samples with the first index. 

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

## Load and prepare the data 

In [None]:
datadir = 'data/'

In [None]:
def normalize(x_train,x_test):
    """
    Normalizes the pixels values of the images - mean and stdev are computed from the training set.
    
    Parameters:
    x_train -- Array of training samples of shape (n,m1) where n,m1 are the number of features and samples, respectively.  
    x_test -- Array of test samples of shape (n,m2) where n,m2 are the number of features and samples, respectively. 
    
    Returns:
    The arrays with the normalized train and test samples.  
    """
    mean = np.mean(x_train)
    std = np.std(x_train)
    x_train -= mean
    x_test -= mean
    x_train /= std
    x_test /= std
    return x_train, x_test

In [None]:
# in case you have trouble with the fetch_openml, use this code
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
x,y = fetch_openml('mnist_784', data_home=datadir, return_X_y=True)
x_train0, x_test0, y_train, y_test = train_test_split(x, y, test_size=10000, random_state=1)
x_train, x_test = normalize(x_train0, x_test0)

In [None]:
x_train.shape

## Specify Model Family and learn how to compute the metrics

#### Model
Use the functionality of scikit learn to configure a MLP and its training procedure with
* hidden layers: 0-2 layers with suitable number of units per layer
* mini-batch gradient descent with given batch_size (no advanced optimisers)
* constant learning rate (no learning rate schedules)
* number of epochs
* no regularisation such as L2 penalty or early stopping

#### Metrics
Compute the train and test error resp. accuracy as well as the class precision, recall, f1-score.

__See__:
* https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

## First Training Run

Run the training and plot the training loss with a first set of values:
* no hidden layers
* mini-batchsize: 64
* learning rate: 0.1
* 100 epochs

Compute the Metrics.
Which digits are hard to predict?  

#### MODEL

In [None]:
from sklearn.neural_network import MLPClassifier

# Basic Hyperparameters
hidden_layer_sizes = ()
batch_size = 64
learning_rate = 0.1
nepochs = 100

# Regularisation:
alpha = 0.0 # L2 regularisation constant
early_stopping = False
n_iter_no_change = 10

### START YOUR CODE ###
# Model instantiation and training
mlp = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes,
                    batch_size=batch_size,
                    learning_rate='constant',
                    learning_rate_init=learning_rate,
                    max_iter=nepochs,
                    alpha=alpha,
                    early_stopping=early_stopping,
                    n_iter_no_change=n_iter_no_change,
                    solver='sgd')
mlp.fit(x_train, y_train)

# Plot loss curve
plt.plot(mlp.loss_curve_)
plt.title('Loss curve')
### END YOUR CODE ###

#### METRICS

In [None]:
### START YOUR CODE ###
# train and test error, accuracy
# per class accuracy, precision, f1 score
from sklearn.metrics import accuracy_score, classification_report

y_pred_train = mlp.predict(x_train)
y_pred_test = mlp.predict(x_test)

print('Train Set:')
print('Accuracy Score: %.2f' % accuracy_score(y_train, y_pred_train))
print(classification_report(y_test, y_pred_test))

print('\n')
print('Test Set:')
print('Accuracy Score: %.2f' % accuracy_score(y_test, y_pred_test))
print(classification_report(y_test, y_pred_test))
### END YOUR CODE ###

## Best Model without Hidden Layer

By first varying just the parameters 
* mini-batchsize
* learning rate
* epochs

with adding any hidden layer.

Summarize what the best combination of the abover hyper-parameters is.

In [None]:
### START YOUR CODE ###
from sklearn.model_selection import GridSearchCV

# Keep hidden_layer_sizes = () 
# Vary the following

batch_size = 64
learning_rate = 0.1
nepochs = 100

mlp = MLPClassifier(hidden_layer_sizes=(),
                    learning_rate='constant')

parameters = {
    'batch_size': [32, 64, 128, 256, 512],
    'learning_rate_init': [0.0001, 0.001, 0.01, 0.1,],
    'max_iter': [25, 50, 100],
}

grid_clf = GridSearchCV(mlp, parameters, 
                        n_jobs=-1, cv=5, 
                        verbose=2, return_train_score=True)
grid_clf.fit(x_train, y_train)
### END YOUR CODE ###

In [None]:
print('Best parameters found:\n %s' % grid_clf.best_params_)

print('Scores:')
means = grid_clf.cv_results_['mean_test_score']
stds = grid_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_clf.cv_results_['params']):
    print("test err: %0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

In [None]:
print('Train error: %.3f (+/-%0.03f)' % (grid_clf.cv_results_['mean_train_score'].mean(),
                                        grid_clf.cv_results_['std_train_score'].mean()))
print('Test error: %.3f (+/-%0.03f)' % (grid_clf.cv_results_['mean_test_score'].mean(),
                                        grid_clf.cv_results_['std_test_score'].mean()))

__BEST MODEL__ (no hidden layer)

batch_size = 128

learning_rate = 0.01

nepochs = 50

train / validation error : 0.91/0.89

## Adding one Hidden layer

Explore the performance of the model by varying the parameters 
* mini-batchsize
* learning rate
* epochs
* complexity (number of units in the one hidden layer)

For given complexity, summarize what the best combination of other hyper-parameters is - compute this for several complexities.

Compute also the "best" train and validation error (or accuracy) for given complexity - as a function of the complexity and plot the curve (for selected number of units - e.g. 10 different values). 


In [None]:
### START YOUR CODE ###

# Keep hidden_layer_sizes = () 
# Vary the following

hidden_layer_sizes = (100,) # just one layer 
batch_size = 64
learning_rate = 0.1
nepochs = 100

mlp = MLPClassifier(hidden_layer_sizes=(),
                    learning_rate='constant')

parameters = {
    'batch_size': [128, 256],
    'learning_rate_init': [0.01],
    'max_iter': [50],
    'hidden_layer_sizes': [(int(x),) for x in np.linspace(10, 1000, 10)],
}

grid_clf = GridSearchCV(mlp, parameters, 
                        n_jobs=-1, cv=5, 
                        verbose=2, return_train_score=True)
grid_clf.fit(x_train, y_train)
### END YOUR CODE ###

In [None]:
print('Best parameters found:\n %s' % grid_clf.best_params_)

print('Scores:')
means = grid_clf.cv_results_['mean_test_score']
stds = grid_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_clf.cv_results_['params']):
    print("test err: %0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

In [None]:
print('Train error: %.3f (+/-%0.03f)' % (grid_clf.cv_results_['mean_train_score'].mean(),
                                        grid_clf.cv_results_['std_train_score'].mean()))
print('Test error: %.3f (+/-%0.03f)' % (grid_clf.cv_results_['mean_test_score'].mean(),
                                        grid_clf.cv_results_['std_test_score'].mean()))

__Error vs Complexity__:

Plot with the train and test error vs complexity (number of units in the hidden layer)

In [None]:
### START YOUR CODE ###

def plot_grid_search(cv_results, grid_param_1, grid_param_2, name_param_1, name_param_2):
    # Get Train/Test Scores Mean
    scores_mean = cv_results['mean_test_score']
    scores_mean = np.array(scores_mean).reshape(len(grid_param_2),len(grid_param_1))
    
    scores_mean_train = cv_results['mean_train_score']
    scores_mean_train = np.array(scores_mean_train).reshape(len(grid_param_2),len(grid_param_1))

    # Plot Grid search scores
    _, ax = plt.subplots(1,1)

    # Param1 is the X-axis, Param 2 is represented as a different curve (color line)
    for idx, val in enumerate(grid_param_2):
        ax.plot(grid_param_1, scores_mean[idx,:], '-o', label= name_param_2 + ': ' + str(val) + ' (test)')
        ax.plot(grid_param_1, scores_mean_train[idx,:], '-.o', label= name_param_2 + ': ' + str(val) + ' (train)')

    ax.set_title("Error vs. Complexity")
    ax.set_xlabel(name_param_1)
    ax.set_ylabel('CV Average Score')
    ax.legend(loc="best")
    ax.grid('on')

plot_grid_search(grid_clf.cv_results_, [(50,), (100,), (200,)], [128, 256], 'N. of hidden layers', 'Batch size')

### END YOUR CODE ###

__BEST MODEL__ (one hidden layer)

hidden_layer_sizes = (200,)

batch_size = 128

learning_rate = 0.01

nepochs = 50

train / validation error : 0.984 (+/-0.002) / 0.956 (+/-0.002)


## Impact of Regularisation

Explore the Impact of Using L2 Regularisation (still adding just one hidden layer) again by varying mini-batchsize, learning rate, epochs, complexity.

Can you reach a better best model performance (on validation set)?

In [None]:
### START YOUR CODE ###

# Vary the following

# Basic Hyperparameters
hidden_layer_sizes = (100,)
batch_size = 64
learning_rate = 0.1
nepochs = 100

# Regularisation:
alpha = 0.0 # L2 regularisation constant

mlp = MLPClassifier(hidden_layer_sizes=(),
                    learning_rate='constant')

parameters = {
    'batch_size': [128, 256],
    'learning_rate_init': [0.01],
    'max_iter': [50],
    'hidden_layer_sizes': [(int(x),) for x in np.linspace(10, 1000, 10)],
    'alpha': [0.0, 0.1, 0.01],
}

grid_clf = GridSearchCV(mlp, parameters, 
                        n_jobs=-1, cv=5, 
                        verbose=2, return_train_score=True)
grid_clf.fit(x_train, y_train)

### END YOUR CODE ###

In [None]:
print('Best parameters found:\n %s' % grid_clf.best_params_)

print('Scores:')
means = grid_clf.cv_results_['mean_test_score']
stds = grid_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_clf.cv_results_['params']):
    print("test err: %0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

__Error vs Complexity__:

Plot with the train and test error vs complexity (number of units in the hidden layer)

In [None]:
### START YOUR CODE ###

plot_grid_search(grid_clf.cv_results_, [0.0, 0.1, 0.01], [128, 256], 'Alpha', 'Batch size')

### END YOUR CODE ###

__BEST MODEL__ (one hidden layer)

hidden_layer_sizes = (200,)

batch_size = 256

learning_rate = 0.01

nepochs = 50

alpha =  0.0 # L2 regularisation constant

train / validation error : 0.968 (+/-0.004)

## Adding up to 3 Hidden Layers

Now consider using a model with more than one hidden layer (at max 3).


In [None]:
### START YOUR CODE ###

# Vary the following

# Basic Hyperparameters
hidden_layer_sizes = (100,0,0)
batch_size = 64
learning_rate = 0.1
nepochs = 100

# Regularisation:
alpha = 0.0 # L2 regularisation constant

mlp = MLPClassifier(hidden_layer_sizes=(),
                    learning_rate='constant')

hidden_layer_sizes =  [(int(x),) for x in np.linspace(10, 500, 5)] # single
hidden_layer_sizes += [(int(x),int(x)) for x in np.linspace(10, 500, 5)] # double
hidden_layer_sizes += [(int(x),int(x), int(x)) for x in np.linspace(10, 500, 5)] # trible

parameters = {
    'batch_size': [128, 256],
    'learning_rate_init': [0.01],
    'max_iter': [50],
    'hidden_layer_sizes': hidden_layer_sizes,
    'alpha': [0.0],
}

grid_clf = GridSearchCV(mlp, parameters, 
                        n_jobs=-1, cv=5, 
                        verbose=2, return_train_score=True)
grid_clf.fit(x_train, y_train)

### END YOUR CODE ###

In [None]:
print('Best parameters found:\n %s' % grid_clf.best_params_)

print('Scores:')
means = grid_clf.cv_results_['mean_test_score']
stds = grid_clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, grid_clf.cv_results_['params']):
    print("test err: %0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

__Error vs Complexity__:

Plot with the train and test error vs complexity (number of units in the hidden layer)

In [None]:
### START YOUR CODE ###

plot_grid_search(grid_clf.cv_results_, [(50,50,50), (100,100,100)], [128, 256], 'N. of hidden layers', 'Batch size')

### END YOUR CODE ###

__BEST MODEL__ (1-3 hidden layers)

hidden_layer_sizes = (100, 100, 100)

batch_size = 256

learning_rate = 0.0

nepochs = 50

alpha =  0.0 # L2 regularisation constant

train / validation error : 0.968 (+/-0.004)

## Test Performance of Best Model

Test Error: 

In [None]:
y_pred_train = grid_clf.best_estimator_.predict(x_train)
y_pred_test = grid_clf.best_estimator_.predict(x_test)

acc_train = accuracy_score(y_train, y_pred_train)
acc_test = accuracy_score(y_test, y_pred_test)

print('Train acc: %.3f' % acc_train)
print('Test acc: %.3f' % acc_test)