# MNIST with sklearn

The goal of this exercise is to 
* explore some of the sklearn functionality for training a MLP classifier (see https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier)  
* by using cross validation 
* learn how to compute the confusion matrix and its derived quantities and how to interpret them
* explore the test error as a function of the complexity (number of units, number of layers)
* explore the impact of L2 regularisation

__IMPORTANT REMARK__: We here follow the convention of sklearn to enumerate the samples with the first index. 

In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import f1_score

## Load and prepare the data 

In [None]:
datadir = '/tmp/data'

In [None]:
def normalize(x_train,x_test):
    """
    Normalizes the pixels values of the images - mean and stdev are computed from the training set.
    
    Parameters:
    x_train -- Array of training samples of shape (n,m1) where n,m1 are the number of features and samples, respectively.  
    x_test -- Array of test samples of shape (n,m2) where n,m2 are the number of features and samples, respectively. 
    
    Returns:
    The arrays with the normalized train and test samples.  
    """
    mean = np.mean(x_train)
    std = np.std(x_train)
    x_train -= mean
    x_test -= mean
    x_train /= std
    x_test /= std
    return x_train, x_test

In [None]:
x,y = fetch_openml('mnist_784', data_home=datadir, return_X_y=True)
x_train0, x_test0, y_train, y_test = train_test_split(x, y, test_size=10000, random_state=1)
x_train, x_test = normalize(x_train0, x_test0)

In [None]:
x_train.shape

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=.1, random_state=1)

In [None]:
x_train.shape

## Specify Model Family and learn how to compute the metrics

#### Model
Use the functionality of scikit learn to configure a MLP and its training procedure with
* hidden layers: 0-2 layers with suitable number of units per layer
* mini-batch gradient descent with given batch_size (no advanced optimisers)
* constant learning rate (no learning rate schedules)
* number of epochs
* no regularisation such as L2 penalty or early stopping

#### Metrics
Compute the train and test error resp. accuracy as well as the class precision, recall, f1-score.

__See__:
* https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
* https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

## First Training Run

Run the training and plot the training loss with a first set of values:
* no hidden layers
* mini-batchsize: 64
* learning rate: 0.1
* 100 epochs

Compute the Metrics.
Which digits are hard to predict?  

#### MODEL

In [None]:
# Basic Hyperparameters
hidden_layer_sizes = ()
batch_size = 64
learning_rate = 0.1
nepochs = 100

# Regularisation:
alpha = 0.0 # L2 regularisation constant
early_stopping = False
n_iter_no_change = 10

### START YOUR CODE ###
# Model instantiation and training

cls = MLPClassifier(
    hidden_layer_sizes=hidden_layer_sizes,
    alpha=alpha, batch_size=batch_size,
    learning_rate_init=learning_rate,
    early_stopping=early_stopping,
    n_iter_no_change=n_iter_no_change,
    max_iter=nepochs)

cls.fit(x_train, y_train)

In [None]:
# Plot loss curve


plt.plot(range(len(cls.loss_curve_)), cls.loss_curve_)


### END YOUR CODE ###

#### METRICS

In [None]:
### START YOUR CODE ###

# train and test error, accuracy
# per class accuracy, precision, f1 score
y_pred = cls.predict(x_val)

print(classification_report(y_val, y_pred))

print(f1_score(y_val, y_pred, average="macro"))

### END YOUR CODE ###

## Best Model without Hidden Layer

By first varying just the parameters 
* mini-batchsize
* learning rate
* epochs

with adding any hidden layer.

Summarize what the best combination of the abover hyper-parameters is.

In [None]:
import itertools

def explore_basic_hyperparams(
        hidden_layer_size,
        early_stopping = False,
        n_iter_no_change = 10,
        alphas = [0.0], # L2 regularisation constant
        batch_sizes = [16, 32, 64, 128],
        learning_rates = [0.001, 0.01, 0.1, 1],
        nepochss = [100, 200]):
    f1s = []
    params = list(itertools.product(batch_sizes, learning_rates, nepochss, alphas))

    i = 0
    for batch_size, learning_rate, nepochs, alpha in params:
        print(batch_size, "; ", learning_rate, "; ", nepochs, "; ", alpha, "; ", f" ({i} / {len(params)})", end="\r")
        i += 1
        # Model instantiation and training

        cls = MLPClassifier(
            hidden_layer_sizes=hidden_layer_size,
            alpha=alpha, batch_size=batch_size,
            learning_rate_init=learning_rate,
            early_stopping=early_stopping,
            n_iter_no_change=n_iter_no_change,
            max_iter=nepochs)

        cls.fit(x_train, y_train)
        y_pred = cls.predict(x_val)
        f1 = f1_score(y_val, y_pred, average="macro")
        f1s.append(f1)

    max_idx = np.argmax(f1s)
    print(f"best params are {params[max_idx]} with f1 of {f1s[max_idx]}")
    return params[max_idx], f1s[max_idx], f1s

In [None]:
### START YOUR CODE ###

# Keep hidden_layer_sizes = () 
# Vary the following
params, f1, f1s = explore_basic_hyperparams(hidden_layer_size=())
### END YOUR CODE ###

In [None]:
cls = MLPClassifier(
    hidden_layer_sizes=(),
    alpha=0.0, batch_size=64,
    learning_rate_init=0.001,
    early_stopping=False,
    n_iter_no_change=10,
    max_iter=100)

cls.fit(x_train, y_train)

y_pred = cls.predict(x_train)
print(classification_report(y_train, y_pred))

y_pred = cls.predict(x_val)
print(classification_report(y_val, y_pred))
print(f1_score(y_val, y_pred, average="macro"))

__BEST MODEL__ (no hidden layer)

batch_size = 64

learning_rate = 0.001

nepochs = 100

train / validation error : 6% / 8%

## Adding one Hidden layer

Explore the performance of the model by varying the parameters 
* mini-batchsize
* learning rate
* epochs
* complexity (number of units in the one hidden layer)

For given complexity, summarize what the best combination of other hyper-parameters is - compute this for several complexities.

Compute also the "best" train and validation error (or accuracy) for given complexity - as a function of the complexity and plot the curve (for selected number of units - e.g. 10 different values). 


In [None]:
### START YOUR CODE ###

# Keep hidden_layer_sizes = () 
# Vary the following

hidden_layer_sizes = [(int(x),) for x in np.linspace(10, 1000, 10)] # just one layer 

f1s = []

i = 0
for hidden_layer_size in hidden_layer_sizes:
    print(f"working on {hidden_layer_size}, ({i}/{len(hidden_layer_sizes)})")
    i += 1
    params, f1, _ = explore_basic_hyperparams(
        hidden_layer_size=hidden_layer_size,
        batch_sizes = [64],
        learning_rates = [0.001],
        nepochss = [100]
    )
    
    f1s.append(f1)

### END YOUR CODE ###

__Error vs Complexity__:

Plot with the train and test error vs complexity (number of units in the hidden layer)

In [None]:
### START YOUR CODE ###


plt.plot(np.linspace(10, 1000, 10), f1s)


### END YOUR CODE ###

In [None]:
cls = MLPClassifier(
    hidden_layer_sizes=(780,),
    alpha=0.0, batch_size=64,
    learning_rate_init=0.001,
    early_stopping=False,
    n_iter_no_change=10,
    max_iter=100)

cls.fit(x_train, y_train)

y_pred = cls.predict(x_train)
print(classification_report(y_train, y_pred))

y_pred = cls.predict(x_val)
print(classification_report(y_val, y_pred))

print(f1_score(y_val, y_pred, average="macro"))

__BEST MODEL__ (one hidden layer)

hidden_layer_sizes = (780,)

batch_size = 64

learning_rate = 0.001

nepochs = 100

train / validation error : 0% / 2%


## Impact of Regularisation

Explore the Impact of Using L2 Regularisation (still adding just one hidden layer) again by varying mini-batchsize, learning rate, epochs, complexity.

Can you reach a better best model performance (on validation set)?

In [None]:
### START YOUR CODE ###

# Vary the following

hidden_layer_sizes = [(int(x),) for x in np.linspace(10, 1000, 10)] # just one layer 

f1s = []

i = 0
for hidden_layer_size in hidden_layer_sizes:
    print(f"working on {hidden_layer_size}, ({i}/{len(hidden_layer_sizes)})")
    i += 1
    params, f1, _ = explore_basic_hyperparams(
        hidden_layer_size=hidden_layer_size,
        alphas=[0.1],
        batch_sizes = [64],
        learning_rates = [0.001],
        nepochss = [100]
    )
    
    f1s.append(f1)

### END YOUR CODE ###

__Error vs Complexity__:

Plot with the train and test error vs complexity (number of units in the hidden layer)

In [None]:
### START YOUR CODE ###


plt.plot(np.linspace(10, 1000, 10), f1s)


### END YOUR CODE ###

In [None]:
cls = MLPClassifier(
    hidden_layer_sizes=(780),
    alpha=0.1, batch_size=64,
    learning_rate_init=0.001,
    early_stopping=False,
    n_iter_no_change=10,
    max_iter=100)

cls.fit(x_train, y_train)

y_pred = cls.predict(x_train)
print(classification_report(y_train, y_pred))

y_pred = cls.predict(x_val)
print(classification_report(y_val, y_pred))

print(f1_score(y_val, y_pred, average="macro"))

__BEST MODEL__ (one hidden layer)

hidden_layer_sizes = (780)

batch_size = 64

learning_rate = 0.001 

nepochs = 100

alpha =  .1 # L2 regularisation constant

train / validation error : 2% / 3%

## Adding up to 3 Hidden Layers

Now consider using a model with more than one hidden layer (at max 3).


In [None]:
### START YOUR CODE ###

# Vary the following

hidden_layer_sizes =  [(int(x),) for x in np.linspace(10, 500, 5)] # single
hidden_layer_sizes += [(int(x),int(x)) for x in np.linspace(10, 500, 5)] # double
hidden_layer_sizes += [(int(x),int(x), int(x)) for x in np.linspace(10, 500, 5)] # trible

f1s = []

i = 0
for hidden_layer_size in hidden_layer_sizes:
    print(f"working on {hidden_layer_size}, ({i}/{len(hidden_layer_sizes)})")
    i += 1
    params, f1, _ = explore_basic_hyperparams(
        hidden_layer_size=hidden_layer_size,
        alphas=[0, 0.1],
        batch_sizes = [64],
        learning_rates = [0.001],
        nepochss = [100]
    )
    
    f1s.append(f1)

### END YOUR CODE ###

__Error vs Complexity__:

Plot with the train and test error vs complexity (number of units in the hidden layer)

In [None]:
### START YOUR CODE ###


plt.title("single layer")
plt.plot([np.prod(y) for y in hidden_layer_sizes][:5], f1s[:5])
plt.xlabel("parameters")
plt.ylabel("f1")
plt.show()

plt.title("two layers")
plt.plot([np.prod(y) for y in hidden_layer_sizes][5:10], f1s[5:10])
plt.xlabel("parameters")
plt.ylabel("f1")
plt.show()

plt.title("three layers")
plt.plot([np.prod(y) for y in hidden_layer_sizes][10:], f1s[10:])
plt.xlabel("parameters")
plt.ylabel("f1")
plt.show()

### END YOUR CODE ###

In [None]:
cls = MLPClassifier(
    hidden_layer_sizes=(377, 377, 377),
    alpha=0.0, batch_size=64,
    learning_rate_init=0.001,
    early_stopping=False,
    n_iter_no_change=10,
    max_iter=100)

cls.fit(x_train, y_train)


y_pred = cls.predict(x_train)
print(classification_report(y_train, y_pred))

y_pred = cls.predict(x_val)
print(classification_report(y_val, y_pred))

print(f1_score(y_val, y_pred, average="macro"))

__BEST MODEL__ (1-3 hidden layers)

hidden_layer_sizes = (377, 377, 377)

batch_size = 64

learning_rate = 0.001

nepochs = 100

alpha =  0.0 # L2 regularisation constant

train / validation error : 0% / 2%

## Test Performance of Best Model

Test Error: 2% 

In [None]:
cls = MLPClassifier(
    hidden_layer_sizes=(780,),
    alpha=0.0, batch_size=64,
    learning_rate_init=0.001,
    early_stopping=False,
    n_iter_no_change=10,
    max_iter=100)

cls.fit(x_train, y_train)

y_pred = cls.predict(x_test)

print(classification_report(y_test, y_pred))
print(f1_score(y_test, y_pred, average="macro"))