# Miniproject 1: Image Classification

## Introduction

### Description

One of the deepest traditions in learning about deep learning is to first [tackle the exciting problem of MNIST classification](http://deeplearning.net/tutorial/logreg.html). [The MNIST database](https://en.wikipedia.org/wiki/MNIST_database) (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that was [recently extended](https://arxiv.org/abs/1702.05373). We break with this tradition (just a little bit) and tackle first the related problem of classifying cropped, downsampled and grayscaled images of house numbers in the [The Street View House Numbers (SVHN) Dataset](http://ufldl.stanford.edu/housenumbers/).


### Prerequisites

- You should have a running installation of [tensorflow](https://www.tensorflow.org/install/) and [keras](https://keras.io/).
- You should know the concepts "multilayer perceptron", "stochastic gradient descent with minibatches", "training and validation data", "overfitting" and "early stopping".

### What you will learn

- You will learn how to define feedforward neural networks in keras and fit them to data.
- You will be guided through a prototyping procedure for the application of deep learning to a specific domain.
- You will get in contact with concepts discussed later in the lecture, like "regularization", "batch normalization" and "convolutional networks".
- You will gain some experience on the influence of network architecture, optimizer and regularization choices on the goodness of fit.
- You will learn to be more patient :) Some fits may take your computer quite a bit of time; run them over night.

### Evaluation criteria

The evaluation is (mostly) based on the figures you submit and your answer sentences. 
We will only do random tests of your code and not re-run the full notebook.

### Your names

Before you start, please enter your full name(s) in the field below; they are used to load the data. The variable student2 may remain empty, if you work alone.

In [None]:
student1 = "Amaury Combes"
student2 = "Vincenzo Bazzucchi"

## Some helper functions

For your convenience we provide here some functions to preprocess the data and plot the results later. Simply run the following cells with `Shift-Enter`.

### Dependencies and constants

In [None]:
import numpy as np
import time
import matplotlib.pyplot as plt
import scipy.io

import keras
from keras.models import Sequential
from keras.layers import Dense, Conv2D, MaxPooling2D, Dropout, Flatten
from keras.optimizers import SGD, Adam

import itertools

# you may experiment with different subsets, 
# but make sure in the submission 
# it is generated with the correct random seed for all exercises.
np.random.seed(hash(student1 + student2) % 2**32)
subset_of_classes = np.random.choice(range(10), 5, replace = False)

### Plotting

In [None]:
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
def plot_some_samples(x, y = [], yhat = [], select_from = [], 
                      ncols = 6, nrows = 4, xdim = 16, ydim = 16,
                      label_mapping = range(10)):
    """plot some input vectors as grayscale images (optionally together with their assigned or predicted labels).
    
    x is an NxD - dimensional array, where D is the length of an input vector and N is the number of samples.
    Out of the N samples, ncols x nrows indices are randomly selected from the list select_from (if it is empty, select_from becomes range(N)).
    
    Keyword arguments:
    y             -- corresponding labels to plot in green below each image.
    yhat          -- corresponding predicted labels to plot in red below each image.
    select_from   -- list of indices from which to select the images.
    ncols, nrows  -- number of columns and rows to plot.
    xdim, ydim    -- number of pixels of the images in x- and y-direction.
    label_mapping -- map labels to digits.
    
    """
    fig, ax = plt.subplots(nrows, ncols)
    if len(select_from) == 0:
        select_from = range(x.shape[0])
    indices = np.random.choice(select_from, size = min(ncols * nrows, len(select_from)), replace = False)
    for i, ind in enumerate(indices):
        thisax = ax[i//ncols,i%ncols]
        thisax.matshow(x[ind].reshape(xdim, ydim), cmap='gray')
        thisax.set_axis_off()
        if len(y) != 0:
            j = y[ind] if type(y[ind]) != np.ndarray else y[ind].argmax()
            thisax.text(0, 0, (label_mapping[j]+1)%10, color='green', 
                                                       verticalalignment='top',
                                                       transform=thisax.transAxes)
        if len(yhat) != 0:
            k = yhat[ind] if type(yhat[ind]) != np.ndarray else yhat[ind].argmax()
            thisax.text(1, 0, (label_mapping[k]+1)%10, color='red',
                                             verticalalignment='top',
                                             horizontalalignment='right',
                                             transform=thisax.transAxes)
    return fig

def prepare_standardplot(title, xlabel):
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.suptitle(title)
    ax1.set_ylabel('categorical cross entropy')
    ax1.set_xlabel(xlabel)
    ax1.set_yscale('log')
    ax2.set_ylabel('accuracy [% correct]')
    ax2.set_xlabel(xlabel)
    return fig, ax1, ax2

def finalize_standardplot(fig, ax1, ax2):
    ax1handles, ax1labels = ax1.get_legend_handles_labels()
    if len(ax1labels) > 0:
        ax1.legend(ax1handles, ax1labels)
    ax2handles, ax2labels = ax2.get_legend_handles_labels()
    if len(ax2labels) > 0:
        ax2.legend(ax2handles, ax2labels)
    fig.tight_layout()
    plt.subplots_adjust(top=0.9)

def plot_history(history, title):
    fig, ax1, ax2 = prepare_standardplot(title, 'epoch')
    ax1.plot(history.history['loss'], label = "training")
    ax1.plot(history.history['val_loss'], label = "validation")
    ax2.plot(history.history['acc'], label = "training")
    ax2.plot(history.history['val_acc'], label = "validation")
    finalize_standardplot(fig, ax1, ax2)
    return fig


### Loading and preprocessing the data

The data consists of RGB color images with 32x32 pixels, loaded into an array of dimension 32x32x3x(number of images). We convert them to grayscale (using [this method](https://en.wikipedia.org/wiki/SRGB#The_reverse_transformation)) and we downsample them to images of 16x16 pixels by averaging over patches of 2x2 pixels.

With these preprocessing steps we obviously remove some information that could be helpful in classifying the images. But, since the processed data is much lower dimensional, the fitting procedures converge faster. This is an advantage in situations like here (or generally when prototyping), were we want to try many different things without having to wait too long for computations to finish. After having gained some experience, one may want to go back to work on the 32x32 RGB images.


In [None]:
# convert RGB images x to grayscale using the formula for Y_linear in https://en.wikipedia.org/wiki/Grayscale#Colorimetric_(perceptual_luminance-preserving)_conversion_to_grayscale
def grayscale(x):
    x = x.astype('float32')/255
    x = np.piecewise(x, [x <= 0.04045, x > 0.04045], 
                        [lambda x: x/12.92, lambda x: ((x + .055)/1.055)**2.4])
    return .2126 * x[:,:,0,:] + .7152 * x[:,:,1,:]  + .07152 * x[:,:,2,:]

def downsample(x):
    return sum([x[i::2,j::2,:] for i in range(2) for j in range(2)])/4

def preprocess(data):
    gray = grayscale(data['X'])
    downsampled = downsample(gray)
    return (downsampled.reshape(16*16, gray.shape[2]).transpose(),
            data['y'].flatten() - 1)


data_train = scipy.io.loadmat('housenumbers/train_32x32.mat')
data_test = scipy.io.loadmat('housenumbers/test_32x32.mat')

x_train_all, y_train_all = preprocess(data_train)
x_test_all, y_test_all = preprocess(data_test)

### Selecting a subset of classes

We furter reduce the size of the dataset (and thus reduce computation time) by selecting only the 5 (out of 10 digits) in subset_of_classes.

In [None]:
def extract_classes(x, y, classes):
    indices = []
    labels = []
    count = 0
    for c in classes:
        tmp = np.where(y == c)[0]
        indices.extend(tmp)
        labels.extend(np.ones(len(tmp), dtype='uint8') * count)
        count += 1
    return x[indices], labels

x_train, y_train = extract_classes(x_train_all, y_train_all, subset_of_classes)
x_test, y_test = extract_classes(x_test_all, y_test_all, subset_of_classes)

Let us plot some examples now. The green digit at the bottom left of each image indicates the corresponding label in y_test.
For further usage of the function plot_some_samples, please have a look at its definition in the plotting section.

In [None]:
x_test.shape

In [None]:
plot_some_samples(x_test, y_test, label_mapping = subset_of_classes);

To prepare for fitting we transform the labels to one hot coding, i.e. for 5 classes, label 2 becomes the vector [0, 0, 1, 0, 0] (python uses 0-indexing).

In [None]:
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)

## Exercise 1: No hidden layer

### Description

Define and fit a model without a hidden layer. 

1. Use the softmax activation for the output layer.
2. Use the categorical_crossentropy loss.
3. Add the accuracy metric to the metrics.
4. Choose stochastic gradient descent for the optimizer.
5. Choose a minibatch size of 128.
6. Fit for as many epochs as needed to see no further decrease in the validation loss.
7. Plot the output of the fitting procedure (a history object) using the function plot_history defined above.
8. Determine the indices of all test images that are misclassified by the fitted model and plot some of them using the function 
   `plot_some_samples(x_test, y_test, yhat_test, error_indices, label_mapping = subset_of_classes)`


Hints:
* Read the keras docs, in particular [Getting started with the Keras Sequential model](https://keras.io/getting-started/sequential-model-guide/).
* Have a look at the keras [examples](https://github.com/keras-team/keras/tree/master/examples), e.g. [mnist_mlp](https://github.com/keras-team/keras/blob/master/examples/mnist_mlp.py).

### Solution

Here we use the `EarlyStopping` callback to ensure point 7: it will stop the learning process when the validation loss stops decreasing for 5 iterations. As discussed in the forum, with early stopping we mean a different behavior. Below we try to provide a simple implementation of this regularization technique.

In [None]:
# ALL HYPERPARAMETERS NEED TUNING
model = Sequential([
    Dense(y_train.shape[1], input_shape=(x_train.shape[1],), activation="softmax")
])

model.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=SGD(lr=0.1), #find params
    metrics=['accuracy']
)

history = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=9999999999, batch_size=128,
    callbacks=[keras.callbacks.EarlyStopping('val_loss', patience=5, min_delta=0.001)]
)

In [None]:
for metric, value in zip(model.metrics_names, model.evaluate(x_test, y_test)):
    print(metric, '=', value)

In [None]:
plot_history(history, "No Hidden Layer") # if do not store in var is displayed twice...

## Exercise 2: One hidden layer, different optizimizers
### Description

Train a network with one hidden layer and compare different optimizers.

1. Use one hidden layer with 64 units and the 'relu' activation. Use the [summary method](https://keras.io/models/about-keras-models/) to inspect your model.
2. Fit the model for 50 epochs with different learning rates of stochastic gradient descent and answer the question below.
3. Replace the stochastic gradient descent optimizer with the [Adam optimizer](https://keras.io/optimizers/#adam).
4. Plot the learning curves of SGD with a reasonable learning rate together with the learning curves of Adam in the same figure. Take care of a reasonable labeling of the curves in the plot.

### Solution

#### Question 1

In [None]:
model = Sequential([
    Dense(64, input_shape=(x_train.shape[1],), activation="relu"),
    Dense(y_train.shape[1], activation="softmax")
])

model.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=SGD(lr=0.01),
    metrics=['accuracy']
)

historySGD = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=50,
    verbose=0
)

In [None]:
model.summary()

In [None]:
plot_history(historySGD, "SGD") # if do not store in var is displayed twice...

#### Question 2

In [None]:
LARGE_RATE = 0.9
SMALL_RATE = 10**(-6)

In [None]:
sgd_test_rate = Sequential([
    Dense(64, input_shape=(x_train.shape[1],), activation="relu"),
    Dense(y_train.shape[1], activation="softmax")
])

In [None]:
sgd_test_rate.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=SGD(lr=LARGE_RATE),
    metrics=['accuracy']
)

_ = plot_history(sgd_test_rate.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=50,
    verbose=0
), "SGD with very large learning rate")

In [None]:
sgd_test_rate.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=SGD(lr=SMALL_RATE),
    metrics=['accuracy']
)

_ = plot_history(sgd_test_rate.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=50,
    verbose=0
), "SGD with very small learning rate")

**Question**: What happens if the learning rate of SGD is A) very large B) very small? Please answer A) and B) with one full sentence (double click this markdown cell to edit).

**Answer**:

A) The validation error and accuracy are very "unstable", the search for the optimal value goes in the wrong direction many times

B) The improvement is very slow but constant: we would need three times the number of epochs to reach the best result obtained with the larger learning rate

#### Question 3

In [None]:
model.compile(
    loss=keras.losses.categorical_crossentropy,
    optimizer=Adam(lr=0.001),
    metrics=['accuracy']
)

historyAdam = model.fit(
    x_train, y_train,
    validation_data=(x_test, y_test),
    epochs=50,
    verbose=0
)

In [None]:
_ = plot_history(historyAdam, "Adam") # if do not store in var is displayed twice...

Using the same learning rate with Adam and SGD does not allow Adam to improve its accuracy, therefore we changed it here

#### Question 4

In [None]:
fig, ax1, ax2 = prepare_standardplot("Comparing SGD and ADAM", 'epoch')
ax1.plot(historySGD.history['loss'], label = "training with SGD", linestyle='--', c='b')
ax1.plot(historySGD.history['val_loss'], label = "validation with SGD", linestyle='-', c='b')
ax2.plot(historySGD.history['acc'], label = "training with SGD", linestyle='--', c='b')
ax2.plot(historySGD.history['val_acc'], label = "validation with SGD", linestyle='-', c='b')

ax1.plot(historyAdam.history['loss'], label = "training with Adam", linestyle='--', c='orange')
ax1.plot(historyAdam.history['val_loss'], label = "validation with Adam", linestyle='-', c='orange')
ax2.plot(historyAdam.history['acc'], label = "training with Adam", linestyle='--', c='orange')
ax2.plot(historyAdam.history['val_acc'], label = "validation with Adam", linestyle='-', c='orange')
finalize_standardplot(fig, ax1, ax2)

## Exercise 3: Overfitting and early stopping with Adam

### Description

Run the above simulation with Adam for sufficiently many epochs (be patient!) until you see clear overfitting.

1. Plot the learning curves of a fit with Adam and sufficiently many epochs and answer the questions below.

A simple, but effective mean to avoid overfitting is early stopping, i.e. a fit is not run until convergence but stopped as soon as the validation error starts to increase. We will use early stopping in all subsequent exercises.

### Solution

In [None]:
_ = plot_history(historyAdam, "Adam") # if do not store in var is displayed twice...

The training we ran before was already overfitting: we can clearly see that the training error keeps decreasing while the validation error stays stable. We can see the same pattern observing the accuracy: the training accuracy keeps increasing while the validation accuracy is mostly stable

**Question 1**: At which epoch (approximately) does the model start to overfit? Please answer with one full sentence.

**Answer**: The model start to overfit right away but after epoch 15 we clearly see that the validation error stays stable or increases while the training error keeps decreasing

**Question 2**: Explain the qualitative difference between the loss curves and the accuracy curves with respect to signs of overfitting. Please answer with at most 3 full sentences.

**Answer**: # TODO

As discussed in the forum, we should not use `keras.callbacks.EarlyStopping`. Therefore we implemented early stopping as described in the Deeplearning book: each time the validation accuracy improves, we take a snapshot of the model weights

In [None]:
class RealEarlyStopper(keras.callbacks.Callback):
    def __init__(self, set_best_at_end=True):
        self._best_score = -1
        self._best_weights = None
        self._set_best_at_end = set_best_at_end
    
    def on_epoch_end(self, epoch=None, logs={}):
        valacc = logs['val_acc']
        if valacc > self._best_score:
            self._best_score = valacc
            self._best_weights = [layer.get_weigths().copy() for layer in self.model.layers]
        
    def on_train_end(self, logs={}):
        if not self._set_best_at_end:
            return
        for layer, best_weights in zip(self.model.layers, self._best_weights):
            layer.set_weights(best_weights)

## Exercise 4: Model performance as a function of number of hidden neurons

### Description

Investigate how the best validation loss and accuracy depends on the number of hidden neurons in a single layer.

1. Fit a reasonable number of models with different hidden layer size (between 10 and 1000 hidden neurons) for a fixed number of epochs well beyond the point of overfitting.
2. Collect some statistics by fitting the same models as in 1. for multiple initial conditions. Hints: 1. If you don't reset the random seed, you get different initial conditions each time you create a new model. 2. Let your computer work while you are asleep.
3. Plot summary statistics of the final validation loss and accuracy versus the number of hidden neurons. Hint: [boxplots](https://matplotlib.org/examples/pylab_examples/boxplot_demo.html) (also [here](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.boxplot.html?highlight=boxplot#matplotlib.axes.Axes.boxplot)) are useful. You may also want to use the matplotlib method set_xticklabels.
4. Plot summary statistics of the loss and accuracy for early stopping versus the number of hidden neurons.

### Solution

As we observed from our previous plots, the Adam optimizer converges much quicker than the SGD one. Therefore we will use it here. We saw that the model start to overfit at around 15 so we will train it for 30 epochs

In [None]:
def get_model_result(hidden_neurons):
    m = Sequential([
        Dense(hidden_neurons, input_shape=(x_train.shape[1],), activation="relu"),
        Dense(y_train.shape[1], activation="softmax")
    ])
    
    m.compile(
        loss=keras.losses.categorical_crossentropy,
        optimizer=Adam(lr=0.001),
        metrics=['accuracy']
    )

    h = model.fit(
        x_train, y_train,
        validation_data=(x_test, y_test),
        epochs=30,
        verbose=0,
        callbacks=[keras.callbacks.EarlyStopping('val_loss', patience=3, min_delta=0.001)]
    )
    
    return h.history['val_loss'][-1], h.history['val_acc'][-1]

#### Questions 1 and 2

In [None]:
N = 20
hidden_neurons = list(map(int, np.logspace(1, 3, N)))
stats = []
for idx, hid in enumerate(hidden_neurons):
    # This seed or the seed parameter of initializer?
    # https://keras.io/initializers/
    np.random.seed(hash(student1 + student2) % 2**32)
    stats.append(get_model_result(hid))
    print("Completed train {}/{}".format(idx+1, N))

#### Question 3

In [None]:
fig, (ax_loss, ax_acc) = plt.subplots(nrows=1, ncols=2)#, sharey=True)

losses, accuracies = zip(*stats)

ax_loss.boxplot(losses)
ax_loss.set_title("Validation loss")
#ax_loss.set_yticklabels(hidden_neurons) # I am not sure it makes any sense: to build boxplot points are not kept in the same order


ax_acc.boxplot(accuracies)
ax_acc.set_title("Validation accuracy")
#ax_acc.set_yticklabels(hidden_neurons)

In [None]:
fig, (ax_loss, ax_acc) = plt.subplots(nrows=1, ncols=2, sharey=True)

ax_loss.plot(hidden_neurons, losses)
ax_loss.set_title("Validation loss")
ax_acc.plot(hidden_neurons, accuracies)
ax_acc.set_title("Validation accuracy")

## Exercise 5: Comparison to deep models

### Description

Instead of choosing one hidden layer (with many neurons) you experiment here with multiple hidden layers (each with not so many neurons).

1. Fit models with 2, 3 and 4 hidden layers with approximately the same number of parameters as a network with one hidden layer of 100 neurons. Hint: Calculate the number of parameters in a network with input dimensionality N_in, K hidden layers with N_h units, one output layer with N_out dimensions and solve for N_h. Confirm you result with the keras method model.summary().
2. Run each model multiple times with different initial conditions and plot summary statistics of the best validation loss and accuracy versus the number of hidden layers.

### Solution

### 1.

#### Fit models with 2, 3 and 4 hidden layers with approximately the same number of parameters as a network with one hidden layer of 100 neurons

In [None]:
def dense_factory(input_, output, weight_regularizer, bias_regularizer):
    return Dense(output, input_shape=(input_,), activation="softmax", kernel_initializer='random_uniform', kernel_regularizer=weight_regularizer, bias_regularizer=bias_regularizer)

def model_factory(units, layers, weight_regularizer, bias_regularizer, dropout=None):
    # Check input validity for dropout
    if dropout is not None and len(dropout) < 2:
        dropout.append(0)
        
    model_builder = Sequential()
    
    # Input layer
    model_builder.add(dense_factory(x_train.shape[1], units, weight_regularizer, bias_regularizer)) 
    if dropout is not None:
        model_builder.add(Dropout(dropout[0]))
        
    # Hidden layers
    for i in range(layers-1):
        model_builder.add(dense_factory((units,), units, weight_regularizer, bias_regularizer))
        if dropout is not None:
            model_builder.add(Dropout(dropout[1]))
        
    # Output layer
    model_builder.add(dense_factory((units,), y_train.shape[1], weight_regularizer, bias_regularizer))
    if dropout is not None:
        model_builder.add(Dropout(dropout[1]))
    
    return model_builder

# Dummy regularizer, used as non existing regularizer
null_reg = keras.regularizers.l1(0)

# Factories for required models
size_per_layer_1_hidden = 100
size_per_layer_2_hidden = 77
size_per_layer_3_hidden = 66
size_per_layer_4_hidden = 59

def model_1_hidden_factory(weight_regularizer=None, bias_regularizer=None, dropout=None):
    return model_factory(size_per_layer_1_hidden, 1, weight_regularizer, bias_regularizer, dropout)

def model_2_hidden_factory(weight_regularizer=None, bias_regularizer=None, dropout=None):
    return model_factory(size_per_layer_2_hidden, 2, weight_regularizer, bias_regularizer, dropout)

def model_3_hidden_factory(weight_regularizer=None, bias_regularizer=None, dropout=None):
    return model_factory(size_per_layer_3_hidden, 3, weight_regularizer, bias_regularizer, dropout)

def model_4_hidden_factory(weight_regularizer=None, bias_regularizer=None, dropout=None):
    return model_factory(size_per_layer_4_hidden, 4, weight_regularizer, bias_regularizer, dropout)

def compile_model(model):
    model.compile(
        loss=keras.losses.categorical_crossentropy,
        optimizer=Adam(lr=0.01),
        metrics=['accuracy']
    )
    
def fit_model(model, x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test):
    h = model.fit(
        x_train, y_train,
        validation_data=(x_test, y_test),
        epochs=80,
        verbose=1,
        callbacks=[keras.callbacks.EarlyStopping('val_loss', patience=5, min_delta=0.001)]
    )   
    return h

#### Confirm you result with the keras method model.summary()

In [None]:
model_1_hidden_factory().summary()
model_2_hidden_factory().summary()
model_3_hidden_factory().summary()
model_4_hidden_factory().summary()

### 2.
#### Run each model multiple times with different initial conditions

In [None]:
model_factories = [model_1_hidden_factory, model_2_hidden_factory, model_3_hidden_factory, model_4_hidden_factory]

try_per_model = 40

results = []

for i, factory in enumerate(model_factories):
    print('Training model', i + 1, '/', len(model_factories))
    
    curr_model_res = []
    
    for i in range(try_per_model):
        print('Inner step:', i+1, '/', try_per_model)
        
        model = factory()
        
        compile_model(model)
        
        h = fit_model(model, x_train, y_train, x_test, y_test)
        
        curr_model_res.append(h)
        
    results.append(curr_model_res)

#### Plot summary statistics of the best validation loss and accuracy versus the number of hidden layers

In [None]:
# Not sure about that, val_loss and val_acc in this case are not necessarily linked (not the same epoch)

best_results = [[[min(res.history['val_loss']), max(res.history['val_acc'])] for res in model_res] for model_res in results]

In [None]:
for i, model_results in enumerate(best_results):
    val_loss, val_acc = list(zip(*model_results))
    plt.hist(val_loss, bins=25)
    plt.title("Distribution of validation's losses for model with " + str(i+1) + " hidden layers")
    plt.show()

In [None]:
for i, model_results in enumerate(best_results):
    val_loss, val_acc = list(zip(*model_results))
    plt.hist(val_acc, bins=25)
    plt.title("Distribution of validation's accuracy for model with " + str(i+1) + " hidden layers")
    plt.show()

In [None]:
best_trainings = []

for model_results in results:
    best_accs = [max(training_history.history['val_acc']) for training_history in model_results]
    best_training = model_results[best_accs.index(max(best_accs))]
    best_trainings.append(best_training)

## Exercise 6: Tricks (regularization, batch normalization, dropout)

### Description

Overfitting can also be counteracted with regularization and dropout. Batch normalization is supposed to mainly decrease convergence time.

1. Try to improve the best validation scores of the model with 1 layer and 100 hidden neurons and the model with 4 hidden layers. Experiment with batch_normalization layers, dropout layers and l1- and l2-regularization on weights (kernels) and biases.
2. After you have found good settings, plot for both models the learning curves of the naive model you fitted in the previous exercises together with the learning curves of the current version.
3. For proper comparison, plot also the learning curves of the two current models in a third figure.

### Solution

In [None]:
weight_l1_regularizations = [10, 1.0, 0.1, 0.01, 0.001]
weight_l2_regularizations = [10, 1.0, 0.1, 0.01, 0.001]

bias_l1_regularizations = [10, 1.0, 0.1, 0.01, 0.001]
bias_l2_regularizations = [10, 1.0, 0.1, 0.01, 0.001]

dropouts = [[0.2, 0.5], [0.2, 0.9], [0.1, 0.1]]

def grid_search(model_factory, compiler, model_fiter):
    histories = []
    
    def inner_loop(*param_lists):
        for w_reg, b_reg, dropout in itertools.product(*param_lists):
            print(w_reg, b_reg, dropout)
            model = model_factory(keras.regularizers.l1(w_reg), keras.regularizers.l1(b_reg), dropout=dropout)
            compiler(model)
            h = model_fiter(model)
            histories.append(h)
            
    inner_loop(weight_l1_regularizations, bias_l1_regularizations, dropouts)
    inner_loop(weight_l2_regularizations, bias_l1_regularizations, dropouts)
    inner_loop(weight_l1_regularizations, bias_l2_regularizations, dropouts)
    inner_loop(weight_l2_regularizations, bias_l2_regularizations, dropouts)
    
    return histories

In [None]:
model_family = [model_1_hidden_factory, model_4_hidden_factory]

results = []
for model in model_family:
    res = grid_search(model, compile_model, fit_model)
    results.append(res)

## Exercise 7: Convolutional networks

### Description

Convolutional neural networks have an inductive bias that is well adapted to image classification.

1. Design a convolutional neural network, play with the parameters and fit it. Hint: You may get valuable inspiration from the keras [examples](https://github.com/keras-team/keras/tree/master/examples), e.g. [mnist_cnn](https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py).
2. Plot the learning curves of the convolutional neural network together with the so far best performing model.

## Solution

In [None]:
train_tensor = np.reshape(x_train, (x_train.shape[0], 16, 16, 1))
test_tensor = np.reshape(x_test, (x_test.shape[0], 16, 16, 1))

In [None]:
cnn = Sequential([
    Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=train_tensor.shape[1:]),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D(pool_size=(2, 2)),
    Dropout(0.25),
    Flatten(),
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(y_train.shape[1], activation='softmax')
])

In [None]:
cnn.compile(loss=keras.losses.categorical_crossentropy,
           optimizer=keras.optimizers.Adadelta(),
           metrics=['accuracy'])

In [None]:
cnn.fit(
    train_tensor, y_train,
    validation_data=(test_tensor, y_test),
    epochs=20    
)