# Task 1

Start with reading the section “Implementing MLPs with Keras” from Chapter 10 of Geron’s text-book (pages 292-325).
Then install `TensorFlow 2.0+` and experiment with the code included in this section.
Additionally, study the official documentation (https://keras.io/) and get an idea of the numerous options offered by Keras (layers, loss functions, metrics, optimizers, activations, initializers, regularizers).
Don’t get overwhelmed with the number of options – you will frequently return to this site in the coming months.

### Imports

In [None]:
from itertools import product
from time import perf_counter

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.models import Sequential

---
## Part 1

Check out this official repository with many examples of Keras implementations of various sorts of deep neural networks [here](https://github.com/keras-team/keras/tree/tf-keras-2/examples).
We recommend cloning this repository and try to get some of these examples running on your system (or Colab/DeepNote).
In particular, experiment with `mnist_mlp.py` and `mnist_cnn.py` scripts which show you how to build simple neural networks for the MNIST dataset (useful for the next task).

*insert findings*

---

## Part 2

Next, take the two well-known datasets: Fashion MNIST (introduced in _Ch. 10, p. 295_) and CIFAR-10.
The first dataset contains 2D (grayscale) images of size 28x28, split into 10 categories; 60,000 images for training and 10,000 for testing, while the latter contains 32x32x3 RGB images (50,000/10,000 train/test).
Apply two reference networks on the fashion MNIST dataset: a MLP described in detail in _Ch. 10, pp. 297-307_ and a CNN described in _Ch. 14, p. 447_.
Experiment with both networks, trying various options: initializations, activations, optimizers (and their hyperparameters), regularizations (L1, L2, Dropout, no Dropout).
You may also experiment with changing the architecture of both networks: adding/removing layers, number of convolutional filters, their sizes, etc.

After you have found the best performing hyperparameter sets, take the 3 best ones and train new models on the CIFAR-10 dataset, see whether your performance gains translate to a different dataset.
Provide your thoughts on these results in the report.

First we create a MLP model for the fashion MNIST dataset.
We use the same model as in the book, but we add a dropout layer after the first dense layer.
We also use the Adam optimizer with a learning rate of 0.001.
We train the model for 10 epochs and use a batch size of 32.
We use the same model for the CIFAR-10 dataset, but we change the number of epochs to 20 and the batch size to 64.
We also use a learning rate of 0.0001 for the CIFAR-10 dataset.

### Fashion MNIST

load dataset

In [None]:
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

test different hyperparameters for a 2-layer MLP as defined in chapter 10 of the book

In [None]:
df = pd.DataFrame(columns=['optimizer', 'lr', 'activation', 'loss', 'accuracy', 'traintime'])
reps = 3
optimizers = [keras.optimizers.Adam, keras.optimizers.SGD, keras.optimizers.RMSprop]
lrs = [1e-3, 5e-3, 1e-2]
activations = ['relu', 'sigmoid', 'tanh']

configs = list(product(optimizers, lrs, activations))

In [None]:
run = 1
for rep in range(reps):
    for optimizer, lr, activation in configs:
        print(f'\r{run}/{len(configs)*reps}', end='')
        model = Sequential()
        model.add(layers.Flatten(input_shape=(28, 28)))
        model.add(layers.Dense(300, activation=activation))
        model.add(layers.Dense(100, activation=activation))
        model.add(layers.Dense(10, activation='softmax'))
        model.compile(
            optimizer = optimizer(learning_rate=lr),
            loss = 'sparse_categorical_crossentropy',
            metrics = ['accuracy']
        )
        tic = perf_counter()
        history = model.fit(x_train, y_train, epochs=5, batch_size=64, verbose=0)
        toc = perf_counter()
        test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
        df.loc[f'{optimizer.__name__}-{activation}-{lr}'] = [
            optimizer.__name__,
            lr,
            activation,
            test_loss / reps,
            test_acc / reps,
            (toc-tic) / reps
        ]
        run += 1

In [None]:
df

plot the results

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(15, 15))
for i, optimizer in enumerate([opt.__name__ for opt in optimizers]):
    for j, metric in enumerate(['accuracy', 'loss', 'traintime']):
        ax = axes[i,j]
        sns.lineplot(x='lr', y=metric, hue='activation', data=df[df['optimizer'] == optimizer], ax=ax)
        ax.set_xlabel('')
        ax.set_ylabel(optimizer) if j == 0 else ax.set_ylabel('')
        ax.set_xticks(lrs, fontsize=3)
        ax.get_legend().remove()
        ax.set_title(metric) if i == 0 else ax.set_title('')

# set global legend
handles, labels = ax.get_legend_handles_labels()
fig.legend(handles, labels, loc='upper right', bbox_to_anchor=(0.99, 0.99), ncol=3, fontsize=14)

fig.suptitle('MLP - Fashion MNIST HPO exploration', fontsize=20, weight='bold')

fig.tight_layout()
fig.savefig('../plots/MLP_hpo.png')

_to be edited!_

ReLU seems to be the best activation function, and the Adam optimizer with a learning rate of 0.001 seems to be the best optimizer.
ReLU does not only achieve the lowest loss and highest accuracy, but it is also faster to train with than the other activation functions.

Between sigmoid and tanh, sigmoid takes the cake while having practically identical training times.

### CNN

In [None]:
from functools import partial

DefaultConv2D = partial(keras.layers.Conv2D,
kernel_size=3, activation='relu', padding="SAME")

model = keras.models.Sequential([
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[28, 28, 1]),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=256),
    DefaultConv2D(filters=256),
    keras.layers.MaxPooling2D(pool_size=2),
    keras.layers.Flatten(),
    keras.layers.Dense(units=128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=64, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=10, activation='softmax'),
])

In [None]:
model.compile(loss="sparse_categorical_crossentropy",
optimizer="nadam",
metrics=["accuracy"])

history = model.fit(x_train, y_train, epochs=5, batch_size=128, validation_data=(x_test, y_test))

model.evaluate(x_test, y_test)
