<a href="https://colab.research.google.com/github/D4ve39/pythonProg/blob/master/MachineLearning_DL_FFNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Machine Learning Workshop 2021 - Deep Learning
# A gentle introduction to Neural Networks

Today we will attempt to solve a machine learning problem by building and training a neural network. In particular, we will implement a very general architecture: a Multi Layer Perceptron (MLP). We will see how well it does on the task of handwritten digit recognition.

*Keras* will be our library of choice. It relies on a more powerful library (*tensorflow*) to handle the training process of neural networks, but it also exposes a very intuitive interface that will remind you of *sklearn*.

In [None]:
# first of all, we import the libraries we will use

import numpy as np  # to deal with matrix and numerical computation
import matplotlib.pyplot as plt  # our loyal plotting library
import pandas as pd  # great tool for managing datasets
from sklearn.model_selection import train_test_split  # good old splitting function
from sklearn import datasets

# finally, our newcomers! tensorflow and keras

from tensorflow import keras
from tensorflow.keras import layers

## Data preparation
As usual, we will load the data, split it and visualize it. Since neural networks are notoriously expensive to train, we will unfortunately not be able to rely on cross-validation. Instead, we will use an arguably suboptimal validation strategy.

In [None]:
# let us fetch this datasets once again

digits = datasets.fetch_openml('mnist_784', version=1)

In [None]:
# preprocessing!

data = digits.data  # get features
data = data.astype(np.float32) / 255  # we normalize the data to avoid extreme values
labels = digits.target  # get labels
labels = keras.utils.to_categorical(labels)  # we are using one-hot labels to handle vectorized outputs

# split training and test data

training_data, test_data, training_labels, test_labels = train_test_split(data, labels, test_size=0.2, random_state=42)

In [None]:
# Let us check some simple statistics
print('data shape: {}'.format(training_data.shape[1:]))
print('# training samples: {}'.format(training_data.shape[0]))
print('# test samples: {}'.format(test_data.shape[0]))
print('# features: {}'.format(training_data.shape[1]))
print('# classes: {}'.format(training_labels.shape[1]))

We can (and should) have a look at the training data! On the other hand, we should forget about the test dataset for now, in order to avoid information leakage.

All data samples can be presented as an image (2D array of values) or flattened as a 1D vector. The first option is great for visualization, while the second is needed later as our MLP will only take 1D vectors as input.

Similarly, the labels can be represented as an integer (from 0 to 9) or as a one-hot vector of size 10, which will match the output size of the MLP. You can see some examples below.

Note: do not pay too much attention to the code for visualization, it is not too important.

In [None]:
# visualize sample images from the training set
images_to_plot = 4  # number of samples to show
fig, ax = plt.subplots(images_to_plot, figsize=(5,15))
rnd_id = np.random.randint(0,len(training_data),images_to_plot)  # pick a random index in the training set
for i,id in enumerate(rnd_id):
    rnd_img = training_data[id].reshape((28, 28))  # reshape from vector to matrix
    rnd_lbl = training_labels[id]
    ax[i].imshow(rnd_img)
    ax[i].set_title(f'True label: {rnd_lbl}   ({rnd_lbl@np.arange(10)})')
plt.show()

## Creating a model
Now, let us turn to designing our neural network. For simplicity, we will start with the bare minimum: since the output will be a 10 dimensional vector (one element for each possible label), we will construct a network with only 10 neurons. This means that each output is then computed as a linear combination of input features. We really could not choose a simpler architecture.

Note: the *softmax* activation will transform the outputs into a probability distribution over different classes.

In [None]:
n_features = training_data.shape[1]
n_classes = 10

model = keras.Sequential(
    [
        keras.Input(shape=(n_features,)),
        layers.Dense(n_classes, activation='softmax')
    ]
)

The model is thus created. Let us have a look at how many parameters/weights it will need to learn.

In [None]:
model.summary()

We can even ask *keras* to plot our network.

In [None]:
keras.utils.plot_model(model, show_shapes=True)

We have finally defined our neural network, which represents a model class. The last obstacle we need to overcome is training the network or, in other words, learning good parameters. This will not be too different from previous methods: we will define a goodness of fit criterion and try to optimize it. While this will be handled directly by *keras*, we will now take some time to introduce the main idea:

## Gradient Descent
We start by defining a loss function. A loss function quantifies the inaccuracies of our model on the training data as a function of its parameters. Examples of loss functions are the Mean Square Error for regression, or Cross Entropy for classification, which is a generalization of the Binary Cross Entropy we have seen with logistic regression.

Gradient Descent is an optimizer algorithm: it iteratively updates the parameters of the neural network to minimize the loss function. We will now see how gradient descent operates on a much simpler function, but the procedure for complex loss functions is not too dissimilar.

In [None]:
# first of all, we introduce a simple quadratic function

def f(x):
    return x**2

# x is the (only) parameter of this function
# our goal is that of finding a value of x that minimizes f(x) (a minimum of f)
# this corresponds to finding a set of parameters that minimizes the loss funciton in a neural network

# gradient descent relies on one hyperparameter that controls the magnitude of parameter updates
learning_rate = 0.2

# first we initialize our parameter (usually would be done randomly)
x = -1.9
print('Initialization: {}'.format(x))

# now we iterate n times
n = 5

history = [x]
for i in range(n): 
    # we evaluate the function (forward pass)
    y = f(x)
    print(f'Step {i} - value: {y}')
    
    # we compute the derivative of f wrt our parameter x (gradient)
    # we will do this by hand, but keras can automatically compute very complex gradients
    grad = 2 * x
    print(f'Step {i} - gradient: {grad}')
    
    # we now take a single optimization step
    # the gradient points our the direction in which the function grows
    # in our case, if the gradient is positive, it means that the function locally increases as x increases
    # if the gradient is negative, the function decreases as x increases
    # since we want to minimize f, we have to take a step in the opposite direction wrt to the gradient
    
    step = -grad * learning_rate
    x = x + step
    history.append(x)
    print(f'Step {i} - parameter: {x}')
    # ... and we repeat!


# let us have a look at what happened

z = np.linspace(-2, 2, 100)
plt.plot(z, f(z))
plt.plot(history, [f(x) for x in history], '-o')
for i, x in enumerate(history):
    plt.annotate(str(i), (x, f(x)), fontsize='large')
    

## Training

Let us now let *keras* handle all of this. All we need to do is call the right methods.

Note: at each gradient descent step the loss function is recomputed on part of the training set, called a batch. Once each batch that the training set was divided into has been used, we start over. Each complete pass over the dataset is called an epoch.

In [None]:
batch_size = 128  # how much data to feed in at once
epochs = 20  # how many time to go over the dataset

# define loss, optimizer and metrics
model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.SGD(1e-2), metrics=["accuracy"])

history = model.fit(training_data, training_labels, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
def show_history(h, show_plots=False):
    if show_plots:
        fig, ax = plt.subplots(1, 2, figsize=(8, 4))
        ax[0].plot(history.history['accuracy'], label='Training accuracy')
        ax[0].plot(history.history['val_accuracy'], label='Validation accuracy')
        ax[0].set_title('Model accuracy')
        ax[0].set_ylabel('Accuracy')
        ax[0].set_xlabel('Epoch')
        ax[0].legend(loc='lower right')
        ax[1].plot(history.history['loss'], label='Training loss')
        ax[1].plot(history.history['val_loss'], label='Validation loss')
        ax[1].set_title('Model loss')
        ax[1].set_ylabel('Loss')
        ax[1].set_xlabel('Epoch')
        ax[1].legend(loc='upper right')
        plt.show()
    print('Final training accuracy: {}'.format(history.history['accuracy'][-1]))
    print('Final validation accuracy: {}'.format(history.history['val_accuracy'][-1]))
    print('Final training loss: {}'.format(history.history['loss'][-1]))
    print('Final validation loss: {}'.format(history.history['val_loss'][-1]))

show_history(history)

Results are not great...as a matter fact they are comparable with logistic regression! This is because such a simple model has basically the same expressivity as simpler methods method. We have to go deeper! Let us add more layer and (hopefully) increase the expressiveness of the network!

In [None]:
model = keras.Sequential(
    [
        keras.Input(shape=(n_features,)),
        layers.Dense(256),  # additional layer!
        layers.Dense(n_classes, activation='softmax'),
    ]
)
model.summary()

In [None]:
model.compile(loss="categorical_crossentropy",  optimizer=keras.optimizers.SGD(1e-2), metrics=["accuracy"])

history = model.fit(training_data, training_labels, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
show_history(history)

Final training accuracy: 0.9158532023429871
Final validation accuracy: 0.9200000166893005
Final training loss: 0.29516375064849854
Final validation loss: 0.28193774819374084


It is not improving much...why?
Because all operations are completely linear. We need to introduce non-linearities.

## Finally, an actual MLP

In [None]:
model = keras.Sequential(
    [
        keras.Input(shape=(n_features,)),
        layers.Dense(256, activation='relu'),
        layers.Dense(n_classes, activation='softmax'),
    ]
)
model.summary()

In [None]:
model.compile(loss="categorical_crossentropy",  optimizer=keras.optimizers.Adam(1e-2), metrics=["accuracy"])

history = model.fit(training_data, training_labels, batch_size=batch_size, epochs=epochs, validation_split=0.2)

In [None]:
show_history(history, show_plots=True)

Now we are talking! Two ingredients are fundamental: multiple (or very wide) layers and non linearities.
What we have achieved is called a MLP.

## Your turn: Playtime!

Now you can play around! In general, a model that is too complex will be more powerful, but harder to train and prone to overfitting. Can you strike the right balance? Try to customize the architecture and improve performance.

### a) Architecture

Our MLP only has two layers. Can you add new ones? Can you increase the number of units per layer? Does that help?


### b) Optimizer

By default we are using a cousin of vanilla gradient descent called SGD. However, other fancier optimizers exist. What happens if you swap out 'sgd' for 'adam' or 'rmsprop' when compiling the model?


### c) Batch size and epochs
Try changing the default value of batch size and see how that affects training. You can do so by passing the right arguments to model.fit


In [None]:
model = keras.Sequential(
    [
        # ...
    ]
)
model.summary()
model.compile(...)

history = model.fit(training_data, training_labels, batch_size=batch_size, epochs=epochs, validation_split=0.1)
show_history(history, show_plots=True)

If you want, you can finally evaluate your model on test data to get a proper estimate of its performance.

In [None]:
score = model.evaluate(test_data, test_labels, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])