<a href="https://colab.research.google.com/github/LeonardoGoncRibeiro/01_DataScienceUsingPython/blob/main/02_Advanced/02_DeepLearning_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning: How does the network learns?

In this course, we will go further into the specificities of a neural network. We will understand how it works, and how does it learn. This will give us a better sense on:
* Why these models are so easily generalized.
* Why, sometimes, it is so hard to fit a model.
* Understand some very important parameters to prevent overfitting.
* Why model building may take a long time in some cases.

In this course, we will use the following packages:

In [None]:
import tensorflow as tf
from tensorflow import keras

import numpy as np

Also, similar to the previous course, we will use the Fashion MNIST dataset:

In [None]:
((X_train, y_train), (X_test, y_test)) = keras.datasets.fashion_mnist.load_data( )

# Understanding a base model

Now, let's fit a basic model:

In [None]:
SEED = 42
tf.random.set_seed(SEED)

model = keras.Sequential([
                          keras.layers.Flatten(input_shape = (28, 28)),
                          keras.layers.Dense(256, activation = tf.nn.relu),
                          keras.layers.Dropout(0.2),
                          keras.layers.Dense(10,  activation = tf.nn.softmax)
])

model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

fit_history = model.fit(X_train, y_train, epochs = 5, validation_split = 0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


So, we have fitted a base model. Note that our training accuracy is 74.29%, and our validation accuracy is 82.52%. Let's check the test accuracy:

In [None]:
model.evaluate(X_test, y_test)



[0.6125737428665161, 0.796999990940094]

Very close to the validation accuracy. 

Ok, but how is our model working internally? Let's check a summary for our model:

In [None]:
model.summary( )

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_4 (Flatten)         (None, 784)               0         
                                                                 
 dense_8 (Dense)             (None, 256)               200960    
                                                                 
 dropout_4 (Dropout)         (None, 256)               0         
                                                                 
 dense_9 (Dense)             (None, 10)                2570      
                                                                 
Total params: 203,530
Trainable params: 203,530
Non-trainable params: 0
_________________________________________________________________


So, we have four layers: a flatten layer, a dense layer, a dropout layer, and a final dense layer. Then flatten layer has 784 neurons, the dense and dropout layers have 256 neurons, and the final dense layer has 10 neurons. 

Note that only the dense layers have parameters. The first dense layer has 200960 parameters. This happens because the Relu activation function has a weight parameter. Since the dense layer is connecting every neuron from the previous layer (784) to every neuron from the current layer (256), we end up with 200704 weight vectors. The rest of the hyperparameters are the biases given at each neuron. Thus, $200704 + 256 = 200960$. We can get the parameters of our model using keras. 

When we are fitting our neural network, we are effectively fitting those weights and biases. Initially, our model sets a random value for each parameter and, throughout the optimization process, the model adjusts those parameters until an optimal configuration is found. This optimal configuration represents the set of hyperparameters that optimizes a given metric (e.g. minimizes a loss function).

To understand the configuration of our neural network, we can use:

In [None]:
model.get_config( )

{'layers': [{'class_name': 'InputLayer',
   'config': {'batch_input_shape': (None, 28, 28),
    'dtype': 'float32',
    'name': 'flatten_4_input',
    'ragged': False,
    'sparse': False}},
  {'class_name': 'Flatten',
   'config': {'batch_input_shape': (None, 28, 28),
    'data_format': 'channels_last',
    'dtype': 'float32',
    'name': 'flatten_4',
    'trainable': True}},
  {'class_name': 'Dense',
   'config': {'activation': 'relu',
    'activity_regularizer': None,
    'bias_constraint': None,
    'bias_initializer': {'class_name': 'Zeros', 'config': {}},
    'bias_regularizer': None,
    'dtype': 'float32',
    'kernel_constraint': None,
    'kernel_initializer': {'class_name': 'GlorotUniform',
     'config': {'seed': None}},
    'kernel_regularizer': None,
    'name': 'dense_8',
    'trainable': True,
    'units': 256,
    'use_bias': True}},
  {'class_name': 'Dropout',
   'config': {'dtype': 'float32',
    'name': 'dropout_4',
    'noise_shape': None,
    'rate': 0.2,
    'see

Here, we have a JSON containing a set of parameters for each layer. The initialization of the weights is performed using the Glorot Uniform algorithm. The biases, however, are being initialized with zeros.

## Relu activation function

The Relu activation is given by:

\begin{equation}
i(x) = \max(0, x)
\end{equation}

Thus, if the value is negative, the Relu returns 0. Else, it returns the value itself. The output for the neuron, however, gets the Relu function, multiplies by the weight for the connection, and adds the bias for the neuron. Thus, we can say that:

\begin{equation}
\text{Output} = i*w + b
\end{equation}

Since we are dealing with multiple neurons, we can write this function as the multiplication between two matrices, and then the sum with another matrix. Thus, essentially, what happens which deep learning is a multiplication and sum of matrices. 

## Neural Network optimization

Thus, our Neural Network is enterily dependent on the weights and bias. To get the optimal weight and bias, our model uses the gradient descent algorithm. 

The gradient descent algorithm is an iterative algorithm which continuously changes the values of our parameters until it reachs the optimum. The change in the value of the weights for each iteration is called momentum. If we have a high momentum, we will not be able to find to optimal point, since our algorithm will not be able to stop in the exact optimal. If we have a low momentum, we will likely find it, but it may take very long. 

Also, note that the gradient descent algorithm only guarantees that we find the optimum if we are optimizing a simple unimodal function, with only one local minima (which is the same as the global minima). If we have a multimodal function, we may end up finding a local minima which is different from the true global.

Some things can help the algorithm to avoid those local minima. For instance, using Dropout layers can help us to continue exploring new possible sets of parameters even if we found a local minima. Also, the use of an stochastic gradient descent algorithm also makes it so that our algorithm does not stop upon reaching a local minima. 

Note that we are using the Adam algorithm as the optimizer. Adam is a very good optimizer, where the moment is adaptively estimated by the algorithm. Also, it uses the stochastic gradient descent to find the optimum. To understand if our model has converged, we may look for the decrease in loss for each epoch. If it has stabilized, it means that our model converged, and likely will not be able to be improved (unless we change the model parameters).

The Adam optimizer has some parameters, which may assist in the optimization process. To get information about the Adam optimizer, one may look for:

https://keras.io/api/optimizers/adam/

For instance, to change the momentum of our network, we can use the learning rate parameter (```lr```). Thus, we may do:

In [None]:
SEED = 42
tf.random.set_seed(SEED)

model = keras.Sequential([
                          keras.layers.Flatten(input_shape = (28, 28)),
                          keras.layers.Dense(256, activation = tf.nn.relu),
                          keras.layers.Dropout(0.2),
                          keras.layers.Dense(10,  activation = tf.nn.softmax)
])

adam = keras.optimizers.Adam(learning_rate = 0.1)

model.compile(optimizer = adam, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

fit_history = model.fit(X_train, y_train, epochs = 5, validation_split = 0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Note that, by default, the learning rate for Adam is 0.001. Thus, here, we are increasing the learning rate of our optimizer. When we use a very high learning rate, our model is not able to find the optimal value. That occurs because our momentum is too high, and the optimizer is not able to find the optimum. Our accuracy is very close to 10%, which is very close to the accuracy we would expect from a dummy optimizer (since we have 10 possible labels). 

Also, we can make keras identify when the model has converged (or is not being able to improve upon the current optimum), so that it may stop prematurely. For that end, we can do:

In [None]:
SEED = 42
tf.random.set_seed(SEED)

model = keras.Sequential([
                          keras.layers.Flatten(input_shape = (28, 28)),
                          keras.layers.Dense(256, activation = tf.nn.relu),
                          keras.layers.Dropout(0.2),
                          keras.layers.Dense(10,  activation = tf.nn.softmax)
])

adam = keras.optimizers.Adam(learning_rate = 0.1)

callbacks = [keras.callbacks.EarlyStopping(monitor = 'val_loss', patience = 1)]

model.compile(optimizer = adam, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

fit_history = model.fit(X_train, y_train, epochs = 5, validation_split = 0.2, callbacks = callbacks)

Epoch 1/5
Epoch 2/5


Nice! This time, instead of performing 5 epochs, we only perform 2, and the model itself understood that the model would not improve any further. For that, it looked for the validation accuracy.


Let's try to use a lower learning rate, but slightly higher learning rate than the default value:

In [None]:
SEED = 42
tf.random.set_seed(SEED)

model = keras.Sequential([
                          keras.layers.Flatten(input_shape = (28, 28)),
                          keras.layers.Dense(256, activation = tf.nn.relu),
                          keras.layers.Dropout(0.2),
                          keras.layers.Dense(10,  activation = tf.nn.softmax)
])

adam = keras.optimizers.Adam(learning_rate = 0.002)

model.compile(optimizer = adam, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

fit_history = model.fit(X_train, y_train, epochs = 5, validation_split = 0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Still, our accuracy reduced (comparing with our baseline). Thus, our former model, with ```lr = 0.001```, seems to be the best one for now.

# Improving the efficiency of our model

Another thing we can change here is to use only some data in each run of the optimizer. We can change this using the batch size parameter. Here, for instance, we will use ```batch_size = 480```, where 480 corresponds to 10% of our dataset.

In [None]:
SEED = 42
tf.random.set_seed(SEED)

model = keras.Sequential([
                          keras.layers.Flatten(input_shape = (28, 28)),
                          keras.layers.Dense(256, activation = tf.nn.relu),
                          keras.layers.Dropout(0.2),
                          keras.layers.Dense(10,  activation = tf.nn.softmax)
])

adam = keras.optimizers.Adam(learning_rate = 0.002)

model.compile(optimizer = adam, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

fit_history = model.fit(X_train, y_train, epochs = 5, batch_size = 480, validation_split = 0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Here, the change in accuracy was very minor (train accuracy was higher, and validation accuracy was lower). However, the time spent in each epoch was reduced a lot.

# Saving the model in its best state

Note that, during our optimization process, our weights and biases are changed, and our model may, actually, get worse.

We can try to save time by making a checkpoint of our model state, and save the model when it shows the highest validation accuracy. For that end, we may use:

In [None]:
SEED = 42
tf.random.set_seed(SEED)

del model

model = keras.Sequential([
                          keras.layers.Flatten(input_shape = (28, 28)),
                          keras.layers.Dense(256, activation = tf.nn.relu),
                          keras.layers.Dropout(0.2),
                          keras.layers.Dense(10,  activation = tf.nn.softmax)
])

adam = keras.optimizers.Adam(learning_rate = 0.001)

callbacks = [keras.callbacks.ModelCheckpoint(filepath = 'best_model.hdf5', monitor = 'val_loss', save_best_only = True)]

model.compile(optimizer = adam, loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

fit_history = model.fit(X_train, y_train, epochs = 5, validation_split = 0.2, batch_size = 480, callbacks = callbacks)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Nice! Note that, here, the final model state was saved, as it presented the lower validation loss.