<a href="https://colab.research.google.com/github/Doometnick/ConvNet-Workflow/blob/master/1_Improved_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Remark: I recommend to run this in Google Colab. Click the link above.

# Improving Model Accuracy

In the first session, we have built a simple regularized convolutional network that does not seem to overfit. The accuracy, however, is too low. There are a many ways to improve this, four of these are briefly explained below.

First, we need to import the data again and create the training and test sets. This is the same code than in the first session.

In [0]:
import keras
import tensorflow as tf

import functools
import matplotlib.pyplot as plt
import numpy as np
import os

In [0]:
assert tf.test.is_gpu_available(), "GPU not enabled. Enable under Runtime > Change runtime type"

In [0]:
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()

In [0]:
x_train = x_train / 255.0
x_test = x_test / 255.0

In [0]:
def plot_training_history(history, title=None):
    fig = plt.figure(figsize=(9,9))
    plt.plot(history['acc'], label='acc')
    plt.plot(history['val_acc'], label='val_acc')
    plt.plot(history['loss'], label='loss')
    plt.plot(history['val_loss'], label='val_loss')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend(loc='best')
    if title is not None:
        plt.suptitle(title)

### 1. Data augmentation

A way to avoid overfitting is by increasing the amount of data. We can easily increase the number of data points by copying and manipulating the images that we have. For example, rotating, zooming in and out, or cropping single images can generate new images that are slightly different than the originals. The increased number of data points should help to further increase the quality of our model.

### 2. More complex network architecture

The current model architecture is fairly simple with only two convolutional layers followed by max pooling and one hidden dense layer before the final output. More complicated architectures can be very powerful in detecting features. For example, convolutional networks applied on images tend to detect local detail features in shallow layers (close to the input layer) and global features in deeper layers. This effect can be seen in an illustrative way in a [neural style transfer paper by Gatys et al](https://arxiv.org/abs/1508.06576) where they use shallow and deep layers from the [vgg model](https://arxiv.org/abs/1409.1556) to extract local and global features to create artistic images.

As the CIFAR-10 images we are woring with can also be partly distinguished by global features (for example shapes of animals, cars, etc.), we can make our network deeper to see if it increases the accuracy.

### 3. Batch normalization

Similar to normalizing input data, [batch normalization](https://arxiv.org/abs/1502.03167?context=cs) can significantly increase the training speed of a network by 'reducing covariate shift'. This means that weights of deeper networks are made more stable with respect to changes in weights in earlier layers. To achieve this, values of hidden layers are scaled to have zero mean and a standard deviation of one. This is normalization is done on mini-batches, which can lead to a regularizing effect depending on the size of the mini-batches. Furthermore, it can be applied either before or after the activation function with different effects. But we will not get deeper into those topics here. 

### 4. Dropout

Dropout is another form of regularzation that can help to avoid overfitting. If Dropout is applied to a certain hidden layer in a network, a certain percentage of all nodes are assigned weights of 0 from the previous layer. The network is trained with the dropped out nodes for a short while and the weight of the nodes adjusted accordingly. Afterwards, the dropped nodes are reset and a new set of random nodes is removed for the next training step.

The motivation of Dropout is to build different models. Each different model will fit towards certain features in the data - and each model might overfit towards these features. But the collection of all models, the average of it, will lead to a reguarized, not overfitted model.

## New Improved Model
With all that said, let's build our improved model containing all the points from above.


First, let's augment our data set by utilizing keras' preprocessing feature ImageDataGenerator. This is a generator object that will return infinite sets of images that contain certain manipulations.

The manipulations chosen in this example are flipping the images horizontally, rotating the angles, and shifting the picture in width and height.

Such an image generator is not a data set that we have been using before, but a generator that yields one data set at a time. Therefore, we will have to change our model.fit() function to a model.fit_generator() function in a later step.

In [0]:
data_generator = keras.preprocessing.image.ImageDataGenerator(
    horizontal_flip=True,
    rotation_range=5)
    # width_shift_range=0.05,
    # height_shift_range=0.05,
    # zoom_range=0.05)

data_generator.fit(x_train)

In [0]:
def build_improved_model():

    n_filters = 64
    wdecay = 0.001

    Activation = functools.partial(tf.keras.layers.Activation, activation="relu")
    Conv2D = functools.partial(tf.keras.layers.Conv2D, 
                               activation=None, 
                               padding="same", 
                               kernel_regularizer=tf.keras.regularizers.l2(wdecay))
    MaxPool = functools.partial(tf.keras.layers.MaxPool2D, pool_size=(2,2))
    Flatten = tf.keras.layers.Flatten
    Dense = tf.keras.layers.Dense
    BatchNormalization = tf.keras.layers.BatchNormalization
    Dropout = tf.keras.layers.Dropout

    model = tf.keras.Sequential([
        
        Conv2D(filters=n_filters*1, kernel_size=(3,3), strides=(1,1), input_shape=(32, 32, 3)),
        BatchNormalization(),
        Activation(),
        Conv2D(filters=n_filters*1, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        Conv2D(filters=n_filters*1, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        MaxPool(),
        

        Conv2D(filters=n_filters*2, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        Conv2D(filters=n_filters*2, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        Conv2D(filters=n_filters*2, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        MaxPool(),
        

        Conv2D(filters=n_filters*3, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        Conv2D(filters=n_filters*3, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        Conv2D(filters=n_filters*3, kernel_size=(3,3), strides=(1,1)),
        BatchNormalization(),
        Activation(),
        MaxPool(),

        Flatten(),

        Dense(128, activation=None, kernel_regularizer=tf.keras.regularizers.l2(wdecay)),
        Dropout(0.5),
        BatchNormalization(),
        Activation(),

        Dense(10, activation="softmax")
    ])

    return model

There are four major blocks in this model: three stacks of convolutional layers followed by max pooling, and dense layers at the end. The dense layers transform the flattened output from the convolution into the ten labels describing the images. 
The convolutional stacks each consist of three convolutional layers without any max pooling in between. Such a structure could be replaced by only one convolutional layers that is larger (i.e. has a larger receptive field/kernel_size). But having three layers with smalle receptive fields leads to a lower number of total parameters used and has a more discriminative decision function since we apply the activation (relu) three times instead of one. This is described in more detail in a [paper by Simonyan & Zisserman](https://arxiv.org/abs/1409.1556).


Our new model requires a lot more training than the old one, since it has many more parameters. Therefore, we will create two callback functions that are periodically called while training.

### Reducing learning rate on plateau

Since we will be training much longer than before, it might be beneficial to have a dynamic learning rate. If a certain metric of our model (here: validation loss) won't improve for a certain amount of periods, the learning rate will be automatically decreased by the factor of 10.

In [0]:
lr_schedule = tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", 
                                                   factor=0.1,
                                                   patience=5, 
                                                   min_lr=1e-08,
                                                   verbose=1)

### Saving training data
In order to not lose the model's weights after a training session, we can save them by introducing a callback function that saves the weights every couple of iterations. If we decide to continue training, we can just pick up from where we left off. 

Note, however, that all saved data in colab will be removed after 24 hours. We can avoid this issue by saving data on the Google Drive, but this will not be covered here.

In [9]:
# Saving training checkpoints, as training takes longer time now.
checkpoint_path = "training_1/cp.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create a callback that saves the model's weights.
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path,
                                                 save_weights_only=True,
                                                 verbose=1,
                                                 period=10)



### Compile and train model
Since we work with a [generator](https://wiki.python.org/moin/Generators) now instead of a normal variable that carries all the data, we need to change the model.fit function to model.fit_generator.

In [0]:
def compile_and_train_generator(model, data_generator, epochs=20):

    # Batch normalization allows for larger learning rate.
    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-06, amsgrad=True)

    model.compile(optimizer=optimizer,
                loss="sparse_categorical_crossentropy",
                metrics=["accuracy"])
    
    # If we have saved weights, load them to continue training.
    latest = tf.train.latest_checkpoint(checkpoint_dir)
    if latest is not None:
        model.load_weights(latest)

    batch_size = 256
    
    history = model.fit_generator(
        data_generator.flow(x_train, y_train, batch_size=batch_size), 
        steps_per_epoch=len(x_train) / batch_size, 
        epochs=epochs,
        validation_data=(x_test, y_test),
        callbacks=[lr_schedule, cp_callback])

    return history

In [0]:
# In case we do multiple runs, save histories in a list and concatenate later.
histories = []  

In [0]:
improved_model = build_improved_model()
improved_history = compile_and_train_generator(improved_model, data_generator, epochs=20)
histories.append(improved_history)

In [0]:
total_history = {}

for history in histories:
    for key in ["acc", "val_acc", "loss", "val_loss"]:
        total_history.setdefault(key, []).extend(history.history[key].copy())

plot_training_history(total_history, title="Regularized Model")
improved_model.evaluate(x_test, y_test)