<a href="https://colab.research.google.com/github/nyp-sit/sdaai-pdc2-students/blob/master/iti107/session-3/baseline_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# Baseline model

Welcome to this week's programming exercise. In this exercise, we will be training a model to recognise if an image depicts positive (e.g. happy, pleasant, beautiful) or negative (e.g. sad, angry, death, etc) emotion . We will first train a baseline model without using transfer learning. The dataset is a collection of around 1600 images from Flickr, and labelled with Positive or Negative label. We only apply data augmentation to our training set. In the next exercise, we will use transfer learning technique to train another model and compare the performance of both.

At the end of this exercise, you will be able to: 
- understand and use ImageDataGenerator to generate augmented images from directory 
- understand typical directory structure expected by the ImageDataGenerator
- train a Convnet using the ImageDataGenerator
- visualize the training/validation loss/accuracy over training epochs

In [None]:
from __future__ import print_function

import os
import json
import shutil
import numpy as np

from utils import prepare_data, download_trained_model_and_history

from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, Conv2D, MaxPool2D, Flatten, Dense, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from sklearn.metrics import classification_report

import matplotlib
import matplotlib.pyplot as plt
import pickle

%matplotlib inline

## Preparing Data

To avoid cluttering the codes in the notebook, we put the `prepare_data()` code in a separate python file. This function prepare the directory structure (by creating a **train** and **valid** subfolders under `data_path` directory for holding the train and validation data respectively). It also automatically unzip and copy the image files into 'Negative' and 'Positive' subfolders of the training and validation folder.

In [None]:
data_path = "data"
models_path = "models"
if not os.path.exists(models_path):
    os.mkdir(models_path)
valid_size = 0.2    # validation split 
FORCED_DATA_REWRITE = True  # remove old data if they exists

In [None]:
train_path, valid_path = prepare_data(data_path=data_path, 
                                      valid_size=valid_size, 
                                      FORCED_DATA_REWRITE=FORCED_DATA_REWRITE)

In [None]:
train_neg_path = os.path.join(train_path, "Negative")
train_pos_path = os.path.join(train_path, "Positive")
valid_neg_path = os.path.join(valid_path, "Negative")
valid_pos_path = os.path.join(valid_path, "Positive")

We randomly select `n_examples` and display them.

In [None]:
n_examples = 5
np.random.seed(42)
positive_expamples = np.random.choice(os.listdir(train_pos_path), size=n_examples, replace=False)
negative_expamples = np.random.choice(os.listdir(train_neg_path), size=n_examples, replace=False)

In [None]:
plt.figure(figsize=(5, n_examples * 2))
for i in range(n_examples):
    plt.subplot(n_examples, 2, i * 2 + 1)
    img = load_img(os.path.join(train_pos_path, positive_expamples[i]))
    plt.imshow(img)
    plt.axis("off")
    if i == 0:
        plt.title("Positive", fontsize=18)
    plt.subplot(n_examples, 2, i * 2 + 2)
    img = load_img(os.path.join(train_neg_path, negative_expamples[i]))
    plt.imshow(img)
    plt.axis("off")
    if i == 0:
        plt.title("Negative", fontsize=18)

## Create a Data Generator

We will use the Keras ImageDataGenerator to serve the training and validation data from the directory. For the training data, we will apply some data augmentation techniques such as rotaion, shifting, shearing, etc. We will also need to normalize the image pixel values to between 0.0 and 1.0. 

***Note about Python generator***

A Python generator is an object that acts as an iterator: it’s an object you can use
with the `for … in` loop. Generators are built using the yield operator.

In [None]:
train_datagen = ImageDataGenerator(rescale=1./255, 
                                   rotation_range=40, 
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=True,
                                   fill_mode="nearest")

**Exercise**

Create an ImageDataGenerator for validation data too. Do you need to apply data transformation for validation data?

<details><summary>Click here for answer</summary>

We only apply transformation for training data, and not validation data.

valid_datagen =  ImageDataGenerator(rescale=1./255)

</details>

In [None]:
### START YOUR CODE HERE ###

valid_datagen = None

### END YOUR CODE HERE ###

We use `flow_from_directory()` method to generate batches of data from the specified train and validation directory. The directory should contain one subdirectory per class. Any PNG, JPG, BMP, PPM or TIF images inside each of the subdirectories directory tree will be included in the generator.  Since we only have 2 classes, we specify `class_mode` as 'binary' so that the generator will return the binary labels (0 and 1). The class name mapping for the labels will be based on the names of the subdirectories (in our case 'Negative' and 'Positive'). The `batch_size` determines how many samples are returned by the generator on each iteration.

See [Tensorflow documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) for more details of the different parameters.

In [None]:
img_height, img_width = 300, 400

train_gen = train_datagen.flow_from_directory(train_path, 
                                              target_size=(img_height, img_width), 
                                              class_mode="binary", 
                                              batch_size=32, 
                                              shuffle=True, 
                                              seed=21)

valid_gen = valid_datagen.flow_from_directory(valid_path, 
                                              target_size=(img_height, img_width), 
                                              class_mode="binary", 
                                              batch_size=32, 
                                              shuffle=False, 
                                              seed=21)

In [None]:
# Print the class names to class labels mapping
train_gen.class_indices

If you are running this on non-GPU, the training could take quite a while. To save time, you can set LOAD_BASELINE_MODEL = True and it will download the model we have previously trained as well as the training history to the current directory

In [None]:
LOAD_BASELINE_MODEL = True

if LOAD_BASELINE_MODEL: 
    #download_trained_model_and_history(os.path.join(models_path, 'baseline.model'))
    download_trained_model_and_history(os.path.join(models_path, "baseline.model.h5"))

In [None]:
if LOAD_BASELINE_MODEL:
    try:
        model_baseline = load_model(os.path.join(models_path, "baseline.model.h5"))
        print("Model has been loaded!")
    except:
        LOAD_BASELINE_MODEL = False
        print("Load has failed. Model will be built from scratch.")
        
if not LOAD_BASELINE_MODEL:
    
    inp = Input(shape=train_gen.target_size + (3,))

    conv1 = Conv2D(filters=64, kernel_size=(3, 3), strides=(2, 2))(inp)
    conv2 = Conv2D(filters=64, kernel_size=(3, 3), strides=(2, 2))(conv1)
    maxpool1 = MaxPool2D(pool_size=(2, 2))(conv2)

    conv3 = Conv2D(filters=128, kernel_size=(3, 3), strides=(2, 2))(maxpool1)
    conv4 = Conv2D(filters=128, kernel_size=(3, 3), strides=(2, 2))(conv3)
    maxpool2 = MaxPool2D(pool_size=(2, 2))(conv4)

    flattened = Flatten()(maxpool2)
    
    fc1 = Dense(units=256, activation="relu", 
                kernel_initializer="he_normal")(flattened)
    dp1 = Dropout(rate=0.5)(fc1)
    
    fc2 = Dense(units=512, activation="relu", 
                kernel_initializer="he_normal")(dp1)
    dp2 = Dropout(rate=0.5)(fc2)
    
    out = Dense(units=1, activation="sigmoid")(dp2)
    
    model_baseline = Model(inputs=[inp], outputs=[out])
    
    model_baseline.compile(optimizer="Adam", 
                           loss="binary_crossentropy", 
                           metrics=["accuracy"])
    
    print("Model has been built.")

In [None]:
model_baseline.summary()

## Train the model

Because data is drawn endlessly from generator, you need to tell Keras model how many samples to draw from generator before declaring an epoch is over. This is the the role of `steps_per_epoch`. 

Below, we set the `steps_per_epoch` to be equal to 'number of samples/batch size'. However, this is kind of arbitrary, and it does not mean the generator will return all the images available in the directory. For example, if we have 100 different images in the directory and our batch size is 10, our steps_per_epoch = 100/10, i.e. 10. However, after 10 steps of 10 images, for a total of 100 generated images, not all the original 100 images in the directory will be used. This is because ImageDataGenerator randomly transforms the images, and you may get two slightly transformed versions of the same image, instead of 2 different images.

In [None]:
train_steps_per_epoch = int(np.ceil(train_gen.n * 1. / train_gen.batch_size))
valid_steps_per_epoch = int(np.ceil(valid_gen.n * 1. / valid_gen.batch_size))

In [None]:
if not LOAD_BASELINE_MODEL:
    hist_baseline = model_baseline.fit_generator(train_gen, 
                                                 steps_per_epoch=train_steps_per_epoch, 
                                                 epochs=20, 
                                                 validation_data=valid_gen, 
                                                 validation_steps=valid_steps_per_epoch)
    # save the trained model
    # we save the model in h5 format instead of the default SavedModel format due to an issue as highlighted here:
    # https://github.com/tensorflow/tensorflow/issues/33454
    model_baseline.save(os.path.join(models_path, "baseline.model.h5"))
    
    # save the history of training
    with open('baseline.history', 'wb') as f:
        pickle.dump(hist_baseline.history, f)
    hist_baseline = hist_baseline.history
else:
    with open('baseline.history', 'rb') as f:
        hist_baseline = pickle.load(f)
    print("Model has already been trained.")

In [None]:
plt.figure(figsize=(16, 6))
plt.suptitle("Training evolution for homegrown model", fontsize=18)

plt.subplot(121)
plt.plot(hist_baseline["loss"], label="Train")
plt.plot(hist_baseline["val_loss"], label="Validation")
plt.legend()
plt.ylabel("Crossentropy loss", fontsize=14)
plt.xlabel("Epoch", fontsize=14)

plt.subplot(122)
plt.plot(np.array(hist_baseline["accuracy"]) * 100, label="Train")
plt.plot(np.array(hist_baseline["val_accuracy"]) * 100, label="Validation")
plt.legend()
plt.ylabel("Accuracy, %", fontsize=14)
plt.xlabel("Epoch", fontsize=14);

As you can see from the plot, the validation accuracy fluctuates around 50% point. Our model is no better than random guess !! 

### Classification Report on Test Data 

By right, you should have allocated some data as test set for your test model. Since our data is pretty small, we did not. But for the sake of having better idea how our model is faring on each class, let's just use our validation data for getting some hard numbers :)

In [None]:
y_pred = model_baseline.predict_generator(valid_gen, valid_steps_per_epoch)
y_valid = np.array(valid_gen.classes)

In [None]:
print(classification_report(y_valid, y_pred.flatten() > 0.5))


Looks like our model almost always predict 1 (Positive) emotion!