# Finetuning AlexNet for Oxford-102
**Author: Tien Dinh**

This is a demonstration of the finetuning process done on pretrained weights from AlexNet (2012).

* *Note: The `.py` version of the project will be available in the same repository.*

## The dataset

* The Oxford-102 dataset is a flower dataset with 102 classes of flowers.
* The dataset can be found [here](http://www.robots.ox.ac.uk/~vgg/data/flowers/102/).

## The network

* AlexNet was created by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in 2012 featuring a deep convolutional network.
* AlexNet was the winner of **2012 ILSVRC** (ImageNet Large-Scale Visual Recognition Challenge).
* The network features 6 layers of convolutional and pooling, and 3 layers of fully connected neural networks (the network architecture image will be included in this project).
* Click [here](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) to read the research paper.

### The ImageNet mean
* The mean of the ImageNet dataset, which was defined as `[104., 117., 124.]` was used to normalize the images.
* The mean will help center all the images to around `0` (originally was from `0` to `255`)

<center>`imagenet_mean = np.array([104., 117., 124.], dtype=np.float32)`</center>

## The network architecture in TensorBoard
<p align="center">
  <img src="./images/the_graph.png">
</p>

In [None]:
# @hidden_cell
import os
from datetime import datetime

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

import cv2

from scipy.io import loadmat

import tensorflow as tf

imagenet_mean = np.array([104., 117., 124.], dtype=np.float32)

os.mkdir('./summary')
os.mkdir('./models')

### Load up the train and test indexes

* We are going to use `loadmat` from `scipy.io` to load MatLab file.
* It is odd that there are more testing images than training images, so I decided to flip them to increase accuracy.
* Converting them to list for easier iteration later on.

In [2]:
set_ids = loadmat('setid.mat')
test_ids = set_ids['trnid'].tolist()[0]
train_ids = set_ids['tstid'].tolist()[0]

### Preprocessing image indexes
* Obtained that all the provided images were named from `00001` to `0xxxx` so we need a special function to pad the zeros in front of our indexes

In [3]:
def indexes_processing(int_list):
    returned_list = []
    for index, element in enumerate(int_list):
        returned_list.append(str(element))
    for index, element in enumerate(returned_list):
        if int(element) < 10:
            returned_list[index] = '0000' + element
        elif int(element) < 100:
            returned_list[index] = '000' + element
        elif int(element) < 1000:
            returned_list[index] = '00' + element
        else:
            returned_list[index] = '0' + element
    return returned_list

In [4]:
raw_train_ids = indexes_processing(train_ids)
raw_test_ids = indexes_processing(test_ids)

### Load the labels for train and test sets

In [6]:
image_labels = (loadmat('imagelabels.mat')['labels'] - 1).tolist()[0]

In [7]:
labels = ['pink primrose', 'hard-leaved pocket orchid', 'canterbury bells', 'sweet pea', 'english marigold', 'tiger lily', 'moon orchid', 'bird of paradise', 'monkshood', 'globe thistle', 'snapdragon', "colt's foot", 'king protea', 'spear thistle', 'yellow iris', 'globe-flower', 'purple coneflower', 'peruvian lily', 'balloon flower', 'giant white arum lily', 'fire lily', 'pincushion flower', 'fritillary', 'red ginger', 'grape hyacinth', 'corn poppy', 'prince of wales feathers', 'stemless gentian', 'artichoke', 'sweet william', 'carnation', 'garden phlox', 'love in the mist', 'mexican aster', 'alpine sea holly', 'ruby-lipped cattleya', 'cape flower', 'great masterwort', 'siam tulip', 'lenten rose', 'barbeton daisy', 'daffodil', 'sword lily', 'poinsettia', 'bolero deep blue', 'wallflower', 'marigold', 'buttercup', 'oxeye daisy', 'common dandelion', 'petunia', 'wild pansy', 'primula', 'sunflower', 'pelargonium', 'bishop of llandaff', 'gaura', 'geranium', 'orange dahlia', 'pink-yellow dahlia?', 'cautleya spicata', 'japanese anemone', 'black-eyed susan', 'silverbush', 'californian poppy', 'osteospermum', 'spring crocus', 'bearded iris', 'windflower', 'tree poppy', 'gazania', 'azalea', 'water lily', 'rose', 'thorn apple', 'morning glory', 'passion flower', 'lotus', 'toad lily', 'anthurium', 'frangipani', 'clematis', 'hibiscus', 'columbine', 'desert-rose', 'tree mallow', 'magnolia', 'cyclamen ', 'watercress', 'canna lily', 'hippeastrum ', 'bee balm', 'ball moss', 'foxglove', 'bougainvillea', 'camellia', 'mallow', 'mexican petunia', 'bromelia', 'blanket flower', 'trumpet creeper', 'blackberry lily']

## Image Preprocessing
### Two Different Approaches, two distinct results
#### 1. Normalize by dividing by `255`
* Dividing by `255` to normalize the images between `0s` and `1s` is the way I usually do when I preprocess images to feed to Convolutional Neural Network.
* The top accuracy for this method falls somewhere between `80%` and `82%`. This is not bad at all for a simple network architecture
* Below is the snapshot during runtime of this method, the network converged to `80%` at epoch `20000` and did not improve further even with learning rate decay.

```On Step 32500
At: 2018-02-21 02:46:02.002311
Accuracy: 81.96%
Saving model...
Model saved at step: 32500```


```On Step 33000
At: 2018-02-21 02:50:38.211141
Accuracy: 82.25%
Saving model...
Model saved at step: 33000```


```On Step 33500
At: 2018-02-21 02:55:13.426248
Accuracy: 82.35%
Saving model...
Model saved at step: 33500```
#### 2. Normalize by subtracting the mean
* This is by far the best method for AlexNet since the images used to feed this network were normalized this way.
* Simply call `image -= mean` and `image` is ready to feed to the network.
* The top accuracy for this method is around `90%`. This is absolutely amazing, I got `8%` accuracy boost just by using a different normalization approach.
* The network also converged incredibly fast (see the output below).

In [8]:
class ImageProcessor():
    
    def __init__(self, num_classes=102):           
        self.i = 0
        self.num_classes = num_classes
        
        self.training_images = np.zeros((6149, 227, 227, 3))
        self.training_labels = None
        
        self.testing_images = np.zeros((1020, 227, 227, 3))
        self.testing_labels = None
    
    def one_hot_encode(self, labels):
        '''
        One hot encode the output labels to be numpy arrays of 0s and 1s
        '''
        out = np.zeros((len(labels), self.num_classes))
        for index, element in enumerate(labels):
            out[index, element] = 1
        return out
    
    def set_up_images(self):
        print('Processing Training Images...')
        i = 0
        for element in raw_train_ids:
            img = cv2.imread('/input/image_{}.jpg'.format(element))
            img = cv2.resize(img, (227, 227)).astype(np.float32)
            img -= imagenet_mean
            self.training_images[i] = img
            i += 1
        print('Done!')
        
        i = 0
        print('Processing Testing Images...')
        for element in raw_test_ids:
            img = cv2.imread('/input/image_{}.jpg'.format(element))
            img = cv2.resize(img, (227, 227)).astype(np.float32)
            img -= imagenet_mean
            self.testing_images[i] = img
            i += 1
        print('Done!')
        
        print('Processing Training and Testing Labels...')
        encoded_labels = self.one_hot_encode(image_labels)
        for train_id in train_ids:
            train_labels.append(encoded_labels[train_id - 1])
        for test_id in test_ids:
            test_labels.append(encoded_labels[test_id - 1])
        self.training_labels = train_labels
        self.testing_labels = test_labels
        print('Done!')
        
    def next_batch(self, batch_size):
        x = self.training_images[self.i:self.i + batch_size]
        y = self.training_labels[self.i:self.i + batch_size]
        self.i = (self.i + batch_size) % len(self.training_images)
        return x, y

In [9]:
# Initialize ImageProcessor instance
image_processor = ImageProcessor()

In [10]:
# Call set_up_images
image_processor.set_up_images()

Processing Training Images...
Done!
Processing Testing Images...
Done!
Processing Training and Testing Labels...
Done!


## The Architecture
<p align="center">
  <img src="./images/alex_ar.png">
</p>

In [11]:
class AlexNet():
    
    def __init__(self, X, keep_prob, num_classes, skip_layer, weights_path='DEFAULT'):
        self.X = X
        self.KEEP_PROB = keep_prob
        self.NUM_CLASSES = num_classes
        self.SKIP_LAYER = skip_layer
        if weights_path == 'DEFAULT':
            self.WEIGHTS_PATH = '/weights/bvlc_alexnet.npy'
        else:
            self.WEIGHTS_PATH = weights_path
        
        self.initialize()
        
    def initialize(self):
        
        # 1st Layer: Conv (w ReLu) -> Lrn -> Pool
        conv_1 = self.conv_layer(self.X, 11, 11, 96, 4, 4, name='conv1', padding='VALID')
        norm_1 = self.lrn(conv_1, 2, 1e-05, 0.75, name='norm1')
        pool_1 = self.max_pool(norm_1, 3, 3, 2, 2, name='pool1', padding='VALID')
        
        # 2nd Layer: Conv (w ReLu) -> Lrn -> Pool
        conv_2 = self.conv_layer(pool_1, 5, 5, 256, 1, 1, name='conv2', groups=2)
        norm_2 = self.lrn(conv_2, 2, 1e-05, 0.75, name='norm2')
        pool_2 = self.max_pool(norm_2, 3, 3, 2, 2, name='pool2', padding='VALID')
        
        # 3rd Layer: Conv (w ReLu)
        conv_3 = self.conv_layer(pool_2, 3, 3, 384, 1, 1, name='conv3')

        # 4th Layer: Conv (w ReLu)
        conv_4 = self.conv_layer(conv_3, 3, 3, 384, 1, 1, name='conv4', groups=2)

        # 5th Layer: Conv (w ReLu) -> Pool
        conv_5 = self.conv_layer(conv_4, 3, 3, 256, 1, 1, name='conv5', groups=2)
        pool_5 = self.max_pool(conv_5, 3, 3, 2, 2, name='pool5', padding='VALID')

        # 6th Layer: Flatten -> FC (w ReLu) -> Dropout
        pool_6_flat = tf.reshape(pool_5, [-1, 6*6*256])
        full_6 = self.fully_connected(pool_6_flat, 6*6*256, 4096, name='fc6')
        full_6_dropout = self.drop_out(full_6, self.KEEP_PROB)
        
        # 7th Layer: FC (w ReLu) -> Dropout
        full_7 = self.fully_connected(full_6_dropout, 4096, 4096, name='fc7')
        full_7_dropout = self.drop_out(full_7, self.KEEP_PROB)
        
        # 8th Layer: FC and return unscaled activations
        self.y_pred = self.fully_connected(full_7_dropout, 4096, self.NUM_CLASSES, relu=False, name='fc8')
        
    def load_weights(self, session):
        
        # Load the weights into memory
        weights_dict = np.load(self.WEIGHTS_PATH, encoding='bytes').item()
        
        # Loop over all layer names stored in the weights dict
        for op_name in weights_dict:
            # Check if layer should be trained from scratch
            if op_name not in self.SKIP_LAYER:
                with tf.variable_scope(op_name, reuse=True):
                    for data in weights_dict[op_name]:
                        if len(data.shape) == 1:
                            var = tf.get_variable('biases')
                            session.run(var.assign(data))
                        else:
                            var = tf.get_variable('weights')
                            session.run(var.assign(data))
                            
    def conv_layer(self, x, filter_height, filter_width, num_filters, stride_y, stride_x, name, padding='SAME', groups=1):
        num_channels = int(x.get_shape()[-1])
        convolve = lambda i, k: tf.nn.conv2d(i, k, strides=[1,stride_y,stride_x,1], padding=padding)
        with tf.variable_scope(name) as scope:
            weights = tf.get_variable('weights', shape=[filter_height,
                                                        filter_width,
                                                        num_channels/groups,
                                                        num_filters])
            biases = tf.get_variable('biases', shape=[num_filters])
        if groups == 1:
            conv = convolve(x, weights)
        else:
            input_groups = tf.split(axis=3, num_or_size_splits=groups, value=x)
            weight_groups = tf.split(axis=3, num_or_size_splits=groups, value=weights)
            output_groups = [convolve(i, k) for i, k in zip(input_groups, weight_groups)]
            conv = tf.concat(axis=3, values=output_groups)
        bias = tf.reshape(tf.nn.bias_add(conv, biases), tf.shape(conv))
        return tf.nn.relu(bias, name=scope.name)

    def max_pool(self, x, filter_height, filter_width, stride_y, stride_x, name, padding='SAME'):
        return tf.nn.max_pool(x, ksize=[1,filter_height,filter_width,1], 
                              strides=[1,stride_y,stride_x,1], padding=padding,
                              name=name)

    def lrn(self, x, radius, alpha, beta, name, bias=1.0):
        return tf.nn.local_response_normalization(x, depth_radius=radius, 
                                                  alpha=alpha, beta=beta, 
                                                  bias=bias, name=name)

    def fully_connected(self, input_layer, num_in, num_out, name, relu=True):
        with tf.variable_scope(name) as scope:
            weights = tf.get_variable('weights', shape=[num_in, num_out], trainable=True)
            biases = tf.get_variable('biases', shape=[num_out], trainable=True)
            activation = tf.nn.xw_plus_b(input_layer, weights, biases, name=scope.name)
        if relu:
            return tf.nn.relu(activation)
        else:
            return activation
    
    def drop_out(self, x, keep_prob):
        return tf.nn.dropout(x, keep_prob=keep_prob)

### Placeholders for inputs, outputs, and hold probability

In [12]:
x = tf.placeholder(tf.float32, [None, 227, 227, 3])
y_true = tf.placeholder(tf.float32, [None, 102])
keep_prob = tf.placeholder(tf.float32)

### The Hyperparameters
* Epoch is set to 50000.
* Drop rate is set to 0.5.

*The parameter choices are adapted from [here](https://github.com/jimgoo/caffe-oxford102).*

#### Learning rate decay
### $$calculated = base \times decay rate^{\frac{global step}{decay step}}$$

Where:
* $calculated$ is the calculated learning rate.
* $base$ is the base learning rate.

In [13]:
global_step = tf.Variable(0, trainable=False)
base_lr = 0.001
base_lr = tf.train.exponential_decay(base_lr, global_step, 20000, 0.5, staircase=True)
num_epochs = 50000
drop_rate = 0.5
train_layers = ['fc8']

## Picking layers to train from scratch
### 1. Choosing last two layers `fc7` and `fc8`
* The network performs quite well at top accuracy of `77%`.
* The learning rate are all the same for all variables.
* All other variables are set to `trainable=False` to prevent learning.

### 2. Choosing only the last `fc8` layer
* The network performs well at top accuracy of `90%`.
* The learning rates are different for each variables with pretrained weights learn slower.
* All variables are trainable.

In [14]:
model = AlexNet(x, keep_prob, 102, train_layers)

In [15]:
with tf.name_scope('network_output'):
    y_pred = model.y_pred

## Custom learning rate
### Pretrained layers
* The pretrained layers include `conv1`, `conv2`, `conv3`, `conv4`, `conv5`, `fc6`, `fc7`.
* The pretrained `weights` will have a learning rate of `1*base_lr`.
* The pretrained `biases` will have a learning rate of `2*base_lr`.

### Untrained layers
* The untrained layer includes `fc8`.
* The untrained `weights` will have a learning rate of `10*base_lr`.
* The untrained `biases` will have a learning rate of `20*base_lr`.

*`conv` means convolution layer, `fc` means fully connected layer.*

*These learning rate choices are adapted from [here](https://github.com/jimgoo/caffe-oxford102).*

In [16]:
# Spliting variables into batches which have the same learning rate.
all_vars = tf.global_variables()
all_vars = all_vars[1:]
conv_vars = [all_vars[0], all_vars[2], all_vars[4], all_vars[6], all_vars[8], all_vars[10], all_vars[12]]
bias_vars = [all_vars[1], all_vars[3], all_vars[5], all_vars[7], all_vars[9], all_vars[11], all_vars[13]]
last_weights = [all_vars[14]]
last_bias = [all_vars[15]]

In [17]:
with tf.name_scope('cross_entropy'):
    cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_true,logits=y_pred))
    
tf.summary.scalar('cross_entropy', cross_entropy)

<tf.Tensor 'cross_entropy_1:0' shape=() dtype=string>

In [18]:
with tf.name_scope('train'):
    gradients = tf.gradients(cross_entropy, conv_vars + bias_vars + last_weights + last_bias)
    conv_vars_gradients = gradients[:len(conv_vars)]
    bias_vars_gradients = gradients[len(conv_vars):len(conv_vars) + len(bias_vars)]
    last_weights_gradients = gradients[len(conv_vars) + len(bias_vars):len(conv_vars) + len(bias_vars) + len(last_weights)]
    last_bias_gradients = gradients[len(conv_vars) + len(bias_vars) + len(last_weights):len(conv_vars) + len(bias_vars) + len(last_weights) + len(last_bias)]
    
trained_weights_optimizer = tf.train.GradientDescentOptimizer(base_lr)
trained_biases_optimizer = tf.train.GradientDescentOptimizer(2*base_lr)
weights_optimizer = tf.train.GradientDescentOptimizer(10*base_lr)
biases_optimizer = tf.train.GradientDescentOptimizer(20*base_lr)

train_op1 = trained_weights_optimizer.apply_gradients(zip(conv_vars_gradients, conv_vars))
train_op2 = trained_biases_optimizer.apply_gradients(zip(bias_vars_gradients, bias_vars))
train_op3 = weights_optimizer.apply_gradients(zip(last_weights_gradients, last_weights))
train_op4 = biases_optimizer.apply_gradients(zip(last_bias_gradients, last_bias))

train = tf.group(train_op1, train_op2, train_op3, train_op4)

In [19]:
with tf.name_scope('accuracy'):
    matches = tf.equal(tf.argmax(y_pred, 1), tf.argmax(y_true, 1))
    acc = tf.reduce_mean(tf.cast(matches, tf.float32))
    
tf.summary.scalar('accuracy', acc)

<tf.Tensor 'accuracy_1:0' shape=() dtype=string>

In [None]:
merged_summary = tf.summary.merge_all()
writer = tf.summary.FileWriter('./summary')
init = tf.global_variables_initializer()
saver = tf.train.Saver(max_to_keep=3)

In [None]:
with tf.Session() as sess:
    sess.run(init)
    writer.add_graph(sess.graph)
    model.load_weights(sess)
    
    print('Training process started at {}'.format(datetime.now()))

    for i in range(num_epochs):
        batches = image_processor.next_batch(128)
        sess.run(train, feed_dict={x:batches[0], y_true:batches[1], keep_prob:0.5})
        global_step += 1
        if (i%500==0):
            print('On Step {}'.format(i))
            print('Current base learning rate: {0:.5f}'.format(sess.run(base_lr)))
            print('At: {}'.format(datetime.now()))
            
            accuracy = sess.run(acc, feed_dict={x:image_processor.testing_images, y_true:image_processor.testing_labels, keep_prob:1.0})
            print('Accuracy: {0:.2f}%'.format(accuracy * 100))
            
            print('Saving model...')
            saver.save(sess, './models/model_iter.ckpt', global_step=i)
            print('Model saved at step: {}'.format(i))
            print('\n')
            
    print('Saving final model...')
    saver.save(sess, './models/model_final.ckpt')
    print('Saved')
    print('Training finished at {}'.format(datetime.now()))

Training process started at 2018-02-22 06:50:01.159043
On Step 0
Current base learning rate: 0.00100
At: 2018-02-22 06:50:03.418358
Accuracy: 2.35%
Saving model...
Model saved at step: 0


On Step 500
Current base learning rate: 0.00100
At: 2018-02-22 06:57:10.471321
Accuracy: 45.29%
Saving model...
Model saved at step: 500


On Step 1000
Current base learning rate: 0.00100
At: 2018-02-22 07:04:35.250569
Accuracy: 52.55%
Saving model...
Model saved at step: 1000


On Step 1500
Current base learning rate: 0.00100
At: 2018-02-22 07:12:13.779904
Accuracy: 68.43%
Saving model...
Model saved at step: 1500


On Step 2000
Current base learning rate: 0.00100
At: 2018-02-22 07:20:06.863819
Accuracy: 77.25%
Saving model...
Model saved at step: 2000


On Step 2500
Current base learning rate: 0.00100
At: 2018-02-22 07:28:20.584413
Accuracy: 81.96%
Saving model...
Model saved at step: 2500


On Step 3000
Current base learning rate: 0.00100
At: 2018-02-22 07:36:40.312540
Accuracy: 84.22%
Saving mode

## Conclusion
* The model converges incredibly fast and reaches a stable accuracy of 90% at epoch 11000.
* Training took 6 hours on one Tesla K80 GPU.
* The whole process would take around 20 hours.

### Final Accuracy: 89.51%