## DICE - Notebook 3.2 - Model Training and Transfer Learning - Augmented Data

<br/>

```
*************************************************************************
**
** 2017 Mai 23
**
** In place of a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
```

<table style="width:100%; font-size:14px; margin: 20px 0;">
    <tr>
        <td style="text-align:center">
            <b>Contact: </b><a href="mailto:contact@jonathandekhtiar.eu" target="_blank">contact@jonathandekhtiar.eu</a>
        </td>
        <td style="text-align:center">
            <b>Twitter: </b><a href="https://twitter.com/born2data" target="_blank">@born2data</a>
        </td>
        <td style="text-align:center">
            <b>Tech. Blog: </b><a href="http://www.born2data.com/" target="_blank">born2data.com</a>
        </td>
    </tr>
    <tr>
        <td style="text-align:center">
            <b>Personal Website: </b><a href="http://www.jonathandekhtiar.eu" target="_blank">jonathandekhtiar.eu</a>
        </td>
        <td style="text-align:center">
            <b>RSS Feed: </b><a href="https://www.feedcrunch.io/@dataradar/" target="_blank">FeedCrunch.io</a>
        </td>
        <td style="text-align:center">
            <b>LinkedIn: </b><a href="https://fr.linkedin.com/in/jonathandekhtiar" target="_blank">JonathanDEKHTIAR</a>
        </td>
    </tr>
</table>

## Objectives

This notebook aims to perform the actual transfer learning from the [ImageNet](http://www.image-net.org/) dataset to our custom dataset. For this we will load the model previously trained and retrain the last layers in order to obtain predictions on new classes.

A wide variety of models has been trained and made available by the Google Team: https://github.com/tensorflow/models/tree/master/slim

We will use in this Notebook, one of the most famous Deep Learning Model: GoogLeNet (aka. Inception-V1) developed by Christian Szegedy and published on ArXiv: https://arxiv.org/abs/1409.4842

This notebook will use [Tensorflow-Slim](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim) to ease the understanding and reduce the code complexity.

Download Inception-V1 Model: http://download.tensorflow.org/models/inception_v1_2016_08_28.tar.gz

---

As reminder before starting, the data have already been preprocessed (resized, augmented, etc.) in the first Notebook: **[DICE - Notebook 1 - Dataset Augmentation](https://github.com/DEKHTIARJonathan/DICE-DMU_Imagery_Classification_Engine/blob/master/DICE%20-%20Notebook%201%20-%20Dataset%20Augmentation.ipynb)**

The preprocessed data all have been saved as **JPEG images** and thus we will only focus on these data.

## 1. Notebook Initialisation

### 1.1. Load the necessary libraries

In [1]:
import os, sys, time, math

import numpy as np

import tensorflow as tf
from tensorflow.contrib.framework.python.ops.variables import get_or_create_global_step
from tensorflow.python.platform import tf_logging as logging

slim = tf.contrib.slim

###  1.2 Initialise global variables and application Flags

In [2]:
flags = tf.app.flags

#State your dataset directory
flags.DEFINE_string('dataset_dir', 'data_prepared', 'String: Your dataset directory')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('output_dir', 'output/augmented', 'String: The output directory where model-checkpoints will be saved')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('inception_dir', 'inception_files', 'String: The output directory where model-checkpoints will be saved')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('labels_dir', 'data_prepared', 'String: The output directory where model-checkpoints will be saved')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('tf_record_start_name', 'dmunet_augmented_dataset_', 'String: The output filename to name your TFRecord file')

#State the number of epochs to train
flags.DEFINE_integer('training_epochs', 10, 'Int: Number of shards to split the TFRecord files into')

#State your batch size => Choose the highest value which doesn't give you a memory error.
flags.DEFINE_integer('batch_size', 110, 'Int: Number of shards to split the TFRecord files into')

#Learning rate information and configuration (Up to you to experiment)
flags.DEFINE_float('initial_learning_rate', 1e-4, 'Float: The proportion of examples in the dataset to be used for validation')

flags.DEFINE_float('learning_rate_decay_factor', 0.8, 'Float: The proportion of examples')

flags.DEFINE_integer('num_epochs_before_decay', 1, 'Int: Number of shards to split the TFRecord files into')

# Choose between "tf.train.SaverDef.V2" and "tf.train.SaverDef.V1". The V1 version is deprecated since Tensorflow r1.0.0
flags.DEFINE_integer('tf_saver', tf.train.SaverDef.V1, 'Int: Number of shards to split the TFRecord files into')

#Set the verbosity to INFO level => highest to lowest logging level: DEBUG > INFO > WARN > ERROR > FATAL  
flags.DEFINE_integer('tf_logging_level', tf.logging.INFO, 'Int: Number of shards to split the TFRecord files into')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('checkpoint_basename', 'dmunet_augmented_data.ckpt', 'String: The output filename to name your TFRecord file')

FLAGS = flags.FLAGS

###  1.3 Complementary imports from the inception directory set by the flags above

In [3]:
sys.path.append(FLAGS.inception_dir)

from preprocessing      import inception_preprocessing
from nets.inception_v1  import inception_v1, inception_v1_arg_scope
from datasets           import dataset_utils

## 2. Environment Check and Model Downloading

In [4]:
# ================ Additional Derived Variable ================

checkpoint_dir  = os.path.join(FLAGS.inception_dir, "models")
checkpoint_file = os.path.join(checkpoint_dir, "inception_v1.ckpt")
labels_file     = os.path.join(FLAGS.labels_dir, "labels.txt")

image_size      = inception_v1.default_image_size # 224 (width and height in pixels)

#Create the file pattern of your TFRecord files so that it could be recognized later on
file_pattern    = FLAGS.tf_record_start_name + '%s_*.tfrecord'

tf.logging.set_verbosity(FLAGS.tf_logging_level) 

#Create a dictionary that will help people understand your dataset better. This is required by the Dataset class later.

items_to_descriptions = {
    'image': 'A 3-channel RGB coloured flower image that is either tulips, sunflowers, roses, dandelion, or daisy.',
    'label': 'A label that is as such -- 0:daisy, 1:dandelion, 2:roses, 3:sunflowers, 4:tulips'
}

# =================== Environment Checking ====================

#Create the log directory here. Must be done here otherwise import will activate this unneededly.
if not os.path.exists(FLAGS.output_dir):
    os.mkdir(FLAGS.output_dir)
    
if not os.path.exists(checkpoint_dir):
    os.makedirs(checkpoint_dir)
    
if not os.path.isfile(checkpoint_file):
    # We download first the TARGZ archive, if necessary, and then extract it.
    
    targz = "inception_v1_2016_08_28.tar.gz"
    url = "http://download.tensorflow.org/models/" + targz
    
    tarfilepath = os.path.join(checkpoint_dir, targz)
    
    if os.path.isfile(tarfilepath):
        import tarfile
        tarfile.open(tarfilepath, 'r:gz').extractall(checkpoint_dir)
    else:
        dataset_utils.download_and_uncompress_tarball(url, checkpoint_dir)
        
    # Get rid of tarfile source (the checkpoint itself will remain)
    os.unlink(tarfilepath)


if not os.path.isfile(labels_file):
    raise Exception("The Label File does not exists")
else:
    #State the labels file and read it   
    labels = open(labels_file, 'r')
    
    #Create a dictionary to refer each label to their string name
    
    labels_to_name = dict()
    
    for line in labels:
        label, string_name = line.split(':')
        string_name = string_name[:-1] #Remove newline
        labels_to_name[int(label)] = string_name

    #State the number of classes to predict
    num_classes = len(labels_to_name)

In [5]:
#============== DATASET LOADING ======================
# We now create a function that creates a Dataset class which will give us many TFRecord files 
#to feed in the examples into a queue in parallel.

def get_split(split_name, dataset_dir, file_pattern=file_pattern):
    '''
    Obtains the split - training or validation - to create a Dataset class for feeding the examples into a queue later on. This function will
    set up the decoder and dataset information all into one Dataset class so that you can avoid the brute work later on.
    Your file_pattern is very important in locating the files later. 

    INPUTS:
    - split_name(str): 'train' or 'validation'. Used to get the correct data split of tfrecord files
    - dataset_dir(str): the dataset directory where the tfrecord files are located
    - file_pattern(str): the file name structure of the tfrecord files in order to get the correct data

    OUTPUTS:
    - dataset (Dataset): A Dataset class object where we can read its various components for easier batch creation later.
    '''

    #First check whether the split_name is train or validation
    if split_name not in ['train', 'validation']: 
        err = 'The split_name %s is not recognized. Please input either train or validation as the split_name' % (split_name)
        raise ValueError(err)
    
    file_pattern_for_counting = file_pattern % (split_name)
    
    #Count the total number of examples in all of these shard    
    tfrecords_to_count = [
        os.path.join(dataset_dir, file) 
        for file in os.listdir(dataset_dir) 
        if file.startswith(file_pattern_for_counting[:-10]) # We remove the 10 last chars: *.tfrecord   
    ]
    
    num_samples = 0
    
    for tfrecord_file in tfrecords_to_count:
        for record in tf.python_io.tf_record_iterator(tfrecord_file):
            num_samples += 1

    #Create a reader, which must be a TFRecord reader in this case
    reader = tf.TFRecordReader

    #Create the keys_to_features dictionary for the decoder
    keys_to_features = {
      'image/encoded': tf.FixedLenFeature((), tf.string, default_value=''),
      'image/format': tf.FixedLenFeature((), tf.string, default_value='jpg'),
      'image/class/label': tf.FixedLenFeature([], tf.int64, default_value=tf.zeros([], dtype=tf.int64)),
    }

    #Create the items_to_handlers dictionary for the decoder.
    items_to_handlers = {
        'image': slim.tfexample_decoder.Image(),
        'label': slim.tfexample_decoder.Tensor('image/class/label'),
    }

    #Start to create the decoder
    decoder = slim.tfexample_decoder.TFExampleDecoder(keys_to_features, items_to_handlers)

    #Create the labels_to_name file
    labels_to_name_dict = labels_to_name
    
    #Create the full path for a general file_pattern to locate the tfrecord_files
    file_pattern_path = os.path.join(dataset_dir, file_pattern_for_counting)

    #Actually create the dataset
    dataset = slim.dataset.Dataset(
        data_sources = file_pattern_path,
        decoder = decoder,
        reader = reader,
        num_readers = 4,
        num_samples = num_samples,
        num_classes = num_classes,
        labels_to_name = labels_to_name_dict,
        items_to_descriptions = items_to_descriptions)

    return dataset

In [6]:
def load_batch(dataset, batch_size, height=image_size, width=image_size, is_training=True):
    '''
    Loads a batch for training.

    INPUTS:
    - dataset(Dataset): a Dataset class object that is created from the get_split function
    - batch_size(int): determines how big of a batch to train
    - height(int): the height of the image to resize to during preprocessing
    - width(int): the width of the image to resize to during preprocessing
    - is_training(bool): to determine whether to perform a training or evaluation preprocessing

    OUTPUTS:
    - images(Tensor): a Tensor of the shape (batch_size, height, width, channels) that contain one batch of images
    - labels(Tensor): the batch's labels with the shape (batch_size,) (requires one_hot_encoding).

    '''
    #First create the data_provider object
    data_provider = slim.dataset_data_provider.DatasetDataProvider(
        dataset,
        common_queue_capacity = 24 + 3 * batch_size,
        common_queue_min = 24)

    #Obtain the raw image using the get method
    raw_image, label = data_provider.get(['image', 'label'])

    #Perform the correct preprocessing for this image depending if it is training or evaluating
    image = inception_preprocessing.preprocess_image(raw_image, height, width, is_training)

    #As for the raw images, we just do a simple reshape to batch it up
    raw_image = tf.expand_dims(raw_image, 0)
    raw_image = tf.image.resize_nearest_neighbor(raw_image, [height, width])
    raw_image = tf.squeeze(raw_image)

    #Batch up the image by enqueing the tensors internally in a FIFO queue and dequeueing many elements with tf.train.batch.
    images, raw_images, labels = tf.train.batch(
        [image, raw_image, label],
        batch_size = batch_size,
        num_threads = 4,
        capacity = 4 * batch_size,
        allow_smaller_final_batch = True)

    return images, raw_images, labels

### Loading dataset and data batches

In [7]:
dataset = get_split('train', FLAGS.dataset_dir, file_pattern=file_pattern)
images, _, labels = load_batch(dataset, batch_size=FLAGS.batch_size)

num_steps_per_epoch = math.ceil(dataset.num_samples / FLAGS.batch_size)

#Know the number steps to take before decaying the learning rate and batches per epoch
decay_steps = FLAGS.num_epochs_before_decay * num_steps_per_epoch

In [8]:
#Create the model inference
with slim.arg_scope(inception_v1_arg_scope()):
    logits, end_points = inception_v1(images, num_classes = dataset.num_classes, is_training = True)

In [9]:
#Define the scopes that you want to exclude for restoration
exclude              = ["InceptionV1/Logits", "InceptionV1/AuxLogits"]
variables_to_restore = slim.get_variables_to_restore(exclude = exclude)
variables_to_save    = slim.get_variables_to_restore()

In [10]:
#Perform one-hot-encoding of the labels (Try one-hot-encoding within the load_batch function!)
one_hot_labels = slim.one_hot_encoding(labels, dataset.num_classes)

In [11]:
#Performs the equivalent to tf.nn.sparse_softmax_cross_entropy_with_logits but enhanced with checks
loss = tf.losses.softmax_cross_entropy(onehot_labels = one_hot_labels, logits = logits)
total_loss = tf.losses.get_total_loss()    #obtain the regularization losses as well

In [12]:
#Create the global step for monitoring the learning_rate and training.
global_step = get_or_create_global_step()

In [13]:
#Define your exponentially decaying learning rate
lr = tf.train.exponential_decay(
    learning_rate = FLAGS.initial_learning_rate,
    global_step = global_step,
    decay_steps = decay_steps,
    decay_rate = FLAGS.learning_rate_decay_factor,
    staircase = True
)

In [14]:
#Now we can define the optimizer that takes on the learning rate
optimizer = tf.train.AdamOptimizer(learning_rate = lr)

In [15]:
#Create the train_op.
train_op = slim.learning.create_train_op(total_loss, optimizer)

In [16]:
#State the metrics that you want to predict. We get a predictions that is not one_hot_encoded.
predictions                = tf.argmax(end_points['Predictions'], 1)
probabilities              = end_points['Predictions']
accuracy, accuracy_update  = tf.contrib.metrics.streaming_accuracy(predictions, labels)
metrics_op                 = tf.group(accuracy_update, probabilities)

In [17]:
#Now finally create all the summaries you need to monitor and group them into one summary op.
tf.summary.scalar('losses/Total_Loss', total_loss)
tf.summary.scalar('accuracy', accuracy)
tf.summary.scalar('learning_rate', lr)

my_summary_op = tf.summary.merge_all()

In [18]:
#Now we need to create a training step function that runs both the train_op, metrics_op and updates the global_step concurrently.
def train_step(sess, train_op, global_step):
    '''
    Simply runs a session for the three arguments provided and gives a logging on the time elapsed for each global step
    '''
    #Check the time for each sess run
    start_time = time.time()
    total_loss, global_step_count, _ = sess.run([train_op, global_step, metrics_op])
    time_elapsed = time.time() - start_time

    #Run the logging to print some results
    logging.info('global step %s: loss: %.4f (%.2f sec/step)', global_step_count, total_loss, time_elapsed)

    return total_loss, global_step_count

In [19]:
#Now we create a saver function that actually restores the variables from a checkpoint file in a sess
restore_saver = tf.train.Saver(
    var_list      = variables_to_restore,
    write_version = FLAGS.tf_saver
)

def restore_fn(sess):
    return restore_saver.restore(sess, checkpoint_file)

In [20]:
#Define your supervisor for running a managed session. 
#Do not run the summary_op automatically or else it will consume too much memory

saving_saver = tf.train.Saver(
    var_list      = variables_to_save,
    write_version = FLAGS.tf_saver, 
    max_to_keep   = FLAGS.training_epochs
)

sv = tf.train.Supervisor(
    logdir                = FLAGS.output_dir, 
    summary_op            = None, 
    init_fn               = restore_fn,
    checkpoint_basename   = FLAGS.checkpoint_basename,
    save_model_secs       = None, # Prevent Automatic Model saving
    saver                 = saving_saver
)

In [24]:
#Run the managed session
with sv.managed_session() as sess:   
    
    print("\n###################################\n")
    
    print("Number of Epochs: %d" % FLAGS.training_epochs)
    print("Number of Steps per Epoch: %d" % num_steps_per_epoch)
    print("Summary Recorded Every %d Steps\n" % round(num_steps_per_epoch/10))
    
    print("total steps: %d" % (num_steps_per_epoch * FLAGS.training_epochs))
    
    print("\n###################################\n")
    
    for step in range(num_steps_per_epoch * FLAGS.training_epochs):
        
        #At the start of every epoch, show the vital information:
        if step % num_steps_per_epoch == 0:
            
            learning_rate_value, accuracy_value = sess.run([lr, accuracy])
            
            logging.info('Epoch %d/%d', step/num_steps_per_epoch + 1, FLAGS.training_epochs)
            logging.info('Current Learning Rate: %s', learning_rate_value)
            logging.info('Current Streaming Accuracy: %s', accuracy_value)
            
            # We save the model after each epoch
            if step != 0:
                sv.saver.save(sess, sv.save_path, global_step = sv.global_step)

        #Log the summaries every 1-10th of epoch.
        if (step % num_steps_per_epoch % round(num_steps_per_epoch/10)) == 0 :
            loss, _ = train_step(sess, train_op, sv.global_step)
            summaries = sess.run(my_summary_op)
            sv.summary_computed(sess, summaries)

        #If not, simply run the training step
        else:
            loss, _ = train_step(sess, train_op, sv.global_step)

    #We log the final training loss and accuracy
    logging.info('Final Loss: %s', loss)
    logging.info('Final Accuracy: %s', sess.run(accuracy))

    #Once all the training has been done, save the log files and checkpoint model
    logging.info('Finished training! Saving model to disk now.')
    
    sv.saver.save(sess, sv.save_path, global_step = sv.global_step)

INFO:tensorflow:Restoring parameters from inception_files\models\inception_v1.ckpt
INFO:tensorflow:global_step/sec: 0

###################################

Number of Epochs: 10
Number of Steps per Epoch: 621
Summary Recorded Every 62 Steps

total steps: 6210

###################################

INFO:tensorflow:Epoch 1/10
INFO:tensorflow:Current Learning Rate: 0.0001
INFO:tensorflow:Current Streaming Accuracy: 0.0
INFO:tensorflow:global step 1: loss: 1.9852 (2.61 sec/step)
INFO:tensorflow:global step 2: loss: 1.7566 (1.17 sec/step)
INFO:tensorflow:global step 3: loss: 1.6449 (1.18 sec/step)
INFO:tensorflow:global step 4: loss: 1.6816 (1.19 sec/step)
INFO:tensorflow:global step 5: loss: 1.5104 (1.26 sec/step)
INFO:tensorflow:global step 6: loss: 1.4108 (1.34 sec/step)
INFO:tensorflow:global step 7: loss: 1.4286 (1.36 sec/step)
INFO:tensorflow:global step 8: loss: 1.4229 (1.35 sec/step)
INFO:tensorflow:global step 9: loss: 1.3287 (1.42 sec/step)
INFO:tensorflow:global step 10: loss: 1.20

KeyboardInterrupt: 