# DICE - Notebook 2 - Dataset Preparation

<br/>

```
*************************************************************************
**
** 2017 Mai 23
**
** In place of a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
```

<table style="width:100%; font-size:14px; margin: 20px 0;">
    <tr>
        <td style="text-align:center">
            <b>Contact: </b><a href="mailto:contact@jonathandekhtiar.eu" target="_blank">contact@jonathandekhtiar.eu</a>
        </td>
        <td style="text-align:center">
            <b>Twitter: </b><a href="https://twitter.com/born2data" target="_blank">@born2data</a>
        </td>
        <td style="text-align:center">
            <b>Tech. Blog: </b><a href="http://www.born2data.com/" target="_blank">born2data.com</a>
        </td>
    </tr>
    <tr>
        <td style="text-align:center">
            <b>Personal Website: </b><a href="http://www.jonathandekhtiar.eu" target="_blank">jonathandekhtiar.eu</a>
        </td>
        <td style="text-align:center">
            <b>RSS Feed: </b><a href="https://www.feedcrunch.io/@dataradar/" target="_blank">FeedCrunch.io</a>
        </td>
        <td style="text-align:center">
            <b>LinkedIn: </b><a href="https://fr.linkedin.com/in/jonathandekhtiar" target="_blank">JonathanDEKHTIAR</a>
        </td>
    </tr>
</table>

## Objectives

This notebook aims to preprocess and prepare the dataset for later used during the training phase. 

There exists many methods to feed data into a Deep Learning with [Tensorflow](https://www.tensorflow.org/), the Python Library we have chosen to use for this study:

1. **From Disk**: Data can be inputed into a model with the **feed_dict** argument when running a *training operation*. It would  definitely be possible, however this process can be slow if there are a lot of data to read simultaneously and could be too large to be held in the GPU Memory.
<br><br>
2. **From a CSV File**: This [type of file](https://en.wikipedia.org/wiki/Comma-separated_values) is not revelant when dealing with images.
<br><br>
3. **From a preprocessed binary file**: Tensorflow is able to save and recover data in a binary format called [TFRecords](https://www.tensorflow.org/api_guides/python/python_io#TFRecords_Format_Details). The data can be preprocessed beforehand and only the necessary data can be saved and read in real time during the training. This approach is the fatest and most memory-efficient when dealing with images.

This notebook will focus on generating the necessary **TFRecord** files. Generating **TFRecords** is less intuitive than 
[HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), used in other Deep Learning libraries such as [Keras](https://keras.io/). Using **TFRecords** will give you access to natively available tools, such as *Queue Runners*, *Coordinators*, *Supervisors*, *etc.*, to design [data pipelines](https://www.tensorflow.org/programmers_guide/reading_data) and process the images in a batch fashion.

This notebook will use [Tensorflow-Slim](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim) to ease the understanding and reduce the code complexity.

As we aim to to later re-train a CNN Model

This will be used later to retrain an CNN model: [Inception-V4](https://arxiv.org/abs/1602.07261) model developed by Szegedy et al. The model has been Pre-Trained with the [ImageNet](http://www.image-net.org/) dataset allowing a much more accurate result due to the large number of data avaiable in this dataset. We call this kind of process: "*Transfer Learning*".

---

As reminder before starting, the data have already been preprocessed in the first Notebook: **[DICE - Notebook 1 - Dataset Augmentation](https://github.com/DEKHTIARJonathan/DICE-DMU_Imagery_Classification_Engine/blob/master/DICE%20-%20Notebook%201%20-%20Dataset%20Augmentation.ipynb)**

The preprocessed data all have been saved as **JPEG images** and thus we will only focus on these data.

## 1. Notebook Initialisation

### 1.1. Load the necessary libraries

In [1]:
import os, random
import tensorflow as tf
from dataset_utils import _dataset_exists, _get_filenames_and_classes, write_label_file, _convert_dataset

###  1.2 Initialise global variables and application Flags

In [2]:
flags = tf.app.flags

#State your dataset directory
flags.DEFINE_string('dataset_dir', 'data_augmented/total', 'String: Your dataset directory')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('output_dir', 'data_prepared/', 'String: The output filename to name your TFRecord file')

# Proportion of dataset to be used for evaluation: 0.3 => 70% Training & 30% Validation
flags.DEFINE_float('validation_size', 0.3, 'Float: The proportion of examples in the dataset to be used for validation')

# The number of shards to split the dataset into.
flags.DEFINE_integer('num_shards', 2, 'Int: Number of shards to split the TFRecord files into')

# Seed for repeatability.
flags.DEFINE_integer('random_seed', 666, 'Int: Random seed to use for repeatability.')

#Output filename for the naming the TFRecord file
flags.DEFINE_string('tfrecord_filename', 'dmunet_dataset', 'String: The output filename to name your TFRecord file')

FLAGS = flags.FLAGS

### 1.3. Create the output directory

In [3]:
if not os.path.exists(FLAGS.output_dir):
    os.makedirs(FLAGS.output_dir)

## 2. Getting the data

In [4]:
photo_filenames, class_names = _get_filenames_and_classes(FLAGS.dataset_dir)  
class_names_to_ids = dict(zip(class_names, range(len(class_names))))

## 3. Performing the train/val split

In [5]:
#Find the number of validation examples we need
num_validation = int(FLAGS.validation_size * len(photo_filenames))

# Divide the training datasets into train and test:
random.seed(FLAGS.random_seed)
random.shuffle(photo_filenames)
training_filenames = photo_filenames[num_validation:]
validation_filenames = photo_filenames[:num_validation]

## 4. Converting the datasets into TFRecords

### 4.1. Training set

In [6]:
_convert_dataset(
    split_name             = 'train', 
    filenames              = training_filenames, 
    class_names_to_ids     = class_names_to_ids,
    dataset_dir            = FLAGS.dataset_dir,
    output_dir             = FLAGS.output_dir,
    tfrecord_filename      = FLAGS.tfrecord_filename,
    _NUM_SHARDS            = FLAGS.num_shards
)

Processing TFRecord File: data_prepared/dmunet_dataset_train_00001-of-00002.tfrecord
Shard Size 39864

Converting image 1000/79726 - shard 1
Converting image 2000/79726 - shard 1
Converting image 3000/79726 - shard 1
Converting image 4000/79726 - shard 1
Converting image 5000/79726 - shard 1
Converting image 6000/79726 - shard 1
Converting image 7000/79726 - shard 1
Converting image 8000/79726 - shard 1
Converting image 9000/79726 - shard 1
Converting image 10000/79726 - shard 1
Converting image 11000/79726 - shard 1
Converting image 12000/79726 - shard 1
Converting image 13000/79726 - shard 1
Converting image 14000/79726 - shard 1
Converting image 15000/79726 - shard 1
Converting image 16000/79726 - shard 1
Converting image 17000/79726 - shard 1
Converting image 18000/79726 - shard 1
Converting image 19000/79726 - shard 1
Converting image 20000/79726 - shard 1
Converting image 21000/79726 - shard 1
Converting image 22000/79726 - shard 1
Converting image 23000/79726 - shard 1
Convertin

### 4.2. Validation set

In [7]:
_convert_dataset(
    split_name             = 'validation', 
    filenames              = validation_filenames, 
    class_names_to_ids     = class_names_to_ids,
    dataset_dir            = FLAGS.dataset_dir,
    output_dir             = FLAGS.output_dir,
    tfrecord_filename      = FLAGS.tfrecord_filename,
    _NUM_SHARDS            = FLAGS.num_shards
)

Processing TFRecord File: data_prepared/dmunet_dataset_validation_00001-of-00002.tfrecord
Shard Size 17085

Converting image 1000/34168 - shard 1
Converting image 2000/34168 - shard 1
Converting image 3000/34168 - shard 1
Converting image 4000/34168 - shard 1
Converting image 5000/34168 - shard 1
Converting image 6000/34168 - shard 1
Converting image 7000/34168 - shard 1
Converting image 8000/34168 - shard 1
Converting image 9000/34168 - shard 1
Converting image 10000/34168 - shard 1
Converting image 11000/34168 - shard 1
Converting image 12000/34168 - shard 1
Converting image 13000/34168 - shard 1
Converting image 14000/34168 - shard 1
Converting image 15000/34168 - shard 1
Converting image 16000/34168 - shard 1
Converting image 17000/34168 - shard 1

#######################

Processing TFRecord File: data_prepared/dmunet_dataset_validation_00002-of-00002.tfrecord
Shard Size 17085

Converting image 18000/34168 - shard 2
Converting image 19000/34168 - shard 2
Converting image 20000/341

### 5. Finally, we write a labels file that will be useful as a reference later on

In [10]:
labels_to_class_names = dict(zip(range(len(class_names)), class_names))
write_label_file(labels_to_class_names, FLAGS.output_dir, filename="labels.txt")