# DMU-Net Dataset Generation Notebook

* **Creator:** Jonathan DEKHTIAR
* **Date:** 2017-05-21
<br/><br/>
* **Contact:** [contact@jonathandekhtiar.eu](mailto:contact@jonathandekhtiar.eu)
* **Twitter:** [@born2data](https://twitter.com/born2data)
* **LinkedIn:** [JonathanDEKHTIAR](https://fr.linkedin.com/in/jonathandekhtiar)
* **Personal Website:** [JonathanDEKHTIAR](http://www.jonathandekhtiar.eu)
* **RSS Feed:** [FeedCrunch.io](https://www.feedcrunch.io/@dataradar/)
* **Tech. Blog:** [born2data.com](http://www.born2data.com/)
* **Github:** [DEKHTIARJonathan](https://github.com/DEKHTIARJonathan)
<br/><br/>

```
*************************************************************************
**
** 2017 March 13
**
** In place of a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
```

## Objectives

This notebook generates a [TFRecords](https://www.tensorflow.org/api_guides/python/python_io#TFRecords_Format_Details) file from the images contained in the "./data/" folder.

This will be used later to retrain an CNN model: [Inception-V3](http://arxiv.org/abs/1512.00567) model developed by Szegedy et al. The model has been Pre-Trained with the [ImageNet](http://www.image-net.org/) dataset allowing a much more accurate result due to the large number of data avaiable in this dataset. We call this a "Transfer Learning".

## 1. Load the necessary libraries and initialise global variables

In [1]:
import os

import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

####################### GLOBAL Notebook

NUM_EPOCH = 1
INPUT_DIRECTORY = "data"
TFRECORD_FILENAME = "output/dmu_net.tfrecord"
SEED = 666 # A specific number for reproducability or None for a random value

## 2. File Queue and Image Reading Process Definition

### 2.1 Define a queue of all the images in "jpeg" in the specific data folder

Make a queue of file names including all the JPEG images files in the relative image directory.

In [2]:
data_directories = [ name for name in os.listdir(INPUT_DIRECTORY) if os.path.isdir(os.path.join(INPUT_DIRECTORY, name)) ]

all_files = [tf.train.match_filenames_once(INPUT_DIRECTORY + "/" + x + "/*.jpg") for x in data_directories]

filename_queue = tf.train.string_input_producer(
    tf.concat(all_files,0), # Merge the sub-tensors into one
    num_epochs=NUM_EPOCH,
    seed=SEED,
    shuffle=True
)

### 2.2. Define the image reader

Read an entire image file which is required since they're JPEGs, if the images are too large they could be split in advance to smaller files or use the Fixed reader to split up the file.

In [3]:
image_reader = tf.WholeFileReader()

### 2.3. Read images from the Queue One by One
Read a whole file from the queue, the first returned value in the tuple is the filename which we are ignoring.

In [4]:
img_key, image_file = image_reader.read(filename_queue)

### 2.4. Convert each Image to a Tensor
Decode the image as a JPEG file, this will turn it into a Tensor which we can then use in training.

In [5]:
image       = tf.image.decode_jpeg(image_file)
image_shape = tf.shape(image)
image_label = tf.string_split([img_key] , delimiter=os.path.sep).values[1]

## 3. Defining the TFRecord Saving Process

### 3.1. Defining the tf.train.Features functions

In [6]:
def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

### 3.2. Defining the writer function

In [7]:
def writer_to_tfrecord(label, shape, image, writer):
    
    # write label, shape, and image content to the TFRecord file
    example = tf.train.Example(features=tf.train.Features(feature={
        'label': _bytes_feature(label),
        'shape': _bytes_feature(shape.tobytes()),
        'image': _bytes_feature(image.tobytes())
    }))
    
    writer.write(example.SerializeToString())

### 3.3. Creating a TFRecords writer

In [8]:
writer = tf.python_io.TFRecordWriter(TFRECORD_FILENAME)

## 4. Define an Initialisation Operation

In [9]:
init_op_global = tf.global_variables_initializer()
init_op_local = tf.local_variables_initializer()

## 5. Launch the dataset generation Session

In [10]:
with tf.Session() as sess:
    sess.run([init_op_global, init_op_local])

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    
    try:
        i = 1
        while not coord.should_stop():
            lbl, shp, img = sess.run([image_label, image_shape, image])
            writer_to_tfrecord(lbl, shp, img, writer)
            
            if (i % 300 == 0):
                print ("Processing Image:", i)
                
            i += 1
            
    except tf.errors.OutOfRangeError:
        print("\nNumber of Images Processed:", i)
        pass
    finally:
        writer.close()
        coord.request_stop()
        coord.join(threads)

Processing Image: 300
Processing Image: 600
Processing Image: 900
Processing Image: 1200
Processing Image: 1500
Processing Image: 1800
Processing Image: 2100
Processing Image: 2400
Processing Image: 2700
Processing Image: 3000
Processing Image: 3300
Processing Image: 3600

Number of Images Processed: 3671
