## DICE - Notebook 1.1 - Dataset Cleaning

<br/>

```
*************************************************************************
**
** 2017 Mai 23
**
** In place of a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
```

<table style="width:100%; font-size:14px; margin: 20px 0;">
    <tr>
        <td style="text-align:center">
            <b>Contact: </b><a href="mailto:contact@jonathandekhtiar.eu" target="_blank">contact@jonathandekhtiar.eu</a>
        </td>
        <td style="text-align:center">
            <b>Twitter: </b><a href="https://twitter.com/born2data" target="_blank">@born2data</a>
        </td>
        <td style="text-align:center">
            <b>Tech. Blog: </b><a href="http://www.born2data.com/" target="_blank">born2data.com</a>
        </td>
    </tr>
    <tr>
        <td style="text-align:center">
            <b>Personal Website: </b><a href="http://www.jonathandekhtiar.eu" target="_blank">jonathandekhtiar.eu</a>
        </td>
        <td style="text-align:center">
            <b>RSS Feed: </b><a href="https://www.feedcrunch.io/@dataradar/" target="_blank">FeedCrunch.io</a>
        </td>
        <td style="text-align:center">
            <b>LinkedIn: </b><a href="https://fr.linkedin.com/in/jonathandekhtiar" target="_blank">JonathanDEKHTIAR</a>
        </td>
    </tr>
</table>

## Objectives

In order to maximise the robustness of the re-trained model, each image in the dataset will be loaded and augmented.

The augmentation process consists in varying image characteristics such as *brightness, saturation, hue, contrast, gamma, orientation, etc.* These modifications applied to the image are randomly set. 

This process tends to improve the generalisation power of the model. The number of augmented images generated directly impact the training time and the memory requirements, thus leading to a tradeoff between memory, computing power and the model accuracy.

For this study, we have chosen to generate 30 augmented images + the original one, leading to 31 images saved per image in the dataset.


## 1. Load the necessary libraries and initialise global variables

In [1]:
import os, string, random

import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

################################## GLOBAL NOTEBOOK VARS ##################################

INPUT_DIRECTORY         = os.path.join("data_bg", "raw")
OUTPUT_DIRECTORY        = os.path.join("data_bg", "cleaned")

############################### RANDOM VALUE GENERATION SEED #############################

SEED                    = 666

######################## Model Dependant Parameters - Inception V1 #######################

IMG_HEIGHT              = 224     # This parameter is fixed due to the model used: Inception-V1
IMG_WIDTH               = 224     # This parameter is fixed due to the model used: Inception-V1
IMG_CHANNELS            = 3       # This parameter is fixed due to the model used: Inception-V1

## 2. File Queue and Image Reading Process Definition

### 2.1 Define a queue of all the images in "jpeg" in the specific data folder

Make a queue of file names including all the JPEG images files in the relative image directory.

In [2]:
# Get a list of the sub-directories in the INPUT_DIRECTORY
data_directories = [ name for name in os.listdir(INPUT_DIRECTORY) if os.path.isdir(os.path.join(INPUT_DIRECTORY, name)) ]

# This Notebook can handle the following data-types
png_ext_list  = ["png"]
jpeg_ext_list = ["jpg", "jpeg"]

ext_list = jpeg_ext_list + png_ext_list # = ['jpg', 'jpeg', 'png']

# We scan all the files in the sub-directories with the extensions given above
all_files = tf.concat(
    [tf.train.match_filenames_once(INPUT_DIRECTORY + "/" + x + "/*."+ext) for x in data_directories for ext in ext_list],
    0
)

filename_queue = tf.train.string_input_producer(
    all_files, # Merge the sub-tensors into one
    num_epochs=1,
    seed=SEED,
    shuffle=True
)

### 2.2. Define the image reader

Read an entire image file which is required since they're JPEGs, if the images are too large they could be split in advance to smaller files or use the Fixed reader to split up the file.

In [3]:
image_reader = tf.WholeFileReader()

### 2.3. Read images from the Queue One by One
Read a whole file from the queue, the first returned value in the tuple is the filename which we are ignoring.

In [4]:
image_path, image_file = image_reader.read(filename_queue)

### 2.4. Convert each Image to a Tensor

Decode the image file, this will turn it into a Tensor which we can then use in training. It automatically detect whether the image is ["GIF", "PNG", "JPEG"] and which decoder to use.

In [5]:
def string_length_tf(t):
    return tf.py_func(lambda x: len(x), [t], tf.int32)

In [6]:
path_length = string_length_tf(image_path)
file_extension = tf.substr(image_path, path_length - 3, 3)

file_cond = tf.equal(file_extension, jpeg_ext_list)
file_cond = tf.count_nonzero(file_cond)
file_cond = tf.equal(file_cond, 1) ## 1 => JPEG EXTENSION, 0 => PNG EXTENSION
        
image_tmp      = tf.cond(
                    file_cond, 
                    lambda: tf.image.decode_jpeg(image_file), 
                    lambda: tf.image.decode_png(image_file)
               )

image_resized  = tf.image.resize_images(
                    image_tmp, 
                    tf.stack([IMG_HEIGHT, IMG_WIDTH]), 
                    method=tf.image.ResizeMethod.BICUBIC,
                    align_corners=True
               )

# resize image by bilinear, bicubic and area will change image data type(from uint8 to float32)
image_data     = tf.cast(image_resized, tf.uint8) # We need to convert it back to unint8 to display it properly

image_encoded  = tf.image.encode_png(image_data)

image_label    = tf.string_split([image_path] , delimiter=os.path.sep).values[-2:][0]  

## 4. Define an Initialisation Operation

In [7]:
init_op_global = tf.global_variables_initializer()
init_op_local = tf.local_variables_initializer()

## 5. Define a function generating random filenames

In [8]:
def id_generator(size=20, chars=string.ascii_uppercase + string.ascii_lowercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

## 7. Launch the dataset generation Session

In [9]:
with tf.Session() as sess:
    sess.run([init_op_global, init_op_local])

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    
    try:
        
        i = 0        
        n_files = len(all_files.eval())
        
        print("Number of Images to process %d\n" % n_files)
        
        while not coord.should_stop():
            
            _lbl_txt, _image = sess.run([image_label, image_encoded])   
            
            ## Increment ops count
            i += 1 

            out_dir = OUTPUT_DIRECTORY + "/" + _lbl_txt.decode("utf-8")
            
            if not os.path.exists(out_dir):
                os.makedirs(out_dir)
                 
            filename = out_dir + "/" + id_generator() + ".png"

            with open(filename, "wb+") as f:
                f.write(_image)
                f.close()
            
            if (i % 300 == 0):
                print ("Processing Image: %d/%d => %.2f%%" % (i, n_files, i/n_files*100))
            
    except tf.errors.OutOfRangeError:
        pass
    
    finally:        
        print("\nNumber of Images Processed: %d" % i)
        
        coord.request_stop()
        coord.join(threads)

Number of Images to process 142


Number of Images Processed: 142
