# DMU-Net Dataset Augmentation Notebook

* **Creator:** Jonathan DEKHTIAR
* **Date:** 2017-05-21
<br/><br/>
* **Contact:** [contact@jonathandekhtiar.eu](mailto:contact@jonathandekhtiar.eu)
* **Twitter:** [@born2data](https://twitter.com/born2data)
* **LinkedIn:** [JonathanDEKHTIAR](https://fr.linkedin.com/in/jonathandekhtiar)
* **Personal Website:** [JonathanDEKHTIAR](http://www.jonathandekhtiar.eu)
* **RSS Feed:** [FeedCrunch.io](https://www.feedcrunch.io/@dataradar/)
* **Tech. Blog:** [born2data.com](http://www.born2data.com/)
* **Github:** [DEKHTIARJonathan](https://github.com/DEKHTIARJonathan)
<br/><br/>

```
*************************************************************************
**
** 2017 March 13
**
** In place of a legal notice, here is a blessing:
**
**    May you do good and not evil.
**    May you find forgiveness for yourself and forgive others.
**    May you share freely, never taking more than you give.
**
*************************************************************************
```

## Objectives

In order to maximise the robustness of the re-trained model, each image in the dataset will be loaded and augmented.

The augmentation process consists in varying image characteristics such as *brightness, saturation, hue, contrast, gamma, orientation, etc.* These modifications applied to the image are randomly set. 

This process tends to improve the generalisation power of the model. The number of augmented images generated directly impact the training time and the memory requirements, thus leading to a tradeoff between memory, computing power and the model accuracy.

For this study, we have chosen to generate 30 augmented + the original image leading to 31 images per image in the dataset.

This notebook will also randomly split the available data into two sets of data: [Training and Validation sets](https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set). This process aims to reduce the [overfit](http://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html) of the model and thus improving its accuracy on previously unseen data. In this study the selection ratio has been chosen as followed:
- *training set:* 60%
- *validation set:* 40%.


## 1. Load the necessary libraries and initialise global variables

In [1]:
import os, string, random

import tensorflow as tf
import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

################################## GLOBAL NOTEBOOK VARS ##################################

NUM_EPOCH               = 1
INPUT_DIRECTORY         = "data"
OUTPUT_DIRECTORY        = "data_augmented"

TRAINING_DIR_NAME       = "train"
VALIDATION_DIR_NAME     = "val"

TRAIN_VAL_SPLIT         = 0.6 # 60% of the images are training data, 40% are validation data
IMG_AUGMENTATION_FACTOR = 30 # The number of augmented images generated from the raw image.

############################### RANDOM VALUE GENERATION SEED #############################

SEED                    = 666

######################## Model Dependant Parameters - Inception V3 #######################

IMG_HEIGHT              = 299     # This parameter is fixed due to the model used: Inception-V3
IMG_WIDTH               = 299     # This parameter is fixed due to the model used: Inception-V3
IMG_CHANNELS            = 3       # This parameter is fixed due to the model used: Inception-V3
IMG_COLORSPACE          = "RGB"   # This parameter is fixed due to the model used: Inception-V3
IMG_OUTFORMAT           = "JPEG"  # This parameter is fixed due to the model used: Inception-V3

## 2. File Queue and Image Reading Process Definition

### 2.1 Define a queue of all the images in "jpeg" in the specific data folder

Make a queue of file names including all the JPEG images files in the relative image directory.

In [2]:
data_directories = [ name for name in os.listdir(INPUT_DIRECTORY) if os.path.isdir(os.path.join(INPUT_DIRECTORY, name)) ]

png_ext_list  = ["png"]
jpeg_ext_list = ["jpg", "jpeg"]

ext_list = jpeg_ext_list + png_ext_list # = ['jpg', 'jpeg', 'png']

all_files = [tf.train.match_filenames_once(INPUT_DIRECTORY + "/" + x + "/*."+ext) for x in data_directories for ext in ext_list]

filename_queue = tf.train.string_input_producer(
    tf.concat(all_files,0), # Merge the sub-tensors into one
    num_epochs=NUM_EPOCH,
    seed=SEED,
    shuffle=True
)

### 2.2. Define the image reader

Read an entire image file which is required since they're JPEGs, if the images are too large they could be split in advance to smaller files or use the Fixed reader to split up the file.

In [3]:
image_reader = tf.WholeFileReader()

### 2.3. Read images from the Queue One by One
Read a whole file from the queue, the first returned value in the tuple is the filename which we are ignoring.

In [4]:
image_path, image_file = image_reader.read(filename_queue)

### 2.4. Convert each Image to a Tensor

Decode the image file, this will turn it into a Tensor which we can then use in training. It automatically detect whether the image is ["GIF", "PNG", "JPEG"] and which decoder to use.

In [5]:
def string_length_tf(t):
    return tf.py_func(lambda x: len(x), [t], tf.int32)

In [6]:
path_length = string_length_tf(image_path)
file_extension = tf.substr(image_path, path_length - 3, 3)

file_cond = tf.equal(file_extension, jpeg_ext_list)
file_cond = tf.count_nonzero(file_cond)
file_cond = tf.equal(file_cond, 1) ## 1 => JPEG EXTENSION, 0 => PNG EXTENSION
        
image_tmp      = tf.cond(
                    file_cond, 
                    lambda: tf.image.decode_jpeg(image_file), 
                    lambda: tf.image.decode_png(image_file)
               )

image_resized  = tf.image.resize_images(
                    image_tmp, 
                    tf.stack([IMG_HEIGHT, IMG_WIDTH]), 
                    method=tf.image.ResizeMethod.BICUBIC,
                    align_corners=True
               )

# resize image by bilinear, bicubic and area will change image data type(from uint8 to float32)
image_data = tf.cast(image_resized, tf.uint8) # We need to convert it back to unint8 to display it properly

image_label    = tf.string_split([image_path] , delimiter=os.path.sep).values[1]  

## 3. Determining whether an image will be used for validation or training at random

In [7]:
is_train_val     = tf.random_uniform([], 0, 1)

is_training_data = tf.less(is_train_val, TRAIN_VAL_SPLIT, name=None)

## 4. Perform Image Augmentation

### 4.1 Define an Image Augmentation Function

In [8]:
def augment_image(image):
    
    ### GAMMA SHIFTING => It affects primarily the high lights ###
    
    random_gamma      = tf.random_uniform([], 0.5, 1.1)
    image_aug         = image ** random_gamma
    
    ### BRIGHTNESS SHIFTING ###
    
    # This gives a centered random  image*(1 +/- delta)
    # It does not fit our requirements, we would like a random brightness not centered around "1".
    #image = tf.image.random_brightness(image, max_delta=0.125) 
    
    random_brightness = tf.random_uniform([], 0.5, 1.2)
    image         =  image * random_brightness
    
    ### OPS SHIFTING ###   
    
    image = tf.image.random_saturation(image, lower=0.5, upper=1.5)
    image = tf.image.random_hue(image, max_delta=0.2)
    image = tf.image.random_contrast(image, lower=0.5, upper=1.5)
    
    # randomly horizontally flip the image
    do_flip = tf.random_uniform([], 0, 1)
    image  = tf.cond(do_flip > 0.5, lambda: tf.image.flip_left_right(image), lambda: image)
    
    # randomly rotate the image
    n_rot = tf.random_uniform([], 0, 3, tf.int32) # 0 => No Rotation, 1 => 90° Rot, 2 => 180° Rot, 3 => 270° Rotation
    image = tf.image.rot90(image, n_rot)
    
     # The random_* ops do not necessarily clamp.
    image = tf.clip_by_value(image, 0.0, 255.0)
    
    return tf.cast(image, tf.uint8)

### 4.2. Create a Tensor of Images and Populate it

In [9]:
img_arr = tf.stack([
    tf.image.encode_jpeg(image_data),
])

for _ in range(IMG_AUGMENTATION_FACTOR):
    img_arr = tf.concat([img_arr, [tf.image.encode_jpeg(augment_image(image_resized))]], 0)

## 5. Define a function generating random filenames

In [10]:
def id_generator(size=20, chars=string.ascii_uppercase + string.ascii_lowercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

## 6. Define an Initialisation Operation

In [11]:
init_op_global = tf.global_variables_initializer()
init_op_local = tf.local_variables_initializer()

## 7. Launch the dataset generation Session

In [12]:
with tf.Session() as sess:
    sess.run([init_op_global, init_op_local])

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    
    try:
        
        i = 0
        i_trn = 0 # Counter of training data
        i_val = 0 # Counter of validation data
        
        while not coord.should_stop():
            
            _trn_bool, _lbl_txt, _img_arr = sess.run([is_training_data, image_label, img_arr])   
            
            ## Increment ops count
            i += 1 

            if (_trn_bool):
                out_dir = OUTPUT_DIRECTORY + "/" + TRAINING_DIR_NAME + "/" + _lbl_txt.decode("utf-8")
                i_trn += 1
                
            else:                
                out_dir = OUTPUT_DIRECTORY + "/" + VALIDATION_DIR_NAME + "/" + _lbl_txt.decode("utf-8")
                i_val += 1
            
            if not os.path.exists(out_dir):
                os.makedirs(out_dir)
                 
            for _img in _img_arr:
                filename = out_dir + "/" + id_generator() + ".jpg"

                with open(filename, "wb+") as f:
                    f.write(_img)
                    f.close()
            
            if (i % 300 == 0):
                print ("Processing Image:", i)
                print("Training-Validation Proportion: %2.2f%%\n" % (i_trn/(i_trn+i_val)*100))
            
    except tf.errors.OutOfRangeError:
        pass
    
    finally:        
        print("\nNumber of Images Processed:", i)
        print("Number of Training Images:", i_trn)
        print("Number of Validation Images:", i_val)
        print("Training-Validation Proportion: %2.2f%%" % (i_trn/(i_trn+i_val)*100))
        
        coord.request_stop()
        coord.join(threads)

Processing Image: 300
Training-Validation Proportion: 57.33%

Processing Image: 600
Training-Validation Proportion: 56.67%

Processing Image: 900
Training-Validation Proportion: 58.22%

Processing Image: 1200
Training-Validation Proportion: 59.42%

Processing Image: 1500
Training-Validation Proportion: 60.00%

Processing Image: 1800
Training-Validation Proportion: 59.44%

Processing Image: 2100
Training-Validation Proportion: 59.10%

Processing Image: 2400
Training-Validation Proportion: 58.54%

Processing Image: 2700
Training-Validation Proportion: 58.96%

Processing Image: 3000
Training-Validation Proportion: 59.43%

Processing Image: 3300
Training-Validation Proportion: 59.48%

Processing Image: 3600
Training-Validation Proportion: 59.28%


Number of Images Processed: 3670
Number of Training Images: 2180
Number of Validation Images: 1490
Training-Validation Proportion: 59.40%
