In this project, we are going to build a Convolution Neural Network model to classify images into 5 different severity levels of acne: Clear, Almost Clear, Mild, Moderate, and Severe. Since CNN models (or most of the deep learning models) are spatial sensitive, unless we have enough images which cover almost all possible locations of acne lesions in the training data, the CNN models trained using a limited set of images will not generalize well if the testing images have new locations of acne lesions that have never been seen during the training process. 

To overcome this problem, we roll each image patch for some number of pixels. The rolling action is taken on all image patches, no matter whether the patches are from the dominating class (mild), or other minor classes. 

For image patches from the mild class, we roll each patch 2 times. For image patches from other minor classes, the number of rolling is determined by the ratio between the numbers of mild class images and the minor class images. Therefore, after rolling, the numbers of images of the 5 classes are almost balanced. 

It is important to roll the images of the dominating class since we need to increase the coverage of acne lesion locations of this class as well in the training data. 

Forehead image patches are rolled from right to left. Cheeks and chin image patches are rolled bottom up. 

For instance, for a forehead patch with width 1000 pixels, assuming that the rolling step size is 200 pixels, then, 
image_after_rolling[:, 0:800, :] = image_before_rolling[:, 200:1000, :]
image_after_rolling[:, 800:1000, :] = image_before_rolling[:, 0:200, :]

For a cheek patch with height 500 pixels, assuming that the rolling step size is 100 pixels, then,
image_after_rolling[0:400, :, :] = image_before_rolling[100:500, :, :]
image_after_rolling[400:500, :, :] = image_before_rolling[0:100, :, :]

The rolling step size is determined by the size of the dimension that the rolling is on, and the number of rolling times, i.e., 
***step_size = int(rolling dimension size/(num\_rolling\_times+1))***

The image patches without rolling (rolling pixels = 0) are also saved to the destination directories. 

The image patches after rolling are also allocated to different subdirectories under the destination directory, based on the labels the original images received from the dermatologists. In this Jupyter notebook, we assume that the labels of the non-golden set images have been saved in a csv file. Therefore, we can be sure that images allocated based on this csv file will not have any golden-set images. 

During allocating image patches to different subdirectories, we also used a random number generator to decide whether an image patch goes to the training data or the validation data, based on whether the random number is greater than a pre-defined threshold. Two mapping files are created to record which files are in training data, which files are in validation data. These two mapping files will be used by CNTK to train CNN models. 

## Prerequisites

### Skin patches extracted from the original images
You need to run ***Step 1. Extract Forehead, cheeks, and chin skin patches from raw images using facial landmark model and One Eye model*** to extract the skin patches and save them in a single directory. It is OK that at this step, you extract the skin patches of the golden set images. However, in this Jupyter notebook, the skin patches from the golden set images will not be allocated to the subdirectories since the golden set image patches will not exist in the csv file with the image labels. 

### Save the non-golden set image labels in a csv file
The image labels are stored in a database. Run the following query to retrieve the latest labels (final) of all non-golden set images. In the following query, shared_image = 0 stands for non-golden set images, and shared_image = 1 stands for golden set images. After getting the results, save them to a csv file, and upload it to the server where this Jupyter notebook runs. The csv file has two columns and a header line.: 

- Col1: Image Name (e.g., 0001.jpg)
- Col2: Image label (e.g., 1-Clear, 2-Almost Clear)


    select a.image_name as image_name, b.label as label
    from
    (
        select labeler, image_name, max(label_at) as latest_time
        from image_labels_new
        where label is not null and shared_image = 0
        group by labeler, image_name
    )a
    left outer join
    (
        select labeler, image_name, label, label_at 
        from image_labels_new
        where label is not null and shared_image = 0
    )b
    on a.labeler = b.labeler and a.image_name = b.image_name and a.latest_time = b.label_at
    order by image_name
    
### Python and Python libraries
    - Python 3.5 and later version
    - shutil, PIL, random, cv2, scipy copy

## Parameters

In [8]:
training_ratio = 0.7 # this ratio will be used to generate the mapping files, which will be used for CNTK models later on
root_dir = "../data" # root_dir/source_dir will be the directory with the image patches
dirs = ["0-Not Acne", "1-Clear", "2-Almost Clear", "3-Mild", "4-Moderate", "5-Severe"] #The subdirectory names have to be consistent 
                                                                                       #with the image label names in database
source_dir = "images_patches"
dest_dir = "rolled" # root_dir/dest_dir/dirs[i] will be the destination 
                                                                     # directory for rolled images belonging to the ith label
image_label_file_name = "images_labels.csv" #image label csv file name. Assuming it is in root_dir

## Create directories if not existing and mapping files

In [9]:
import os
import random
from shutil import copyfile
from os import listdir
from os.path import join, isfile, splitext, basename
from PIL import Image
from random import randint
import numpy as np
import cv2
from scipy import misc
import copy


mapping_train = os.path.join(root_dir, dest_dir, "mapping_train.txt") #mapping file of the training images
mapping_valid = os.path.join(root_dir, dest_dir, "mapping_valid.txt") #mapping file of the validation images
train_fp = open(mapping_train, 'w')
valid_fp = open(mapping_valid, 'w')
for dir in dirs: #create directories for classes of image patches if not existing
    path = os.path.join(root_dir, dest_dir, dir)
    if not os.path.exists(path):
        os.makedirs(path)

## Read file names in the skin patch original directory into a list

In [10]:
imageFiles = [f for f in listdir(join(root_dir, source_dir)) if isfile(join(root_dir, source_dir, f))]
print("There are %d files in the source dir %s"%(len(imageFiles), join(root_dir,source_dir)))

There are 5564 files in the source dir ../data\images_patches


## Define a function to get the indices of image patches in _imageFiles_ that belong to the same image ID

In [11]:
def find_index_of_images(imageFiles, imagename):
    num_images = len(imageFiles)
    index = [i for i in range(num_images) if imagename in imageFiles[i]]
    return index

## Allocate image patches into destination subdirectories based on labels given by dermatologists

In [12]:
label_result_file = join(root_dir, image_label_file_name) # Assuming that the image label file is in the root_dir
fp = open(label_result_file, 'r')
fp.readline() # skip the headerline
label_count = {}
max_count = 0

# There is a bug in this handling. We are counting the number of images in each class, from the image label file. 
# However, the image patches we are rolling and allocating are from the selected images. The distribution of the 
# classes in image patches is different from the distribution in the labeled images, which including all non-golden set images.
# That is the reason why after rolling and balancing, the classes of images are still not balanced. 
# Correcting this should have positive impact on the model performance. 
# For now, let's keep as it is. But we need to fix this later on when we retrain the model. 
for row in fp: #Read the count of images in the image label file, and get the number of images of the dominating class
    row = row.strip().split(",")
    label = row[1]
    label_count[label] = label_count.get(label, 0) + 1 #DN: basically counting the number of observations for each label
    if max_count < label_count[label]: # Get the count of images in the dominating class
        max_count = label_count[label]
fp.close()
print(label_count) 
#DN: We got the max_count after this
fp = open(label_result_file, 'r') # Read the image label file again for allocating purpose
fp.readline()
random.seed(98052) # Set the random seed in order to reproduce the result. 

# This is the function that rolls an image patch, and saves the rolled image patch as a jpg file on the destination directory
# img: image data frame before rolling
# dest_path: destination directory to save the rolled image patch
# file_name_wo_ext: file name without extension, i.e., just the image ID
# image_names: a list of image names and path. The new image name and path will be appended to it and returns as an output
# x_or_y: 'x' or 'y'. It specifies whether it is rolling the images right to left, or bottom to top.
# pixels: number of pixels to roll in the direction specified by x_or_y.
# returns: image_names
def roll_and_save(img, dest_path, file_name_wo_ext, image_names, x_or_y, pixels):
    img_height, img_width = img.shape[0:2]
    img2 = copy.copy(img)
    if x_or_y == 'x':
        img2[:, 0:(img_width-pixels),:] = img[:,pixels:img_width,:]
        img2[:,(img_width-pixels):img_width,:] = img[:,0:pixels,:]
    else:
        img2[0:(img_height-pixels), :, :] = img[pixels:img_height, :, :]
        img2[(img_height-pixels):img_height, :,:] = img[0:pixels,:, :]
    img2 = cv2.cvtColor(img2, cv2.COLOR_BGR2RGB)        
    dest = join(dest_path, file_name_wo_ext+"_roll_"+x_or_y+"_"+str(pixels)+".jpg") #rolled image file name e.g., 0001_roll_x_112.jpg
    misc.imsave(dest, img2) 
    image_names.append(dest)
    return image_names

minimal_roll_times = 2 #Even the dominating class images need to roll twice

for row in fp: # go over for each row in the image label file
    rn = random.uniform(0, 1) # a random number determining whether this file goes to training or validation
    row = row.strip().split(",")
    file_name = row[0]
    label = row[1]
    file_name_wo_ext = splitext(file_name)[0] #get the image ID
    #DN: basically splitext will remove the .jpg or .png
    index = find_index_of_images(imageFiles, file_name_wo_ext) #find the image patches that belong to this image ID
    num_files_found = len(index) #number of image patches found. If the image is not in the selected image set, num_files_found=0
    image_names = []
    for i in range(num_files_found):
        source = join(root_dir, source_dir, imageFiles[index[i]])
        image_name_no_ext = splitext(imageFiles[index[i]])[0] #get the image patch name, e.g., 0001_fh, 0003_lc, etc.
        if file_name_wo_ext+"_fh" == image_name_no_ext: # forehead image patches, rolling right to left
            x_or_y = 'x'
        elif file_name_wo_ext+"_rc" == image_name_no_ext or file_name_wo_ext+"_lc" == image_name_no_ext or file_name_wo_ext+"_chin" == image_name_no_ext: #if cheeks, or chins, rolling bottom to top
            x_or_y = 'y'
        else:
            continue
        
        img = cv2.imread(source)
        img_height, img_width = img.shape[0:2]
        
        roll_ratio = float(max_count)/float(label_count[label]) # determining how many times to roll, in order to balance
        dest_path = join(root_dir, dest_dir, label) #destination path at the image class level
        
        image_names = roll_and_save(img, dest_path, image_name_no_ext, image_names, x_or_y, 0) #save the image without rolling 
        if roll_ratio > 1: # of this is not the dominating class, we need to roll in order to balance
            num_times = int(np.floor(roll_ratio) - 1)
        else:
            num_times = 0
        num_times += minimal_roll_times # adding the number of times that the dominating class is also rolling. 
        if num_times > 0: # determining the step size based on number of times to roll. We want the constant step size for each image
            if x_or_y == 'x':
                step_size = int(np.floor(np.float(img_width)/np.float(num_times+1))) 
            else:
                step_size = int(np.floor(np.float(img_height)/np.float(num_times+1)))
            for j in range(num_times):
                image_names = roll_and_save(img, dest_path, image_name_no_ext, image_names, x_or_y, step_size*(j+1))
        # The following lines of writing image names to the mapping file have some problem. The image path and name list image_names 
        # is accumulating over the image patches of the same image ID. However, the following lines is writing to the mapping file for
        # every image patch. There will be duplicates in the mapping files. 
        # However, it does not affect the tensorflow models we built since tensorflow was not using the mapping file. 
        # It should not affect the CNTK models either since CNTK models were not using the mapping files I created. 
        # A simple fix of this is to move the following lines to the outer for loop
        label_index = [i for i,x in enumerate(dirs) if x == label][0] # Determining the label index. dirs has 0-Not Acne in the list. 
                                                                      # So, for 1-Clear images, the label index in dirs is 1.
        if label_index >= 1: # We do not model 0-Not Acne, where label_index = 0
            label_index -= 1 #writing the image path and names of the entire rolled image
                             #set for a skin patch to the training mapping file
            if rn <= training_ratio:
                for image_name in image_names:
                    train_fp.write("%s\t%d\n"%(image_name, label_index)) 
            else:
                for image_name in image_names:
                    valid_fp.write("%s\t%d\n"%(image_name, label_index))
fp.close()
train_fp.close()
valid_fp.close()

{'1-Clear': 984, '2-Almost Clear': 243, '3-Mild': 127, '4-Moderate': 69, '5-Severe': 58}


`imsave` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imwrite`` instead.
