## Build transfer learning model based on the rolled skin patches

This is the Step 3: training the deep learning model. In this step, we are using a pre-trained deep learning model in CNTK, [ResNet-152 model](https://github.com/Microsoft/CNTK/tree/master/Examples/Image/Classification/ResNet), to extract features from the training images, and then train a full-connected neural network model on these features, to make the entire deep learning model specific to the acne severity classification domain. The features we extracted are from the last max pooling layer of the pretrained model. The trained full connected neural network is stored for future scoring pipeline.

There are two steps here: 
* Extract features from CNTK pretrained model 
 -   Specify the model name, the last layer name and dimension (model_name, node_name, num_nodes)

* Build full connected layers based on the extracted features:
 -   Build Neural Network regression/classfication model.

***Note***: It may take around 1 hour to complete this step, where around 20 minutes to extract features from skin patch images by using pretrained ResNet-152 model as a feature extractor, and around 30 minutes to train the full connected neural network by using the features. 




## Prerequisites

Here are the prerequisites to run this Jupyter notebook:

### Skin patch images
- The skin patch images need to be stored in the subdirectories corresponding to the labels of their original selfie images. Completing [Step 2. Roll the skin patches and balance classes of images, and move the rolled skin patches to directories based on image labels](../01_DataPre/Step 2. Roll the skin patches and balance classes of images, and move the rolled skin patches to directories based on image labels.ipynb) should make this prerequisite ready. The subdirectories also have rolled skin patches to make the training data cover as much as possible the possible locations of the acne lesions. 
    
### Python and Python libraries
- Python 3.5 or later version 
- CNTK, PIL

### CNTK pre-trained model
You need to download the pretrained model to the machine. 
   - [ResNet152\_ImageNet\_Caffe.model](https://www.cntk.ai/Models/Caffe_Converted/ResNet152_ImageNet_Caffe.model)
   - You can also choose other pretrained models such as inception model, ResNest50 ......
   - If you want to look for the layer name and num_nodes after each layer of the pretrained model, run the following code
           (
           node_outputs = C.logging.get_node_outputs(C.load_model(base_model_file))
           for l in node_outputs: 
                print("  {0} {1}".format(l.name, l.shape))
            )


## Parameters

Change the following parameters to let the Jupyter Notebook know the locations of the pretrained model, the layer of the pretrained model to extract features from, and the location of the skin patch images under different subdirectories. 

Later on, the Jupyter Notebook will try to access the model from directory ***pretrained\_model\_path\\\\pretrained\_model\_name***, and to access the Clear skin patch images from ***data\_path\\\\Clear***, etc. 

The image\_height and image\_width should be consistent with the required image size of the pretrained model. In this work, the pretrained ResNet-152 model requires the input image size to be 224-by-224. Some other pre-trained model might be flexible on the input image size. However, usually, pretrained model requires that the input image to be square shape, i.e., image\_width = image\_height. 

We also specify that we use 80% of the images to be the training images, and the remaining 20% to be the validation images. 

The parameter ***random\_seed*** is controlling the randomness of the data splitting and the initialization of the weight matrics of the full connected neural network model. Setting ***random\_seed=5*** can result in ***RMSE=0.4819*** on the golden set images. Choosing different random seeds might result in higher RMSE on golden set images. 

In [14]:
pretrained_model_name = 'ResNet152_ImageNet_Caffe.model'
pretrained_model_path = '../models'
pretrained_node_name = 'pool5' 

img_dirs = ['1-Clear', '2-Almost Clear', '3-Mild', '4-Moderate', '5-Severe'] # image labels
data_path = '../data/rolled' # image data source

image_height = 224 # the height of resize image
image_width  = 224 # the width of resize image
num_channels = 3 # the RGB image has three chanels
random_seed = 5
train_ratio = 0.8 # this ratio is used for training and validation in the following models


## Load pretrained model

In [15]:
from __future__ import print_function
import os
import numpy as np
import pandas as pd
import cntk as C
from PIL import Image
import pickle
import time
from cntk import load_model, combine
import cntk.io.transforms as xforms
from cntk.logging import graph
from cntk.logging.graph import get_node_outputs

picklefolder_path = os.path.join(data_path, 'pickle') # create a directory pickle to store pickle files for image patches in each 
                                                      # label directory. Data of all files in each label directory are dumped into
                                                      # a single pickle file
if not os.path.exists(picklefolder_path):
    os.mkdir(picklefolder_path)

output_path = '../models'
if not os.path.exists(output_path):
    os.mkdir(output_path)
    
regression_model_path = os.path.join(output_path, 'cntk_regression.dat')

In [16]:
# define pretrained model location, node name
model_file  = os.path.join(pretrained_model_path, pretrained_model_name)
loaded_model  = load_model(model_file) # load the pretrained ResNet-152 model.
node_in_graph = loaded_model.find_by_name(pretrained_node_name) #find the node name in the pretrained ResNet-152 model
output_nodes  = combine([node_in_graph.owner])

node_outputs = C.logging.get_node_outputs(loaded_model)
for l in node_outputs: 
    if l.name == pretrained_node_name:
        num_nodes = np.prod(np.array(l.shape))
        
print ('the pretrained model is %s' % pretrained_model_name)
print ('the selected layer name is %s and the number of flatten nodes is %d' % (pretrained_node_name, num_nodes))


the pretrained model is ResNet152_ImageNet_Caffe.model
the selected layer name is pool5 and the number of flatten nodes is 2048


## Extract features for each class of images and save them into pickle files

In [17]:
def extract_features(image_path):   
    img = Image.open(image_path)       
    resized = img.resize((image_width, image_height), Image.ANTIALIAS)  
    
    bgr_image = np.asarray(resized, dtype=np.float32)[..., [2, 1, 0]]    
    hwc_format = np.ascontiguousarray(np.rollaxis(bgr_image, 2)) 
    
    arguments = {loaded_model.arguments[0]: [hwc_format]}    
    output = output_nodes.eval(arguments)  #extract the features from the pretrained model, and output
    return output

def maybe_pickle(folder_path): 
    dataset = np.ndarray(shape=(len(next(os.walk(folder_path))[2]), num_nodes),
                         dtype=np.float16) 
    num_image = 0        
    for file in next(os.walk(folder_path))[2]:
        image_path = os.path.join(folder_path, file)
        dataset[num_image, :] = extract_features(image_path)[0].flatten()
        num_image = num_image + 1
    
    pickle_filename = folder_path.split('\\')[-1] + '.pickle'
    pickle_filepath = os.path.join(picklefolder_path, pickle_filename)
    if os.path.isfile(pickle_filepath):
        os.remove(pickle_filepath)
    with open(pickle_filepath, 'wb') as f:
        pickle.dump(dataset, f, pickle.HIGHEST_PROTOCOL) 
    
    return pickle_filename

In [18]:
# Here, we go over each subdirectory corresponding to each label, and dump the data of all images in each 
# subdirectory into a single pickle file
start_time = time.time()

pickle_names = []
    
for f in img_dirs:
    folder_path = os.path.join(data_path, f)
    pickle_names.append(os.path.join(picklefolder_path, maybe_pickle(folder_path)))  # store the pickle file name in pickle_names

print("It takes %s seconds to extract features from skin patch images and dump to pickle files." % (time.time() - start_time))

It takes 11317.181208610535 seconds to extract features from skin patch images and dump to pickle files.


In [19]:
# This is the function that combines training data in each label subdirectory into the same pickle file, so to the validation data.
def merge_datasets(pickle_files, train_ratio):
    num_classes = len(pickle_files)
    num_datasets = [0]*num_classes
    for i in range(num_classes):
        with open(pickle_files[i], 'rb') as f:
            load_data = pickle.load(f)
            num_datasets[i] = load_data.shape[0]
            
    total_datasets = np.sum(num_datasets)
    
    num_train = [int(round(float(x)*train_ratio)) for x in num_datasets]
    num_valid = np.array(num_datasets) - np.array(num_train)
   
    total_train = np.sum(num_train)
    train_dataset = np.ndarray((total_train, num_nodes), dtype=np.float32)
    train_labels = np.ndarray(total_train, dtype=np.int32)  
    
    total_valid = np.sum(num_valid)
    valid_dataset = np.ndarray((total_valid, num_nodes), dtype=np.float32)
    valid_labels = np.ndarray(total_valid, dtype=np.int32)  
    
    start_trn, start_val = 0, 0
    # the first element in the pickle file is labeled as 1, followd by second element as 2, etc...
    np.random.seed(seed=random_seed)
    for label, pickle_file in enumerate(pickle_files):  
        print (label+1)
        print (pickle_file)
        try:
            with open(pickle_file, 'rb') as f:
                data_set = pickle.load(f)
                np.random.shuffle(data_set) #shuffle the data in each pickle file
                
                train_data = data_set[0:num_train[label], :] # the first batch goes to training data
                train_dataset[start_trn:(start_trn+num_train[label]), :] = train_data
                train_labels[start_trn:(start_trn+num_train[label])] = label+1
                start_trn += num_train[label]
                
                valid_data = data_set[num_train[label]:num_datasets[label], :]
                valid_dataset[start_val:(start_val+num_valid[label]), :] = valid_data
                valid_labels[start_val:(start_val+num_valid[label])] = label+1
                start_val += num_valid[label]

        except Exception as e:
            print('Unable to process data from', pickle_file, ':', e)
            raise   
            
    return train_dataset, train_labels, valid_dataset, valid_labels

In [20]:
# merge all dataset together and divide it into training and validation
train_dataset, train_labels, valid_dataset, valid_labels = merge_datasets(pickle_names, train_ratio)
print('Training:', train_dataset.shape, train_labels.shape)
print('Validation:', valid_dataset.shape, valid_labels.shape)

1
../data/rolled\pickle\1-Clear.pickle
2
../data/rolled\pickle\2-Almost Clear.pickle
3
../data/rolled\pickle\3-Mild.pickle
4
../data/rolled\pickle\4-Moderate.pickle
5
../data/rolled\pickle\5-Severe.pickle
Training: (22656, 2048) (22656,)
Validation: (5664, 2048) (5664,)


## Add additional layers and train the regression model

In [21]:
# add regression model which has three hidden layers (1024, 512, 256).
# It may take around 30 minutes to train the model. 
# Default hyperparameters are used here:
# L2 penalty: 0.0001
# Solver: adam
# batch_size: 'auto', = min(200, n_samples) = 200 since n_samples > 200
# learning_rate: 'constant'
# learning_rate_init: 0.001
# max_iter: 200. 200 iterations.
# verbose: False. Turn it to True if you want to see the training progress.
from sklearn.neural_network import MLPRegressor
clf_regr = MLPRegressor(hidden_layer_sizes=(1024, 512, 256), activation='relu', random_state=random_seed)
clf_regr.fit(train_dataset, train_labels) #Start training the regression model using the training data

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(1024, 512, 256), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=5, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

## Predict the validation dataset, and calculate the RMSE on the validation dataset

In [22]:
# Predict the labels of images in the validation dataset
pred_labels_regr = clf_regr.predict(valid_dataset)

In [23]:
# Calculate RMSE on the validation dataset
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse_regr = sqrt(mean_squared_error(pred_labels_regr, valid_labels))
print ('the RMSE of regression NN is %f' % rmse_regr)

the RMSE of regression NN is 0.335607


## Save the trained regression model to be used in the scoring pipeline.

In [24]:
# Store regression model
regr_model = pickle.dumps(clf_regr)
regression_store= pd.DataFrame({"model":[regr_model]})
regression_store.to_pickle(regression_model_path)