In [32]:
# Imports

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import sklearn.metrics as met

import time

import keras
from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical


BASE_PROCESSED_DATA_DIR = '../data/processed'
"""
str: Base processed data directory
"""

PROCESSED_CSV_FILE = BASE_PROCESSED_DATA_DIR + '/processed.csv'
"""
str: HAM1000_metadata.csv metadata file location 
"""
        
# Read dataset in
skin_df = pd.read_csv(PROCESSED_CSV_FILE, index_col=0)
"""
pandas.core.frame.DataFrame: final dataset
"""

def printMetrics(prediction, y_test):
    """
    Prints accuracy, confusion and F1 metrics
    
    returns list of accuracy, confusion and F1 metrics
    """
    accuracy = met.accuracy_score(y_test, prediction)
    confusion = met.confusion_matrix(y_test, prediction)
    f1_score_avg = met.f1_score(y_test, prediction, average='weighted')
    f1_score = met.f1_score(y_test, prediction, average= None)

    print('accuracy', accuracy)
    print()
    print(confusion)
    print()
    print('f1 average: ', f1_score_avg)
    print('f1: ', f1_score)

    return([accuracy, confusion, f1_score_avg])

lesion_type_label = skin_df[
    ['lesion_type_idx', 'lesion_type']].sort_values(
    'lesion_type_idx').drop_duplicates()['lesion_type']
"""
pandas.core.series.Series: Lesion types (text) series sorted by idx for labels
"""

'\npandas.core.series.Series: Lesion types (text) series sorted by idx for labels\n'

## Report Introduction

This report documents the process of creating neural networks to model skin pigment diagnosis and the evaluation of said models (especially compared to other methods used in the previous analysis). A basic Sequential neural network will be created.


#### One Hot Encoding and Minor Manipulation

In [33]:
# encode categorical cols using one hot encoding

one_hot_localization = pd.get_dummies(skin_df['localization'])
one_hot_localization.drop('unknown', axis=1, inplace = True)

one_hot_sex = pd.get_dummies(skin_df['sex'])
one_hot_sex.drop('unknown', axis=1, inplace = True)

# Drop old categorical cols and replace with new ones
# drop dx type (not needed beyond data understanding)

skin_df.drop(['dx_type', 'localization', 'sex'], axis = 1, inplace = True)

# Join the encoded dfs

skin_df = skin_df.join(one_hot_localization)
skin_df = skin_df.join(one_hot_sex)

Using pandas dummies for categorical variables, localization values are one hot coded using new columns for every value (0 false / 1 true), however one of the columns is dropped since a negation of all the other columns represents it. Lastly, the now redundant sex and localization fields are dropped alongside dx_type (no need for analysing diagnosis type beyond Data Understanding).

#### Test Split and Scaling 

In [34]:
# Split the dataset into training and test data in a 50-50 split
# Don't include lesion_types (used for response) and image path (not used yet)

X_train, X_test, y_train, y_test = train_test_split(
    skin_df.drop(['lesion_type_idx', 'lesion_type'], axis=1),
    skin_df['lesion_type_idx'], test_size=0.5, random_state=0)

# scale using a partial fit for speed

scaling = StandardScaler()

scaling.partial_fit(X_test)
X_test = scaling.transform(X_test)

scaling.partial_fit(X_train)
X_train = scaling.transform(X_train)

  if sys.path[0] == '':
  del sys.path[0]
  from ipykernel import kernelapp as app
  app.launch_new_instance()


The training data and the testing data are separated using a 50-50 split respectively, both sets consist of a set of predictors (X) and a response (y). The predictor data has the lesion_type_idx and lesion_type fields removed since they can leak the ground truth. For the response data only the lesion_type_idx field is used since it is sufficient at representing the category of skin lesion (the response / what is being predicted).

To ensure that the impact of predictors is not effected by the measurement scale - which could occur in this dataset due to the variety of predictors - the predictors are scaled using a scaling transform (i.e. with default mean and standard deviation transform).

### Measurements

The computer used to carry the measurements has the following specifications:
* CPU: i7-7700HQ
* RAM: 8GB
* OS: Windows 10
* GPU: GTX 1060 (notebook)

To evaluate the model the following measurements are taken:
* Fit time: Using the time python library, a timer is started and stopped to measure tuning and fit .
* Prediction time: Using the time python library, a timer is started and stopped to measure the prediction.
* Confusion matrix: Using the sklearn metrics library a confusion matrix is printed.
* F1 Score: Using the sklearn metrics library a F1 score is calculated using weighted averages and for every class.

## Models Description and Assessments

### Neural Networks

Neural networks are machine learning tools that are inspired by human brain biology, relying on multiple connected neurons that interact with one another through multiple layers. Neural networks emphasize the importance of interaction between features more so than traditional models reliant on basic sums of coefficients attached to predictors.
				
#### Sequential Neural Network

##### Introduction 		

Sequential Neural Networks layers are sequentially attached meaning each layer's neurons only attach to the next layer's. 

##### Construction and Tuning 

The networks will use relu activation - but softmax at final layer - and Adam optimisation using categorical cross entropy loss. The model training will early stop if no improvement occurs after 3 iterations via early stopping callback.

To help reduce overfitting a dropout layer of proportion 0.1 will be added between layers (chooses a random subset of units in the layer to ignore in propagation steps).

To tune the networks multiple models of increasing complexity/capacity (various widths and/or layer counts) are created and the ones with the best performance are picked. 

In [41]:
# Need to separate targets
target = to_categorical(y_train)

In [42]:
def create_seq_model(hidden_layers, nodes_per_layer) :
    """
    creates a sequential neural networks with a set of hidden 
    layers (hidden_layers) followed by 0.1 drop out
    with a given number of nodes (nodes_per_layer)
    which is complied in a 0.3 validation split in 20 epochs (early stop 3)
    
    returns keras model
    """
    
    early_stopping_monitor = EarlyStopping(patience = 3)
    
    model = Sequential()
    
    for i in range(hidden_layers - 1) :
        if i == 0 :
            model.add(Dense(nodes_per_layer, activation = 'relu', input_shape = (X_train.shape[1],)))
        else :
            model.add(Dense(nodes_per_layer, activation = 'relu'))
        model.add(Dropout(0.10))

    
    model.add(Dense(7, activation = 'softmax'))

    model.compile(optimizer = 'adam', loss = 'categorical_crossentropy',
                   metrics = ['accuracy'])
    model.fit(X_train, target, validation_split = 0.3, epochs = 20,
               callbacks = [early_stopping_monitor])
    
    return model

In [43]:
nn_fit_start = time.time()
nn_model_1 = create_seq_model(1, 30)
nn_model_2 = create_seq_model(3, 85)
nn_model_3 = create_seq_model(5, 70)
nn_model_4 = create_seq_model(5, 150)
nn_fit_end = time.time()

Train on 3504 samples, validate on 1503 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Train on 3504 samples, validate on 1503 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Train on 3504 samples, validate on 1503 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Train on 3504 samples, validate on 1503 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20


In [44]:
nn_fit_time = nn_fit_end - nn_fit_start
print('Sequential NN fit time(seconds): ', nn_fit_time)

Sequential NN fit time(seconds):  47.84800863265991


A simple neuron network of 5 layers and 150 nodes per layer seems to performing in a manner satisfactory manner and seems to be very fast to tune with a dedicated GPU. But generally all networks performed quite similarly.

#### Assessment

In [45]:
# carry prediction with time measurements 
# while recording prediction

nn_pred_start = time.time()
prediction = nn_model_4.predict_classes(X_test)
nn_pred_end = time.time()

nn_met = printMetrics(prediction, y_test)
nn_pred_time = nn_pred_end - nn_pred_start
print('Sequential NN prediction time(seconds): ', nn_pred_time)

accuracy 0.7182507987220448

[[  37   36   53    0   23    7    1]
 [  32   92   64    1   66    1    3]
 [  26   25  270    0  190   23    3]
 [   4   19   20    0   18    2    1]
 [   7   31  145    1 3138   28    6]
 [   5   13  165    0  316   54    8]
 [   6   12    4    0   44    2    6]]

f1 average:  0.6817478659478878
f1:  [0.27007299 0.37782341 0.42925278 0.         0.87763949 0.15929204
 0.11764706]
Sequential NN prediction time(seconds):  0.7041149139404297


It seems the a sequential neural network doesn't actually perform better than logistic regression while being slightly slower to fit. Perhaps an approach were we do not look at the picture pixel by pixel would help.

### Summary of Model Assessment



It seems that a simple sequential neural network doesn't cut it for this dataset still, this is probably because of the way the data is handled. The neural network still has to work in a pixel by pixel manner due to the dataset and its nature. Perhaps a convolutional neural network can resolve this by using kernels to identify features instead of thinking a pixel by pixel manner, however this will need the pipeline to be reorganised to load the images instead of pixel columns.