Team Members:

* Deepan Chakravarthi Padmanabhan
* Jaswanth Bandlamudi
* Muhammad Umer Ahmed Khan

## Data:

1. Files used for compiling the positive samples (living skin): 'Archiv/Referenz-Haut_6-Klassen.csv' and 'Archiv2016/2016skin.csv'.
2. Files used for compiling the negative samples (non-living skin): 'Archiv2016/2016material.csv', 'Archiv2016/2016material-fake.csv', 'Archiv/Fleisch.csv', 'Archiv/Stoff.csv', 'Archiv/Holz.csv', and 'Archiv/Leder.csv'.

## Process flow:

$$\text{Data: Raw data provided in Archiv and Archiv 2016} \\ \downarrow  \\ \text{Data cleaning: Features with Nan values are removed in each file}  \\ \downarrow\\   \text{Visualization: Plot spectral data of each material provided} \\ \downarrow \\ \text{Combined files using same features alone. Subsampled archiv 2016 and dropped 400-660 from archiv} \\ \downarrow \\\text{Feature extraction: PCA- Extracted 5 features and normalized} \\ \downarrow \\ \text{Compiled positive and negative samples with labels} \\ \downarrow \\ \text{Shuffled data} \\ \downarrow \\ \text{Split train-test data (66.66%-33.33%)}\\ \downarrow \\ \text{Classification (Training): } \textbf{SVM-Linear Kernel, SVM-RBF Kernel, Multi-Layer Perceptron, Random Forest}\\ \downarrow \\ \text{Classifier testing and validation} \\ \downarrow\\ \text{Visualize classification metrics for comparison: }\textbf{Accuracy, Precision, Recall, Model memory size, Training time}$$ 

##### Reported cross-validation- mean roc_auc score for each classifier.  #####

## Results and discussion:

##### Provided at the end of the notebook #####

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import sklearn
from sklearn.model_selection import cross_val_score, train_test_split, learning_curve, ShuffleSplit
from sklearn.metrics import accuracy_score, classification_report 
from sklearn.metrics import confusion_matrix, precision_score, recall_score
from sklearn import decomposition, model_selection
from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn import svm
from sklearn.neural_network import MLPClassifier

import warnings
warnings.filterwarnings('ignore')

import time
import sys

import keras
from keras.layers import Input, Dense
from keras.models import Model, Sequential
from keras.optimizers import Adam
from sklearn.utils.class_weight import compute_class_weight
import matplotlib.patches as patches

from memory_profiler import profile
%load_ext memory_profiler

# Memeory profiler


In [None]:
# Read data
def read_data(fileName, decimal_):
    data = pd.read_csv(fileName, decimal = decimal_, delimiter=';', encoding='utf8')
    return data

In [None]:
# Reading all data files

data_material = read_data('Archiv2016/2016material.csv', decimal_='.')
data_fake_material = read_data('Archiv2016/2016material-fake.csv', decimal_='.')
data_skin = read_data('Archiv2016/2016skin.csv', decimal_='.')
data_reference = read_data('Archiv2016/Referenz-Haut_6-Klassen.csv', decimal_=',')

data_flesh = read_data('Archiv/Fleisch.csv', decimal_=',')
data_stoff = read_data('Archiv/Stoff.csv', decimal_=',')
data_holz = read_data('Archiv/Holz.csv', decimal_=',')
data_leder = read_data('Archiv/Leder.csv', decimal_=',')
data_2reference = read_data('Archiv/Referenz-Haut_6-Klassen.csv', decimal_=',')


In [None]:
# Neglecting features (wavelength) with Nan values

data_material = data_material.dropna()
data_fake_material = data_fake_material.dropna()
data_skin = data_skin.dropna()
data_reference = data_reference.dropna()
data_flesh = data_flesh.dropna()
data_stoff = data_stoff.dropna()
data_holz = data_holz.dropna()
data_leder = data_leder.dropna()
data_2reference = data_2reference.dropna()


In [None]:
# Plotting spectra
def plot_spectra(data, title):  
    x = list(data.iloc[:,0])
    columns = data.columns
    plt.figure(figsize=(12,8))
    for i in columns[1:]:
        plt.plot(x,data[i])
    plt.xlabel('Wavelength (nm)')
    plt.ylabel('Intensity')
    plt.title(title)
    plt.show()

In [None]:
plot_spectra(data_material,'2016_material')
plot_spectra(data_fake_material,'2016_fake_material')
plot_spectra(data_skin,'2016_skin')
plot_spectra(data_reference,'2016_reference')

plot_spectra(data_flesh,'flesh')
plot_spectra(data_stoff,'stoff')
plot_spectra(data_holz,'holz')
plot_spectra(data_leder,'leder')
plot_spectra(data_2reference, 'archiv skin')

Notes:

1. Referenz-Haut_6-Klassen.csv in archiv and archiv 2016 are the same.

2. As seen above from the graphs all the materials are tested in various wavelengths making the number of features (along the x axis) unequal. In order to take a common feature representation numerous methods like imputation, feature split, dimensionality reduction can be applied as per [1].

In [None]:
def normalize(df):
    x = df.values #returns a numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    df = pd.DataFrame(x_scaled)
    return df
    

In [None]:
def PCA_(X, k):
    # PCA for the data and taking k features
    pca = decomposition.PCA(n_components=k)
    pca_data  = pca.fit_transform(X.T)
    return pca_data
    

In [None]:
data_material = normalize(data_material.iloc[:,1:])
data_fake_material = normalize(data_fake_material.iloc[:,1:])
data_flesh = normalize(data_flesh.iloc[:,1:])
data_stoff = normalize(data_stoff.iloc[:,1:])
data_holz = normalize(data_holz.iloc[:,1:])
data_leder = normalize(data_leder.iloc[:,1:])

data_skin = normalize(data_skin.iloc[:,1:])
data_reference = normalize(data_reference.iloc[:,1:])
data_2reference = normalize(data_2reference.iloc[:,1:])

In [None]:
data_material.shape

In [None]:
# Obtaining PCA data. 

data_material_pca = PCA_(data_material, 5)
data_fake_material_pca = PCA_(data_fake_material, 5)
data_flesh_pca = PCA_(data_flesh, 5)
data_stoff_pca = PCA_(data_stoff, 5)
data_holz_pca = PCA_(data_holz, 5)
data_leder_pca = PCA_(data_leder, 5)

data_skin_pca = PCA_(data_skin, 5)
data_reference_pca = PCA_(data_reference, 5)
data_2reference_pca = PCA_(data_2reference, 5)

In [None]:
data_material_pca.shape

In [None]:
def normalize_array(x):
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    return x_scaled

In [None]:

data_material_pca = normalize_array(data_material_pca)
data_fake_material_pca = normalize_array(data_fake_material_pca)
data_flesh_pca = normalize_array(data_flesh_pca)
data_stoff_pca = normalize_array(data_stoff_pca)
data_holz_pca = normalize_array(data_holz_pca)
data_leder_pca = normalize_array(data_leder_pca)

data_skin_pca = normalize_array(data_skin_pca)
data_reference_pca = normalize_array(data_reference_pca)
data_2reference_pca = normalize_array(data_2reference_pca)

In [None]:
# Labelling data 

def get_data_labels_pca(data, label_type):
    columns = data.shape[0]
    label_value = np.ones(((columns),1), dtype=int)
    label_value.fill(label_type)
    data = np.append(data, label_value, axis=1)
    return data
    

In [None]:
# label 1 - non living materials
# label 0 - living skin

# Compiling dataset
data_material_train = get_data_labels_pca(data_material_pca, 1)
data_fake_material_train = get_data_labels_pca(data_fake_material_pca, 1)
data_flesh_train = get_data_labels_pca(data_flesh_pca, 1)
data_stoff_train = get_data_labels_pca(data_stoff_pca, 1)
data_holz_train = get_data_labels_pca(data_holz_pca, 1)
data_leder_train = get_data_labels_pca(data_leder_pca, 1)

data_skin_train = get_data_labels_pca(data_skin_pca, 0)
data_reference_train = get_data_labels_pca(data_reference_pca, 0)
data_2reference_train = get_data_labels_pca(data_2reference_pca,0)

In [None]:
pos_data = np.append(data_skin_train, data_reference_train, axis=0) # Skin
# Append all negative data together
neg_data = np.append(data_material_train, data_fake_material_train, axis=0) # Non skin

In [None]:
pos_data_X = pos_data[:,0:-1]
pos_data_y = pos_data[:,-1]

In [None]:
neg_data_X = neg_data[:,0:-1]
neg_data_y = neg_data[:,-1]

In [None]:
X = np.append(pos_data_X, neg_data_X, axis=0)
y = np.append(pos_data_y, neg_data_y, axis=0)


X = Features and y = labels are ready for learning and developing classifier.

## SVM

In [None]:
# Shuffling the dataset
X, y = shuffle(X, y)

# Splitting train-test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [None]:
# SVM Linear
svm_linear = svm.SVC(C=10, kernel='linear', gamma='auto')
start = time.time()
%memit svm_linear.fit(X_train,y_train)
end = time.time()
svm_linear_time = end - start
print("Training time taken for SVM-Linear kernel (in seconds): ",end - start)

In [None]:
def svm_linear_fun(X_train,y_train):
    svm_linear = svm.SVC(C=10, kernel='linear', gamma='auto')
    start = time.time()
    svm_linear.fit(X_train,y_train)
    end = time.time()
    svm_linear_time = end - start
    print("Training time taken for SVM-Linear kernel (in seconds): ",end - start)
    
%memit svm_linear_fun(X_train,y_train)

In [None]:
# prediction
%memit svm_predict = svm_linear.predict(X_test)

print("=== Confusion Matrix SVM-Linear Kernel Classifier===")
print(confusion_matrix(y_test, svm_predict))
print('\n')

print("=== Classification Report SVM-Linear Kernel Classifier===")
print(classification_report(y_test, svm_predict))
print('\n')

print("=== Accuracy SVM-Linear Kernel Classifier===")
svm_linear_accuracy = accuracy_score(y_test, svm_predict)
print(svm_linear_accuracy)
print('\n')

print("=== Precision SVM-Linear Kernel Classifier===")
svm_linear_precision = precision_score(y_test, svm_predict)
print(svm_linear_precision)
print('\n')


print("=== Recall SVM-Linear Kernel Classifier===")
svm_linear_recall = recall_score(y_test, svm_predict)
print(svm_linear_recall)
print('\n')

In [None]:
svm_linear_cv = svm.SVC(C=10, kernel='linear', gamma='auto')
start = time.time()
svm_linear_cv_score = cross_val_score(svm_linear_cv, X, y, cv=5, scoring='roc_auc')
end = time.time()
print("=== All AUC Scores SVM-Linear Classifier===")

print(svm_linear_cv_score)
print('\n')

print("=== Mean AUC Score SVM-Linear Kernel Classifier: ===")
print(svm_linear_cv_score.mean())
print('\n')

print("Time taken for cross validation SVM-Linear kernel (in seconds): ",end - start,"seconds")

# Using archiv and archiv 2016 folder data

Entire dataset is used.

In [None]:
neg_data = np.append(neg_data, data_flesh_train, axis=0)
neg_data = np.append(neg_data, data_stoff_train, axis=0)
neg_data = np.append(neg_data, data_holz_train, axis=0)
neg_data = np.append(neg_data, data_leder_train, axis=0)

neg_data_X = neg_data[:,0:-1]
neg_data_y = neg_data[:,-1]

X = np.append(pos_data_X, neg_data_X, axis=0)
y = np.append(pos_data_y, neg_data_y, axis=0)

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
X = min_max_scaler.fit_transform(X)

# Shuffling the dataset
X, y = shuffle(X, y)

# Splitting train-test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [None]:
svm_linear = svm.SVC(C=10, kernel='linear')
start = time.time()
%memit svm_linear.fit(X_train,y_train)
end = time.time()
svm_linear_time = end - start
print("Training time taken for SVM-Linear Kernel (in seconds): ",svm_linear_time)

In [None]:
def svm_linear_fun(X_train,y_train):
    svm_linear = svm.SVC(C=10, kernel='linear', gamma='auto')
    start = time.time()
    svm_linear.fit(X_train,y_train)
    end = time.time()
    svm_linear_time = end - start
    print("Training time taken for SVM-Linear kernel (in seconds): ",end - start)
    
%memit svm_linear_fun(X_train,y_train)

In [None]:
# predictions
%memit svm_predict = svm_linear.predict(X_test)

print("=== Confusion Matrix SVM-Linear Kernel Classifier===")
print(confusion_matrix(y_test, svm_predict))
print('\n')

print("=== Classification Report SVM-Linear Kernel Classifier===")
print(classification_report(y_test, svm_predict))
print('\n')

print("=== Accuracy SVM-Linear Kernel Classifier===")
svm_linear_accuracy = accuracy_score(y_test, svm_predict)
print(svm_linear_accuracy)
print('\n')

print("=== Precision SVM-Linear Kernel Classifier===")
svm_linear_precision = precision_score(y_test, svm_predict)
print(svm_linear_precision)
print('\n')


print("=== Recall SVM-Linear Kernel Classifier===")
svm_linear_recall = recall_score(y_test, svm_predict)
print(svm_linear_recall)
print('\n')

In [None]:
svm_linear_cv = svm.SVC(C=10, kernel='linear')
svm_linear_cv_score = cross_val_score(svm_linear_cv, X, y, cv=5, scoring='roc_auc')

print("=== All AUC Scores SVM-Linear Kernel Classifier===")

print(svm_linear_cv_score)
print('\n')

print("=== Mean AUC Score SVM-Linear Kernel Classifier ===")
print(svm_linear_cv_score.mean())

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    print("CV score, test_scores", test_scores_mean)
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [None]:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
f = plt.figure(figsize=(16,5))
plt.subplots_adjust(left=None,bottom=0.1,top=0.9,wspace=0.2,hspace=0.5)

estimator = svm_linear
title = r"Learning Curves"
plot_learning_curve(estimator, title, X, y, ylim=(0, 1.01), cv=cv, n_jobs=4)

fig1 = plt.gcf()
plt.draw()

## Radial Basis Function Kernel SVM

In [None]:
X, y = shuffle(X, y)

# Splitting train-test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [None]:
svm_rbf = svm.SVC(C=1, kernel='rbf', gamma=10)
start = time.time()
%memit svm_rbf.fit(X_train,y_train)
end = time.time()
svm_rbf_time = end - start
print("Training time taken for SVM-RBF Kernel (in seconds): ",svm_rbf_time)

In [None]:
def svm_rbf_fun(X_train,y_train):
    svm_linear = svm.SVC(C=10, kernel='rbf', gamma=10)
    start = time.time()
    svm_linear.fit(X_train,y_train)
    end = time.time()
    svm_linear_time = end - start
    print("Training time taken for SVM-Linear kernel (in seconds): ",end - start)
    
%memit svm_rbf_fun(X_train,y_train)

In [None]:
# predictions
%memit svm_predict = svm_rbf.predict(X_test)

print("=== Confusion Matrix SVM-RBF Kernel Classifier===")
print(confusion_matrix(y_test, svm_predict))
print('\n')

print("=== Classification Report SVM-RBF Kernel Classifier===")
print(classification_report(y_test, svm_predict))
print('\n')

print("=== Accuracy SVM-RBF Kernel Classifier===")
svm_rbf_accuracy = accuracy_score(y_test, svm_predict)
print(svm_rbf_accuracy)
print('\n')

print("=== Precision SVM-RBF Kernel Classifier===")
svm_rbf_precision = precision_score(y_test, svm_predict)
print(svm_rbf_precision)
print('\n')


print("=== Recall SVM-RBF Kernel Classifier===")
svm_rbf_recall = recall_score(y_test, svm_predict)
print(svm_rbf_recall)
print('\n')

In [None]:
svm_rbf_cv = svm.SVC(C=1, kernel='rbf', gamma=10)
svm_rbf_cv_score = cross_val_score(svm_rbf_cv, X, y, cv=5, scoring='roc_auc')

print("=== All AUC Scores SVM-RBF Kernel Classifier===")

print(svm_rbf_cv_score)
print('\n')

print("=== Mean AUC Score SVM-RBF Kernel Classifier ===")
print(svm_rbf_cv_score.mean())

In [None]:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
f = plt.figure(figsize=(16,5))
plt.subplots_adjust(left=None,bottom=0.1,top=0.9,wspace=0.2,hspace=0.5)

estimator = svm_rbf
title = r"Learning Curves"
plot_learning_curve(estimator, title, X, y, ylim=(0, 1.01), cv=cv, n_jobs=4)

fig1 = plt.gcf()
plt.draw()

## Multi Layer Perceptron

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [None]:
# Using sklearn

# Hyper parameter optimization - Finding best parameters using Gridsearch
parameter_space = {
    'hidden_layer_sizes': [(10,10,10), (5,10), (10,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

mlp_sklearn = MLPClassifier(max_iter=100, random_state=1)
mlp_sklearn = model_selection.GridSearchCV(mlp_sklearn, parameter_space, n_jobs=-1, cv=3)
%memit mlp_sklearn.fit(X_train, y_train)
# Best model found by HPO
print("Best model", mlp_sklearn)
predicted = mlp_sklearn.predict(X_test)

print("=== Accuracy MLP Classifier===")
mlp_accuracy = accuracy_score(y_test, predicted)
print(mlp_accuracy)
print('\n')

print("=== Precision MLP Classifier===")
mlp_precision = precision_score(y_test, predicted)
print(mlp_precision)
print('\n')


print("=== Recall MLP Classifier===")
mlp_recall = recall_score(y_test, predicted)
print(mlp_recall)
print('\n')

In [None]:
# MLP with 3 hidden layers and a output layer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)
input_ = Input(shape=(X_train.shape[1],)) # input
hidden_1 = Dense(12, activation='relu')(input_) # inputs * weights 
hidden_2 = Dense(12, activation='relu')(hidden_1) # hidden * weights
hidden_3 = Dense(12, activation='relu')(hidden_2)
output_ = Dense(1, activation='sigmoid')(hidden_3)
classifier = Model(input_, output_)

# Optimizer and loss function definition
classifier.compile(loss='binary_crossentropy', optimizer=Adam(lr=0.01), metrics = ['accuracy'] )
# For weight balancing on unbalanced dataset.
weights = compute_class_weight('balanced', np.array([0,1]), y_train)

start_ = time.time()
%memit history = classifier.fit(X_train, y_train, batch_size=10, epochs=100, class_weight={0:weights[0],1:weights[1]})
stop_ = time.time()
predicted_labels = classifier.predict(X_test)
predicted_labels = np.array([i > 0.5 for i in predicted_labels])
# Map probabilities to class labels

mlp_time = stop_ - start_
predictions = predicted_labels

# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

# summarize history for loss
plt.plot(history.history['loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train'], loc='upper left')
plt.show()

mlp_accuracy = accuracy_score(y_test, predictions)
mlp_precision = precision_score(y_test, predictions, [1,0])
mlp_recall = recall_score(y_test, predictions, [1,0])

print ("=== Confusion matrix of MLP Classifier===")
print ("=== Depiction ===\n",np.array([["TP", "FN"], ["FP", "TN"]]))
print ("=== Estimated values ===\n",confusion_matrix(y_test, predictions, [1,0]))
print ("=== Precision of MLP Classifier===\n",mlp_precision)
print ("=== Recall of MLP Classifier===\n",mlp_recall)
print("=== Accuracy MLP Classifier===\n", mlp_accuracy)
print("Training time taken for MLP with 3 hidden layers (in seconds): ",mlp_time)        

## Random forest

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [None]:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
# random forest model creation
rfc = RandomForestClassifier()
start = time.time()
%memit rfc.fit(X_train,y_train)
end = time.time()
random_forest_time = end - start
print("Training time taken for Random Forest with 10 estimators (in seconds): ",random_forest_time)        

In [None]:
%memit rfc_predict = rfc.predict(X_test)

print("=== Confusion Matrix Random forest Classifier===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')

print("=== Classification Report Random forest Classifier===")
print(classification_report(y_test, rfc_predict))
print('\n')

print("=== Accuracy Random forest Classifier===")
random_forest_accuracy = accuracy_score(y_test, rfc_predict)
print(random_forest_accuracy)
print('\n')

print("=== Precision Random forest Classifier===")
random_forest_precision = precision_score(y_test, rfc_predict)
print(random_forest_precision)
print('\n')


print("=== Recall Random forest Classifier===")
random_forest_recall = recall_score(y_test, rfc_predict)
print(random_forest_recall)
print('\n')

In [None]:
rfc_cv = RandomForestClassifier()
rfc_cv_score = cross_val_score(rfc_cv, X, y, cv=5, scoring='roc_auc')

print("=== All AUC Scores Random Forest Classifier===")

print(rfc_cv_score)
print('\n')

print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest Classifier: ", rfc_cv_score.mean())

In [None]:
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)
f = plt.figure(figsize=(16,5))
plt.subplots_adjust(left=None,bottom=0.1,top=0.9,wspace=0.2,hspace=0.5)

estimator = rfc
title = r"Learning Curves"
plot_learning_curve(estimator, title, X, y, ylim=(0, 1.01), cv=cv, n_jobs=4)

fig1 = plt.gcf()
plt.draw()

## Results:

In [None]:
timer = [svm_linear_time, svm_rbf_time, mlp_time, random_forest_time]
LABELS = ['svm_linear', 'svm_rbf', 'mlp', 'random_forest']

fig, ax = plt.subplots(figsize=(12,6), facecolor='white', dpi= 80)
ax.vlines(x=LABELS, ymin=0, ymax=timer, color='firebrick', alpha=1, linewidth=20)

# Annotate Text
for i, value in enumerate(timer):
    ax.text(i, value+0.05, round(value,3), horizontalalignment='center')
plt.ylabel('Training time (in seconds)')
plt.xlabel('Classifiers')
plt.title('Classifiers training time taken comparison')
plt.show()


In [None]:
accuracy_ = [svm_linear_accuracy, svm_rbf_accuracy, mlp_accuracy, random_forest_accuracy]
LABELS = ['svm_linear', 'svm_rbf', 'mlp', 'rf']

fig, ax = plt.subplots(figsize=(12,6), facecolor='white', dpi= 80)
ax.vlines(x=LABELS, ymin=0, ymax=accuracy_, color='firebrick', alpha=1, linewidth=20)

# Annotate Text
for i, value in enumerate(accuracy_):
    ax.text(i, value+0.01, round(value,3), horizontalalignment='center')
plt.ylabel('Accuracy')
plt.xlabel('Classifiers')
plt.title('Classifiers accuracy comparison')
plt.show()

In [None]:
recall = [svm_linear_recall, svm_rbf_recall, mlp_recall, random_forest_recall]
x_range = range(len(recall))
LABELS = ['svm_linear', 'svm_rbf', 'mlp', 'rf']

fig, ax = plt.subplots(figsize=(12,6), facecolor='white', dpi= 80)
ax.vlines(x=LABELS, ymin=0, ymax=recall, color='firebrick', alpha=1, linewidth=20)

# Annotate Text
for i, value in enumerate(recall):
    ax.text(i, value+0.01, round(value,3), horizontalalignment='center')
plt.ylabel('Recall')
plt.xlabel('Classifiers')
plt.title('Classifiers recall comparison')
plt.show()

In [None]:
precision = [svm_linear_precision, svm_rbf_precision, mlp_precision, random_forest_precision]
LABELS = ['svm_linear', 'svm_rbf', 'mlp', 'rf']

fig, ax = plt.subplots(figsize=(12,6), facecolor='white', dpi= 80)
ax.vlines(x=LABELS, ymin=0, ymax=precision, color='firebrick', alpha=1, linewidth=20)

# Annotate Text
for i, value in enumerate(precision):
    ax.text(i, value+0.01, round(value,3), horizontalalignment='center')
plt.ylabel('Precision')
plt.xlabel('Classifiers')
plt.title('Classifiers precision comparison')
plt.show()

In [None]:
import os, pickle
svm_linear_filename = 'svm_model.pkl'
pickle.dump(svm_linear, open(svm_linear_filename, 'wb'))

svm_rbf_filename = 'svm_model.pkl'
pickle.dump(svm_rbf, open(svm_rbf_filename, 'wb'))

mlp_filename = 'mlp_model.pkl'
pickle.dump(classifier, open(mlp_filename, 'wb'))

rfc_filename = 'rf_model.pkl'
pickle.dump(rfc, open(rfc_filename, 'wb'))

statinfo_svm_linear = os.path.getsize(svm_linear_filename)
print("Memeory of the SVM-Linear Kernel model (in bytes):",statinfo_svm_linear)

statinfo_svm_rbf = os.path.getsize(svm_rbf_filename)
print("Memeory of the SVM-RBF Kernel model (in bytes):",statinfo_svm_rbf)

statinfo_mlp = os.path.getsize(mlp_filename)
print("Memeory of the MLP model (in bytes):",statinfo_mlp)

statinfo_rfc = os.path.getsize(rfc_filename)
print("Memeory of the Random forest model (in bytes):",statinfo_rfc)

In [None]:
memory = [statinfo_svm_linear, statinfo_svm_rbf, statinfo_mlp, statinfo_rfc]
LABELS = ['svm_linear', 'svm_rbf', 'mlp', 'rf']

fig, ax = plt.subplots(figsize=(12,6), facecolor='white', dpi= 80)
ax.vlines(x=LABELS, ymin=0, ymax=memory, color='firebrick', alpha=1, linewidth=20)

# Annotate Text
for i, value in enumerate(memory):
    ax.text(i, value+1000, round(value,3), horizontalalignment='center')
plt.ylabel('Model size (in bytes)')
plt.xlabel('Classifiers')
plt.title('Classifiers model memory size comparison')
plt.show()

## Discussion:
The classifiers are compared on 5 metrics: 

* Accuracy
* Recall
* Precision
* Training time
* Model memory size
* Memory foot print


The individual illustrations provided in results section compare the metrics and the winner among the classifier is SVM- Radial basis function kernel.

SVM-RBF on 5 features extracted by perform Principal Component Analysis provides an accuracy of 97.8%.

Since accuracy alone cannot be used to decide the best classifier for the given task, recall and precision is also analyzed. SVM-RBF is the winner of all classifiers comparing recall and precision too.

MLP training time is more significantly higher than SVM-RBF.

RF model size is significantly larger than SVM-RBF and MLP.

On comparing all the metrics SVM-RBF stands out as shown in the above graphs.


## References

1. Emre Rencberoglu, "Fundamental Techniques of Feature Engineering for Machine Learning", Towards Datascience, Available at: https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114, Accessed on: 16.11.2019 [Online].

2. Profiling and Optimizing in Jupyter Notebooks - A Comprehensive Guide by Muriz Serifovic
https://towardsdatascience.com/speed-up-jupyter-notebooks-20716cbe2025 Accessed on:23-11-2019