# <span style="color:blue"> jupyter notebook solution</span> : classify blood cell

## <span style="color:blue">  Business Understanding :  </span>

An important problem in blood diagnostics is classifying different types of blood cells.


## <span style="color:blue">  Import libraries :  </span>

In [None]:
#import libraries for data manipulation and visualization
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

## <span style="color:blue">  Load data : </span>

In [None]:
image_test = np.load("image_test.npy")
image_train = np.load("image_train.npy")
target_train = np.load("target_train.npy")
target_test = np.load("target_test.npy")

## <span style="color:blue">   Perform a brief exploratory analysis : </span>

#####  Data Understanding & Data requirements

<p style="color:blue">  How many training,  testing examples do we have ? What shape are the images and target ? </p>

<p style="color:blue">What is the proportion of each observed target ? Are the datasets balanced ? </p>

In [None]:
print(image_test[0][1][1])
print (image_test.shape)
print (image_train.shape)
print (target_train.shape)
print (target_test.shape)

In [None]:
print(image_train[0][1][1])

In [None]:
type(image_train)

In [None]:
type(target_train)

In [None]:
print(target_train[0])

In [None]:
#print(target_test)

In [None]:
#numpy.ndarray to pandas.core.frame.DataFrame
df_target_train = pd.DataFrame(target_train, 
             columns=['target_train'])

In [None]:
type(df_target_train)

In [None]:
df_target_train['target_train'].unique()

In [None]:
# check the target variable 
df_target_train['target_train'].value_counts()

In [None]:
# visualize the target variable
import seaborn as sns
g = sns.countplot(df_target_train['target_train'])
g.set_xticklabels(['lymphocytes', 'neutrophils', 'monocytes', 'eosinophils',
       'unknown'])
plt.show()

<p style="color:blue"> we can clearly see that there is a huge difference between the data set. 
296 neutrophils ,279 lymphocytes ,28 monocytes ,12 eosinophils and 2 unknown.  </p>

In [None]:
train_counts = np.unique(target_train, return_counts = True)
test_counts = np.unique(target_test, return_counts = True)

In [None]:

class_names = ['lymphocytes', 'neutrophils', 'monocytes', 'eosinophils','unknown']
pd.DataFrame({ "train": train_counts[1], "test": test_counts[1]}, index = class_names).plot.bar()

plt.show()

In [None]:
plt.pie(train_counts[1],
        explode=(0, 0, 0, 0,0) , 
        labels=class_names,
        autopct='%1.1f%%')
plt.axis('equal')
plt.title('Proportion of each observed category')
plt.show()

<p style="color:blue"> Imbalance Ratio (IR) :Is the proportion of the number of instances in the negative class to the number of instances in the positive one</p>

<p style="color:blue">
IR = (negative_class/positive_class)</p>
<p style="color:blue">
Where positive_class is the number of minority class samples and negative_class is the number of majority class samples.</p>

In [None]:
# class count
class_count_0, class_count_1,class_count_2, class_count_3 , class_count_4= df_target_train['target_train'].value_counts()


In [None]:
IR=(class_count_0+ class_count_1)/(class_count_2+ class_count_3 + class_count_4)
IR

## reshape

Currently, you have 4 dimension to your input data (batch size, channels, height, width) you need to flatten out your images to two dimensions (number of images, channels* height* width)


In [None]:
image_train_reshape = image_train.reshape(1269,24*24*3)
image_test_reshape = image_test.reshape(617,24*24*3)


In [None]:
#pip install xgboost

In [None]:
# import linrary
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

xgb_model = XGBClassifier().fit(image_train_reshape, target_train)

# predict
xgb_y_predict = xgb_model.predict(image_test_reshape)

# accuracy score
xgb_score = accuracy_score(xgb_y_predict, target_test)

print('Accuracy score is:', xgb_score)

<p style="color:blue" >We can see 67% accuracy, we are getting very high accuracy because it is predicting mostly the majority class <p/>

# Resampling Technique

<p style="color:blue" > A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of removing samples from the majority class (under-sampling) and/or adding more examples from the minority class (over-sampling). <p/>

In [None]:
#pip install imblearn

### Over Sampling Minority class by duplication

In [None]:

from imblearn.over_sampling import RandomOverSampler
oversam = RandomOverSampler(sampling_strategy='minority')
X_over,Y_over=oversam.fit_resample(image_train_reshape,target_train)


In [None]:
xgb_model = XGBClassifier().fit(X_over,Y_over)

# predict
xgb_y_predict = xgb_model.predict(X_over)

# accuracy score
xgb_score = accuracy_score(xgb_y_predict, Y_over)

print('Accuracy score is:', xgb_score)

In [None]:
#numpy.ndarray to pandas.core.frame.DataFrame
df_Y_over = pd.DataFrame(Y_over, 
             columns=['Y_over'])

In [None]:
# visualize the target variable
import seaborn as sns
g = sns.countplot(df_Y_over['Y_over'])
g.set_xticklabels(['lymphocytes', 'neutrophils', 'monocytes', 'eosinophils',
       'unknown'])
plt.show()

### Under-sampling majority class

In [None]:
from imblearn.under_sampling import RandomUnderSampler
sam = RandomUnderSampler(random_state=0)
image_train_under,target_train_under=sam.fit_resample(image_train_reshape,target_train)
image_test_under,target_test_under=sam.fit_resample(image_test_reshape,target_test)

In [None]:
#numpy.ndarray to pandas.core.frame.DataFrame
df_target_train_under = pd.DataFrame(target_train_under, 
             columns=['target_train_under'])

In [None]:
# visualize the target variable
import seaborn as sns
g = sns.countplot(df_target_train_under['target_train_under'])
g.set_xticklabels(['lymphocytes', 'neutrophils', 'monocytes', 'eosinophils',
       'unknown'])
plt.show()

In [None]:
xgb_model = XGBClassifier().fit(image_train_under,target_train_under)

# predict
xgb_y_predict = xgb_model.predict(image_test_under)

# accuracy score
xgb_score = accuracy_score(xgb_y_predict, target_test_under)

print('Accuracy score is:', xgb_score)

In [None]:
class_count_0, class_count_1,class_count_2, class_count_3 , class_count_4= df_target_train_under['target_train_under'].value_counts()
IR=(class_count_0+ class_count_1)/(class_count_2+ class_count_3 + class_count_4)
IR

# Plot images

#### Examples of images in the dataset

In [None]:
class_names = ['lymphocytes', 'neutrophils', 'monocytes', 'eosinophils','unknown']
def display_examples(class_names, image_test, target_test):
    fig = plt.figure(figsize = (10,10))
    fig.suptitle("Examples of images in the dataset", fontsize=16)
    for i in range(25):
        plt.subplot(5,5,i+1)
        plt.xticks([])
        plt.yticks([])
        plt.grid(False)
        plt.imshow(image_test[i], cmap=plt.cm.binary)
        plt.xlabel(target_test[i])
    plt.show()
    
display_examples(class_names, image_train, target_train)

#### Plot images from 3 samples of each class from the dataset

In [None]:
class_names = ['lymphocytes', 'neutrophils', 'monocytes', 'eosinophils','unknown']
def display_examples(class_names, image_train, target_train):
    fig = plt.figure(figsize = (10,10))
    #fig.suptitle("Examples of images in the dataset", fontsize=16)
    for j in range(5):
        posClass=np.where(target_train == class_names[j])
        print("----"+class_names[j]+"---")
        for i in range(3):
            plt.subplot(5,5,i+1)
            plt.xticks([])
            plt.yticks([])
            plt.grid(False)
            plt.imshow(image_train[[z[i] for z in posClass][0]], cmap=plt.cm.binary)
            plt.xlabel(target_train[[z[i] for z in posClass][0]])
        plt.show()
        
    
display_examples(class_names, image_train, target_train)

# <span style="color:blue">  Train machine learning models : </span>

In [None]:
#Method #1: Grayscale Pixel Values as Features
#pixel features

features = np.reshape(image_test, (1066176))

features.shape, features

<p style="color:blue"> we chose Random Forest and Support Vector machine as they can deal with great number of features. </p>

## RandomForestClassifier

In [None]:
from sklearn.ensemble import RandomForestClassifier


clf = RandomForestClassifier(max_depth=30, random_state=0)
clf.fit(image_train_under,target_train_under)
print(clf.score(image_test_under,target_test_under))

In [None]:
pred = clf.predict(image_test_under)
pred2 = clf.predict(image_train_under)

In [None]:
#Getting Accuracy Score
from sklearn.metrics import accuracy_score
print("Test accuracy",accuracy_score(target_test_under, pred))
print("Train accuracy",accuracy_score(target_train_under,pred2))


### Analyze accuracy using 5-fold cross validation


k-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models

In [None]:

# evaluate a RandomForestClassifier model using k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# prepare the cross-validation procedure
cv = KFold(n_splits=5, random_state=1, shuffle=True)
# create model
model = RandomForestClassifier(max_depth=30, random_state=0)
# evaluate model
scores = cross_val_score(model, image_train_reshape,target_train, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))


In [None]:
target_train_under.shape

In [None]:
target_train.shape

In [None]:
estimator

In [None]:
# Extract single tree
estimator = clf.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = None,
                class_names = class_names,
                rounded = True, proportion = False, 
                precision = 3, filled = True)

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

In [None]:
dotfile = open("test.dot", 'w')
tree.export_graphviz(modelTree, out_file=dotfile,
 feature_names = None,
                class_names = class_names,
filled=True, rounded=True,
special_characters=True)

dotfile.close()
system("dot -Tpng test.dot -o dtree7.png")

In [None]:
Image(filename = 'dtree7.png')

## Support Vector machine

In [None]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
cl = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
cl.fit(image_train_under,target_train_under)

#Predict the response for test dataset
y_pred = cl.predict(image_test_under)

In [None]:
print(cl.score(image_test_under,target_test_under))

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(target_test_under, y_pred))

Well, we got a classification rate of 80%, considered as very good accuracy

# <span style="color:blue">  Train a deep learning model : </span>



This model contains a sequence of five Conv blocks containing combinations of SeparableConv2D, BatchNormalization, MaxPooling and Dropout layers. The output of the final Conv block is flattened and followed by three Fully Connected (FC) layers each with its own Dropout layer. A final FC layer is added with four units and a softmax activation for multiclass classification.

In [None]:
from sklearn.model_selection import train_test_split
train_image,valid_image,train_target,valid_target = train_test_split(image_train, target_train, test_size=0.2, random_state=13)

In [None]:

from sklearn.utils import shuffle
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import tensorflow as tf
import keras
from keras.applications.vgg16 import VGG16 
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from keras.models import Sequential, Model 
from keras.applications import DenseNet201
from keras.initializers import he_normal
from keras.layers import Lambda, SeparableConv2D, BatchNormalization, Dropout, MaxPooling2D, Input, Dense, Conv2D, Activation, Flatten 
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
import imutils

In [None]:
model1 = Sequential()

# First Conv block
model1.add(Conv2D(16 , (3,3) , padding = 'same' , activation = 'relu' , input_shape = (120,120,3)))
model1.add(Conv2D(16 , (3,3), padding = 'same' , activation = 'relu'))
model1.add(MaxPooling2D(pool_size = (2,2)))

# Second Conv block
model1.add(SeparableConv2D(32, (3,3), activation = 'relu', padding = 'same'))
model1.add(SeparableConv2D(32, (3,3), activation = 'relu', padding = 'same'))
model1.add(BatchNormalization())
model1.add(MaxPooling2D(pool_size = (2,2)))

# Third Conv block
model1.add(SeparableConv2D(64, (3,3), activation = 'relu', padding = 'same'))
model1.add(SeparableConv2D(64, (3,3), activation = 'relu', padding = 'same'))
model1.add(BatchNormalization())
model1.add(MaxPooling2D(pool_size = (2,2)))

# Forth Conv block
model1.add(SeparableConv2D(128, (3,3), activation = 'relu', padding = 'same'))
model1.add(SeparableConv2D(128, (3,3), activation = 'relu', padding = 'same'))
model1.add(BatchNormalization())
model1.add(MaxPooling2D(pool_size = (2,2)))
model1.add(Dropout(0.2))

# Fifth Conv block 
model1.add(SeparableConv2D(256, (3,3), activation = 'relu', padding = 'same'))
model1.add(SeparableConv2D(256, (3,3), activation = 'relu', padding = 'same'))
model1.add(BatchNormalization())
model1.add(MaxPooling2D(pool_size = (2,2)))
model1.add(Dropout(0.2))


# FC layer 
model1.add(Flatten())
model1.add(Dense(units = 512 , activation = 'tanh'))
model1.add(Dropout(0.7))
model1.add(Dense(units = 128 , activation = 'tanh'))
model1.add(Dropout(0.5))
model1.add(Dense(units = 64 , activation = 'tanh'))
model1.add(Dropout(0.3))

# Output layer
model1.add(Dense(units = 4 , activation = 'softmax'))

# Compile
model1.compile(optimizer = "adam" , loss = 'sparse_categorical_crossentropy' , metrics = ['accuracy'])
model1.summary()

# Implement callbacks 
checkpoint = ModelCheckpoint(filepath='best_model.hdf5', save_best_only=True, save_weights_only=False)
early_stop = EarlyStopping(monitor='val_loss', min_delta=0.1, patience=3, verbose = 1, mode='min', restore_best_weights = True)
learning_rate_reduction = ReduceLROnPlateau(
    monitor = 'val_accuracy', 
    patience = 2, 
    verbose = 1, 
    factor = 0.3, 
    min_lr = 0.000001)

# Train
history1 = model1.fit(
   train_images,train_target,
    batch_size = 32, 
    epochs = 30, 
    validation_data=(val_images, val_target), 
    callbacks=[learning_rate_reduction])

### Analyze accuracy using 5-fold cross validation

Analyze accuracy using 5-fold cross validation

In [None]:
cv = KFold(n_splits=5, random_state=1, shuffle=True)
# evaluate model
scores = cross_val_score(model1,  image_train_under,target_train_under, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))


Lets evaluate the model on test data to find the loss and accuracy:

In [None]:
results = model1.evaluate(image_test_under,target_test_under)

print("Loss of the model  is - test ", results[0])
print("Accuracy of the model is - test", results[1]*100, "%")


results = model1.evaluate(image_train_under,target_train_under)

print("Loss of the model  is - train ", results[0])
print("Accuracy of the model is - train", results[1]*100, "%")

In [None]:
predictions = model1.predict(image_test_under)

# Check the model’s result metrics on the test dataset



## Confusion matrix


In [None]:
#function plot matrix
def plot_confusion_matrix (cm):
    plt.figure(figsize = (10,10))
    sns.heatmap(
        cm, 
        cmap = 'Blues', 
        linecolor = 'black', 
        linewidth = 1, 
        annot = True, 
        fmt = '', 
        xticklabels = class_names, 
        yticklabels = class_names)

In [None]:
#CNN model
cmCNN = confusion_matrix(target_test_under, predictions)
cmCNN = pd.DataFrame(cmCNN, index = ['0', '1', '2', '3','4'], columns = ['0', '1', '2', '3','4'])
cmCNN

In [None]:
  
plot_confusion_matrix(cmCNN)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
#RandomForestClassifierModel
cmRan = confusion_matrix(target_test_under, pred)
cmRan = pd.DataFrame(cmRan, index = ['0', '1', '2', '3','4'], columns = ['0', '1', '2', '3','4'])
cmRan


In [None]:
plot_confusion_matrix(cmRan)

In [None]:
#SVM model 

cmSVM = confusion_matrix(target_test_under, y_pred)
cmSVM = pd.DataFrame(cmSVM, index = ['0', '1', '2', '3','4'], columns = ['0', '1', '2', '3','4'])
cmSVM

In [None]:
plot_confusion_matrix(cmSVM)

## TPR, TNR, PPV, NPV 


#### confusion-matrix-terminologies for RandomForestClassifierModel

In [None]:


FP = cmRan.sum(axis=0) - np.diag(cmRan)  
FN = cmRan.sum(axis=1) - np.diag(cmRan)
TP = np.diag(cmRan)
TN = cmRan.sum() - (FP + FN + TP)

FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)


# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP) 
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)

# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)

#### confusion-matrix-terminologies for CNN

In [None]:


FP = cmCNN.sum(axis=0) - np.diag(cmCNN)  
FN = cmCNN.sum(axis=1) - np.diag(cmCNN)
TP = np.diag(cmCNN)
TN = cmCNN.sum() - (FP + FN + TP)

FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)


# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP) 
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)

# Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)