# Introduction

Currently, one of the most common (and accurate) methods for conducting a Blood
Smear is manually. The goal of this project is to develop a neural network that can classify WBCs from images as part of an eventual effort to automate the procedure without a significant loss in accuracy. By automating this process, we can not only speed it up, but we also reduce the amount of human labor required to conduct a test, thus lowering the overall cost.

The dataset for this project is a collection of ~12,500 images that are 240 x 320. The images contain several RBCs and a single, highlighted WBC. Each WBC falls into one of four categories: Eosinophil, Lymphocyte, Monocyte, or Neutrophil. The dataset can be found on Kaggle [here](https://www.kaggle.com/paultimothymooney/blood-cells). Through accurate classification, accurate proportions of each WBC type could be calculated and checked for normalcy. Additionally, cell images could be further inspected for abnormalities.

# Module Imports

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from PIL import Image
from tqdm import tqdm
import os
import matplotlib.pyplot as plt
%matplotlib inline


import warnings  
with warnings.catch_warnings():  
    warnings.filterwarnings("ignore",category=FutureWarning)
    import keras

Using TensorFlow backend.


In [2]:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPool2D, Activation, Dropout, Flatten, Dense
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras.utils import to_categorical
from keras.optimizers import Adadelta

# Data Exploration

Let's begin by taking a look at an example of each of the four types of WBC we'll be attempting to classify.

In [None]:
wbc_types = ['EOSINOPHIL', 'NEUTROPHIL', 'LYMPHOCYTE', 'MONOCYTE']
wbc_df = pd.DataFrame(columns=['file_name', 'type', 'group'])
X_train = []
X_test = []
y_train = []
y_test = []

plt.figure(figsize=(10,8))

for i, wbc_type in enumerate(wbc_types):
    path = 'dataset2-master/images/TRAIN/' + wbc_type + '/'
    files = os.listdir(path)
    print("Loading {} Training Image Files".format(wbc_type))
    for file in tqdm(files):
        image = load_img(path + file)
        image = image.resize((150,150))
        new_row = {'file_name': file,
                   'type': wbc_type, 
                   'group': 'train'}
        wbc_df = wbc_df.append(new_row, ignore_index=True)
        im_arr = np.asarray(image)
        im_arr = im_arr / 255.0
        X_train.append(im_arr)
        y_train.append(i)
    path = 'dataset2-master/images/TEST/' + wbc_type + '/'
    files = os.listdir(path)
    print("Loading {} Testing Image Files".format(wbc_type))
    for file in tqdm(files):
        image = load_img(path + file)
        image = image.resize((150,150))
        new_row = {'file_name': file,
                   'type': wbc_type,
                   'group': 'test'}
        wbc_df = wbc_df.append(new_row, ignore_index=True)
        im_arr = np.asarray(image)
        im_arr = im_arr / 255.0
        X_train.append(im_arr)
        y_test.append(i)
    
#     df = pd.DataFrame(columns=['file_name', 'type'])
#     df['file_name'] = files
#     df.fillna(value=wbc_type, inplace=True)
#     wbc_df = wbc_df.append(df)
    
#     image = load_img(path + files[0])
#     plt.subplot(2,2,i+1)
#     plt.title(wbc_type)
#     plt.axis('off')
#     plt.imshow(image)
    
# plt.tight_layout()
# plt.show()

In [None]:
sns.countplot(x='group', hue='type', data=wbc_df)
plt.show()

In [None]:
X_train = np.asarray(X_train)
#X_train = X_train/255.0

y_train = np.asarray(y_train)
y_train = to_categorical(y_train, num_classes=4)

X_test = np.asarray(X_test)
#X_test = X_test/255.0

y_test = np.asarray(y_test)
y_test = to_categorical(y_test, num_classes=4)

In [None]:
y_train.shape

The images show that each of the four cell types are quite easily visually differentiable. Additionally, we can see from the countplot that our training data is very well balanced with ~2,500 images of each WBC cell type.

# Model Construction

Since the input to be classified are image files, we will be using a Convolutional Neural Network for these purposes. 

In [11]:
trdata = ImageDataGenerator(featurewise_center=False,  # set input mean to 0 over the dataset
                            samplewise_center=False,  # set each sample mean to 0
                            featurewise_std_normalization=False,  # divide inputs by std of the dataset
                            samplewise_std_normalization=False,  # divide each input by its std
                            zca_whitening=False,  # apply ZCA whitening
                            rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
                            width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
                            height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
                            horizontal_flip=True,  # randomly flip images
                            vertical_flip=False)  # randomly flip images)
traindata = trdata.flow_from_directory(directory="dataset2-master/images/TRAIN",target_size=(60,80))
tsdata = ImageDataGenerator()
testdata = tsdata.flow_from_directory(directory="dataset2-master/images/TEST", target_size=(60,80))

Found 9957 images belonging to 4 classes.
Found 2487 images belonging to 4 classes.


In [12]:
model = Sequential()
# model.add(Conv2D(input_shape=(224,224,3),filters=64,kernel_size=(3,3),padding="same", activation="relu"))
# model.add(Conv2D(filters=64,kernel_size=(3,3),padding="same", activation="relu"))
# model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))

# model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(Conv2D(filters=128, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))

# model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(Conv2D(filters=256, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))

# model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))

# model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(Conv2D(filters=512, kernel_size=(3,3), padding="same", activation="relu"))
# model.add(MaxPool2D(pool_size=(2,2),strides=(2,2)))

# model.add(Flatten())
# model.add(Dense(units=4096,activation="relu"))
# model.add(Dense(units=4096,activation="relu"))
# model.add(Dense(units=4, activation="softmax"))

# from keras.optimizers import Adam

img_rows,img_cols=60,80
input_shape = (img_rows, img_cols, 3)

model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, strides=1))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

# opt = Adadelta()
# model.compile(optimizer=opt, loss=keras.losses.categorical_crossentropy, metrics=['accuracy'])

In [5]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_3 (Conv2D)            (None, 58, 78, 32)        896       
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 56, 76, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 28, 38, 64)        0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 28, 38, 64)        0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 68096)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               8716416   
_________________________________________________________________
dropout_4 (Dropout)          (None, 128)              

In [None]:
# this is the augmentation configuration we will use for training
train_datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

In [13]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

checkpoint = ModelCheckpoint("vgg16_1.h5",
                             monitor='val_accuracy',
                             verbose=1,
                             save_best_only=True,
                             save_weights_only=False,
                             mode='auto',
                             period=1)

early = EarlyStopping(monitor='val_accuracy',
                      min_delta=0,
                      patience=20,
                      verbose=1,
                      mode='auto')

hist = model.fit_generator(steps_per_epoch=100,
                           generator=traindata,
                           validation_data=testdata,
                           validation_steps=10,
                           epochs=100,
                           callbacks=[checkpoint,early])

Epoch 1/100

Epoch 00001: val_accuracy improved from -inf to 0.27812, saving model to vgg16_1.h5
Epoch 2/100

Epoch 00002: val_accuracy did not improve from 0.27812
Epoch 3/100

Epoch 00003: val_accuracy did not improve from 0.27812
Epoch 4/100

Epoch 00004: val_accuracy did not improve from 0.27812
Epoch 5/100

Epoch 00005: val_accuracy did not improve from 0.27812
Epoch 6/100

Epoch 00006: val_accuracy did not improve from 0.27812
Epoch 7/100

Epoch 00007: val_accuracy did not improve from 0.27812
Epoch 8/100

Epoch 00008: val_accuracy did not improve from 0.27812
Epoch 9/100

Epoch 00009: val_accuracy did not improve from 0.27812
Epoch 10/100

Epoch 00010: val_accuracy improved from 0.27812 to 0.28125, saving model to vgg16_1.h5
Epoch 11/100

Epoch 00011: val_accuracy did not improve from 0.28125
Epoch 12/100

Epoch 00012: val_accuracy improved from 0.28125 to 0.28438, saving model to vgg16_1.h5
Epoch 13/100

Epoch 00013: val_accuracy did not improve from 0.28438
Epoch 14/100

Epoch


Epoch 00040: val_accuracy improved from 0.29063 to 0.31250, saving model to vgg16_1.h5
Epoch 41/100

Epoch 00041: val_accuracy did not improve from 0.31250
Epoch 42/100

Epoch 00042: val_accuracy did not improve from 0.31250
Epoch 43/100

Epoch 00043: val_accuracy did not improve from 0.31250
Epoch 44/100

Epoch 00044: val_accuracy did not improve from 0.31250
Epoch 45/100

Epoch 00045: val_accuracy did not improve from 0.31250
Epoch 46/100

Epoch 00046: val_accuracy did not improve from 0.31250
Epoch 47/100

Epoch 00047: val_accuracy did not improve from 0.31250
Epoch 48/100

Epoch 00048: val_accuracy did not improve from 0.31250
Epoch 49/100

Epoch 00049: val_accuracy did not improve from 0.31250
Epoch 50/100

Epoch 00050: val_accuracy did not improve from 0.31250
Epoch 51/100

Epoch 00051: val_accuracy did not improve from 0.31250
Epoch 52/100

Epoch 00052: val_accuracy did not improve from 0.31250
Epoch 53/100

Epoch 00053: val_accuracy did not improve from 0.31250
Epoch 54/100

E