# <center>Malaria Detection Model</center>

The data source for the same is obtained from Kaggel.com and the link to the data source is - https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria. The original data source is from the NIH Website: https://ceb.nlm.nih.gov/repositories/malaria-datasets/.

The data is provided in a single cell image folder, which in turn contains two folders - Parasitized (containing images of cells that are infected and Uninfected (containing images of cells that are not infected). Both folders contain 13779 images each. For the purpose of this model creation I am only using 500 images from each set making a data set of 1000 images. The reason for the same is resources constraints while loading the files into Watson studio and also a slow internet connection at my end.


The first step of the process is to import all the necessary libraries. There are few more which are imported later on as and when the need arises

In [1]:
import numpy as np # linear algebra
import pandas as pd 
import os, sys
from IPython.display import display
from IPython.display import Image as _Imgdis
from PIL import Image
from sklearn.model_selection  import train_test_split
from scipy import ndimage
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import to_categorical
from keras.optimizers import Adadelta
from keras import backend as K

Using TensorFlow backend.


The files were loaded as two streaming body object each containing a zip folder of 500 images of uninfected & parasitized cells.

Below is the streaming body object creation code for the zip folder containing the parasitized images.

In [2]:
# The code was removed by Watson Studio for sharing.

Below is the streaming body object creation code for the zip folder containing the uninfected images.

In [None]:
# Your data file was loaded into a botocore.response.StreamingBody object.
# Please read the documentation of ibm_boto3 and pandas to learn more about your possibilities to load the data.
# ibm_boto3 documentation: https://ibm.github.io/ibm-cos-sdk-python/
# pandas documentation: http://pandas.pydata.org/
streaming_body_6 = client_acbfbc181f5a4f93b6155e38c3fdbb33.get_object(Bucket='adsfinalproject-donotdelete-pr-tzflpkevbjjbwo', Key='Uninfected_500.zip')['Body']
# add missing __iter__ method so pandas accepts body as file-like object
if not hasattr(streaming_body_6, "__iter__"): streaming_body_6.__iter__ = types.MethodType( __iter__, streaming_body_6 )


Reading the streaming body object as a zip file and initializing the same to a variable.

In [3]:
from io import BytesIO
import zipfile

zip_parasitized = zipfile.ZipFile(BytesIO(streaming_body_5.read()), 'r')

In [5]:
zip_uninfected = zipfile.ZipFile(BytesIO(streaming_body_6.read()), 'r')

The images are all color images hence setting the channel value to 3, after looking at several images setting the base width and base height at 140 and 120. The exercise can be done by reducing the dimensions or converting the images to a greyscale.

In [6]:
channels = 3
basewidth= 140
baseheight = 120
input_shape = (channels,baseheight,basewidth)

### Converting the images into Numeric Data.

Initializing empty data sets to load the numeric data. Creating for loop to loop through the zip folder which has the 500 images of uninfected cells. Resizing using the base width and base height values. Also using the channel first data format while converting the image to arrays. Also creating the label data set, with a value of zero.

In [7]:
x_unin_data = []
y_unin_label = []
for i in range(0,len(zip_uninfected.namelist())):
    y_unin_label.append(0)
    img = Image.open((zip_uninfected.open(zipfile.ZipFile.namelist(zip_uninfected)[i])))
    img = img.resize((basewidth, baseheight), Image.ANTIALIAS)
    image_dt = img_to_array(img,data_format='channels_first')
    image_dt = image_dt/255
    x_unin_data.append(image_dt)

Ensuring that 500 sets of data is present in the image array and the label data sets.

In [8]:
print(len(x_unin_data))
print(len(y_unin_label))

500
500


Repeating the exercise for the parasitized images zip folder. This time the label is assigned to 1.

In [9]:
x_para_data = []
y_para_label = []
for i in range(0,len(zip_parasitized.namelist())):
    y_para_label.append(1)
    img = Image.open((zip_parasitized.open(zipfile.ZipFile.namelist(zip_parasitized)[i])))
    img = img.resize((basewidth, baseheight), Image.ANTIALIAS)
    image_dt = img_to_array(img,data_format='channels_first')
    image_dt = image_dt/255
    x_para_data.append(image_dt)

In [10]:
print(len(x_para_data))
print(len(y_para_label))

500
500


Combining the two individual data sets to form the combined data set.

In [11]:
X = x_unin_data + x_para_data
Y = y_unin_label + y_para_label

Ensuring that the data count is correct.

In [12]:
print(len(X))
print(len(Y))

1000
1000


Converting the data into numpy arrays and ensuring the data count is correct again.

In [13]:
X = np.array(X)
Y = np.array(Y)
print(X.shape[0])
print(Y.shape[0])

1000
1000


Splitting the data set into train and test data with a ratio of 70 to 30.

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42, test_size=0.3)

Converting the label datasets into categorical values.

In [15]:
y_train = to_categorical(y_train,2)
y_test = to_categorical(y_test,2)

The approach I have adopted here is to create a convolutional neural network using keras. By keeping all other parameters as constants, I have studies the effect of applying various optimizers on the training data set and evaluating the same using the test data set. Accuracy and Loss are used as the measures to determine the fit of the models.

The names of optimizers being used for the analysis are stored in a list. A list to store the accuracy and loss value is also created. A for loop is constructed to loop over the optimizer names and define the model, train it and test it. The model definition has been included in the for loop to ensure that layers don’t keep increasing with each iteration.
The first layer is a Convolution 2d layer with 16 neurons and activation value of relu. The second layer has 32 neurons with the same activation value. Max pooling is set to a pool size (2,2). We have added a drop out layer which drops off 25 percent of the neurons to prevent overfitting. The third layer is a dense layer with 64 neurons and activation value of relu again. Next one more drop out layer is added to drop 50 percent of the neurons. Final layer is a dense layer with two neuron for the two classifiers and a softmax activation.

The epoch is set to a value of 10 and batch size to 25 for all the iterations.


In [16]:
Optimizer_List = ['Adam','Adadelta','Adagrad','RMSprop','Adamax','Nadam','SGD']
Score = []
for j in range(0,len(Optimizer_List)):
    model = Sequential()
    model.add(Conv2D(16, kernel_size=(1,1),activation='relu',input_shape=input_shape, data_format= "channels_first"))
    model.add(Conv2D(32, (1,1), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(2, activation='softmax'))
    model.compile(loss='binary_crossentropy',optimizer=Optimizer_List[j] ,metrics=['accuracy'])
    model.fit(X_train, y_train,batch_size=25, epochs=10, validation_data=(X_test, y_test))
    Score.append(model.evaluate(X_test,y_test))
    model.reset_states()    

Train on 700 samples, validate on 300 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 700 samples, validate on 300 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 700 samples, validate on 300 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 700 samples, validate on 300 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 700 samples, validate on 300 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 700 samples, validate on 300 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Train on 700 samples, validate on 300 samples
Epoch 1/10
E

Printing the score values.

In [17]:
Score

[[0.53622841914494834, 0.78333333333333333],
 [0.62272109667460118, 0.66999999920527142],
 [0.43585414807001749, 0.83666666587193805],
 [0.28718192815780641, 0.90666666587193812],
 [0.59146838823954262, 0.77666666746139523],
 [0.57654201269149785, 0.71999999920527136],
 [0.62621420780817671, 0.69333333253860474]]

Printing the score values along with the respective Optimizer names.

### The Loss and Accuracy Value for the Various Optimizers
##### Adam     : 0.53622841914494834, 0.78333333333333333
##### Adadelta :0.62272109667460118, 0.66999999920527142
##### Adagrad  : 0.43585414807001749, 0.83666666587193805
##### RMSprop  : 0.28718192815780641, 0.90666666587193812
##### Adamax   : 0.59146838823954262, 0.77666666746139523
##### Nadam    : 0.57654201269149785, 0.71999999920527136
##### SGD      : 0.62621420780817671, 0.69333333253860474

#### Since RMSprop had the best traning and test accuracy (0.934 & 0.9066) respectively we choose the same for our model.

Retraining the model using the RMSprop optimizer.

In [21]:
model_final= Sequential()
model_final.add(Conv2D(16, kernel_size=(1,1),activation='relu',input_shape=input_shape, data_format= "channels_first"))
model_final.add(Conv2D(32, (1,1), activation='relu'))
model_final.add(MaxPooling2D(pool_size=(2, 2)))
model_final.add(Dropout(0.25))
model_final.add(Flatten())
model_final.add(Dense(64, activation='relu'))
model_final.add(Dropout(0.5))
model_final.add(Dense(2, activation='softmax'))
model_final.compile(loss='binary_crossentropy',optimizer='RMSprop' ,metrics=['accuracy'])
model_final.fit(X_train, y_train,batch_size=25, epochs=10, validation_data=(X_test, y_test))

Train on 700 samples, validate on 300 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f0180116e10>

Re-evaluating the model and printing the Loss and Accuracy Values.

In [22]:
Score_final = model_final.evaluate(X_test,y_test)



In [23]:
print('Test_Loss :', Score_final[0])
print('Test_Accuracy :', Score_final[1])

Test_Loss : 0.335929806232
Test_Accuracy : 0.873333334128


Saving the model in a .h5 file and converting the same into tar file for model deployment.

In [24]:
save_path = "Malaria_Detection_Model.h5"
model_final.save(save_path)

In [25]:
!tar -zcvf Malaria_Detection_Model.tgz Malaria_Detection_Model.h5

Malaria_Detection_Model.h5


Importing the WatsonMachineLearningAPIClient for model deployment.

In [26]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

Status code: 400, body: {"trace":"809d60a9e33db85880210001c20abeff8558","errors":[{"code":"invalid_framework_input","message":"The framework libraries values specified: [{\"name\":\"keras\",\"version\":\"2.1.4\"}] are not supported."}]}


Intializing the wml credentials.

In [92]:
# The code was removed by Watson Studio for sharing.

Creating an instance of wml using the credentials and setting the parameters for model storage.

In [28]:
client = WatsonMachineLearningAPIClient(wml_credentials)

In [41]:
model_props = {client.repository.ModelMetaNames.AUTHOR_NAME: "IBM", 
               client.repository.ModelMetaNames.AUTHOR_EMAIL: "ibm@ibm.com", 
               client.repository.ModelMetaNames.NAME: "Malaria_Detection_Model",
               client.repository.ModelMetaNames.FRAMEWORK_NAME: "tensorflow",
               client.repository.ModelMetaNames.FRAMEWORK_VERSION: "1.5" ,
               client.repository.ModelMetaNames.FRAMEWORK_LIBRARIES: [{"name": "keras", "version": "2.1.3"}]
              }

Storing the model and deploying the same.

In [42]:
published_model = client.repository.store_model(model="Malaria_Detection_Model.tgz", meta_props=model_props)

In [43]:
published_model_uid = client.repository.get_model_uid(published_model)
model_details = client.repository.get_details(published_model_uid)

In [44]:
client.deployments.list()

----  ----  ----  -----  -------  ---------  -------------
GUID  NAME  TYPE  STATE  CREATED  FRAMEWORK  ARTIFACT TYPE
----  ----  ----  -----  -------  ---------  -------------


In [45]:
created_deployment = client.deployments.create(published_model_uid, name="Malaria Detection")



#######################################################################################

Synchronous deployment creation for uid: 'cbab4d9a-c834-400f-b1e2-e1bb4de7a737' started

#######################################################################################


INITIALIZING
DEPLOY_IN_PROGRESS
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='ac44e366-15d8-42a0-a0b8-ed49082695a0'
------------------------------------------------------------------------------------------------




#### Tetsing The Model with a random sample from the test data set

In [84]:
scoring_endpoint = created_deployment['entity']['scoring_url']
print(scoring_endpoint)

https://us-south.ml.cloud.ibm.com/v3/wml_instances/9b73dc07-bbe7-4ae2-a9aa-1f8e0bc7c5a0/deployments/ac44e366-15d8-42a0-a0b8-ed49082695a0/online


In [85]:
x_score_1 = X_test[168].tolist()
print('The answer should be: ',np.argmax(y_test[168]))
scoring_payload = {'values': [x_score_1]}

The answer should be:  0


In [86]:
predictions = client.deployments.score(scoring_endpoint, scoring_payload)
print('And the answer is!... ',predictions['values'][0][1])

And the answer is!...  0
