# Malaria Cell Classification with Keras #

This notebook is a log of the work performed in training a Keras H5 model based on the publicly available Malaria dataset https://ceb.nlm.nih.gov/repositories/malaria-datasets.

## Setup Environment ##

In [None]:
# create new anaconda environment
conda create -n env2 python=3.6

# enter new environment
source activate env2

# install necessary packages
conda install numpy scipy seaborn matplotlib pandas scikit-learn tensorflow theano keras opencv pillow
pip install tensorflowjs

## Setup scripts and dataset ##

__Create the following directory structure:__  

<img src="i1.png" width=250 align=left  style="padding-right: 15px;">

The parisitized and uninfected jpeg images are placed in the folders with these respective names. The names of these folders are used as classes.  

The the test folder will be empty at first.  

### Explanation of files: ###
__init-conda.sh__ executes the environment setup as above, optional  

__organize.py__ moves around the dataset, optional  

__extract_features.py__ Pre-trained models are loaded from the application module of the Keras library and the model is constructed based on the user specified configurations in the __conf.json__ file. Afterwards, features are extracted from the specified layer in the model pre-trained with ImageNet dataset. These features along with their labels are stored locally using HDF5 file format. The model and the weights are saved.  

__train.py__ Features and labels that were extracted from the dataset are first loaded. Then, a logistic regression model is created to train these features and labels. This also generates a confusion matrix (normalized and unnormalized) of the trained model on unseen test data splitted using scikit-learn and seaborn.

__test.py__ A script that uses the trained model to predict the class of unseen images.


## Perform feature extraction ##

In [None]:
$ python extract_features.py
Using TensorFlow backend.
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg19_weights_tf_dim_ordering_tf_kernels.h5
574717952/574710816 [==============================] 100%
[INFO] successfully loaded base model and model...
[INFO] encoding labels...
[INFO] processed - 1
[INFO] processed - 2
[INFO] processed - 3
...
[INFO] processed - 9998
[INFO] processed - 9999
[INFO] processed - 10000
[INFO] completed label - parasitized
[INFO] processed - 1
[INFO] processed - 2
[INFO] processed - 3
...
[INFO] processed - 9998
[INFO] processed - 9999
[INFO] processed - 10000
[INFO] completed label - uninfected
[STATUS] training labels shape: (20000,)
[STATUS] saved model and weights to disk..
[STATUS] features and labels saved..

👆 above operation takes about 13-14 hours on my desktop machine (cpu only)

## Perform training ##

In [None]:
$ python train.py 
[INFO] features shape: (20000, 4096)
[INFO] labels shape: (20000,)
[INFO] training started...
[INFO] splitted train and test data...
[INFO] train data  : (20000, 4096)
[INFO] test data   : (7558, 4096)
[INFO] train labels: (20000,)
[INFO] test labels : (7558,)
[INFO] creating model...
[INFO] evaluating model...
[INFO] saving model...

Training is much faster, on the order of a few minutes.

## Perform testing ##

In [None]:
$ python test.py 
Using TensorFlow backend.
[INFO] loading the classifier...
dataset/test/P__C33P1thinF_IMG_20150619_120804a_cell_224.jpg is parasitized
dataset/test/P__C33P1thinF_IMG_20150619_120645a_cell_216.jpg is parasitized
dataset/test/P__C33P1thinF_IMG_20150619_120645a_cell_217.jpg is parasitized
dataset/test/U__C1_thinF_IMG_20150604_104722_cell_216.jpg is uninfected
dataset/test/P__C33P1thinF_IMG_20150619_120742a_cell_210.jpg is parasitized
...
dataset/test/U__C1_thinF_IMG_20150604_104722_cell_211.jpg is uninfected
dataset/test/U__C1_thinF_IMG_20150604_104722_cell_164.jpg is uninfected
dataset/test/U__C1_thinF_IMG_20150604_104722_cell_231.jpg is uninfected
dataset/test/U__C1_thinF_IMG_20150604_104722_cell_191.jpg is uninfected
dataset/test/P__C33P1thinF_IMG_20150619_120838a_cell_222.jpg is parasitized

A small portion of the images are broken out of the dataset for testing purposes.

## Confusion Matrices ##

We plot confusion matrices (non-normalized and normalizeed) to look for errata.  

<img src="i3.png" align=left><img src="i4.png" align=left>

## Convert Keras model to Tensorflowjs

In [None]:
$ tensorflowjs_converter --input_format=keras /x/output/vgg19/model_0.2.h5 /x/output/vgg19-js
Using TensorFlow backend.

Successfully created a Tensorflowjs model from Keras.

## Deductions ##

The trained models appear to be very accurate in validation. This is almost suspicious given the degree of accuracy. Some further investigation may be needed.  

90 images from the uninfected dataset appear to be missing in the generated confusion matrices for both the vgg19 and vgg16 models. This also possibly warrants investigation.  

All Ops (Operations) supported in a Keras model created in this fashion are exportable to a Tensorflowjs model.  

Training could be sped up by placing the job on SLURM or Cloud.  

## Achievements ##

VGG16 and VGG19 Keras and Tensorflowjs models created on the Malaria dataset.  

## Future Work ##

Created models for inceptionv3, resnet50, xception, inceptionresnetv2 and mobilenet  

Create demployable HTML and Javascript app that demonstrates the inference capabilities of Tensorflowjs. Observe analaytics such as model/shard load time, warm up time, and prediction speeds.