# &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; &emsp; Term Project: 
# &emsp; &emsp; &emsp; &emsp; Pneumonia vs. Normal X-Ray images
# &emsp; &emsp; &emsp; &emsp; &emsp; using SqueezeNet & SVM


This notebook demonstrates the usage of ``image_featurizer`` using the Kaggle Pneumonia-infected vs. Normal dataset.

We will look at the usage of the ``ImageFeaturizer()`` class, which provides a convenient pipeline to quickly tackle image problems with DataRobot's platform. 

It allows users to load image data into the featurizer, and then featurizes the images into a maximum of 2048 features. It appends these features to the CSV as extra columns in line with the image rows. If no CSV was passed in with an image directory, the featurizer generates a new CSV automatically and performs the same function.

## DONE BY:
## &emsp;Sayed Al-Qasim Dheya &nbsp; 20147349&emsp; && &emsp;Mahmoud Mohammad Saleh &nbsp; 20150884
##  &emsp; 

In [2]:
# Setting up stdout logging
import logging
import sys
import pandas as pd
root = logging.getLogger()
root.setLevel(logging.INFO)

ch = logging.StreamHandler(sys.stdout)
ch.setFormatter(logging.Formatter('%(levelname)s - %(message)s'))
root.addHandler(ch)

# Setting pandas display options
pd.options.display.max_rows = 10


In [19]:
# Importing the dependencies for this example
import numpy as np
from sklearn import svm
from pic2vec import ImageFeaturizer
import csv

### Formatting the Data

'ImageFeaturizer' accepts as input either:
1. An image directory
2. A CSV with URL pointers to image downloads, or 
3. A combined image directory + CSV with pointers to the included images. 

For this example, we will load in the Kaggle ill vs. healthy x-ray dataset of 5,216 images, along with a CSV that includes each images class label.

In [4]:
#WORKING_DIRECTORY = os.path.expanduser('~') + '/workspace/'

csv_path = 'training_set.csv'
image_path = 'chest_xray/train'

Let's take a look at the csv before featurizing the images:

In [5]:
pd.read_csv(csv_path)

Unnamed: 0,images,label
0,person1000_bacteria_2931.jpeg,1
1,person1000_virus_1681.jpeg,1
2,person1001_bacteria_2932.jpeg,1
3,person1002_bacteria_2933.jpeg,1
4,person1003_bacteria_2934.jpeg,1
...,...,...
5211,NORMAL2-IM-1406-0001.jpeg,0
5212,NORMAL2-IM-1412-0001.jpeg,0
5213,NORMAL2-IM-1419-0001.jpeg,0
5214,NORMAL2-IM-1422-0001.jpeg,0


The image directory contains 1341 healthy images and 3875 ill images. The CSV contains pointers to each image in the directory, along with a class label (0 for norm, 1 for ill).

## Initializing the Featurizer

We will now initialize the ImageFeaturizer( ) class with a few parameters that define the model. If in doubt, we can always call the featurizer with no parameters, and it will initialize itself to a cookie-cutter build. Here, we will call the parameters explicitly to demonstrate functionality. However, these are generally the default weights, so for this build we could just call ```featurizer = ImageFeaturizer()```.

Because we have not specified a model, the featurizer will default to the built-in SqueezeNet model, with loaded weights prepackaged. If you initialize another model, pic2vec will automatically download the model weights through the Keras backend.

The depth indicates how far down we should cut the model to draw abstract features– the further down we cut, the less complex the representations will be, but they may also be less specialized to the specific classes in the ImageNet dataset that the model was trained on– and so they may perform better on data that is further from the classes within the dataset.

Automatic downsampling means that this model will downsample the final layer from 512 features to 256 features, which is a more compact representation. With large datasets and bigger models (such as InceptionV3, more features may run into memory problems or difficulty optimizing, so it may be worth downsampling to a smaller featurspace.

In [6]:
featurizer = ImageFeaturizer(depth=1, autosample = False, model='squeezenet')

INFO - Building the featurizer.
INFO - Building the featurizer.
INFO - Loading/downloading SqueezeNet model weights. This may take a minute first time.
INFO - Loading/downloading SqueezeNet model weights. This may take a minute first time.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
INFO - Model successfully initialized.
INFO - Model successfully initialized.
INFO - Model decapitated.
INFO - Model decapitated.
INFO - Model downsampled.
INFO - Model downsampled.
INFO - Full featurizer is built.
INFO - Full featurizer is built.
INFO - No downsampling. Final layer feature space has size 512
INFO - No downsampling. Final layer feature space has size 512


This featurizer was 'decapitated' to the first layer below the prediction layer, which will produce complex representations. Because it is so close to the final prediction layer, it will create more specialized feature representations, and therefore will be better suited for image datasets that are similar to classes within the original ImageNet dataset. Cats and dogs are present within ImageNet, so a depth of 1 should perform well. 

## Loading and Featurizing Images Simultaneously

Now that the featurizer is built, we can actually load our data into the network and featurize the images all at the same time, using a single method:  

In [7]:
featurized_df = featurizer.featurize(image_columns='images', 
                                     image_path = image_path,
                                     csv_path = csv_path)

INFO - Found image paths that overlap between both the directory and the csv.

INFO - Found image paths that overlap between both the directory and the csv.

INFO - Loading image batch.
INFO - Loading image batch.
INFO - Converting images.
INFO - Converting images.
INFO - Converted 0 images in batch. Only 1000 images left to go.
INFO - Converted 0 images in batch. Only 1000 images left to go.
INFO - Converted 500 images in batch. Only 500 images left to go.
INFO - Converted 500 images in batch. Only 500 images left to go.
INFO - 
Featurizing image batch.
INFO - 
Featurizing image batch.
INFO - Trying to featurize data.
INFO - Trying to featurize data.
INFO - Creating feature array.
INFO - Creating feature array.
INFO - Feature array created successfully.
INFO - Feature array created successfully.
INFO - Combining image features with original dataframe.
INFO - Combining image features with original dataframe.
INFO - Number of missing photos: 1000
INFO - Number of missing photos: 1000
IN

The images have now been featurized. The featurized dataframe contains the original csv, along with the generated features appended to the appropriate row, corresponding to each image.

There is also an `images_missing` column, to track which images were missing. Missing image features are generated on a matrix of zeros.

If there are images in the directory that aren't contained in the CSV, or image names in the CSV that aren't in the directory, or even files that aren't valid image files in the directory, have no fear– the featurizer will only try to vectorize valid images that are present in both the CSV and the directory. Any images present in the CSV but not the directory will be given zero vectors, and the order of the image column from the CSV is considered the canonical order for the images.

In [8]:
featurized_df

Unnamed: 0,images,label,images_missing,images_feat_0,images_feat_1,images_feat_2,images_feat_3,images_feat_4,images_feat_5,images_feat_6,...,images_feat_502,images_feat_503,images_feat_504,images_feat_505,images_feat_506,images_feat_507,images_feat_508,images_feat_509,images_feat_510,images_feat_511
0,person1000_bacteria_2931.jpeg,1,False,0.000000,0.000000,0.555697,13.000273,0.162005,0.416726,1.096843,...,0.956577,0.046512,0.000000,4.215538,0.714503,0.027517,0.388637,0.406943,2.545398,2.772897
1,person1000_virus_1681.jpeg,1,False,0.000000,0.000000,0.000000,6.548187,0.000000,0.116692,5.377332,...,3.442926,0.041953,0.000000,1.113098,0.754876,0.652534,2.536307,1.857577,0.835985,0.971688
2,person1001_bacteria_2932.jpeg,1,False,0.000000,0.540526,0.419684,17.467466,0.000000,0.275836,6.099510,...,2.261778,0.186999,0.000000,5.545873,0.503430,0.802102,4.470579,2.481482,2.229733,5.326783
3,person1002_bacteria_2933.jpeg,1,False,0.000000,0.765431,0.000000,12.930234,0.000000,0.415402,6.282257,...,3.404135,0.000000,0.000000,3.268504,0.383124,1.780778,2.781443,2.767260,1.293516,5.426302
4,person1003_bacteria_2934.jpeg,1,False,0.007463,0.000000,0.011414,6.923587,0.214813,0.585576,8.611836,...,2.554204,0.000000,0.034383,0.076425,0.023470,0.200237,2.419353,5.483137,2.080965,2.520668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5211,NORMAL2-IM-1406-0001.jpeg,0,False,0.000000,0.000000,0.657627,6.251391,0.000000,0.293582,3.950681,...,0.380209,0.013220,0.000000,0.586768,0.364738,0.105559,2.901210,2.060967,0.437555,1.334750
5212,NORMAL2-IM-1412-0001.jpeg,0,False,0.061667,0.000000,0.413972,7.239841,0.131390,0.128750,6.652479,...,1.274938,0.364856,0.016582,0.153024,0.899824,0.998709,1.822760,1.926719,0.046185,1.476737
5213,NORMAL2-IM-1419-0001.jpeg,0,False,0.000000,0.000000,0.651131,9.722810,0.044089,0.079274,1.717659,...,3.250457,0.889037,0.000000,0.129854,2.800372,0.126349,0.111221,0.410358,0.136061,1.632934
5214,NORMAL2-IM-1422-0001.jpeg,0,False,0.173470,0.166740,0.045830,13.122295,0.000000,1.101092,2.474325,...,0.826877,0.562402,0.000000,0.324879,1.598484,0.000000,2.080808,1.227228,2.353605,3.700428


As you can see, the `featurize()` function loads the images as tensors, featurizes them using deep learning, and then appends these features to the dataframe in the same row as the corresponding image.

This can be used with both an image directory and a csv with a column containing the image filepaths (as it is in this case). However, it can also be used with just an image directory, in which case it will construct a brand new DataFrame with the image column header specified. Finally, it can be used with just a csv, as long as the image column header contains URLs of each image.

This is the simplest way to use pic2vec, but it is also possible to perform the function in multiple steps. There are actually two processes happening behind the scenes in the above code block: 
1. The images are loaded into the network, and then 
2. The images are featurized and these features are appended to the csv.



<!--## Loading the Data -->

<!--In the next sections, I will demonstrate loading and featurizing the images in separate steps, and explain in more depth what happens during each process.-->

<!--First, we have to load the images into the network. This will parse through the images in the order given by the csv, rescale them to a target size depending on the network (e.g. SqueezeNet is (227, 227))– and build a 5D tensor containing the vectorized representations of the images. This tensor will later be fed into the network in order to be featurized.-->

<!--The tensor has the following dimensions: `[number_of_image_columns, number_of_images_per_image_column, height, width, color_channels]`. In this case, the image tensor will have size `[1, 5216, 227, 227, 3]`. -->

<!--If one were to add a second photo of each animal taken from a new angle, the new tensor might have the dimensions `[2, 5216, 227, 227, 3]`, as there would be a second image column being featurized in each row.-->


<!--To load the images, we have to pass in the name of the column(s) in the CSV containing the image paths, as well as the path to the image directory and the path to the CSV. -->


<!--**Be aware**: -->

<!--When both steps are performed at once, the ImageFeaturizer can use batch processing prevent any memory errors. By default, it will featurize batches of 1000 images at once, but this number can be changed to whatever batch size your machine can handle when loading the images into memory. -->

<!--If you intend to load and featurize your data in separate steps, make sure your machine is capable of storing every image in memory. -->
Below is the loading operation and featurizing operation done independent of each other.

In [29]:
# Here starts unimportant part.
#featurizer.load_data('images', image_path=image_path, csv_path=csv_path)
# to load data independent of featurizing.

<!-- ## Featurizing the Data -->

<!-- Now that the data is loaded, we're ready to featurize the preloaded data. Like in the `featurize()` method, this will push the vectorized images through the network and save the 2D matrix output– each row representing a single image, and each column storing a different feature.-->

<!--This requires pushing images through the deep network, and so if you choose to use a slower, more powerful model like InceptionV3, large datasets will require a GPU to perform in a reasonable amount of time. Using a low-range GPU, it can take about 30 minutes to process the full 25,000 photos in the Dogs vs. Cats through InceptionV3. On the other hand, if you would like a fast, lightweight model without top-of-the-line accuracy, SqueezeNet works well enough and can perform inference on CPUs quickly. -->
some comment

In [27]:
#featurize_preloaded_df = featurizer.featurize_preloaded_data(save_features=True)[0]
# This is to independently featurize data after loading it, it does not affect the result

In [28]:
#featurize_preloaded_df
# This is to independently featurize data after loading it, it does not affect the result

In [None]:
# This concludes the unimportant part.

The image data is now loaded into the featurizer in one single batch. Like before, the tensor has the following dimensions: `[number_of_image_columns, number_of_images_per_image_column, height, width, color_channels]`.

## Results

The dataset has now been fully featurized! The features are saved under the featurized_data attribute if the `save_features` argument was set to True in either the `featurize()` or `featurize_preloaded_data()` functions:

In [12]:
featurizer.features

Unnamed: 0,images_missing,images_feat_0,images_feat_1,images_feat_2,images_feat_3,images_feat_4,images_feat_5,images_feat_6,images_feat_7,images_feat_8,...,images_feat_502,images_feat_503,images_feat_504,images_feat_505,images_feat_506,images_feat_507,images_feat_508,images_feat_509,images_feat_510,images_feat_511
0,False,0.000000,0.000000,0.555697,13.000273,0.162005,0.416726,1.096843,2.254879,0.002237,...,0.956577,0.046512,0.000000,4.215538,0.714503,0.027517,0.388637,0.406943,2.545398,2.772897
1,False,0.000000,0.000000,0.000000,6.548187,0.000000,0.116692,5.377332,0.076068,0.085824,...,3.442926,0.041953,0.000000,1.113098,0.754876,0.652534,2.536307,1.857577,0.835985,0.971688
2,False,0.000000,0.540526,0.419684,17.467466,0.000000,0.275836,6.099510,1.186228,0.031814,...,2.261778,0.186999,0.000000,5.545873,0.503430,0.802102,4.470579,2.481482,2.229733,5.326783
3,False,0.000000,0.765431,0.000000,12.930234,0.000000,0.415402,6.282257,0.983106,0.577992,...,3.404135,0.000000,0.000000,3.268504,0.383124,1.780778,2.781443,2.767260,1.293516,5.426302
4,False,0.007463,0.000000,0.011414,6.923587,0.214813,0.585576,8.611836,1.738711,0.209424,...,2.554204,0.000000,0.034383,0.076425,0.023470,0.200237,2.419353,5.483137,2.080965,2.520668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5211,False,0.000000,0.000000,0.657627,6.251391,0.000000,0.293582,3.950681,0.083680,0.206381,...,0.380209,0.013220,0.000000,0.586768,0.364738,0.105559,2.901210,2.060967,0.437555,1.334750
5212,False,0.061667,0.000000,0.413972,7.239841,0.131390,0.128750,6.652479,1.135447,0.683810,...,1.274938,0.364856,0.016582,0.153024,0.899824,0.998709,1.822760,1.926719,0.046185,1.476737
5213,False,0.000000,0.000000,0.651131,9.722810,0.044089,0.079274,1.717659,0.969600,0.000000,...,3.250457,0.889037,0.000000,0.129854,2.800372,0.126349,0.111221,0.410358,0.136061,1.632934
5214,False,0.173470,0.166740,0.045830,13.122295,0.000000,1.101092,2.474325,1.641243,0.060482,...,0.826877,0.562402,0.000000,0.324879,1.598484,0.000000,2.080808,1.227228,2.353605,3.700428


The dataframe can be saved in CSV form either by calling the pandas `DataFrame.to_csv()` method, or by using the `ImageFeaturizer.save_csv()` method on the featurizer itself. This will allow the features to be used directly in the DataRobot app:

In [13]:
featurizer.save_csv()



In [14]:
pd.read_csv('training_set_featurized_squeezenet_depth-1_output-512_(25-May-2019-22.56.29).csv')

Unnamed: 0,images,label,images_missing,images_feat_0,images_feat_1,images_feat_2,images_feat_3,images_feat_4,images_feat_5,images_feat_6,...,images_feat_502,images_feat_503,images_feat_504,images_feat_505,images_feat_506,images_feat_507,images_feat_508,images_feat_509,images_feat_510,images_feat_511
0,person1000_bacteria_2931.jpeg,1,False,0.000000,0.000000,0.555697,13.000273,0.162005,0.416726,1.096843,...,0.956577,0.046512,0.000000,4.215538,0.714503,0.027517,0.388637,0.406943,2.545399,2.772896
1,person1000_virus_1681.jpeg,1,False,0.000000,0.000000,0.000000,6.548187,0.000000,0.116692,5.377332,...,3.442926,0.041953,0.000000,1.113098,0.754876,0.652534,2.536307,1.857577,0.835985,0.971688
2,person1001_bacteria_2932.jpeg,1,False,0.000000,0.540526,0.419684,17.467466,0.000000,0.275836,6.099510,...,2.261778,0.186999,0.000000,5.545873,0.503430,0.802102,4.470579,2.481482,2.229733,5.326783
3,person1002_bacteria_2933.jpeg,1,False,0.000000,0.765431,0.000000,12.930234,0.000000,0.415402,6.282257,...,3.404136,0.000000,0.000000,3.268504,0.383124,1.780778,2.781443,2.767260,1.293516,5.426302
4,person1003_bacteria_2934.jpeg,1,False,0.007463,0.000000,0.011414,6.923587,0.214813,0.585576,8.611836,...,2.554204,0.000000,0.034383,0.076425,0.023470,0.200237,2.419354,5.483137,2.080965,2.520668
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5211,NORMAL2-IM-1406-0001.jpeg,0,False,0.000000,0.000000,0.657627,6.251391,0.000000,0.293582,3.950681,...,0.380209,0.013220,0.000000,0.586768,0.364738,0.105559,2.901210,2.060967,0.437555,1.334750
5212,NORMAL2-IM-1412-0001.jpeg,0,False,0.061667,0.000000,0.413972,7.239841,0.131390,0.128750,6.652479,...,1.274938,0.364856,0.016582,0.153024,0.899824,0.998709,1.822760,1.926719,0.046185,1.476737
5213,NORMAL2-IM-1419-0001.jpeg,0,False,0.000000,0.000000,0.651131,9.722810,0.044089,0.079274,1.717659,...,3.250457,0.889037,0.000000,0.129854,2.800372,0.126349,0.111221,0.410358,0.136061,1.632934
5214,NORMAL2-IM-1422-0001.jpeg,0,False,0.173470,0.166740,0.045830,13.122295,0.000000,1.101092,2.474325,...,0.826877,0.562402,0.000000,0.324879,1.598484,0.000000,2.080808,1.227228,2.353605,3.700427


<!--The `save_csv()` function can be called with no arguments in order to create an automatic csv name, like above. It can also be called with the `new_csv_path='{insert_new_csv_path_here}'` argument. -->

<!-- Alternatively, you can omit certain parts of the automatic name generation with `omit_model=True`, `omit_depth=True`, `omit_output=True`, or `omit_time=True` arguments. -->
some comment

We can simply test the performance of a linear SVM classifier over the featurized data. First, we'll build the training and test sets. 

In [39]:
# Creating a training set of 500 for each class
train_ill = featurized_df.iloc[:500, :]
train_norm = featurized_df.iloc[3878:3928, :]

# building training set  images of each class
train_ill, labels_ill = train_ill.drop(['label', 'images'], axis=1), train_ill['label']
train_norm, labels_norm = train_norm.drop(['label', 'images'], axis=1), train_norm['label']

# Combining the train data and the class labels to train on
train_combined = pd.concat((train_ill, train_norm), axis=0)
labels_train = pd.concat((labels_ill, labels_norm), axis=0)

# Creating a test set from the remaining 100 of each class
test_ill = featurized_df.iloc[2000:2100, :]
test_norm = featurized_df.iloc[4000:4100, :]

test_ill, test_labels_ill = test_ill.drop(['label', 'images'], axis=1), test_ill['label']
test_norm, test_labels_norm = test_norm.drop(['label', 'images'], axis=1), test_norm['label']

# Combining the test data and the class labels to check predictions
labels_test = pd.concat((test_labels_ill, test_labels_norm), axis=0)
test_combined = pd.concat((test_ill, test_norm), axis=0)


Then, we'll train the linear SVM:

In [40]:
# Initialize the linear SVC
from sklearn.svm import SVC
svm = SVC(kernel='linear', probability=True, random_state=42)
# Fit it on the training data
svm.fit(train_combined, labels_train)

# Check the performance of the linear classifier over the full Pneumonia vs. Normal dataset!
svm.score(test_combined, labels_test)

0.925

After running the Pneumonia vs. Normal dataset through the lightest-weight pic2vec model, we find that a simple linear SVM trained over the featurized data achieves 92.5% accuracy on distinguishing Pneumonic vs. Normal x-rays out of the box.

## Summary

That's it! We've looked at the following:

1. What data formats can be passed into the featurizer
2. How to initialize a simple featurizer
3. How to load and featurize the data simultaneously (preferred method)
<!--3. How to load data into the featurizer independently -->
<!--4. How to featurize the loaded data independently -->
4. How to save the featurized dataframe as a csv

<!--And as a bonus, we looked at how we might use the featurized data to perform predictions without dropping the CSV into the DataRobot app. --->

<!--Unless you would like to examine the loaded data before featurizing it, it is recommend to use the `ImageFeaturizer.featurize()` method to perform both functions at once and allow batch processing. -->