<a href="https://colab.research.google.com/github/RichFree/CS3244PnumoniaCNN/blob/master/project_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ORIGINAL DATA SOURCE:

The dataset contains 2 folders, test, train
- chest-xray
    - test - 624
        - NORMAL - 234
        - PNEUMONIA - 390
    - train - 5216
        - NORMAL - 1341
        - PNEUMONIA - 3875

And a total of 5216 images.

Acknowledgements This Dataset is taken from : https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia

# Runtime selection

- run the cell below in your first run
- click runtime, change runtime type, select Python 3 and GPU
- click runtime, restart runtime

Why use TF 1.x? 
- Tensorflow 2.2 has some unspecified bugs in the fit function when training model
- tensorflow 1.15.2 runs our functions close enough to tensorflow 2.1.0




In [0]:
# Restart runtime using 'Runtime' -> 'Restart runtime...'
%tensorflow_version 1.x
import tensorflow as tf
print(tf.__version__)

TensorFlow 1.x selected.
1.15.2


# Import

In [0]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.image import imread 
# %matplotlib inline # Technically not necessary in newest versions of jupyter

# this portion is necessary because of large data sizes
# https://stackoverflow.com/questions/48610132/tensorflow-crash-with-cudnn-status-alloc-failed
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
  except RuntimeError as e:
    print(e)

  import pandas.util.testing as tm


In [0]:
tf.__version__

'1.15.2'

# Mount Google Drive

Below is how we import files to Google drive. 
Guide:https://stackoverflow.com/questions/48376580/google-colab-how-to-read-data-from-my-google-drive

1. Mount your drive using the cell below - sign in with your google drive account
2. set the directory to the folder according to your folder path

## Ideas: 

- There are 2 suggestions: 
  - mount a folder in your gdrive, which you have to upload manually
  - import a tar file to the colab drive, then unzip from there

Both method instructions are available in the stackoverflow




In [108]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


## Data Folder download and directory change

- chest_xray folder link: https://drive.google.com/drive/folders/1wPG9qwrz3YyQUUbwMq7ccVCAPmY-_9AN?usp=sharing
- get the folder chest_xray from the above link, then paste in a directory you remember in your google drive
- Replace the directory below with where you put your data
- Please include and stop at the folder "chest_xray"


## Useful commands
- !ls '/gdrive' 
  - list the directories in your drive
- !cd '/gdrive/My drive/...'
  - change directory 

How to get the path of your chest_xray folder? 
- !ls to find which directories you can access from /gdrive
- !cd to the parent directory of chest_xray
- repeat until you reach "chest_xray"


In [110]:
# your google drive overview
!ls '/gdrive/My Drive'

'1. Life'	'4. Computing'	   '6. Modules'        Notability
'2. Education'	'4. Spiritual'	   '7. Projects'       Sharing
'3. Soul'	'5. Organisation'  'Colab Notebooks'


In [0]:
my_data_dir = '/gdrive/My Drive/7. Projects/CS3244/chest_xray'

In [0]:
# CONFIRM THAT THIS REPORTS BACK 'test', and 'train'
# should return: ['test', 'train']
os.listdir(my_data_dir) 

In [0]:
test_path = my_data_dir+'/test/'
train_path = my_data_dir+'/train/'

In [0]:
os.listdir(test_path) # test path

In [0]:
os.listdir(train_path) # train path

# Data Visualisation

you can read in image data and preview images  

**Below is a preview of a pneumonia example**

In [0]:
os.listdir(train_path+'/PNEUMONIA')[0]

try to see what a picture looks like

In [0]:
pnu_patient = train_path+'/PNEUMONIA'+'/person1000_bacteria_2931.jpeg'
pnu_img= imread(pnu_patient)
plt.imshow(pnu_img)

In [0]:
pnu_img.shape # resolution of image

dimensions: 760 vertical, 1152 horizontal, 1 colour channel

**Below is an example of a Normal patient**

In [0]:
norm_patient = train_path+'/NORMAL/'+os.listdir(train_path+'/NORMAL')[53]
norm_img = imread(norm_patient)
plt.imshow(norm_img)

how many images are there?

In [0]:
len(os.listdir(train_path+'/PNEUMONIA'))

In [0]:
len(os.listdir(train_path+'/NORMAL'))

**what is the average dimensions of the images?**

In [0]:
# Other options: https://stackoverflow.com/questions/1507084/how-to-check-dimensions-of-all-images-in-a-directory-using-python
dim1 = []
dim2 = []
for image_filename in os.listdir(test_path+'/NORMAL'):
    
    img = imread(test_path+'/NORMAL'+'/'+image_filename)
    d1,d2 = img.shape
    dim1.append(d1)
    dim2.append(d2)

In [0]:
sns.jointplot(dim1,dim2)

In [0]:
np.mean(dim1)

In [0]:
np.mean(dim2)

In [0]:
image_shape = (64,64,1)

Note: 
- I used 64x64 resolution because average resolution is still too high
- isn't the data black and white? shouldn't it be 1 channel? is there a way to make it 1 channel and raise resolution? 

# Preparing the Data for the model
There is too much data for us to read all at once in memory. We can use some built in functions in Keras to automatically process the data, generate a flow of batches from a directory, and also manipulate the images.

## Image Manipulation
Its usually a good idea to manipulate the images with rotation, resizing, and scaling so the model becomes more robust to different images that our data set doesn't have. We can use the ImageDataGenerator to do this automatically for us. Check out the documentation for a full list of all the parameters you can use here!

In [0]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [0]:
# using settings from 03
train_datagen = ImageDataGenerator(rotation_range=5, # rotate the image 20 degrees
                               width_shift_range=0.10, # Shift the pic width by a max of 5%
                               height_shift_range=0.10, # Shift the pic height by a max of 5%
                               rescale=1/255, # Rescale the image by normalizing it.
                               shear_range=0.1, # Shear means cutting away part of the image (max 10%)
                               zoom_range=0.1, # Zoom in by 10% max
                               horizontal_flip=True, # Allow horizontal flipping
                               fill_mode='nearest' # Fill in missing pixels with the nearest filled value
                              )
test_datagen = ImageDataGenerator(rescale = 1./255)  #Image normalization.

In [0]:
imread(pnu_patient).max() #why we need to rescale to [0,1]

In [0]:
batch_sz = 32

In [0]:
# we create the training and test set data with the transformations specified above
training_set = train_datagen.flow_from_directory(train_path,
                                                 target_size = image_shape[:2],
                                                 batch_size = batch_sz,
                                                 color_mode = 'grayscale',
                                                 class_mode = 'binary')
test_set = test_datagen.flow_from_directory(test_path,
                                            target_size = image_shape[:2],
                                            batch_size = batch_sz,
                                            color_mode = 'grayscale',
                                            class_mode = 'binary',
                                            shuffle=False)

# Training

In [0]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense, Conv2D, MaxPooling2D

In [0]:
# CNN model from: 
# https://www.kaggle.com/sanwal092/intro-to-cnn-using-keras-to-predict-pneumonia

cnn = Sequential() # from now, cnn is the name of the model

#Convolution
cnn.add(Conv2D(32, (3, 3), activation="relu", input_shape=(64, 64, 1)))

#Pooling
cnn.add(MaxPooling2D(pool_size = (2, 2)))

# 2nd Convolution
cnn.add(Conv2D(32, (3, 3), activation="relu"))

# 2nd Pooling layer
cnn.add(MaxPooling2D(pool_size = (2, 2)))

# Flatten the layer
cnn.add(Flatten())

# Fully Connected Layers
cnn.add(Dense(activation = 'relu', units = 128))
cnn.add(Dense(activation = 'sigmoid', units = 1))

# Compile the Neural network
cnn.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [0]:
# configuring early stopping
from tensorflow.keras.callbacks import EarlyStopping 
early_stop = EarlyStopping(monitor='val_loss',patience=2)

In [0]:
# actual training
# fit_generator supported in TF 2.0.x, depracated in TF 2.1.x
cnn_model = cnn.fit_generator(training_set,
                    epochs = 20,
                    verbose = 1,
                    validation_data = test_set, 
                    # steps_per_epoch = 326, # steps_per_epoch should ideally be no. total samples / batch size, 
                    # e.g. 5216 = 16*326 or 5216 = 32 * 163
                    callbacks=[early_stop])

In [0]:
# saving the model
# note: only uncomment if training on your computer for the first time!

#from tensorflow.keras.models import load_model
#cnn.save('pneumonia.h5') 


# Evaluation

In [0]:
losses = pd.DataFrame(cnn.history.history)

In [0]:
losses

- loss is train loss from the training_set
- val_loss is the test loss from the test_set

In [0]:
losses[['loss','val_loss']].plot()

In [0]:
cnn.metrics_names

In [0]:
cnn.evaluate_generator(test_set)

# Prediction matrix

In [0]:
pred_probabilities = cnn.predict_generator(test_set) # model assigns a probability for each image in test set
predictions = pred_probabilities > 0.5 # remember that 1 is pneu and 0 is norm

# Evaluation Statistics

# from sklearn.metrics import classification_report,confusion_matrix

## Classification Report
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

Basically, the higher the precision, the better it is

In [0]:
from sklearn.metrics import classification_report,confusion_matrix

In [0]:
print(classification_report(test_set.classes,predictions))

## Confusion Matrix
https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/ 
- the higher the values that correspond with a corresponding identity matrix, the better the classification

In [0]:
confusion_matrix(test_set.classes,predictions)

# Prediction

In [0]:
from random import randrange # we are going to randomly select from the pool of test images
from tensorflow.keras.preprocessing import image # to deal with images

**Predicting pneumonia patient**

In [0]:
pnu_patient2 = test_path+'/PNEUMONIA/'+os.listdir(test_path+'/PNEUMONIA')[randrange(389)]

In [0]:
# convert image file to vector to read from model
my_image = image.load_img(pnu_patient2, color_mode = 'grayscale', target_size=image_shape) # pillow module
type(my_image) # PIL.Image.Image type

# change PIL.Image.Image type to numpy array of size (64, 64, 1)
my_image=image.img_to_array(my_image) # numpy.ndarray type
my_image.shape # (64, 64, 1)

# change dim (64, 64, 1) to (1, 64, 64, 1)
my_image = np.expand_dims(my_image, axis=0) # convert to (1, 64, 64, 1)
my_image.shape # (1, 64, 64, 1)

# the above image.shape shows how the vector shape changes with each transform we apply 

In [0]:
cnn.predict(my_image)

**Predicting normal patient**

In [0]:
norm_patient2 = test_path+'/NORMAL/'+os.listdir(test_path+'/NORMAL')[randrange(233)]

In [0]:
my_image2 = image.load_img(norm_patient2, color_mode = 'grayscale', target_size=image_shape) # pillow module
my_image2=image.img_to_array(my_image2) # numpy.ndarray type
my_image2 = np.expand_dims(my_image2, axis=0) # convert to (1, 64, 64, 1)

In [0]:
cnn.predict(my_image2)

In [0]:
training_set.class_indices

In [0]:
test_set.class_indices

# Thoughts and Feedback

## Thoughts
- the loss curves looks like crap
- I don't understand why its 3 colour channels when its a 8bitdepth black white image 
    - fixed: we must specify color_mode = 'grayscale' in these functions: 1. img_gen.flow_from_directory, 2. image.load_img

## Future actionables:
- Data related: 
    - include more xray data (combining data sets from RSNA pneumonia detection challenge: https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data) 
    - split pneumonia data into bacterial and viral infection (more applicable to current coronavirus situation), also, can we find coronavirus xray images?
    - can we reduce to 1 colour channel if it is indeed a black and white image?
        - yes, refer to above
    - how much can we increase the resolution in data processing (image_shape) before training becomes too intensive?
    - can we split into 3 targets - bacteria, virus, normal?

- Model related: 
    - study what others have done, especially from the notebooks of those found in the link at the intro of this notebook
    - what are the key parameters to experiment on? can we have a systematic way to test most of the parameters, split the workload between those with desktop GPUs and compare scores?
    
    
## Suggestions and Ideas
- Autoencoders
    - Mark has an idea to use autoencoders to check the "cloudiness" of the lungs, since a visual comparison of images yields this feature. Even if we can't, maybe we can leverage on this for our feature engineering to tease out during our convolution/pooling steps
    
- application
    - can we use a website to feed an xray image and check with our model to make a prediction for healthcare industry to use? 