<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/05_deep_learning/05_image_classification_project/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2020 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Image Classification Project

In this project we will build an image classification model and use the model to identify if the lungs pictured indicate that the patient has pneumonia. The outcome of the model will be true or false for each image.

The [data is hosted on Kaggle](https://www.kaggle.com/rob717/pneumonia-dataset) and consists of 5,863 x-ray images. Each image is classified as 'pneumonia' or 'normal'.

## Ethical Considerations

We will frame the problem as:

> *A hospital is having issues correctly diagnosing patients with pneumonia. Their current solution is to have two trained technicians examine every patient scan. Unfortunately, there are many times when two technicians are not available, and the scans have to wait for multiple days to be interpreted.*
>
> *They hope to fix this issue by creating a model that can identify if a patient has pneumonia. They will have one technician and the model both examine the scans and make a prediction. If the two agree, then the diagnosis is accepted. If the two disagree, then a second technician is brought in to provide their analysis and break the tie.*

Discuss some of the ethical considerations of building and using this model. 

* Consider potential bias in the data that we have been provided. 
* Should this model err toward precision or accuracy?
* What are the implications of massively over-classifying patients as having pneumonia?
* What are the implications of massively under-classifying patients as having pneumonia?
* Are there any concerns with having only one technician make the initial call?

The questions above are prompts. Feel free to bring in other considerations you might have.

### **Student Solution**

> **Question 1:** One bias in the data we have been provided is that all the photos may not be clear enough for our model to really determine if the patients lungs were infected. Looking at some of the pictures while we were doing the exploratory data analysis section, we compared a normal x-ray from the training folder with a infected x-ray from the training folders. If we did not have the titles on we wouldn't know the difference between some the two photos. So we decided to change the code so we plot a image at random. Even then we still couldn't tell the difference. So if human eyes could not spot the difference, the model would not be able to spot the difference if the image is not clear enough. Of course nobody on the team has a medical degree or training in spotting the differences between a infected lung and a normal lung. With that said the easy solution would be to make sure that each x-ray image was clear. Also to use the model aswell with a doctor to give paitents the correct diagnosis. 

> **Question 2:** A model err towards accuracy will predict how many times a model is correct overall. Whereas precision is more focus on the true and false postives in a model. Since this model will be used to determine whether or not a paintent has preumonia I would prefer for the model to err towards precision over accuracy. I find it less important that the model is right and more important that each paintent that has preumonia is taken account for. 

> **Question 3:** The implications of massively over-classifying patients as having pneumonia are that people who don't actually have pneumonia will have to undergo treatment and unneccesary hospital visits as if the patient has pneumonia. Another more implication that is more extreme is the model and doctor diagnosing a paintent with pneumonia and they have a even more extreme and fatal such as Tuberculosis and lung cancer which similar to pneumonia affects the lungs.. 

> **Question 4:** The implications of massively under-classifying patients as having pneumonia is paintent having pneumonia without knowing. This could led to the paitent not getting the neccesary treatment needed and could lead to more serious health concerns. 

> **Question 5:** Having only one technician make the initial call is very concerning. There should be more then one opinion because leading towards the model or that one doctor could lead to the misdiagnosing discussed already. 

> Other considerations we have would be to consider using a multiclassifier model. The more information the computer has the better and if there was a model that could predict other lung diseases like lung cancer or tuberculosis it might end up saving lives. Imagine going to the hospital being told you don't have pneumonia, based on the model, and later on find out that you actually have lung cancer. A multiclassifer model would avoid that mistake. 








---

## Modeling

In this section of the lab, you will build, train, test, and validate a model or models. The data is the ["Detecting Pneumonia" dataset](https://www.kaggle.com/rob717/pneumonia-dataset). You will build a binary classifier that determines if an x-ray image has pneumonia or not.

You'll need to:

* Download the dataset
* Perform EDA on the dataset
* Build a model that can classify the data
* Train the model using the training portion of the dataset. (It is already split out.)
* Test at least three different models or model configurations using the testing portion of the dataset. This step can include changing model types, adding and removing layers or nodes from a neural network, or any other parameter tuning that you find potentially useful. Score the model (using accuracy, precision, recall, F1, or some other relevant score(s)) for each configuration.
* After finding the "best" model and parameters, use the validation portion of the dataset to perform one final sanity check by scoring the model once more with the hold-out data.
* If you train a neural network (or other model that you can get epoch-per-epoch performance), graph that performance over each epoch.

Explain your work!

> *Note: You'll likely want to [enable GPU in this lab](https://colab.research.google.com/notebooks/gpu.ipynb) if it is not already enabled.*

If you get to a working solution you're happy with and want another challenge, you'll find pre-trained models on the [landing page of the dataset](https://www.kaggle.com/paultimothymooney/detecting-pneumonia-in-x-ray-images). Try to load one of those and see how it compares to your best model.

Use as many text and code cells as you need to for your solution.

### **Student Solution**

#### Imports

In [None]:
# Imports 
from google.colab import files
import zipfile


import pandas as pd
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import cv2
import seaborn as sns
import numpy as np
from PIL import Image


In [None]:
# Keras Libraries
import keras
from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Flatten
from keras.layers import Dense
from keras.preprocessing.image import ImageDataGenerator, load_img
from sklearn.metrics import classification_report, confusion_matrix

#### Dowloading Data Using Kaggle

In [None]:

!pip install -U -q kaggle
!mkdir  /root/.kaggle

# Upload kdggle.json api
files.upload()
!cp kaggle.json /root/.kaggle

# Dowloading the Chest X-Ray Pneumonia file from Kaggle
!kaggle datasets download -d paultimothymooney/chest-xray-pneumonia

In [None]:
# This code allows us to unzip the file and remove it from this notebooks files
!apt install pv
!unzip -o /content/chest-xray-pneumonia.zip  | pv -l >/dev/null
os.remove('chest-xray-pneumonia.zip')

In [None]:
# Setting Chest X-Ray file to Data 
data_dir  = '/content/chest_xray/'

# Setting our training data
train_dir = data_dir+'train/'

# Setting our testing data
test_dir  = data_dir+'test/'

# Setting our validation data
val_dir   = data_dir + 'val/'



# Setting the Normal files in the training data to Normal cases
normal_cases_dir = train_dir + 'NORMAL/'

# Setting the Pneumonia files in the training data to Pneumonia cases
pneumonia_cases_dir = train_dir + 'PNEUMONIA/'

# This prints the list in each directory 
print("Datasets:\t",os.listdir(data_dir))
print("Train:\t", os.listdir(train_dir))
print("Test:\t", os.listdir(test_dir))


#### Exploratory Data Analysis

In [None]:
# The default input size for this model is 224x224.
image_size = 150 

# Number of files in training set
nb_train_samples = 5216 

# Number of files in test set
num_of_test_samples = 624 

# The model will take 16 random batches of files at a time during training
batch_size = 16 

# We will run this model for 20 epochs(1 epoch = whole dataset traversion during training)
EPOCHS = 6 

# The model will take 326 steps to complete per batch trainin
STEPS = nb_train_samples // batch_size 

In [None]:
# Plots pie chart of the number of images in the normal and pneumonia cases
x = (len(normal_cases_dir), len(pneumonia_cases_dir))
labels = ['PNEUMONIA', 'NORMAL']
color = ['green', 'yellow']
plt.pie(x, labels = labels, colors = color, autopct = '%.0f%%', radius = 1.5, textprops = {'fontsize':16})
plt.show()


In [None]:
# Normal picture
print(len(normal_cases_dir))
rand_norm= np.random.randint(0,len(normal_cases_dir))
norm_pic = os.listdir(normal_cases_dir)[rand_norm]
print('Normal Picture Title: ',norm_pic)

norm_pic_address = normal_cases_dir + norm_pic

# Pneumonia picture
rand_p = np.random.randint(0,len(pneumonia_cases_dir))

sic_pic =  os.listdir(pneumonia_cases_dir)[rand_norm]
sic_address = pneumonia_cases_dir + sic_pic
print('Pneumonia Picture Title:', sic_pic)


In [None]:
# Load the images
norm_load = Image.open(norm_pic_address)
sic_load = Image.open(sic_address)

# Plots the image
f = plt.figure(figsize= (10,6))
a1 = f.add_subplot(1,2,1)
img_plot = plt.imshow(norm_load)
a1.set_title('Normal')

a2 = f.add_subplot(1, 2, 2)
img_plot = plt.imshow(sic_load)
a2.set_title('Pneumonia')



> Looking at the two images used for the exploratory data analysis it is hard to note the difference between the two x-rays and if the title wasn't use we could potentially misdiagnosis a paitent. 



#### CNN Model

In [None]:
# We decided to use a Neural Network Model
cnn = Sequential()


cnn.add(Conv2D(32, (3, 3), activation="relu", input_shape=(64, 64, 3)))


cnn.add(MaxPooling2D(pool_size = (2, 2)))

cnn.add(Conv2D(32, (3, 3), activation="relu"))

cnn.add(MaxPooling2D(pool_size = (2, 2)))

cnn.add(Flatten())

cnn.add(Dense(activation = 'relu', units = 128))
cnn.add(Dense(activation = 'sigmoid', units = 1))

cnn.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

In [None]:
num_of_test_samples = 600
batch_size = 32

In [None]:
train_datagen = ImageDataGenerator(rescale = 1./255,
                                   shear_range = 0.2,
                                   zoom_range = 0.2,
                                   horizontal_flip = True)

test_datagen = ImageDataGenerator(rescale = 1./255)  

training_set = train_datagen.flow_from_directory(train_dir,
                                                 target_size = (64, 64),
                                                 batch_size = 32,
                                                 class_mode = 'binary')

validation_generator = test_datagen.flow_from_directory(val_dir,
    target_size=(64, 64),
    batch_size=32,
    class_mode='binary')

test_set = test_datagen.flow_from_directory(test_dir,
                                            target_size = (64, 64),
                                            batch_size = 32,
                                            class_mode = 'binary')

In [None]:
#Summary of model
cnn.summary()


In [None]:
cnn_model = cnn.fit_generator(training_set,
                         steps_per_epoch = 163,
                         epochs = 5,
                         validation_data = validation_generator,
                         validation_steps = 624)

In [None]:
# It seems like we got an accuracy of about 93% but lets test farther
test_accu = cnn.evaluate_generator(test_set,steps=624)

In [None]:
print('The testing accuracy is :',test_accu[1]*100, '%')


In [None]:
Y_pred = cnn.predict_generator(test_set, 100)
y_pred = np.argmax(Y_pred, axis=1)

In [None]:
plt.plot(cnn_model.history['acc'])
plt.plot(cnn_model.history['val_acc'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Training set', 'Validation set'], loc='upper left')
plt.show()

In [None]:
plt.plot(cnn_model.history['val_loss'])
plt.plot(cnn_model.history['loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Training set', 'Test set'], loc='upper left')
plt.show()

---