# Building a CNN from Scratch - Lab

## Introduction

Now that you have background knowledge regarding how CNNs work and how to implement them via Keras, its time to pratice those skills a little more independently in order to build a CNN on your own to solve a image recognition problem. In this lab, you'll practice building an image classifier from start to finish using a CNN.  

## Objectives

You will be able to:
* Transform images into tensors
* Build a CNN model for image recognition

## Loading the Images

The data for this lab concerns classifying lung xray images for pneumonia. The original dataset is from kaggle. We have downsampled this dataset in order to reduce trainging time for you when you design and fit your model to the data. It is anticipated that this process will take approximately 1 hour to run on a standard machine, although times will vary depending on your particular computer and set up. At the end of this lab, you are welcome to try training on the complete dataset and observe the impact on the model's overall accuracy. 

You can find the initial downsampled dataset in a subdirectory, **chest_xray**, of this repository.

In [4]:
#Your code here; load the images; be sure to also preprocess these into tensors.
import os, shutil

new_dir = 'chest_xray_downsampled/'

train_folder = os.path.join(new_dir, 'train')
train_yes = os.path.join(train_folder, 'PNEUMONIA')
train_no = os.path.join(train_folder,'NORMAL' )

test_folder = os.path.join(new_dir, 'test')
test_yes = os.path.join(test_folder, 'PNEUMONIA')
test_no= os.path.join(test_folder, 'NORMAL')

val_folder = os.path.join(new_dir, 'val')
val_yes = os.path.join(val_folder, 'PNEUMONIA')
val_no = os.path.join(val_folder, 'NORMAL')

In [5]:
train_yes

'chest_xray_downsampled/train/PNEUMONIA'

In [16]:
print('There are', len(os.listdir(train_yes)), 'positive images in the training set')

There are 1291 positive images in the training set


In [18]:
print('There are', len(os.listdir(train_no)), 'negative images in the training set')

There are 447 negative images in the training set


In [20]:
print('There are', len(os.listdir(val_yes)), 'positive images in the validation set')

There are 2 positive images in the validation set


In [21]:
print('There are', len(os.listdir(val_no)), 'negative images in the validation set')

There are 2 negative images in the validation set


In [22]:
print('There are', len(os.listdir(test_yes)), 'positive images in the test set')

There are 130 positive images in the test set


In [23]:
print('There are', len(os.listdir(test_no)), 'negative images in the test set')

There are 78 negative images in the test set


In [24]:
import time
import matplotlib.pyplot as plt
import scipy
import numpy as np
from PIL import Image
from scipy import ndimage
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

np.random.seed(123)

In [25]:
# get all the data in the directory test, and reshape them
test_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        test_folder, 
        target_size=(64, 64), batch_size = 208) 

# get all the data in the directory validation, and reshape them
val_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        val_folder, 
        target_size=(64, 64), batch_size = 4)

# get all the data in the directory split/train (542 images), and reshape them
train_generator = ImageDataGenerator(rescale=1./255).flow_from_directory(
        train_folder, 
        target_size=(64, 64), batch_size=1738)

Found 208 images belonging to 2 classes.
Found 4 images belonging to 2 classes.
Found 1738 images belonging to 2 classes.


In [26]:
train_images, train_labels = next(train_generator)
test_images, test_labels = next(test_generator)
val_images, val_labels = next(val_generator)

In [27]:
# Explore your dataset again
m_train = train_images.shape[0]
num_px = train_images.shape[1]
m_test = test_images.shape[0]
m_val = val_images.shape[0]

print ("Number of training samples: " + str(m_train))
print ("Number of testing samples: " + str(m_test))
print ("Number of validation samples: " + str(m_val))
print ("train_images shape: " + str(train_images.shape))
print ("train_labels shape: " + str(train_labels.shape))
print ("test_images shape: " + str(test_images.shape))
print ("test_labels shape: " + str(test_labels.shape))
print ("val_images shape: " + str(val_images.shape))
print ("val_labels shape: " + str(val_labels.shape))

Number of training samples: 1738
Number of testing samples: 208
Number of validation samples: 4
train_images shape: (1738, 64, 64, 3)
train_labels shape: (1738, 2)
test_images shape: (208, 64, 64, 3)
test_labels shape: (208, 2)
val_images shape: (4, 64, 64, 3)
val_labels shape: (4, 2)


In [28]:
train_img = train_images.reshape(train_images.shape[0], -1)
test_img = test_images.reshape(test_images.shape[0], -1)
val_img = val_images.reshape(val_images.shape[0], -1)

print(train_img.shape)
print(test_img.shape)
print(val_img.shape)

(1738, 12288)
(208, 12288)
(4, 12288)


In [None]:
train_y = np.reshape(train_labels[:,0], (542,1))
test_y = np.reshape(test_labels[:,0], (180,1))
val_y = np.reshape(val_labels[:,0], (200,1))

## Designing the Model

Now it's time to design your CNN! Remember a few things when doing this: 
* You should alternate convolutional and pooling layers
* You should have later layers have a larger number of parameters in order to detect more abstract patterns
* Add some final dense layers to add a classifier to the convolutional base

In [None]:
#Your code here; design and compile the model

## Training and Evaluating the Model

Remember that training deep networks is resource intensive: depending on the size of the data, even a CNN with 3-4 successive convolutional and pooling layers is apt to take a hours to train on a high end laptop. Using 30 epochs and 8 layers (alternating between convolutional and pooling), our model took about 40 minutes to run on a year old macbook pro.


If you are concerned with runtime, you may want to set your model to run the training epochs overnight.  

**If you are going to run this process overnight, be sure to also script code for the following questions concerning data augmentation. Check your code twice (or more) and then set the notebook to run all, or something equivalent to have them train overnight.**

In [None]:
#Set the model to train; see warnings above

In [None]:
# Plot history

## Save the Model

In [None]:
#Your code here; save the model for future reference.

## Data Augmentation

Recall that data augmentation is typically always a necessary step when using a small dataset as this one which you have been provided. As such, if you haven't already, implement a data augmentation setup.

**Warning: This process took nearly 4 hours to run on a relatively new macbook pro. As such, it is recommended that you simply code the setup and compare to the solution branch, or set the process to run overnight if you do choose to actually run the code.**

In [None]:
#Add data augmentation to the model setup and set the model to train; 
#See warnings above if you intend to run this block of code

## Final Evaluation

Now use the test set to perform a final evaluation on your model of choice.

In [None]:
# Your code here; perform a final evaluation using the test set..

## Extension: Adding Data to the Model

As discussed, the current dataset we worked with is a subset of a dataset hosted on Kaggle. Increasing the data that we use to train the model will result in additional performance gains but will also result in longer training times and be more resource intensive.   

It is estimated that training on the full dataset will take approximately 4 hours (and potentially significantly longer) depending on your computer's specifications.

In order to test the impact of training on the full dataset, start by downloading the data from kaggle here: https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia.   

In [None]:
#Optional extension; Your code here

## Summary

Well done! In this lab, you practice building your own CNN for image recognition which drastically outperformed our previous attempts using a standard deep learning model alone. In the upcoming sections, we'll continue to investigate further techniques associated with CNNs including visualizing the representations they learn and techniques to further bolster their performance when we have limited training data such as here.