# Introduction to Deep Learning Week 3
This notebook was forked from Mattison Hineline and Edited by Joseph Hardin
###  Brief description of the problem and data (5 pts) 

In week 3 of the Introduction to Deep Learning, we were asked to use a Convolutional Neural Network to work on the Cancer Detection challenge provided by Kaggle.  The goal of this challenge is to *"identify metastatic cancer in small image patches taken from larger digital pathology scans"*.  The benefits of this are obvious.  Pathologist are highly trained individuals and ,like all people, are prone to mistakes.  An algorithim that can be used in tandem, or even instead of in extremely specific circumstances (i.e. low impact undergraduate research projects), could increase throughput and decrease cost for positive diagnosis of metastatic cancer.




In [None]:
#import libraries

#general libraries 
import numpy as np 
import pandas as pd 
import os
import random
from sklearn.utils import shuffle
import shutil

#visualizations
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import matplotlib.patches as patches

# work with images
from skimage.transform import rotate
from skimage import io
import cv2 as cv

# model development
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import RandomFlip, RandomZoom, RandomRotation
from tensorflow.keras.layers import Conv2D, MaxPooling2D, AveragePooling2D
from tensorflow.keras.layers import Dense, Flatten, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.optimizers import Adam

import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

In [None]:
tf.__version__

In [None]:
#get files
test_path = '../input/histopathologic-cancer-detection/test/'
train_path = '../input/histopathologic-cancer-detection/train/'
sample_submission = pd.read_csv('../input/histopathologic-cancer-detection/sample_submission.csv')
train_data = pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv')

In [None]:
# declare constants for reproduciblity
RANDOM_STATE = 369

# 1. Brief Description of the Problem and Data

Each file in the data given is a .tif file.  The pictures are 96x96x3 (Length x Width x Channel) with each pixel lying in the range of 0-255.  This presents a challenge, as the reading of .tif files is not standard in the keras framework (unlike .png or .jpeg for example).  Our training dataset consist of 220,025 images that are labeled either 1 or 0.  1 meanss there is a cancerous cell.
This data contains thousands of small images where the 96x96 pixel images with 3 channels, each with an identifying label and id. We have two datasets, a training and testing set already split for us. The training set contains 220,025 unique images and the test set contains about 57,500. To use these images in a machine learning model, we are also given an identifying dataframe with two columns: 'id' which is the unique image ID correpsonding to the training directory, and 'label' which tells us the classification category. Each label is either a 0 or 1, depending whether the image is non-cancerous (0) or cancerous (1). In the competition description, we find that if at least one pixel of an image is identified as cancerous then the whole image is therefore marked with a 1, otherwise it is 0. It is important to note that we do not have any missing values in this data which will make preprocessing more efficient.

In [None]:
# have a look at the format of the data
train_data.head()

In [None]:
# take a look at the data further
train_data.describe()

In [None]:
# check information, data types, and for missing data
train_data.info()

# Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data (15 pts)



As can be seen below, the data is roughly balanced and there do not appear to be any missing labels.  

In [None]:
#create histogram
print(pd.DataFrame(data={'Label Counts': train_data['label'].value_counts()}))
sns.countplot(x=train_data['label'], palette='colorblind').set(title='Label Counts Histogram');

In [None]:
#create pie chart
fig = px.pie(train_data, 
             values = train_data['label'].value_counts().values, 
             names = train_data['label'].unique())
fig.update_layout(
    title={
        'text': "Label Percentage Pie Chart",
        'y':.99,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()

Below we can see some example images from the training data. As someone who is not familiar at looking at cancer cells and images, it would be very challenging for me to classify these. However, we mitigate this by putting the correct label for each image is below it. Additionally, we were told that each image has the (potentially) cancerous cells centered in each 32x32 pixel image, so we have drawn a box around this area as a focal point. 

In [None]:
#visualize a few images
fig, ax = plt.subplots(5, 5, figsize=(15, 15))
for i, axis in enumerate(ax.flat):
    file = str(train_path + train_data.id[i] + '.tif')
    image = io.imread(file)
    axis.imshow(image)
    box = patches.Rectangle((32,32),32,32, linewidth=2, edgecolor='r',facecolor='none', linestyle='-')
    axis.add_patch(box)
    axis.set(xticks=[], yticks=[], xlabel = train_data.label[i]);
    #cv2.waitKey(0)

**Preprocessing Techniques**


When working with training data, first we want to handle any missing data. From the source description and our findings above, we can see that we have no missing data in the training set. Next, we are told that the part of the image which is going to be of interest is the center pixels (red box above). Right now the images are 96x96x3 and we know that the "3" is the RGB channel. For now, we will leave this as well. Lastly, there are many augmentations you can perform on images to produce better results and facilitate model learning, but to start we will not touch these either for simplicity. 

What we will do with the image data is **shuffle** the data so that the model doesn't learn based on the image ordering/pattern of input, which could potentially have consequences in the model training. We will also **split** the data into training and validation set to improve model development. During training we will also **normalize** the pixels by dividing by 255.0, which should help data processing and model training.

# 3. DModel Architecture (25 pts)

For this model we will be using Keras' library to run a convolutional neural network (CNN). The first model we will run without tuning any hyperparameters within the model and use that as our baseline. Then, we will run a second model, tuning hyperparameters such as learning rate, batch normalization, regularization, filter size, stride, activation layers, etc. 

Our CNN model will have a network such that there are two convolutional layers then a MaxPool layer, and we repeat this *n* number of times. Specifically, we will create a fairly simple model with two (n=2) of these clusters. In other words, our model will be input --> Conv2D --> Conv2D --> MaxPool --> Conv2D --> Conv2D --> MaxPool --> Flatten --> Output with sigmoid activation.

**First model:**
1. Normalize images pre-training (image/255)
2. Output layer activation (sigmoid)

**Second model contains all the first model parameters, but we also add:**
1. Dropout (0.1)
2. Batch Normalization
3. Optimization (Adam)
4. Learning rate (0.0001) 
5. Hidden layer activations (ReLU)

Before we start with the models, let's describe what each of the parameters will do to the model. In both models we will use 1 and 2, numbers 3 to 5 will be used in the second and final models. 
1. **Normalize images** : this will take the pixels and divide each pixel by 255 to normalize the data and have values between 0-1.
2. **Output layer activation** : we will use a sigmoid activation function on the output layer since we are working with binary data
3. **Dropout** : we will set dropout at 0.1 which will randonly select some weights and set them to equal 0 which regularizes the model because it is using a smaller number of weights for each training run. 
4. **Optimization** : we will use adaptive moment estimation (Adam) for optimizing the model which essentially mimics momentum for gradient adn gradient-squared. 
5. **Learning rate**: we will set our learning rate to 0.0001 which will assist in the gradient descent such that as the model learns, the speed of learning decreases so that it is less likely to overstep the (hopefully) global minimum. 
6. **Hidden layer activations**: we will use rectified linear regression (ReLU) as our hidden layer activation function which will help the model to converge better, prevent saturation, and provide less need for computation power. 

In addition, we will use fairly large **batch sizes**, set at 256 to help reduce variance. Finally we wil train our two models with 10 **epochs**. We will use accuracy and the ROC-AUC curve to measure model performance, as well as binary cross-entropy as our loss function.

In [None]:
# set model constants
BATCH_SIZE = 256

In [None]:
# prepare data for training
def append_tif(string):
    return string+".tif"

train_data["id"] = train_data["id"].apply(append_tif)
train_data['label'] = train_data['label'].astype(str)

# randomly shuffle training data
train_data = shuffle(train_data, random_state=RANDOM_STATE)

In [None]:
# modify training data by normalizing it 
# and split data into training and validation sets
datagen = ImageDataGenerator(rescale=1./255.,
                            validation_split=0.15)

In [None]:
# generate training data
train_generator = datagen.flow_from_dataframe(
    dataframe=train_data,
    directory=train_path,
    x_col="id",
    y_col="label",
    subset="training",
    batch_size=BATCH_SIZE,
    seed=RANDOM_STATE,
    class_mode="binary",
    target_size=(64,64))        # original image = (96, 96) 

In [None]:
# generate validation data
valid_generator = datagen.flow_from_dataframe(
    dataframe=train_data,
    directory=train_path,
    x_col="id",
    y_col="label",
    subset="validation",
    batch_size=BATCH_SIZE,
    seed=RANDOM_STATE,
    class_mode="binary",
    target_size=(64,64))       # original image = (96, 96) 

In [None]:
# Setup GPU accelerator - configure Strategy. Assume TPU...if not set default for GPU/CPU
tpu = None
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
    strategy = tf.distribute.get_strategy()

**Start First model building and training**

Here we build our first model. This model is fairly simple with a handful of layers, void of activations except the last layer. The last layer we use a sigmoid activation function since our data is binary (0-1). We can see below how the model is built and how many parameters we will need to train for this model. The next cell below we can see the model training per epoch. 

Now that we have trained the model, we can take at look at how it did graphically with the training data. Below we can see the accuracy and loss with regards to the validation and training set. 

**Start second model building and training**

Our second model incorporates hyperparameter tuning. We can see we are using ReLU activation functions for hidden layers, dropout between clusters, and batch normalization. In addition, we have added another hidden layer before the flatten and output layers. Lastly, we are using Adam optimizer with a low learning rate. 

In [None]:
# build second model like first but with hyperparameters and optimizer(s)
ROC_2 = tf.keras.metrics.AUC()

with strategy.scope():
    
    #create model
    model_f = Sequential()
    
    model_f.add(Conv2D(filters=16, kernel_size=(3,3), activation='relu', ))
    model_f.add(Conv2D(filters=16, kernel_size=(3,3), activation='relu'))
    model_f.add(MaxPooling2D(pool_size=(2,2)))
    model_f.add(Dropout(0.1))
    
    model_f.add(BatchNormalization())
    model_f.add(Conv2D(filters=32, kernel_size=(3,3), activation='relu'))
    model_f.add(Conv2D(filters=32, kernel_size=(3,3), activation='relu'))
    model_f.add(AveragePooling2D(pool_size=(2,2)))
    model_f.add(Dropout(0.1))
    
    model_f.add(BatchNormalization())
    model_f.add(Conv2D(filters=32, kernel_size=(3,3), activation='relu'))
    model_f.add(Flatten())
    model_f.add(Dense(1, activation='sigmoid'))
    
    #build model by input size
    model_f.build(input_shape=(BATCH_SIZE, 64, 64, 3))       # original image = (96, 96, 3) 
    
    #compile
    adam_optimizer = Adam(learning_rate=0.0001)
    model_f.compile(loss='binary_crossentropy', metrics=['accuracy', ROC_2], optimizer=adam_optimizer)

#quick look at model
model_f.summary()

In [None]:
EPOCHS = 10

# train model
Final_Model = model_f.fit_generator(
                        train_generator,
                        epochs = EPOCHS,
                        validation_data = valid_generator)

In [None]:
# graph loss
plt.plot(np.arange(1,len(Final_Model.history['accuracy']) + 1), Final_Model.history['accuracy'])
plt.plot(np.arange(1,len(Final_Model.history['val_accuracy']) + 1),Final_Model.history['val_accuracy'])
plt.title('Final Model Accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validate'], loc='upper left')
plt.show();

plt.plot(np.arange(1,len(Final_Model.history['loss']) + 1),Final_Model.history['loss'])
plt.plot(np.arange(1,len(Final_Model.history['val_loss']) + 1),Final_Model.history['val_loss'])
plt.title('Final Model Loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validate'], loc='upper left')
plt.show();



In [None]:
#double check what you're aiming the submission data set to look like
sample_submission.head()

In [None]:
#create a dataframe to run the predictions
test_df = pd.DataFrame({'id':os.listdir(test_path)})
test_df.head()

In [None]:
# prepare test data (in same way as train data)
datagen_test = ImageDataGenerator(rescale=1./255.)

test_generator = datagen_test.flow_from_dataframe(
    dataframe=test_df,
    directory=test_path,
    x_col='id', 
    y_col=None,
    target_size=(64,64),         # original image = (96, 96) 
    batch_size=1,
    shuffle=False,
    class_mode=None)

In [None]:
#run model to find predictions

#save the model
model_f.save('../output/kaggle/working/Classif_Model')

#input model
model_f = keras.models.load_model('../output/kaggle/working/Classif_Model')

# Check model
predictions = model_f.predict(test_generator, verbose=1)

#predictions = model_two.predict(test_generator, verbose=1)


In [None]:
#create submission dataframe
predictions = np.transpose(predictions)[0]
submission_df = pd.DataFrame()
submission_df['id'] = test_df['id'].apply(lambda x: x.split('.')[0])
submission_df['label'] = list(map(lambda x: 0 if x < 0.5 else 1, predictions))
submission_df.head()

In [None]:
#view test prediction counts
submission_df['label'].value_counts()

In [None]:
#plot test predictions
sns.countplot(data=submission_df, x='label').set(title='Predicted Labels for Test Set');

In [None]:
#convert to csv to submit to competition
submission_df.to_csv('submission.csv', index=False)

# 4. Results and Analysis

We can see the benefits of increasing epochs and paramater tuning when comparing model 1 to model 2. .  Each epoch took approximately 5 minutes in model 1 except for the first epoch.  The first epoch is longer (about 4x longer) because keras does lazy execution and is building the model to be used in all epochs during the 1st epoch, not when the model strategy is laid out. One thing that is unclear to me is what happened with the final model.  The intention was to cut down model 2 to only 8 epochs because that has the highest validation accuracy and was close to the training accuracy.  The seperation of the validation and training metrics seen in epochs 9 and 10 made me slightly worried about overfitting.  However, by using the the same model I apparently just increased the amount of epochs for model 2 from 10 to 18.  This link goes into more detail https://github.com/keras-team/keras/issues/4446  

# 5. Conclusion

Our first model was simple with no hyperparameter tuning. The second model incorporated much more tuning had a lower learning rate, and  rand for 2.5x more epochs. I would like to have the freedom to have the amount of epochs be a dependent variable based off the delta in loss functions between epochs.  However, due to time constratints simply plotting and eyballing had to suffice  