# Introduction 

This project will follow the approach presented in the deeplearning.ai coursera specilization. The project will aim to follow an iterative approach by first applying a simple model. The purpose of this model is not necessarily to be the final model of this problem. Instead, it is used to creative a starting point to then iterativly alter the model and (hopefully) improve it. 


## The Goal 

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.



## The Data
"In this dataset, you are provided with a large number of small pathology images to classify. Files are named with an image id. The train_labels.csv file provides the ground truth for the images in the train folder. You are predicting the labels for the images in the test folder. A positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable fully-convolutional models that do not use zero-padding, to ensure consistent behavior when applied to a whole-slide image." 

The data used in this project can be found here: https://www.kaggle.com/c/histopathologic-cancer-detection/notebooks 



## Acknowledgements
 This project was done after finishing the (excellt) specilization course by deeplearning.ai on coursera. Thus, the project was approached similarly to Andrew Ng's approach in this course. 


https://www.kaggle.com/vbookshelf/cnn-how-to-use-160-000-images-without-crashing








# Importing Libraries and Files

In [None]:
from numpy.random import seed
seed(101)
from tensorflow import set_random_seed
set_random_seed(101)

import pandas as pd
import numpy as np

from glob import glob

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam

import os
import cv2

from sklearn.utils import shuffle
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import itertools
import shutil
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### What files are available?

In [None]:
os.listdir('../input')

### Labels as per csv file

0 = no tumor tissue<br>
1 =   has tumor tissue. <br>


### How many images are in each folder?

In [None]:

print(len(os.listdir('../input/train')))
print(len(os.listdir('../input/test')))

### Create a Dataframe containing all images

In [None]:
df_data = pd.read_csv('../input/train_labels.csv')
print(df_data.shape)

In [None]:
df_data.head()

# EDA

## Check the class distribution


Remember that the label '0' denotes non tumor and label '1' denotes tumor. 

In [None]:
df_data['label'].value_counts()

In [None]:
sns.set(style="whitegrid")
sns.countplot(x='label', data=df_data)


## Display a random sample of train images  by class

In [None]:
# source: https://www.kaggle.com/gpreda/honey-bee-subspecies-classification

def draw_category_images(col_name,figure_cols, df, IMAGE_PATH):
    
    """
    Give a column in a dataframe,
    this function takes a sample of each class and displays that
    sample on one row. The sample size is the same as figure_cols which
    is the number of columns in the figure.
    Because this function takes a random sample, each time the function is run it
    displays different images.
    """
    

    categories = (df.groupby([col_name])[col_name].nunique()).index
    f, ax = plt.subplots(nrows=len(categories),ncols=figure_cols, 
                         figsize=(4*figure_cols,4*len(categories))) # adjust size here
    # draw a number of images for each location
    for i, cat in enumerate(categories):
        sample = df[df[col_name]==cat].sample(figure_cols) # figure_cols is also the sample size
        for j in range(0,figure_cols):
            file=IMAGE_PATH + sample.iloc[j]['id'] + '.tif'
            im=cv2.imread(file)
            ax[i, j].imshow(im, resample=True, cmap='gray')
            ax[i, j].set_title(cat, fontsize=16)  
    plt.tight_layout()
    plt.show()
    

In [None]:
IMAGE_PATH = '../input/train/' 

num_pictures = 8

draw_category_images('label',num_pictures, df_data, IMAGE_PATH)

As a non cancer reasearcher I have no idea what it is in the images that is the tumor cells. It looks like some cells that are larger are a factor that can determine if it is cancer or not. Perhaps also the color. Some images that have a lot of white area seems to be labeled as a tumor cell. 

# Data Split 

We will downsample (see below) such that we have approximately 175k images. If we take create a validation set of 10 % we have about 17.5k images which should be enought, atleast for now, to give a decent spread of images to not overfit on the training data. also this gives us more training data to work with (~160k images).   

https://cs230.stanford.edu/blog/split/

## Create the Train and Validation/development Sets

We will downsample the number of non tumor images to equal the number of tumor images. This is done for two reasons. More images require more computations which requires more computational power. A non balanced dataset can give the impression that a model is good when it is not. 

Downsides to this is ofcourse that we lose data.

In [None]:

IMAGE_SIZE = 96 # the images are 96 x 96.
IMAGE_CHANNELS = 3 # RGB 

SAMPLE_SIZE = df_data['label'].value_counts().min()  # the number of images we use from each of the two classes. In this case we downsample the number of non tumors. 


### Downsample

In [None]:
# take a random sample of class 0 with size equal to num samples in class 1
df_0 = df_data[df_data['label'] == 0].sample(SAMPLE_SIZE, random_state = 101)
# filter out class 1
df_1 = df_data[df_data['label'] == 1].sample(SAMPLE_SIZE, random_state = 101)

# concat the dataframes
df_data = pd.concat([df_0, df_1], axis=0).reset_index(drop=True)
# shuffle
df_data = shuffle(df_data)

df_data['label'].value_counts()

### Train / validation split 

We split such that the number of positve and negative samples are equal in the validation data. 

In [None]:
# train_test_split

# stratify=y creates a balanced validation set.
y = df_data['label']

df_train, df_val = train_test_split(df_data, test_size=0.10, random_state=101, stratify=y)

print(df_train.shape)
print(df_val.shape)

In [None]:

print('Percentage of images labeled as "0" in the training set:', round(df_train['label'].value_counts()[0] / df_train.shape[0],2))
print('Percentage of images labeled as "1" in the training set:', round(df_train['label'].value_counts()[1] / df_train.shape[0],2))
print('--'*30)
print('Percentage of images labeled as "0" in the validation set:',round(df_val['label'].value_counts()[0] / df_val.shape[0],2))
print('Percentage of images labeled as "1" in the validation set:',round(df_val['label'].value_counts()[1] / df_val.shape[0],2))


# Create dictionary 

In [None]:

# Create directories
train_path = 'base_dir/train'
valid_path = 'base_dir/valid'
test_path = '../input/test'
for fold in [train_path, valid_path]:
    for subf in ["0", "1"]:
        os.makedirs(os.path.join(fold, subf))

In [None]:
# Set the id as the index in df_data
df_data.set_index('id', inplace=True)
df_data.head()

In [None]:
for image in df_train['id'].values:
    # the id in the csv file does not have the .tif extension therefore we add it here
    fname = image + '.tif' 
    label = str(df_data.loc[image,'label']) # get the label for a certain image
    src = os.path.join('../input/train', fname)
    dst = os.path.join(train_path, label, fname)
    shutil.copyfile(src, dst)

for image in df_val['id'].values:
    fname = image + '.tif'
    label = str(df_data.loc[image,'label']) # get the label for a certain image
    src = os.path.join('../input/train', fname)
    dst = os.path.join(valid_path, label, fname)
    shutil.copyfile(src, dst)

In [None]:
from keras.preprocessing.image import ImageDataGenerator

IMAGE_SIZE = 96
num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 32
val_batch_size = 32

train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

datagen = ImageDataGenerator(preprocessing_function=lambda x:(x - x.mean()) / x.std() if x.std() > 0 else x,
                            horizontal_flip=True,
                            vertical_flip=True)

train_gen = datagen.flow_from_directory(train_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=train_batch_size,
                                        class_mode='binary')

val_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=val_batch_size,
                                        class_mode='binary')

# Note: shuffle=False causes the test dataset to not be shuffled
test_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='binary',
                                        shuffle=False)

# Model


This is an iterative process. As it would be to demanding to write out all iterations, the text below will is descreptive of the process I followed. In this case the human error is estimated to be the bayes error which is estimated to be the best results on Kaggle. Preferably the iterative process would have been done, atleast in part, in code. However, due to the lack of computational resources I will manually try to improve the model. 


**Improving your model performance**
* The two fundamental asssumptions of supervised learning:
    - You can fit the training set pretty well. This is roughly saying that you can achieve low avoidable bias.
    - The training set performance generalizes pretty well to the dev/test set. This is roughly saying that variance is not too bad.
        * To improve your deep learning supervised system follow these guidelines:
    - Look at the difference between human level error and the training error - avoidable bias.
    - Look at the difference between the dev/test set and training set error - Variance.
    - If avoidable bias is large you have these options:
        1. Train bigger model.
        2. Train longer/better optimization algorithm (like Momentum, RMSprop, Adam).
        3. Find better NN architecture/hyperparameters search.
    - If variance is large you have these options:
        1. Get more training data.
        2. Regularization (L2, Dropout, data augmentation).
        3. Find better NN architecture/hyperparameters search.
        
        
        
        
        
        
 Furtermore, we are going to use transfer learning. Transfer learning gives us the benefit of using pre trained models that have been show to be successfull in similair tasks. This is especially beneficial due to time and computational restraints. 

In [None]:
saved_models = {}

# Model 1 - AlexNet

16041/16041 [==============================] - 730s 45ms/step - loss: 0.1169 - acc: 0.9554 - val_loss: 0.2073 - val_acc: 0.9245
10 epochs, batch_size  =10 




 

In [None]:
num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

In [None]:
AlexNet = keras.models.Sequential([
    keras.layers.Conv2D(filters=96, kernel_size=(11,11), strides=(4,4), activation='relu', input_shape=(96,96,3)),
    keras.layers.BatchNormalization(),
    keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),
    keras.layers.Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), activation='relu', padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),
    keras.layers.Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.Conv2D(filters=384, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.Conv2D(filters=256, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.MaxPool2D(pool_size=(3,3), strides=(2,2)),
    keras.layers.Flatten(),
    keras.layers.Dense(4096, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(4096, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
AlexNet.compile(Adam(lr=0.0001), loss='binary_crossentropy', 
              metrics=['accuracy'])

In [None]:
num_epoch = 15
filepath = "AlexNet.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, 
                             save_best_only=True, mode='min')

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=2, 
                                   verbose=1, mode='min', min_lr=0.00001)

early_stop = EarlyStopping(monitor='val_acc', min_delta=0, patience=10, mode='auto')
                              
                              
callbacks_list = [checkpoint, reduce_lr,early_stop]

history = AlexNet.fit_generator(train_gen, steps_per_epoch=train_steps, 
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    epochs=num_epoch,
                    callbacks=callbacks_list,
                     verbose=1)

In [None]:
AlexNet.load_weights('AlexNet.h5')
saved_models['AlexNet'] = AlexNet 

# Model 2 - ResNet

In [None]:
num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, BatchNormalization, Activation
from keras.layers import Conv2D, MaxPool2D
from keras.applications.resnet50 import ResNet50

dropout_fc = 0.5

conv_base = ResNet50(weights = 'imagenet', include_top = False, input_shape = (96,96,3))

resnet = Sequential()
resnet.add(conv_base)
resnet.add(Flatten())
resnet.add(Dense(512, use_bias=False))
resnet.add(BatchNormalization())
resnet.add(Activation("relu"))
resnet.add(Dropout(dropout_fc))
resnet.add(Dense(1, activation = "sigmoid"))

resnet.summary()


In [None]:
from keras import optimizers
resnet.compile(optimizers.Adam(0.001), loss = "binary_crossentropy", metrics = ["accuracy"])

In [None]:
num_epoch = 50
filepath = "ResNet50.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, 
                             save_best_only=True, mode='min')

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=2, 
                                   verbose=1, mode='min', min_lr=0.00001)

early_stop = EarlyStopping(monitor='val_acc', min_delta=0, patience=10, mode='auto')
                              
                              
callbacks_list = [checkpoint, reduce_lr,early_stop]

history = resnet.fit_generator(train_gen, steps_per_epoch=train_steps, 
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    epochs=num_epoch,
                    callbacks=callbacks_list,
                     verbose=1)

In [None]:
resnet.load_weights('ResNet50.h5')
saved_models['ResNet50'] = resnet 

# Model 3 - DenseNet


In [None]:
num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation,BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam

dropout_fc = 0.5
conv_base = tf.keras.applications.DenseNet121(weights = 'imagenet', include_top = False, input_shape = (96,96,3))

densenet = keras.Sequential()
densenet.add(conv_base)
densenet.add(Flatten())
densenet.add(Dense(256, use_bias=False))
densenet.add(BatchNormalization())
densenet.add(Activation("relu"))
densenet.add(Dropout(dropout_fc))
densenet.add(Dense(1, activation = "sigmoid"))
densenet.summary()


In [None]:
densenet.compile(Adam(lr=0.0001), loss='binary_crossentropy', 
              metrics=['accuracy'])

In [None]:

num_epochs = 50
filepath = "DenseNet.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, 
                             save_best_only=True, mode='min')

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=2, 
                                   verbose=1, mode='min', min_lr=0.00001)

early_stop = EarlyStopping(monitor='val_acc', min_delta=0, patience=10, mode='auto')
                              
                              
callbacks_list = [checkpoint, reduce_lr,early_stop]

history = densenet.fit_generator(train_gen, steps_per_epoch=train_steps, 
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    epochs=num_epoch,
                    callbacks=callbacks_list,
                     verbose=1)

In [None]:
densenet.load_weights('DenseNet.h5')
saved_models['DenseNet'] = densenet

# Model 4 - InceptionV3

In [None]:
num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = np.ceil(num_train_samples / train_batch_size)
val_steps = np.ceil(num_val_samples / val_batch_size)

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dense, Dropout, Flatten, Activation,BatchNormalization
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam

dropout_fc = 0.5
conv_base = tf.keras.applications.InceptionV3(weights = 'imagenet', include_top = False, input_shape = (96,96,3))

inceptionnet = keras.Sequential()
inceptionnet.add(conv_base)
inceptionnet.add(Flatten())
inceptionnet.add(Dense(256, use_bias=False))
inceptionnet.add(BatchNormalization())
inceptionnet.add(Activation("relu"))
inceptionnet.add(Dropout(dropout_fc))
inceptionnet.add(Dense(1, activation = "sigmoid"))
inceptionnet.summary()

In [None]:
inceptionnet.compile(Adam(lr=0.0001), loss='binary_crossentropy', 
              metrics=['accuracy'])

In [None]:

num_epochs = 50
filepath = "IncepionNet.h5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, 
                             save_best_only=True, mode='min')

reduce_lr = ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=2, 
                                   verbose=1, mode='min', min_lr=0.00001)

early_stop = EarlyStopping(monitor='val_acc', min_delta=0, patience=10, mode='auto')
                              
                              
callbacks_list = [checkpoint, reduce_lr,early_stop]

history = inceptionnet.fit_generator(train_gen, steps_per_epoch=train_steps, 
                    validation_data=val_gen,
                    validation_steps=val_steps,
                    epochs=num_epoch,
                    callbacks=callbacks_list,
                     verbose=1)

In [None]:
inceptionnet.load_weights('IncepionNet.h5')
saved_models['IncepionNet'] = inceptionnet

# Make a test prediction

In [None]:
shutil.rmtree('base_dir') # free up space
 

In [None]:
#[CREATE A TEST FOLDER DIRECTORY STRUCTURE]

# We will be feeding test images from a folder into predict_generator().
# Keras requires that the path should point to a folder containing images and not
# to the images themselves. That is why we are creating a folder (test_images) 
# inside another folder (test_dir).

# test_dir
    # test_images

# create test_dir
test_dir = 'test_dir'
os.mkdir(test_dir)
    
# create test_images inside test_dir
test_images = os.path.join(test_dir, 'test_images')
os.mkdir(test_images)

In [None]:

test_list = os.listdir('../input/test')

for image in test_list:
    
    fname = image
    
    # source path to image
    src = os.path.join('../input/test', fname)
    # destination path to image
    dst = os.path.join(test_images, fname)
    # copy the image from the source to the destination
    shutil.copyfile(src, dst)

In [None]:
test_path ='test_dir'


# Here we change the path to point to the test_images folder.

test_gen = datagen.flow_from_directory(test_path,
                                        target_size=(IMAGE_SIZE,IMAGE_SIZE),
                                        batch_size=1,
                                        class_mode='categorical',
                                        shuffle=False)

In [None]:
saved_models

In [None]:
num_test_images = 57458

# make sure we are using the best epoch
#model.load_weights('model.h5')

predictions = saved_models['ResNet'].predict_generator(test_gen, steps=num_test_images, verbose=1)

In [None]:
df_preds = pd.DataFrame(predictions)
df_preds.head()

In [None]:
# This outputs the file names in the sequence in which 
# the generator processed the test images.
test_filenames = test_gen.filenames

# add the filenames to the dataframe
df_preds['file_names'] = test_filenames

df_preds.head()

In [None]:
def extract_id(x):
    
    # split into a list
    a = x.split('/')
    # split into a list
    b = a[1].split('.')
    extracted_id = b[0]
    
    return extracted_id

df_preds['id'] = df_preds['file_names'].apply(extract_id)

df_preds.head()

In [None]:
y_pred = df_preds[0]

# get the id column
image_id = df_preds['id']

In [None]:
submission = pd.DataFrame({'id':image_id, 
                           'label':y_pred, 
                          }).set_index('id')

submission.to_csv('resnet_pred.csv', columns=['label']) 

In [None]:
submission.head()

# Simple Ensemble 

In [None]:
InceptionNet = pd.read_csv('desktop/inceptionnet_pred.csv')
DenseNet = pd.read_csv('desktop/DenseNet_preds.csv')
ResNet = pd.read_csv('desktop/resnet_pred.csv')

In [None]:
df_preds = pd.DataFrame()
df_preds['id'] = DenseNet['id']
df_preds['label'] = ( InceptionNet['label']+ResNet['label'] + DenseNet['label']) / 3 

In [None]:
y_pred = df_preds['label']
image_id = df_preds['id']

In [None]:
submission = pd.DataFrame({'id':image_id, 
                           'label':y_pred, 
                          }).set_index('id')

submission.to_csv('Ensemble_pred.csv', columns=['label']) 