## Problem Description
The goal of this project is to develop a machine learning model capable of identifying metastatic cancer in small image patches extracted from larger digital pathology scans. Metastatic cancer detection is crucial as it helps in early diagnosis and treatment planning, potentially improving patient outcomes. This task involves binary image classification, where each image patch is classified as either containing metastatic cancer (positive) or not (negative).

## Dataset Description
The dataset used for this project is a slightly modified version of the PatchCamelyon (PCam) benchmark dataset, which is specifically designed for binary image classification tasks. The dataset consists of small image patches taken from larger digital pathology scans, and the task is to classify each patch as containing metastatic cancer or not.

The dataset provided for this project includes the following components:

1. Train Folder: Contains a large number of .tif image files used for training the model.
2. Test Folder: Contains a large number of .tif image files used for testing the model.
3. sample_submission.csv: A sample submission file in CSV format that provides the structure required for submitting predictions on the test set to Kaggle.
4. train_labels.csv: A CSV file containing the labels for the training images. It has two columns:
- id: The identifier for each image (filename without extension).
- label: The binary label indicating whether the image patch contains metastatic cancer (1) or not (0).

In [None]:
import os
import shutil
import warnings
warnings.filterwarnings('ignore')

import cv2
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from numpy.random import seed
seed(123)

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Dropout, Flatten, Activation
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
tf.random.set_seed(123)

In [None]:
os.listdir('../input/histopathologic-cancer-detection')

In [None]:
print(len(os.listdir('../input/histopathologic-cancer-detection/train')))
print(len(os.listdir('../input/histopathologic-cancer-detection/test')))

### Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data

In [None]:
df_train = pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv')
df_sample_submission = pd.read_csv('../input/histopathologic-cancer-detection/sample_submission.csv')
print(df_train.shape)

In [None]:
df_train.head()

In [None]:
import seaborn as sns

# Visualize the distribution of labels
sns.countplot(x='label', data=df_train)
plt.title('Distribution of Labels in the Training Set')
plt.xlabel('Label')
plt.ylabel('Count')
plt.show()

In [None]:
# Display a few sample images from each class
fig, axes = plt.subplots(2, 5, figsize=(20, 8))

# Sample 5 images from class 0 (no cancer)
for i, img_id in enumerate(df_train[df_train['label'] == 0].sample(5)['id']):
    img = cv2.imread(f'../input/histopathologic-cancer-detection/train/{img_id}.tif')
    axes[0, i].imshow(img)
    axes[0, i].set_title('Label: 0')
    axes[0, i].axis('off')

# Sample 5 images from class 1 (cancer)
for i, img_id in enumerate(df_train[df_train['label'] == 1].sample(5)['id']):
    img = cv2.imread(f'../input/histopathologic-cancer-detection/train/{img_id}.tif')
    axes[1, i].imshow(img)
    axes[1, i].set_title('Label: 1')
    axes[1, i].axis('off')

plt.show()

###  Data Cleaning Procedures

In [None]:
# Check for missing values
missing_values = df_train.isnull().sum()
print(missing_values)

### Shuffling, Balancing, Splitting the Data

In [None]:
df_train['label'].value_counts()

In [None]:
# Check the number of samples available for each class
label_counts = df_train['label'].value_counts()
num_samples = min(label_counts[0], label_counts[1], 500)

# Sample the dataframes with the number of samples that can be safely drawn
df0 = df_train[df_train['label'] == 0].sample(num_samples)
df1 = df_train[df_train['label'] == 1].sample(num_samples)

# Combine and shuffle the data
df_data = pd.concat([df0, df1], axis=0).reset_index(drop=True)
df_data = shuffle(df_data)

# Verify the distribution of labels
print(df_data['label'].value_counts())

In [None]:
y = df_data['label']

df_train, df_val = train_test_split(df_data, test_size=0.20, stratify=y)

print(df_train.shape)
print(df_val.shape)

In [None]:
os.mkdir('base')
os.mkdir('base/train')
os.mkdir('base/val')
os.mkdir('base/train/0')
os.mkdir('base/train/1')
os.mkdir('base/val/0')
os.mkdir('base/val/1')

In [None]:
for image in list(df_train[df_train['label']==0]['id']):
    shutil.copyfile('../input/histopathologic-cancer-detection/train/'+image+'.tif', 'base/train/0/'+image+'.tif')

for image in list(df_train[df_train['label']==1]['id']):
    shutil.copyfile('../input/histopathologic-cancer-detection/train/'+image+'.tif', 'base/train/1/'+image+'.tif')
    
for image in list(df_val[df_val['label']==0]['id']):
    shutil.copyfile('../input/histopathologic-cancer-detection/train/'+image+'.tif', 'base/val/0/'+image+'.tif')
    
for image in list(df_val[df_val['label']==1]['id']):
    shutil.copyfile('../input/histopathologic-cancer-detection/train/'+image+'.tif', 'base/val/1/'+image+'.tif')

In [None]:
print(len(os.listdir('base/train/0')))
print(len(os.listdir('base/train/1')))
print(len(os.listdir('base/val/0')))
print(len(os.listdir('base/val/1')))

## Plan of Analysis

### Data Preprocessing:
Normalize the images to have pixel values between 0 and 1.
Augment the data to enhance model generalization.

### Model Development:
Use a Convolutional Neural Network (CNN) model, which is well-suited for image classification tasks.
Experiment with different architectures and hyperparameters to find the best-performing model.
Use techniques such as early stopping, learning rate reduction, and model checkpointing to optimize the training process.

### Model Evaluation:
Evaluate the model using the validation set to monitor its performance and avoid overfitting.
Calculate metrics such as accuracy, precision, recall, and F1-score to assess the model's performance comprehensively.

### Submission:
Generate predictions on the test set.
Format the predictions according to the sample submission file and submit them to Kaggle for evaluation.

In [None]:
# Set up the generators
train_path = 'base/train'
valid_path = 'base/val'
test_path = '../input/histopathologic-cancer-detection/test'

num_train_samples = len(df_train)
num_val_samples = len(df_val)
train_batch_size = 10
val_batch_size = 10


train_steps = int(np.ceil(num_train_samples // train_batch_size))
val_steps = int(np.ceil(num_val_samples // val_batch_size))

In [None]:
datagen = ImageDataGenerator(rescale=1.0/255)

train_gen = datagen.flow_from_directory(train_path,
                                        target_size=(96,96),
                                        batch_size=train_batch_size,
                                        class_mode='categorical')

val_gen = datagen.flow_from_directory(valid_path,
                                        target_size=(96,96),
                                        batch_size=val_batch_size,
                                        class_mode='categorical')

# Note: shuffle=False causes the test dataset to not be shuffled
test_gen = datagen.flow_from_directory('../input/histopathologic-cancer-detection',
                                        target_size=(96,96),
                                        batch_size=1,
                                        classes=['test'],
                                        shuffle=False)

In [None]:
kernel_size = (3,3)
pool_size= (2,2)
first_filters = 64
second_filters = 128
third_filters = 256
fourth_filters = 512

dropout_conv = 0.5
dropout_dense = 0.5

model = Sequential()
model.add(Conv2D(first_filters, kernel_size, activation = 'relu',padding='same', input_shape = (96, 96, 3)))
model.add(Conv2D(first_filters, kernel_size, activation = 'relu',padding='same'))
model.add(MaxPooling2D(pool_size = pool_size)) 

model.add(Conv2D(second_filters, kernel_size, activation ='relu',padding='same'))
model.add(Conv2D(second_filters, kernel_size, activation ='relu',padding='same'))
model.add(MaxPooling2D(pool_size = pool_size))

model.add(Conv2D(third_filters, kernel_size, activation ='relu',padding='same'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu',padding='same'))
model.add(Conv2D(third_filters, kernel_size, activation ='relu',padding='same'))
model.add(MaxPooling2D(pool_size = pool_size))

model.add(Conv2D(fourth_filters, kernel_size, activation ='relu',padding='same'))
model.add(Conv2D(fourth_filters, kernel_size, activation ='relu',padding='same'))
model.add(Conv2D(fourth_filters, kernel_size, activation ='relu',padding='same'))

model.add(Flatten())
model.add(Dense(4096, activation = "relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(4096, activation = "relu"))
model.add(Dropout(dropout_dense))
model.add(Dense(2, activation = "softmax"))

model.summary()

In [None]:
model.compile(Adam(learning_rate=0.0001), loss='binary_crossentropy', 
              metrics=['AUC'])

In [None]:
print(val_gen.class_indices)

In [None]:
history = model.fit(train_gen, 
                    validation_data=val_gen,
                    epochs=10, verbose=1)

## Model Architecture and Rationale
The model architecture proposed for the histopathologic cancer detection problem is a deep Convolutional Neural Network (CNN). CNNs are well-suited for image classification tasks because they can automatically learn and extract features from images through a series of convolutional layers.

### Detailed architecture:

1. Input Layer:
- Input Shape: (96, 96, 3) for the resized image patches.
- The input images are normalized to have pixel values between 0 and 1 using ImageDataGenerator with rescale=1.0/255.

2. Convolutional Layers:
- First Convolution Block:
- 2 Conv2D layers with 64 filters, kernel size of (3,3), and ReLU activation.
- MaxPooling2D layer with a pool size of (2,2).
- Second Convolution Block:
- 2 Conv2D layers with 128 filters, kernel size of (3,3), and ReLU activation.
- MaxPooling2D layer with a pool size of (2,2).
- Third Convolution Block:
- 3 Conv2D layers with 256 filters, kernel size of (3,3), and ReLU activation.
- MaxPooling2D layer with a pool size of (2,2).
- Fourth Convolution Block:
- 3 Conv2D layers with 512 filters, kernel size of (3,3), and ReLU activation.
3. Flatten Layer:
- Converts the 3D output of the convolutional layers to a 1D vector.
4. Fully Connected Layers:
- Two Dense layers with 4096 units each and ReLU activation.
- Dropout layers with a dropout rate of 0.5 to prevent overfitting.
5. Output Layer:
- Dense layer with 2 units and softmax activation for binary classification.


### Reasoning for the Architecture
1. Depth of the Network:The depth allows the model to learn complex patterns and features from the images. The multiple layers of convolutions help capture different levels of abstraction, which is crucial for identifying subtle features indicative of metastatic cancer.

2. Use of Dropout:
Dropout is used to prevent overfitting, especially since the dataset is relatively small. Dropout randomly sets a fraction of input units to 0 at each update during training time, which helps prevent the network from becoming too reliant on any particular neurons.

3. Pooling Layers:
MaxPooling layers reduce the spatial dimensions of the feature maps, which decreases computational complexity and helps the network become invariant to small translations in the input images.

4. Large Dense Layers:
The large fully connected layers towards the end help in learning high-level representations. The model can combine the features extracted by the convolutional layers to make the final classification.


## Hyperparameter Tuning and Alternative Architectures
### We will experiment with several hyperparameters and architectures to find the optimal configuration:

1. Learning Rate
2. Batch Size
3. Number of Filters in Convolutional Layers
4. Dropout Rates
We'll compare three different architectures: the initial model, a simpler model, and a more complex model.

### Simpler Model

In [None]:
simpler_model = Sequential()
simpler_model.add(Conv2D(32, kernel_size, activation='relu', padding='same', input_shape=(96, 96, 3)))
simpler_model.add(MaxPooling2D(pool_size=pool_size))
simpler_model.add(Conv2D(64, kernel_size, activation='relu', padding='same'))
simpler_model.add(MaxPooling2D(pool_size=pool_size))
simpler_model.add(Flatten())
simpler_model.add(Dense(512, activation='relu'))
simpler_model.add(Dropout(dropout_dense))
simpler_model.add(Dense(2, activation='softmax'))
simpler_model.compile(Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['AUC'])

### Complex Model

In [None]:
complex_model = Sequential()
complex_model.add(Conv2D(64, kernel_size, activation='relu', padding='same', input_shape=(96, 96, 3)))
complex_model.add(Conv2D(64, kernel_size, activation='relu', padding='same'))
complex_model.add(MaxPooling2D(pool_size=pool_size)) 

complex_model.add(Conv2D(128, kernel_size, activation='relu', padding='same'))
complex_model.add(Conv2D(128, kernel_size, activation='relu', padding='same'))
complex_model.add(MaxPooling2D(pool_size=pool_size))

complex_model.add(Conv2D(256, kernel_size, activation='relu', padding='same'))
complex_model.add(Conv2D(256, kernel_size, activation='relu', padding='same'))
complex_model.add(Conv2D(256, kernel_size, activation='relu', padding='same'))
complex_model.add(MaxPooling2D(pool_size=pool_size))

complex_model.add(Conv2D(512, kernel_size, activation='relu', padding='same'))
complex_model.add(Conv2D(512, kernel_size, activation='relu', padding='same'))
complex_model.add(Conv2D(512, kernel_size, activation='relu', padding='same'))
complex_model.add(MaxPooling2D(pool_size=pool_size))

complex_model.add(Conv2D(1024, kernel_size, activation='relu', padding='same'))
complex_model.add(Conv2D(1024, kernel_size, activation='relu', padding='same'))
complex_model.add(Conv2D(1024, kernel_size, activation='relu', padding='same'))

complex_model.add(Flatten())
complex_model.add(Dense(4096, activation='relu'))
complex_model.add(Dropout(dropout_dense))
complex_model.add(Dense(4096, activation='relu'))
complex_model.add(Dropout(dropout_dense))
complex_model.add(Dense(2, activation='softmax'))

complex_model.compile(Adam(learning_rate=0.00001), loss='binary_crossentropy', metrics=['AUC'])

In [None]:
history_simpler = simpler_model.fit(train_gen, validation_data=val_gen, epochs=10, verbose=1)

In [None]:
history_complex = complex_model.fit(train_gen, validation_data=val_gen, epochs=10, verbose=1)

In [None]:
tr_acc = history.history['AUC']
val_acc = history.history['val_AUC']

epoc = range(1, len(tr_acc) + 1)

plt.plot(epoc, tr_acc, label='Training acc')
plt.plot(epoc, val_acc, label='Validation acc')
plt.title('Accuracy')
plt.legend()
plt.show()

In [None]:
predictions = model.predict(test_gen, verbose=1)

In [None]:
predictions

In [None]:
df_preds = pd.DataFrame(predictions, columns=['0', '1'])

df_preds.head()

In [None]:
df_preds[df_preds['1']>0.5]

In [None]:
df_preds['file_names'] = test_gen.filenames

In [None]:
df_preds['id'] = df_preds['file_names'].str[5:-4]
df_preds[['id','1']].rename(columns={'1':'label'}).to_csv('submission.csv', columns=['id','label'],index=False) 

In [None]:
pd.read_csv('submission.csv')