#Week 3 MiniProject

#Description

In this project we are going to build a Convolutional Neural Network in an attempt to identify metastatic cancer in small image patches taken from digital pathology scans.

This dataset of training images is made up of 220,000 images, each being 96x96 pixels:{"shape": [96, 96, 3]}.

 The 3 refers to the number of color channels (R,G,B).

Exploratory Data Analysis

Given the sheer size of the data and the amount of time it takes to train a neural network, the first thing I thought about was finding out a way to randomly select a subset of these images to save time as this is just a weekly mini-project.

Below is my code for splitting up the training data in a fair way.
This took about 5 minutes to run. Tallyho!

In [48]:
import os
import random
from PIL import Image

image_folder = 'C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/train'
output_folder = 'C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/project'
sample_size = 50000

os.makedirs(output_folder, exist_ok=True)

all_images = os.listdir(image_folder)
sampled_images = random.sample(all_images, sample_size)

for img_name in sampled_images:
    src_path = os.path.join(image_folder, img_name)
    dst_name = os.path.splitext(img_name)[0] + '.jpg'
    dst_path = os.path.join(output_folder, dst_name)
    
    try:
        with Image.open(src_path) as img:
            rgb_img = img.convert('RGB')  # Convert in case it's grayscale or RGBA
            rgb_img.save(dst_path, 'JPEG')
    except Exception as e:
        print(f"Error processing {img_name}: {e}")

print(f"Copied and converted {len(sampled_images)} images to {output_folder} as JPGs")


Copied and converted 50000 images to C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/project as JPGs


Now I have less images to work with. Cool. However, now my training labels CSV has 170k records that I do not need. 
Let's shave that down a bit now. Watch out for the file extension name!

In [50]:
import os
import pandas as pd
import shutil
from PIL import Image
output_folder = 'C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/project'
sampled_filenames = {filename.replace('.jpg', '') for filename in os.listdir(output_folder)}

labels_df = pd.read_csv('train_labels.csv')
filter_labels_df = labels_df[labels_df['id'].isin(sampled_filenames)]

filter_labels_df.to_csv('labels_50k.csv',index = False)


labels_df = pd.read_csv('labels_50k.csv')

source_folder = 'C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/project'
output_base = 'C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/project_filter'

# Make class folders if they don't exist
os.makedirs(os.path.join(output_base, '0'), exist_ok=True)
os.makedirs(os.path.join(output_base, '1'), exist_ok=True)

for _, row in labels_df.iterrows():
    filename = row['id'] + '.jpg'
    label = str(row['label'])
    src_path = os.path.join(source_folder, filename)
    dest_path = os.path.join(output_base, label, filename)
    if os.path.exists(src_path):
        shutil.move(src_path, dest_path)


Model Architectures to Consider

This is highly dependent on the amount of images I plan on including to train the neural network in the model.

1. Simple Custom CNN 

Would essentially be working as follows:

2-3 convolution layers, increasing the filters as the layers go on.

Convolution -> ReLU -> MaxPool -> Convolution ->ReLU -> MaxPool -> Flatten -> Dense -> Sigmoid

2. Pretrained Architectures

I think the best pretrained architecture would be ResNet 34.

My first attempt will be using a simple custom CNN.




In [38]:
import sys
print(sys.executable)


import numpy as np
import matplotlib.pyplot as plt
import keras
from keras.layers import *
from keras.models import *
from keras.preprocessing import image



c:\Users\diego\anaconda3\python.exe


Now its time to combine the model and our filtered dataframe

In [51]:
import tensorflow as tf

dataset = tf.keras.utils.image_dataset_from_directory(
    'C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/project_filter',
    labels='inferred',
    label_mode='binary',
    image_size=(96, 96),
    batch_size=32,
    validation_split=0.2,
    subset='training',
    seed=42
)

val_dataset = tf.keras.utils.image_dataset_from_directory(
    'C:/Users/diego/GradSchool/DeepLearningIntro/WeekThree/project_filter',
    labels='inferred',
    label_mode='binary',
    image_size=(96, 96),
    batch_size=32,
    validation_split=0.2,
    subset='validation',
    seed=42
)

Found 50000 files belonging to 2 classes.
Using 40000 files for training.
Found 50000 files belonging to 2 classes.
Using 10000 files for validation.


In [54]:
model = Sequential()
model.add(Conv2D(32,kernel_size=(3,3),activation ='relu',input_shape = (96,96,3)))
model.add(Conv2D(64,kernel_size=(3,3),activation ='relu'))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(64,(3,3),activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Conv2D(128,(3,3),activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(64,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1,activation = 'sigmoid'))

model.compile(loss = keras.losses.binary_crossentropy, optimizer = 'adam',metrics =['accuracy'])

model.summary()

In [55]:
history = model.fit(
    dataset,
    validation_data=val_dataset,
    epochs=10
)

Epoch 1/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m231s[0m 183ms/step - accuracy: 0.7018 - loss: 2.3245 - val_accuracy: 0.8010 - val_loss: 0.4413
Epoch 2/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m215s[0m 172ms/step - accuracy: 0.7887 - loss: 0.4798 - val_accuracy: 0.7856 - val_loss: 0.4634
Epoch 3/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m223s[0m 178ms/step - accuracy: 0.7990 - loss: 0.4640 - val_accuracy: 0.8132 - val_loss: 0.4276
Epoch 4/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m217s[0m 173ms/step - accuracy: 0.7917 - loss: 0.4631 - val_accuracy: 0.8072 - val_loss: 0.4358
Epoch 5/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m217s[0m 174ms/step - accuracy: 0.8056 - loss: 0.4495 - val_accuracy: 0.8263 - val_loss: 0.4024
Epoch 6/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m229s[0m 183ms/step - accuracy: 0.8179 - loss: 0.4219 - val_accuracy: 0.8410 - val_loss:

1st Model Evaluation:

Training accuracy increased from 70 to 84 percent. Validation accuracy reached 84 percent at its best. Training loss steadily decreased. 

No major overfitting noticed but validation accuracy plateaus after Epoch 6. There's a slight gap between training accuracy and validation accuracy as well.

Let's try reducing our learning rate after a few epochs!

In [56]:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=2,
    verbose=1
)

In [57]:
history = model.fit(
    dataset,
    validation_data=val_dataset,
    epochs=10,
    callbacks = lr_scheduler
)

Epoch 1/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m221s[0m 176ms/step - accuracy: 0.8355 - loss: 0.3808 - val_accuracy: 0.8610 - val_loss: 0.3553 - learning_rate: 0.0010
Epoch 2/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m209s[0m 167ms/step - accuracy: 0.8434 - loss: 0.3636 - val_accuracy: 0.8529 - val_loss: 0.3368 - learning_rate: 0.0010
Epoch 3/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m213s[0m 170ms/step - accuracy: 0.8447 - loss: 0.3619 - val_accuracy: 0.8451 - val_loss: 0.3508 - learning_rate: 0.0010
Epoch 4/10
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 162ms/step - accuracy: 0.8464 - loss: 0.3609
Epoch 4: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
[1m1250/1250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m214s[0m 172ms/step - accuracy: 0.8464 - loss: 0.3609 - val_accuracy: 0.8590 - val_loss: 0.3571 - learning_rate: 0.0010
Epoch 5/10
[1m1250/1250[0m [32m━━━━━━━━━

Accuracy improves from 84 to 88 percent - a large improvement over my first run through! The Learning rate got reduced twice, once at epoch 4 and again at epoch 10. 
Validation accuracy stayed more or less the same throughout, and the loss shows better convergence than the previous run.

Conclusion
My model performed pretty well in the first run and I got some pretty good improvements by adjusting the learning rate when accuracy was starting to plateau. 

Improvements I could try in the future would include Early Stopping to prevent unnecessary epochs after validation loss starts to spiral.

I could also consider fine-tuning a pretrained model like mentioned earlier in this notebook - specifically ResNet 34 given the number of images I am training on.
