# Welcome to my Histopathologic Cancer Detection Neural Network (Notebook)

### Kaggle code


In [45]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



### Checking the library versions:

In [46]:
import tensorflow as tf
print("tensorflow:", tf.__version__)

# import keras
# print("keras:", keras.__version__)

# import kerastuner as kt
# print("kerastuner:", kt.__version__)

# import keras_tuner as kt2
# print("keras_tuner:", kt2.__version__)

import platform
print("Python:", platform.python_version())

import numpy as np
print("numpy:", np.__version__)

import pandas as pd
print("pandas:", pd.__version__)

import sklearn
print("sklearn version:", sklearn.__version__)

import sklearn
print("sklearn path:", sklearn.__path__)

import matplotlib
print("matplotlib:", matplotlib.__version__)

import seaborn as sns
print("seaborn:", sns.__version__)

# Tensorflow: 2.15.0
# kerastuner: 1.0.5
# keras_tuner: 1.3.5
# Python: 3.10.11
# numpy: 1.24.3
# pandas: 2.1.4
# sklearn version: 1.2.2
# sklearn path: ['c:\\Users\\Micha\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\sklearn']
# matplotlib: 3.8.2
# seaborn: 0.13.0


tensorflow: 2.15.0
Python: 3.10.11
numpy: 1.24.3
pandas: 2.1.4
sklearn version: 1.2.2
sklearn path: ['c:\\Users\\Micha\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\sklearn']
matplotlib: 3.8.2
seaborn: 0.13.0


### Set Global random seed to make sure we can replicate any model that we create (no randomness)

In [47]:
import random
import tensorflow as tf
import numpy as np
import os



np.random.seed(42)
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

os.environ['TF_DETERMINISTIC_OPS'] = '1'

## Image recognition pre-processing practices:

1. **Rescaling**: Image pixel values usually range from 0 to 255. Rescaling these values to a range of 0 to 1 by dividing each pixel by 255 is a common practice. This helps to stabilize and speed up the learning process.

2. **Resizing**: Deep learning models require the input dimensions to be uniform. Resizing all images to a predetermined size is essential, especially if the original dataset contains images of various dimensions.

3. **Normalization**: Beyond just rescaling, you might want to normalize the image data. This can include subtracting the mean and dividing by the standard deviation across each channel. If you have a pre-trained model, you would use the normalization statistics (mean and standard deviation) from the dataset on which the model was trained.

4. **Data Augmentation**: To increase the diversity of your dataset and prevent overfitting, you can apply random transformations like rotation, shifting, flipping, zooming, and shearing. These transformations generate new training samples from the original ones by altering them slightly.

5. **Color Space Conversions**: Sometimes, converting images to different color spaces (e.g., from RGB to grayscale, HSV, LAB, etc.) can help the model learn more robust features, depending on the task.

6. **Image Denoising**: If the images are noisy, applying denoising algorithms can help to remove noise and improve model accuracy.

7. **Edge Detection**: In certain applications, particularly those involving shape analysis, edge detection filters may be applied to highlight the edges within images.

8. **Masking and Cropping**: If there are regions in the images that are not relevant to the analysis, you might want to mask or crop these regions to focus the model on the important parts of the image.

9. **Histogram Equalization**: This can enhance the contrast in images, which can be beneficial if you have a dataset with varying lighting conditions.

10. **Centering and Standardization**: Similar to normalization, centering the data by subtracting the mean image (computed over the training set) and standardizing, so the variance of the pixels is reduced, can be beneficial.

11. **Handling Class Imbalance**: If your dataset has a class imbalance, techniques such as class weighting, oversampling the minority class, or undersampling the majority class can be considered.

In practice, preprocessing steps are often determined experimentally. You might start with a simple preprocessing pipeline (like just rescaling and resizing) and then iteratively add steps that improve your model performance. It's also important to note that if you're using a pre-trained model, you should preprocess your data in the same way the original model was trained.

### Check to see if each image has the same dimensions since that's important for data preprocessing

In [48]:
# from PIL import Image
# import os

# def check_image_dimensions(directory):
#     image_sizes = set()
#     for img_name in os.listdir(directory):
#         img_path = os.path.join(directory, img_name)
#         with Image.open(img_path) as img:
#             # Get image size
#             size = img.size
#             image_sizes.add(size)
            
#             # # If more than one size is found, we can stop checking
#             # if len(image_sizes) > 1:
#             #     break
    
#     if len(image_sizes) == 1:
#         print(f"For the {directory} directory, all images are of the same dimension: {image_sizes.pop()}")
#     else:
#         print(f"For the {directory} directory, different dimensions found: {image_sizes}")

# # Use it on the train and test data only if this code segment was never ran in this coding session:
# if 'checked_image_dimensions' not in globals():
#     # Use it on the train and test data:
#     check_image_dimensions('train')
#     check_image_dimensions('test')
#     checked_image_dimensions = True

# # For the train directory, all images are of the same dimension: (96, 96)
# # For the test directory, all images are of the same dimension: (96, 96)


### If you need to, call the resize_images functions to ensure each image is the same dimension but make sure you are not distorting the images. In order to do this, you need to make sure all the original images have the same aspect ratios

In [49]:
from PIL import Image
import os
 

def resize_images(directory, size=(128, 128)): 
    for img_name in os.listdir(directory):
        img_path = os.path.join(directory, img_name)
        with Image.open(img_path) as img:
            new_img = img.resize(size)
            new_img.save(img_path)

# Use it on the train and test data if needed, and change the size argument as you need:
            
# resize_images('train', size=(128, 128))
# resize_images('test', size=(128, 128))

### Split the data into a training, validation, testing sets 
Make sure to do this before using data augmentation like ImageDataGenerator(). It's hard to split the data into train-validation-test after using ImageDataGenerator()

In [50]:
from sklearn.model_selection import train_test_split

# Set the base directory to the current directory
base_dir = ''

# Directory for train
train_dir = os.path.join(base_dir, 'train')

# Load the labels
labels = pd.read_csv(os.path.join(base_dir, 'train_labels.csv'))

# Convert the 'label' column to strings
labels['label'] = labels['label'].astype(str)

# Add the full path to the image files, and create a new column called "path" inside the label dataframe to store these paths to images
labels['path'] = labels['id'].apply(lambda x: os.path.join(train_dir, f"{x}.tif"))

# Split the labels dataframe into train, validation, and test sets into a 70/15/15 ratio
train_df, test_df = train_test_split(labels, test_size=0.3, random_state=42)
val_df, test_df = train_test_split(test_df, test_size=0.5, random_state=42)

# Save the DataFrames to CSV files
train_df.to_csv('training_df_labels.csv', index=False)
val_df.to_csv('valid_df_labels.csv', index=False)
test_df.to_csv('testing_df_labels.csv', index=False)

# Now we have a dataframe for train, val, and test which contains the data of their path, label, and id

### Use Keras ImageDataGenerator() on Train/Validation/Test split and also crop 32x32px center
The ImageDataGenerator not only helps you load images from the disk but also allows you to perform **data augmentation**, which is a technique to increase the diversity of your training set by applying random transformations (like rotation, zoom, flips, etc.) to the images. This is very useful to prevent overfitting and helps the model generalize better.

make sure to change the "target_size" argument of the train_datagen.flow_from_dataframe() function as needed

In [51]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator


#This is a function to crop the image to focus on the 32x32px center of the image. We will call this function in the ImageDataGenerator() function
def crop_center(img): 
    y, x, _ = img.shape
    startx = x//2 - (32//2)
    starty = y//2 - (32//2)    
    return img[starty:starty+32, startx:startx+32, :]



# Creating an instance of the ImageDataGenerator for data augmentation and preprocessing
train_datagen = ImageDataGenerator(
    rescale=1./255,  # Rescale the image pixel values to [0,1]
    preprocessing_function = crop_center  # call the crop function on each image

    # Potential data augmentation techniques that won't affect the 32x32px center
    #brightness_range=[0.8, 1.2], 
    #channel_shift_range=20, 

    # I removed these transformations for the data augmentation since this project involves detecting tumor tissue in the center 32x32px region so I can't be doing zooming and other transformations for this project specifically

    # rotation_range=40,  # Random rotations
    # width_shift_range=0.2,  # Random horizontal shifts
    # height_shift_range=0.2,  # Random vertical shifts
    # shear_range=0.2,  # Shear transformations
    # zoom_range=0.2,  # Random zoom
    # horizontal_flip=True,  # Random horizontal flips
    # fill_mode='nearest'  # Strategy for filling in new pixels
)

val_datagen = ImageDataGenerator(rescale = 1./255, preprocessing_function = crop_center)  # call the crop function on each image
test_datagen = ImageDataGenerator(rescale = 1./255, preprocessing_function = crop_center) # call the crop function on each image



# Flow from dataframe method to load images using the dataframe
train_generator = train_datagen.flow_from_dataframe(
    dataframe=train_df, # Use the training dataframe (with labels, id, and paths)
    x_col='path',
    y_col='label',
    target_size=(32, 32),  # The dimensions to which all images found will be resized. Change this as needed
    color_mode='rgb',
    class_mode='binary', # means that the labels are binary labels
    batch_size=32,
    shuffle=True, # This might introduce randomness if set to true, but if it's false, the it might lead to overfitting. So it's best to just save the neural network to ensure no randomness
    seed=42
)

val_generator = val_datagen.flow_from_dataframe(
    dataframe=val_df, # Use the validation dataframe (with labels, id, and paths)
    x_col='path',
    y_col='label',
    target_size=(32, 32),
    color_mode='rgb',
    class_mode='binary',
    batch_size=32,
    shuffle=False,
    seed=42
)

test_generator = test_datagen.flow_from_dataframe(
    dataframe=test_df, # Use the testing dataframe (with labels, id, and paths)
    x_col='path',
    y_col='label',
    target_size=(32, 32),
    color_mode='rgb',
    class_mode='binary',
    batch_size=32,
    shuffle=False,
    seed=42
)



# After setting this up, you can use train_generator as the input to the fit or fit_generator method of your Keras model, 
# which will load images in batches and train your model on them.





Found 154017 validated image filenames belonging to 2 classes.
Found 33004 validated image filenames belonging to 2 classes.
Found 33004 validated image filenames belonging to 2 classes.


### If you want, you can check the image shape and see a visualization of the pictures below:

In [52]:
# import matplotlib.pyplot as plt

# # Get a batch of images
# images, labels = next(train_generator)

# # The images should now be a numpy array. Check its shape:
# print(images.shape)  # Should be (batch_size, target_size[0], target_size[1], 3)

# # Plot the first few images
# for i in range(5):  # Change this value to see more images
#     plt.figure(figsize=(5, 5))
#     plt.imshow(images[i])
#     plt.title(f'Label: {labels[i]}') 
#     plt.show()

### Here is another function which is able to crop images but you have to manually call this function on each image in order to crop, so I just used the ImageDataGenerator() method instead

In [53]:
from PIL import Image
import matplotlib.pyplot as plt

def crop_center(img):
    width, height = img.size
    new_width, new_height = 32, 32

    left = (width - new_width)/2
    top = (height - new_height)/2
    right = (width + new_width)/2
    bottom = (height + new_height)/2

    return img.crop((left, top, right, bottom))


#The code below is for you to visually see the cropped images

# # Get a batch of images
# images, labels = next(train_generator)

# # The images should now be a numpy array. Check its shape:
# print(images.shape)  # Should be (batch_size, target_size[0], target_size[1], 3)

# # Crop and plot the first few images
# for i in range(5):  # Change this value to see more images
#     img = Image.fromarray((images[i] * 255).astype(np.uint8))  # Convert to PIL Image
#     cropped_img = crop_center(img)
#     plt.figure(figsize=(5, 5))
#     plt.imshow(cropped_img)
#     plt.title(f'Label: {labels[i]}') 
#     plt.show()


### Use ImageDataGenerator on the actual test data (from the test directory, not the testing data from the train/valid/test split) 

In [56]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Set the base directory to the current directory
base_dir = ''

# Directory for test
test_dir = os.path.join(base_dir, 'test')

# Get the list of test image filenames
test_filenames = os.listdir(test_dir)

# Create a DataFrame with 'id' and 'path' columns
df_test = pd.DataFrame({
    'id': [filename.split('.')[0] for filename in test_filenames],
    'path': [os.path.join(test_dir, filename) for filename in test_filenames]
})



#This is a function to crop the image to focus on the 32x32px center of the image. We will call this function in the ImageDataGenerator() function
def crop_center(img): 
    y, x, _ = img.shape
    startx = x//2 - (32//2)
    starty = y//2 - (32//2)    
    return img[starty:starty+32, startx:startx+32, :]

# Create a data generator for the test data
real_test_datagen = ImageDataGenerator(rescale=1./255, preprocessing_function = crop_center)

real_test_generator = real_test_datagen.flow_from_dataframe(
        dataframe = df_test,
        x_col="path",
        y_col=None,  # We don't have labels for the test data
        target_size=(32, 32),
        batch_size=32, # Change the batch size as needed
        class_mode=None,  # We don't have labels for the test data
        color_mode = "rgb",
        shuffle=False)

Found 57458 validated image filenames.


## Now, it's time to create my first model. This is Model 1

In [54]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# 1. Define the model architecture
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)),
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')
])

#2. Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

#3. Fit the model
model.fit_generator(train_generator, validation_data=val_generator, epochs=10)

#4. Evaluate the model
loss, accuracy = model.evaluate(test_generator)
print('Test accuracy:', accuracy)

#Took almost 23 min

Epoch 1/10


  model.fit_generator(train_generator, validation_data=val_generator, epochs=10)




Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test accuracy: 0.8527148365974426


1. `Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3))`: This line creates a 2D convolution layer. Convolution layers are the major building blocks used in convolutional neural networks. A convolution layer transforms an input volume into an output volume of different size, as specified by the parameters of the layer. In this case, the layer will output 32 different feature maps, each one representing a different learned feature. The `(3, 3)` parameter specifies the size of the filters that will be learned, and `relu` is the activation function that will be applied element-wise to the output. The `input_shape=(32, 32, 3)` parameter specifies the shape of the input data: images of size 32x32 pixels with 3 color channels (red, green, blue).

2. `MaxPooling2D((2, 2))`: This line creates a max pooling layer, which is used to reduce the spatial dimensions of the output volume from the previous layer. It does this by taking the maximum value over a 2x2 window. This helps to make the model more translation invariant and to reduce computation.

3. `Conv2D(64, (3, 3), activation='relu')`: This is another convolution layer, similar to the first one. This layer will learn 64 filters. The size of the filters is again 3x3 pixels, and the activation function is ReLU.

4. `MaxPooling2D((2, 2))`: This is another max pooling layer, similar to the first one. It again reduces the spatial dimensions of the output volume from the previous layer.

5. `Flatten()`: This layer flattens the output from the previous layer into a one-dimensional vector. This is necessary because the next layer (a dense layer) expects its input to be a vector, not a multi-dimensional array.

6. `Dense(64, activation='relu')`: This is a fully connected layer, also known as a dense layer. Each neuron in a dense layer receives input from all the neurons in the previous layer, hence they are "fully connected". This layer has 64 neurons and uses the ReLU activation function.

7. `Dense(1, activation='sigmoid')`: This is the output layer of the model. It's another dense layer, and it has just one neuron because this is a binary classification problem (assuming your labels are 0 and 1). The sigmoid activation function is used to squash the output of the neuron to a value between 0 and 1, representing the probability that the image belongs to class 1.

### Submitting Model 1

In [57]:
import pandas as pd

# Make predictions
predictions = model.predict(real_test_generator, steps=len(real_test_generator), verbose=1)

# Get filenames (ordered list of image file names)
filenames = real_test_generator.filenames
print(filenames[:5])

# Get the actual predictions, not the probabilities
# If your model is a binary classifier, this will convert the probabilities into class predictions
predicted_classes = [1 if prob > 0.5 else 0 for prob in predictions]

# Create a DataFrame with filenames and predicted classes
submission_df = pd.DataFrame({
    'id': [os.path.basename(filename).split('.')[0] for filename in filenames],  # Extract the id from the filename
    'label': predicted_classes
})

# Save DataFrame to csv
submission_df.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

# Took 51 seconds

['test\\00006537328c33e284c973d7b39d340809f7271b.tif', 'test\\0000ec92553fda4ce39889f9226ace43cae3364e.tif', 'test\\00024a6dee61f12f7856b0fc6be20bc7a48ba3d2.tif', 'test\\000253dfaa0be9d0d100283b22284ab2f6b643f6.tif', 'test\\000270442cc15af719583a8172c87cd2bd9c7746.tif']


## CHANGE THE FIT METHOD FROM fit_generator() to fit()
## LEARN HOW TO SAVE MODEL AND THE HYPERPARAMETERS. LEARN HOW TO PRINT OUT THE MOST IMPORTANT INFO OF THE MODEL, LIKE I DID FOR THE TITANIC

### 1. Maybe try GrayScale conversation
### 2. Try Image cropping for 32x32px or 33x33px or no image cropping at all
### 3. Try histogram equalization (part of data preprocessing)
### 4. Find a way to make the images less blurry or find a way to make it not lose any pixels since each pixel is important