# Welcome to my Histopathologic Cancer Detection Neural Network (Notebook)

### Kaggle code


In [42]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



### Checking the library versions:

In [43]:
import tensorflow as tf
print("tensorflow:", tf.__version__)

# import keras
# print("keras:", keras.__version__)

# import kerastuner as kt
# print("kerastuner:", kt.__version__)

# import keras_tuner as kt2
# print("keras_tuner:", kt2.__version__)

import platform
print("Python:", platform.python_version())

import numpy as np
print("numpy:", np.__version__)

import pandas as pd
print("pandas:", pd.__version__)

import sklearn
print("sklearn version:", sklearn.__version__)

import sklearn
print("sklearn path:", sklearn.__path__)

import matplotlib
print("matplotlib:", matplotlib.__version__)

import seaborn as sns
print("seaborn:", sns.__version__)

# Tensorflow: 2.15.0
# kerastuner: 1.0.5
# keras_tuner: 1.3.5
# Python: 3.10.11
# numpy: 1.24.3
# pandas: 2.1.4
# sklearn version: 1.2.2
# sklearn path: ['c:\\Users\\Micha\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\sklearn']
# matplotlib: 3.8.2
# seaborn: 0.13.0


tensorflow: 2.15.0
Python: 3.10.11
numpy: 1.24.3
pandas: 2.1.4
sklearn version: 1.2.2
sklearn path: ['c:\\Users\\Micha\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\sklearn']
matplotlib: 3.8.2
seaborn: 0.13.0


### Set Global random seed to make sure we can replicate any model that we create (no randomness)

In [44]:
import random
import tensorflow as tf
import numpy as np
import os



np.random.seed(42)
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)

os.environ['TF_DETERMINISTIC_OPS'] = '1'

## Image recognition pre-processing practices:

1. **Rescaling**: Image pixel values usually range from 0 to 255. Rescaling these values to a range of 0 to 1 by dividing each pixel by 255 is a common practice. This helps to stabilize and speed up the learning process.

2. **Resizing**: Deep learning models require the input dimensions to be uniform. Resizing all images to a predetermined size is essential, especially if the original dataset contains images of various dimensions.

3. **Normalization**: Beyond just rescaling, you might want to normalize the image data. This can include subtracting the mean and dividing by the standard deviation across each channel. If you have a pre-trained model, you would use the normalization statistics (mean and standard deviation) from the dataset on which the model was trained.

4. **Data Augmentation**: To increase the diversity of your dataset and prevent overfitting, you can apply random transformations like rotation, shifting, flipping, zooming, and shearing. These transformations generate new training samples from the original ones by altering them slightly.

5. **Color Space Conversions**: Sometimes, converting images to different color spaces (e.g., from RGB to grayscale, HSV, LAB, etc.) can help the model learn more robust features, depending on the task.

6. **Image Denoising**: If the images are noisy, applying denoising algorithms can help to remove noise and improve model accuracy.

7. **Edge Detection**: In certain applications, particularly those involving shape analysis, edge detection filters may be applied to highlight the edges within images.

8. **Masking and Cropping**: If there are regions in the images that are not relevant to the analysis, you might want to mask or crop these regions to focus the model on the important parts of the image.

9. **Histogram Equalization**: This can enhance the contrast in images, which can be beneficial if you have a dataset with varying lighting conditions.

10. **Centering and Standardization**: Similar to normalization, centering the data by subtracting the mean image (computed over the training set) and standardizing, so the variance of the pixels is reduced, can be beneficial.

11. **Handling Class Imbalance**: If your dataset has a class imbalance, techniques such as class weighting, oversampling the minority class, or undersampling the majority class can be considered.

In practice, preprocessing steps are often determined experimentally. You might start with a simple preprocessing pipeline (like just rescaling and resizing) and then iteratively add steps that improve your model performance. It's also important to note that if you're using a pre-trained model, you should preprocess your data in the same way the original model was trained.

### Check to see if each image has the same dimensions since that's important for data preprocessing

In [45]:
# from PIL import Image
# import os

# def check_image_dimensions(directory):
#     image_sizes = set()
#     for img_name in os.listdir(directory):
#         img_path = os.path.join(directory, img_name)
#         with Image.open(img_path) as img:
#             # Get image size
#             size = img.size
#             image_sizes.add(size)
            
#             # # If more than one size is found, we can stop checking
#             # if len(image_sizes) > 1:
#             #     break
    
#     if len(image_sizes) == 1:
#         print(f"For the {directory} directory, all images are of the same dimension: {image_sizes.pop()}")
#     else:
#         print(f"For the {directory} directory, different dimensions found: {image_sizes}")

# # Use it on the train and test data only if this code segment was never ran in this coding session:
# if 'checked_image_dimensions' not in globals():
#     # Use it on the train and test data:
#     check_image_dimensions('train')
#     check_image_dimensions('test')
#     checked_image_dimensions = True

# # For the train directory, all images are of the same dimension: (96, 96)
# # For the test directory, all images are of the same dimension: (96, 96)


### If you need to, call the resize_images functions to ensure each image is the same dimension but make sure you are not distorting the images. In order to do this, you need to make sure all the original images have the same aspect ratios

In [46]:
from PIL import Image
import os
 

def resize_images(directory, size=(128, 128)): 
    for img_name in os.listdir(directory):
        img_path = os.path.join(directory, img_name)
        with Image.open(img_path) as img:
            new_img = img.resize(size)
            new_img.save(img_path)

# Use it on the train and test data if needed, and change the size argument as you need:
            
# resize_images('train', size=(128, 128))
# resize_images('test', size=(128, 128))

### Use Keras ImageDataGenerator() on Train Data
The ImageDataGenerator not only helps you load images from the disk but also allows you to perform **data augmentation**, which is a technique to increase the diversity of your training set by applying random transformations (like rotation, zoom, flips, etc.) to the images. This is very useful to prevent overfitting and helps the model generalize better.

make sure to change the "target_size" argument of the train_datagen.flow_from_dataframe() function as needed

In [47]:
import os
import pandas as pd
from tensorflow.keras.preprocessing.image import ImageDataGenerator


# Set the base directory to the current directory
base_dir = ''

# Directories for training and test images
train_dir = os.path.join(base_dir, 'train')
test_dir = os.path.join(base_dir, 'test')

# Load the labels
labels = pd.read_csv(os.path.join(base_dir, 'train_labels.csv'))

# Convert the 'label' column to strings
labels['label'] = labels['label'].astype(str)

# Add the full path to the image files, and create a new column called "path" inside the label dataframe to store these paths to images
labels['path'] = labels['id'].apply(lambda x: os.path.join(train_dir, f"{x}.tif"))

labels.to_csv('labels.csv', index=False)

# Creating an instance of the ImageDataGenerator for data augmentation and preprocessing
train_datagen = ImageDataGenerator(
    rescale=1./255,  # Rescale the image pixel values to [0,1]

    # Potential data augmentation techniques that won't affect the 32x32px center
    #brightness_range=[0.8, 1.2], 
    #channel_shift_range=20, 

    # I removed these transformations for the data augmentation since this project involves detecting tumor tissue in the center 32x32px region so I can't be doing zooming and other transformations for this project specifically

    # rotation_range=40,  # Random rotations
    # width_shift_range=0.2,  # Random horizontal shifts
    # height_shift_range=0.2,  # Random vertical shifts
    # shear_range=0.2,  # Shear transformations
    # zoom_range=0.2,  # Random zoom
    # horizontal_flip=True,  # Random horizontal flips
    # fill_mode='nearest'  # Strategy for filling in new pixels
)

# Flow from dataframe method to load images using the dataframe
train_generator = train_datagen.flow_from_dataframe(
    dataframe=labels,
    x_col='path',
    y_col='label',
    target_size=(96, 96),  # The dimensions to which all images found will be resized. Change this as needed
    color_mode='rgb',
    class_mode='binary', # means that the labels are binary labels
    batch_size=32,
    shuffle=True, # This might introduce randomness if set to true, but if it's false, the it might lead to overfitting. So it's best to just save the neural network to ensure no randomness
    seed=42
)
#print(labels['path'].head())
# After setting this up, you can use train_generator as the input to the fit or fit_generator method of your Keras model, 
# which will load images in batches and train your model on them.





Found 220025 validated image filenames belonging to 2 classes.


### If you want, you can check the image shape and see a visualization of the pictures below:

In [48]:
# import matplotlib.pyplot as plt

# # Get a batch of images
# images, labels = next(train_generator)

# # The images should now be a numpy array. Check its shape:
# print(images.shape)  # Should be (batch_size, target_size[0], target_size[1], 3)

# # Plot the first few images
# for i in range(5):  # Change this value to see more images
#     plt.figure(figsize=(5, 5))
#     plt.imshow(images[i])
#     plt.title(f'Label: {labels[i]}') 
#     plt.show()

### Split traininging data (train_generator) into train, validation, test split

### 1. Maybe try GrayScale conversation
### 2. Try Image cropping for 32x32px or 33x33px or no image cropping at all
### 3. Try histogram equalization (part of data preprocessing)