# CMPE351 - Group 6 - Brain Tumor MRI Scan Data Augmentation
##### Authors: Madison Boem, Caroline Kim, Eric Venditti, Jade Watson

Given the original dataset (can be found [here](https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection)), we decided to generate more images using data augmentation techniques. This will help improve the accuracy of our CNN and increase its robustness.

This will also help to solve the slight imbalance of yes/no tumorous scans. Currently, 61% of the data is non-tumorous.

## Importing Libraries

In [2]:
from keras.preprocessing.image import ImageDataGenerator
import cv2
from os import listdir
import time
%matplotlib inline

In [3]:
'''
Function to format time
Inputs: seconds elapsed
Outputs: none (just prints formatted timestring)
'''
def hms_string(secElapsed):
    hour = int(secElapsed / (60 * 60))
    minute = int((secElapsed % (60 * 60)) / 60)
    second = secElapsed % 60
    return f"{hour}:{minute}:{round(second,1)}"

Create a function for augmenting data. Thankfully, tensorflow has a built in function and makes this process painless and incredibly modular.

Expressed Terminology:
- rotation_range: range for random rotations (in degrees)
- width_shift_range: shift, fraction of total width
- height_shift_range: shift, fraction of total height
- shear_range: shear angle in counter-clockwise direction in degrees
- brightness_range: range for randomly choosing a brightness shift
- horizonal_flip: randomly flips an image horizontally
- vertical_flip: randomly flips an image vertically
- fill_mode: how to fill points outside the input, but within the image boundary


In [4]:
'''
Function to apply data augmentation
Inputs: original image directory, number of samples to generate for a given image, directory where the augmented images will be saved
Outputs: none (just prints formatted timestring)
'''
def augment_data(file_dir, n_generated_samples, save_to_dir):

    # Specify how the images will be augmented
    data_gen = ImageDataGenerator(rotation_range=10, 
                                  width_shift_range=0.1, 
                                  height_shift_range=0.1, 
                                  shear_range=0.1, 
                                  brightness_range=(0.3, 1.0),
                                  horizontal_flip=True, 
                                  vertical_flip=True, 
                                  fill_mode='nearest'
                                 )

    # Loop through all files and apply image transformation
    for filename in listdir(file_dir):
        # load
        image = cv2.imread(file_dir + '\\' + filename)
        # reshape
        image = image.reshape((1,)+image.shape)
        # add a prefix for the augmented images
        save_prefix = 'aug_' + filename[:-4]
        # generate 'n_generated_samples' sample images
        i = 0
        for batch in data_gen.flow(x=image, batch_size=1, save_to_dir=save_to_dir, save_prefix=save_prefix, save_format='jpg'):
            i += 1
            if(i > n_generated_samples):
                break

Now that the augmentation technique is defined, we must decide how many images to generate.

Given that 61% of the scans are tumorous (and 39% are healthy):
- For each healthy scan, we will create 9 new augmented images
- For each tumerous scan, we will create 6 new augmented images

This will balance the two classes and create lots of meaningful data.

In [5]:
# Time how long the augmentation process takes
start_time = time.time()

# Path to where augmented images will be stored
augmented_data_path = r'C:\Users\jadew\Documents\Winter2022\CMPE351\FinalProject\MohamedAliHabib\augmented'

# Path to original yes/no folders
yes_path = r'C:\Users\jadew\Documents\Winter2022\CMPE351\FinalProject\MohamedAliHabib\yes'
no_path = r'C:\Users\jadew\Documents\Winter2022\CMPE351\FinalProject\MohamedAliHabib\no'

# Augment data tumorous scans
augment_data(file_dir=yes_path, n_generated_samples=6, save_to_dir=augmented_data_path+r'\yes')
# Augment data healthy scans
augment_data(file_dir=no_path, n_generated_samples=9, save_to_dir=augmented_data_path+r'\no')

# Stop the timer and display execution time
end_time = time.time()
execution_time = (end_time - start_time)
print(f"Elapsed time: {hms_string(execution_time)}")

Elapsed time: 0:1:30.2


Confirm that the images have been generated.

In [8]:
# Number of tumorous scans (original + augmented)
m_pos = len(listdir(augmented_data_path + r'\yes'))
# Number of healthy scans (original + augmented)
m_neg = len(listdir(augmented_data_path + r'\no'))
# All scans
m = (m_pos+m_neg)

# Calculate the percentage of each
pos_prec = (m_pos* 100.0)/ m
neg_prec = (m_neg* 100.0)/ m

print(f"Number of examples: {m}")
print(f"Percentage of tumorous scans: {pos_prec}%")
print(f"Number of tumorous scans: {m_pos}")
print(f"Percentage of healthy scans: {neg_prec}%")
print(f"Number of healthy scans: {m_neg}")

Number of examples: 4379
Percentage of tumorous scans: 53.02580497830555%
Number of tumorous scans: 2322
Percentage of healthy scans: 46.97419502169445%
Number of healthy scans: 2057


Now, we can use these images to train the CNN!