# Data Augmentation

The dataset contains 2 folders: yes and no which contains 253 Brain MRI Images. The folder 'yes' contains 155 Brain MRI Images that are tumorous and the folder 'no' contains 98 Brain MRI Images that are non-tumorous.

Since this is a small dataset, I used data augmentation in order to create more images.

Also, we could solve the data imbalance issue (since 61% of the data belongs to the tumorous class) using data augmentation.

## Import Necessary Modules

In [1]:
import tensorflow as tf
from keras.preprocessing.image import ImageDataGenerator
import cv2
from os import listdir
import time    

In [2]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return f"{h}:{m}:{round(s,1)}"

In [3]:
def augment_data(file_dir, n_generated_samples, save_to_dir):
    
    data_gen = ImageDataGenerator(rotation_range=10, 
                                  width_shift_range=0.1, 
                                  height_shift_range=0.1, 
                                  shear_range=0.1, 
                                  brightness_range=(0.3, 1.0),
                                  horizontal_flip=True, 
                                  vertical_flip=True, 
                                  fill_mode='nearest'
                                 )

    
    for filename in listdir(file_dir):
        
        image = cv2.imread(file_dir + '\\' + filename)
        
        image = image.reshape((1,)+image.shape)
        
        save_prefix = 'aug_' + filename[:-4]
        
        i=0
        for batch in data_gen.flow(x=image, batch_size=1, save_to_dir=save_to_dir, 
                                           save_prefix=save_prefix, save_format='jpg'):
            i += 1
            if i > n_generated_samples:
                break

61% of the data (155 images) are tumorous. And, 39% of the data (98 images) are non-tumorous.<br>
So, in order to balance the data we can generate 9 new images for every image that belongs to 'no' class and 6 images for every image that belongs the 'yes' class.<br>

In [4]:
start_time = time.time()

augmented_data_path = 'augmented data/'

yes_path = 'yes'
no_path = 'no'

augment_data(file_dir=yes_path, n_generated_samples=6, save_to_dir=augmented_data_path+'yes')

augment_data(file_dir=no_path, n_generated_samples=9, save_to_dir=augmented_data_path+'no')

end_time = time.time()
execution_time = (end_time - start_time)
print(f"Elapsed time: {hms_string(execution_time)}")

Elapsed time: 0:2:35.4


No of tumorous and non-tumorous examples after performing data augmentation:

In [5]:
def data_summary(main_path):
    
    yes_path = main_path+'yes'
    no_path = main_path+'no'
        
   
    m_pos = len(listdir(yes_path))
   
    m_neg = len(listdir(no_path))
  
    m = (m_pos+m_neg)
    
    pos_prec = (m_pos* 100.0)/ m
    neg_prec = (m_neg* 100.0)/ m
    
    print(f"Number of examples: {m}")
    print(f"Percentage of positive examples: {pos_prec}%, number of pos examples: {m_pos}") 
    print(f"Percentage of negative examples: {neg_prec}%, number of neg examples: {m_neg}") 

In [6]:
data_summary(augmented_data_path)

Number of examples: 2064
Percentage of positive examples: 52.51937984496124%, number of pos examples: 1084
Percentage of negative examples: 47.48062015503876%, number of neg examples: 980
