# Data Augmentation

The dataset we’re using is relatively small, containing only 253 images divided into two categories:
- **yes:** Images with brain tumor
- **no:** Images without brain tumor

To improve the performance of our Convolutional Neural Network (CNN) model and prevent overfitting, we will employ **data augmentation** techniques. Data augmentation artificially increases the size of our training dataset by applying various transformations to the existing images. This approach helps the model generalize better by learning from a more diverse set of training examples.

The following augmentation techniques are used:

- **Rotation:** ±10 degrees
- **Width Shift:** ±10% horizontally
- **Weight Shift:** ±10% vertically
- **Shear Transformation:** Shearing up to 10%
- **Brightness Adjustment:** Scaling brightness between 30% and 100%
- **Horizontal Flips and Vertical Flips:** Random flipping of images
- **Fill Mode:** Filling empty areas created by transformations with nearest pixel values



In [4]:
import os
import time
import numpy as np
import cv2
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import callbacks, models, preprocessing
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint


In [None]:
def augment_data(file_dir, n_generated_samples, save_to_dir):
    """
    Augments images in the specified directory and saves them to a target directory.

    Parameters:
        file_dir (str): Directory containing the original images.
        n_generated_samples (int): Number of augmented images to generate per original image.
        save_to_dir (str): Directory to save the augmented images.
    """
    # Define the data augmentation parameters
    data_gen = ImageDataGenerator(
        rotation_range=10,          # Random rotation between -10 and 10 degrees
        width_shift_range=0.1,      # Random horizontal shift
        height_shift_range=0.1,     # Random vertical shift
        shear_range=0.1,            # Shear transformations
        brightness_range=(0.3, 1.0),# Random brightness adjustments
        horizontal_flip=True,       # Randomly flip images horizontally
        vertical_flip=True,         # Randomly flip images vertically
        fill_mode='nearest'         # Fill mode for newly created pixels
    )

    # Ensure the save directory exists
    os.makedirs(save_to_dir, exist_ok=True)

    for filename in os.listdir(file_dir):
        # Construct the full file path
        file_path = os.path.join(file_dir, filename)
        
        # Load the image
        image = cv2.imread(file_path)
        if image is None:
            print(f"Warning: Couldn't load image {file_path}. Skipping.")
            continue
        
        # Reshape the image to add an extra dimension (batch size)
        image = image.reshape((1,) + image.shape)
        
        # Create a prefix for the augmented images
        save_prefix = 'aug_' + os.path.splitext(filename)[0]
        
        # Generate augmented images
        total = 0
        for batch in data_gen.flow(
            x=image,
            batch_size=1,
            save_to_dir=save_to_dir,
            save_prefix=save_prefix,
            save_format='jpg'
        ):
            total += 1
            if total >= n_generated_samples:
                break

Our dataset is imbalanced, containing approximately **61%** non-tumorous images and 39% tumorous images. An imbalanced dataset can lead to a biased model that performs poorly on the minority class (in this case, tumorous images). To address this issue and improve the model’s ability to detect brain tumors, we need to balance the dataset.

To achieve a balanced dataset, we will use data augmentation to increase the number of tumorous images. Specifically, for every **9 non-tumorous images**, we will generate **6 augmented tumorous images**. This strategy ensures that both classes have a comparable number of samples, which helps the model learn equally from both categories and improves overall performance.

By carefully augmenting the tumorous images, we enhance the diversity and representation of the minority class without excessively oversampling it. This balanced approach contributes to building a more robust and unbiased Convolutional Neural Network (CNN) model for brain tumor detection.

***Note:** The data augmentation techniques applied to the tumorous images include rotations, shifts, shearing, brightness adjustments, and flips. These transformations create varied versions of the original images, enriching the dataset and aiding the model in generalizing better to unseen data.*

In [None]:
# Define base directory
base_dir = '/content/drive/MyDrive'

# Define paths
augmented_data_dir = os.path.join(base_dir, 'augmented_data')
yes_dir = os.path.join(base_dir, 'yes')
no_dir = os.path.join(base_dir, 'no')

augmented_yes_dir = os.path.join(augmented_data_dir, 'yes')
augmented_no_dir = os.path.join(augmented_data_dir, 'no')

# Ensure augmented data directories exist
os.makedirs(augmented_yes_dir, exist_ok=True)
os.makedirs(augmented_no_dir, exist_ok=True)

# Augment data for tumorous examples ('yes' label)
augment_data(
    file_dir=yes_dir,
    n_generated_samples=6,
    save_to_dir=augmented_yes_dir
)

# Augment data for non-tumorous examples ('no' label)
augment_data(
    file_dir=no_dir,
    n_generated_samples=9,
    save_to_dir=augmented_no_dir
)

# Examining the Dataset After Augmentation

After performing data augmentation, let’s examine the number of tumorous and non-tumorous examples now present in our dataset:
- **Tumorous Examples:** The total number of images labeled as “yes” (tumorous).
- **Tumorous Examples:** The total number of images labeled as “no” (non-tumorous).

By comparing these counts, we can verify that our dataset is balanced and ready for training our Convolutional Neural Network (CNN) model.

***Note:** It’s important to ensure that the dataset is balanced after augmentation to prevent the model from becoming biased toward one class. A balanced dataset helps the model learn equally from both classes, improving its ability to accurately classify new, unseen images.*

In [None]:
# Define the augmented data directory
augmented_data_dir = '/content/drive/MyDrive/augmented_data/'

# Define paths for tumorous ('yes') and non-tumorous ('no') images
yes_dir = os.path.join(augmented_data_dir, 'yes')
no_dir = os.path.join(augmented_data_dir, 'no')

# Count the number of images in each category
num_tumorous = len(os.listdir(yes_dir))
num_non_tumorous = len(os.listdir(no_dir))

# Calculate total number of images
total_images = num_tumorous + num_non_tumorous

# Calculate percentages
percent_tumorous = (num_tumorous / total_images) * 100
percent_non_tumorous = (num_non_tumorous / total_images) * 100

# Display the results
print("Percentage of tumorous examples: {:.2f}%".format(percent_tumorous))
print("Percentage of non-tumorous examples: {:.2f}%".format(percent_non_tumorous))

# Data Prepossessing and Preparation


Before training our neural network, it’s crucial to preprocess the data to enhance the model’s performance. One effective preprocessing step is to crop the images to focus exclusively on the brain region, eliminating any irrelevant background and noise.

We will accomplish this using the *crop_image()* function. This function detects the extreme points in the contours of the brain within each image and crops the image accordingly to isolate the area of interest.


**Reference:** [Finding Extreme Points in Contours using OpenCV](https://pyimagesearch.com/2016/04/11/finding-extreme-points-in-contours-with-opencv/)


By cropping the images to include only the brain, we:


- **Reduce Computational Load:** Smaller images mean less data for the network to process, speeding up training.
- **Improve Accuracy:** Eliminating irrelevant features helps the model focus on the essential patterns associated with brain tumors.
- **Enhance Generalization:** Reducing noise and variability in the input data leads to better performance on unseen data.


***Note:** Ensure that the crop_image() function is correctly implemented and tested. It should handle variations in image brightness, contrast, and noise to reliably detect the brain contours in all images.*

In [1]:
def crop_image(image):
    """
    Crops the input image to focus on the brain region by finding the extreme points of the largest contour.

    Parameters:
        image (numpy.ndarray): The input image in BGR format.

    Returns:
        new_image (numpy.ndarray): The cropped image containing only the brain region.
    """
    # Convert the image to grayscale and apply Gaussian blur
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray = cv2.GaussianBlur(gray, (5, 5), 0)

    # Apply thresholding to create a binary image
    _, thresh = cv2.threshold(gray, 45, 255, cv2.THRESH_BINARY)

    # Perform erosions and dilations to remove small noise
    thresh = cv2.erode(thresh, None, iterations=2)
    thresh = cv2.dilate(thresh, None, iterations=2)

    # Find contours in the thresholded image
    cnts = cv2.findContours(
        thresh.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )

    # Handle compatibility between OpenCV versions
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]

    if not cnts:
        print("Warning: No contours found. Returning the original image.")
        return image

    # Find the largest contour
    c = max(cnts, key=cv2.contourArea)

    # Determine the extreme points
    ext_left = tuple(c[c[:, :, 0].argmin()][0])
    ext_right = tuple(c[c[:, :, 0].argmax()][0])
    ext_top = tuple(c[c[:, :, 1].argmin()][0])
    ext_bottom = tuple(c[c[:, :, 1].argmax()][0])

    # Crop the image using the extreme points
    new_image = image[ext_top[1]:ext_bottom[1], ext_left[0]:ext_right[0]]

    return new_image

We are now ready to load our data and prepare it for training our neural network. We will process the images from the augmented 'yes' and 'no' folders using the following steps:

1.	**Read the Image:** Load each image file into memory.
2.  **Crop the Image:** Apply the crop_image() function to extract only the brain region, removing unnecessary background.
3.	**Resize the Image:** Resize the cropped image to a uniform size of **(240, 240, 3)** to ensure consistency across all input data.
4.	**Normalize Pixel Values:** Scale the pixel values to lie within the range **0 to 1** by dividing by 255. This normalization helps the neural network train more effectively.
5.	**Append to Data Arrays:**
	-   **Features (X):** Add the processed image to the features array X.
	-   **Labels (y):** Add the corresponding label to the labels array y:
		-   1 for tumorous images ('yes' folder).
		-   0 for non-tumorous images ('no' folder).

By following these preprocessing steps, we ensure that our dataset is properly formatted and ready for input into our Convolutional Neural Network (CNN) model. This preparation is crucial for the model to learn effectively from the data and make accurate predictions.

**Next Steps:**

-	**Split the Data:** After preprocessing, we’ll split the dataset into training and testing sets to evaluate the model’s performance.
-	**Model Training:** Use the prepared data to train the CNN, adjusting hyperparameters as needed.
-	**Evaluation:** Assess the model’s accuracy, precision, recall, and F1-score to determine its effectiveness in detecting brain tumors.

By systematically loading, preprocessing, and preparing our data, we set a strong foundation for building an effective brain tumor detection model.

In [None]:
# Initialize empty lists for features and labels
X = []
y = []

# Set image dimensions
IMAGE_HEIGHT, IMAGE_WIDTH = 240, 240

# Paths to the augmented data directories
augmented_data_dir = '/content/drive/MyDrive/augmented_data'
yes_dir = os.path.join(augmented_data_dir, 'yes')
no_dir = os.path.join(augmented_data_dir, 'no')

# List of directories with their corresponding labels
directories = [
    (yes_dir, 1),  # Tumorous images labeled as 1
    (no_dir, 0)    # Non-tumorous images labeled as 0
]

# Process each image in the directories
for directory, label in directories:
    for filename in os.listdir(directory):
        # Construct the full file path
        file_path = os.path.join(directory, filename)
        
        # Read the image
        image = cv2.imread(file_path)
        if image is None:
            print(f"Warning: Unable to read image {file_path}. Skipping.")
            continue
        
        # Crop the image to include only the brain region
        image = crop_image(image)
        if image is None or image.size == 0:
            print(f"Warning: Cropped image is empty for {file_path}. Skipping.")
            continue
        
        # Resize the image to a uniform size
        image = cv2.resize(image, (IMAGE_WIDTH, IMAGE_HEIGHT), interpolation=cv2.INTER_CUBIC)
        
        # Normalize the pixel values
        image = image / 255.0
        
        # Append the image and its label to the lists
        X.append(image)
        y.append(label)

# Convert lists to NumPy arrays
X = np.array(X)
y = np.array(y)

# Shuffle the dataset
X, y = shuffle(X, y, random_state=42)

## Plotting Sample Images

After preprocessing and preparing our dataset, it’s important to visualize some sample images to verify that the preprocessing steps have been applied correctly. Plotting sample images from both classes (tumorous and non-tumorous) helps us ensure that:

-	The images are correctly cropped to focus on the brain region.
-	The images are resized to the expected dimensions.
-	The normalization is applied correctly.
-	The labels correspond to the correct images.


In [None]:
def plot_sample_images(X, y, n=30):
    """
    Plots sample images from the dataset for each class label.

    Parameters:
        X (numpy.ndarray): Array of images.
        y (numpy.ndarray): Array of labels corresponding to images in X.
        n (int): Number of images to display per class (default is 30).
    """
    label_names = {0: "No", 1: "Yes"}

    for label in [0, 1]:
        # Get indices of images with the current label
        indices = np.where(y == label)[0]

        # If there are fewer images than n, adjust n
        n_images = min(n, len(indices))
        if n_images == 0:
            print(f"No images found for label {label}.")
            continue

        # Randomly select n_images indices
        selected_indices = np.random.choice(indices, size=n_images, replace=False)

        # Calculate the number of rows and columns for the subplot grid
        columns = 10
        rows = int(np.ceil(n_images / columns))

        plt.figure(figsize=(20, 2 * rows))

        for i, idx in enumerate(selected_indices):
            plt.subplot(rows, columns, i + 1)
            plt.imshow(X[idx])
            plt.axis('off')

        plt.suptitle(f"Brain Tumor: {label_names[label]}", fontsize=16)
        plt.show()

# Example usage:
# plot_sample_images(X, y)

# Splitting the Dataset

In [None]:
# Split the dataset into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Display the sizes of the training and validation sets
print(f'Training set size: {X_train.shape}')
print(f'Validation set size: {X_valid.shape}')

## Building the Model

We are constructing a Convolutional Neural Network (CNN) to perform the classification task of detecting brain tumors. The architecture is designed to be straightforward yet effective, consisting of the following components:

1.	**Convolutional Blocks:**
	-	**Two Consecutive Layers:** Each block includes:
	-	**Convolutional Layer:** Applies a set of filters to extract relevant features from the input images.
	-	**Batch Normalization:** Normalizes the outputs of the convolutional layer to accelerate training and improve stability.
	-	**Activation Function:** Uses a nonlinear activation function (e.g., ReLU) to introduce nonlinearity.
	-	**Max Pooling:** Reduces the spatial dimensions of the feature maps, which helps in reducing the computational load and controlling overfitting.
2.	**Flattening Layer:**
	-	Transforms the pooled feature maps into a one-dimensional vector to serve as input for the fully connected layer.
3.	**Fully Connected Layer:**
	-	Processes the flattened features to learn complex patterns and relationships.
4.	**Output Layer:**
	-	**Single Neuron with Sigmoid Activation:** Produces a probability between 0 and 1, suitable for binary classification tasks.

This architecture strikes a balance between simplicity and performance, making it well-suited for our goal of classifying images as tumorous or non-tumorous. By stacking convolutional layers with batch normalization and max pooling, we enable the model to learn hierarchical feature representations, improving its ability to detect subtle differences in brain images.

**Next Steps:**

-	**Compile the Model:** Define the optimizer, loss function, and evaluation metrics.
-	**Train the Model:** Fit the model on the training data and validate it using the validation set.
-	**Evaluate Performance:** Analyze the model’s performance using appropriate metrics like accuracy, precision, recall, and F1-score.
-	**Fine-tuning:** Adjust hyperparameters or modify the architecture as needed based on evaluation results.

By carefully designing the CNN architecture, we aim to achieve high accuracy in detecting brain tumors, contributing valuable insights to medical diagnostics.

In [None]:
# Define constants
IMAGE_HEIGHT, IMAGE_WIDTH = 240, 240
INPUT_SHAPE = (IMAGE_HEIGHT, IMAGE_WIDTH, 3)

# Build the model
model = keras.models.Sequential([
    keras.layers.Input(shape=INPUT_SHAPE),
    
    keras.layers.ZeroPadding2D(padding=(2, 2)),
    keras.layers.Conv2D(filters=32, kernel_size=(7, 7), strides=(1, 1), activation='relu'),
    keras.layers.BatchNormalization(),
    keras.layers.MaxPooling2D(pool_size=(4, 4)),
    
    keras.layers.Conv2D(filters=64, kernel_size=(3, 3), strides=(1, 1), activation='relu'),
    keras.layers.BatchNormalization(),
    keras.layers.MaxPooling2D(pool_size=(2, 2)),
    
    keras.layers.Flatten(),
    keras.layers.Dense(units=1, activation='sigmoid')
])

# Display the model summary
model.summary()

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# TensorBoard
log_file_name = f'brain_tumor_detection_cnn_{int(time.time())}'
tensorboard = TensorBoard(log_dir=f'logs/{log_file_name}')

In [None]:
model.fit(x=X_train, 
          y=y_train, 
          batch_size=32, 
          epochs=20, 
          validation_data=(X_valid, y_valid), 
          callbacks=[tensorboard])


In [None]:
history = model.history.history


## Visualizing Training and Validation Metrics

After training our Convolutional Neural Network (CNN), it’s essential to visualize the model’s performance over the training epochs. Plotting the training and validation metrics, such as accuracy and loss, helps us understand how well the model is learning and whether it’s overfitting or underfitting

In [None]:
# Retrieve metrics from the history object
train_loss = history.history['loss']
train_acc = history.history['accuracy']
val_loss = history.history['val_loss']
val_acc = history.history['val_accuracy']
epochs = range(1, len(train_loss) + 1)

# Plotting Loss and Accuracy
plt.figure(figsize=(12, 5))

# Plot Loss
plt.subplot(1, 2, 1)
plt.plot(epochs, train_loss, 'b-o', label='Training Loss')
plt.plot(epochs, val_loss, 'r-o', label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)

# Plot Accuracy
plt.subplot(1, 2, 2)
plt.plot(epochs, train_acc, 'b-o', label='Training Accuracy')
plt.plot(epochs, val_acc, 'r-o', label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()