# Brain Tumor Classification with Optimized CNN

This notebook demonstrates an end-to-end workflow for brain tumor classification using an optimized CNN model. It includes data preprocessing, model training, and evaluation with proper progress tracking.

## Overview

1. Mount Google Drive and set up the environment
2. Update the repository
3. Install dependencies
4. Set up paths to the dataset and results directories
5. Preprocess the data
6. Train the optimized CNN model
7. Evaluate performance
8. Visualize results

## 1. Mount Google Drive and Setup Environment

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Verify TensorFlow and GPU availability
import tensorflow as tf
import platform

print('TensorFlow version:', tf.__version__)
print('Python version:', platform.python_version())
print('GPUs available:', tf.config.list_physical_devices('GPU'))

# Set seed for reproducibility
import numpy as np
import random
import os

SEED = 42
os.environ['PYTHONHASHSEED'] = str(SEED)
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

## 2. Update Repository

Pull the latest changes from the repository to get the optimized CNN model and improved training process.

In [None]:
# Navigate to repository directory and pull updates
repo_name = "SE4050-Deep-Learning-Assignment"

# Check if we're already in the repo directory
current_path = os.getcwd()
if os.path.basename(current_path) == repo_name:
    print(f"Already in repository directory: {current_path}")
elif os.path.exists(repo_name):
    print(f"Changing to repository directory: {repo_name}")
    %cd {repo_name}
else:
    print(f"Repository not found, cloning it...")
    !git clone https://github.com/IT22052124/SE4050-Deep-Learning-Assignment.git
    %cd {repo_name}

# Pull latest changes
!git pull
print("\nLatest commit:")
!git log -1 --pretty=format:"Updated to: %h - %s (%an, %ar)"

## 3. Install Dependencies

In [None]:
# Install required packages
!pip install -q -U pip
!pip install -q -r requirements.txt

## 4. Set Up Paths

Define paths to your dataset in Google Drive and where to save results.

In [None]:
# Customize these paths according to your Google Drive structure
DRIVE_ROOT = '/content/drive/MyDrive'
BRAIN_TUMOR_DIR = DRIVE_ROOT + '/BrainTumor'
DATA_DIR = BRAIN_TUMOR_DIR + '/data'
RAW_DATA_DIR = DATA_DIR + '/archive'  # Contains yes/no folders with raw images
PROCESSED_DATA_DIR = DATA_DIR + '/processed'  # Where preprocessed data will be stored
RESULTS_DIR = BRAIN_TUMOR_DIR + '/Result/cnn'  # Where model and results will be saved

# Create directories if they don't exist
!mkdir -p {PROCESSED_DATA_DIR}
!mkdir -p {RESULTS_DIR}

print("Folder structure:")
print(f"- Brain Tumor Directory: {BRAIN_TUMOR_DIR}")
print(f"- Raw Data Directory: {RAW_DATA_DIR}")
print(f"- Processed Data Directory: {PROCESSED_DATA_DIR}")
print(f"- Results Directory: {RESULTS_DIR}")

## 5. Preprocess the Data

Run preprocessing script to prepare the dataset. This step creates the train/val/test split from raw images.

In [None]:
# Let's verify the existence of raw data directories first
import os
from pathlib import Path
import cv2
import shutil
from tqdm.notebook import tqdm

print("Checking raw data directories...")
print(f"Looking for raw data in: {RAW_DATA_DIR}")

# First verify the raw data exists and list what's available
if os.path.exists(RAW_DATA_DIR):
    dirs = [d for d in os.listdir(RAW_DATA_DIR) if os.path.isdir(os.path.join(RAW_DATA_DIR, d))]
    print(f"Found directories: {dirs}")
    
    # Check for yes/no directories specifically
    yes_dir = os.path.join(RAW_DATA_DIR, "yes")
    no_dir = os.path.join(RAW_DATA_DIR, "no")
    
    yes_exists = os.path.exists(yes_dir)
    no_exists = os.path.exists(no_dir)
    
    print(f"'yes' directory exists: {yes_exists}")
    print(f"'no' directory exists: {no_exists}")
    
    # Count files in each directory
    if yes_exists:
        yes_files = [f for f in os.listdir(yes_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
        print(f"Found {len(yes_files)} image files in 'yes' directory")
    
    if no_exists:
        no_files = [f for f in os.listdir(no_dir) if f.lower().endswith(('.jpg', '.jpeg', '.png'))]
        print(f"Found {len(no_files)} image files in 'no' directory")
else:
    print(f"❌ Error: Raw data directory {RAW_DATA_DIR} does not exist!")
    print("Please make sure your Google Drive contains the correct folder structure.")
    
# Now let's define our preprocessing functions
# Constants
RANDOM_SEED = 42
IMG_SIZE = (224, 224)
VAL_SPLIT = 0.15
TEST_SPLIT = 0.15

def create_dirs(base_path, classes):
    """Create directory structure for processed data"""
    for split in ["train", "val", "test"]:
        for cls in classes:
            os.makedirs(os.path.join(base_path, split, cls), exist_ok=True)
            
def split_and_copy(source_root, dest_root, classes):
    """Split and copy files into train/val/test directories"""
    random.seed(RANDOM_SEED)
    create_dirs(dest_root, classes)
    
    # Track total files processed
    total_processed = 0
    
    for cls in classes:
        cls_dir = os.path.join(source_root, cls)
        if not os.path.exists(cls_dir):
            print(f"Warning: Class directory {cls_dir} not found.")
            continue
            
        # Get all image files
        imgs = []
        for ext in ['.jpg', '.jpeg', '.png']:
            imgs.extend([os.path.join(cls_dir, f) for f in os.listdir(cls_dir) 
                         if f.lower().endswith(ext)])
            
        if not imgs:
            print(f"No images found in {cls_dir}")
            continue
            
        print(f"Found {len(imgs)} images in {cls_dir}")
        random.shuffle(imgs)
        n = len(imgs)
        n_val, n_test = int(n * VAL_SPLIT), int(n * TEST_SPLIT)
        n_train = n - n_val - n_test
        
        splits = {
            "train": imgs[:n_train],
            "val": imgs[n_train:n_train+n_val],
            "test": imgs[n_train+n_val:]
        }
        
        for split, files in splits.items():
            print(f"Processing {cls} -> {split}: {len(files)} images")
            for img_path in tqdm(files):
                try:
                    # Read and resize image
                    img = cv2.imread(img_path)
                    if img is None:
                        print(f"Warning: Could not read {img_path}")
                        continue
                    
                    # Resize image
                    resized = cv2.resize(img, IMG_SIZE)
                    
                    # Save to destination
                    out_path = os.path.join(dest_root, split, cls, os.path.basename(img_path))
                    cv2.imwrite(out_path, resized)
                    total_processed += 1
                except Exception as e:
                    print(f"Error processing {img_path}: {e}")
    
    return total_processed

# Check if preprocessing is needed (if processed data doesn't already exist)
train_dir = os.path.join(PROCESSED_DATA_DIR, "train")
val_dir = os.path.join(PROCESSED_DATA_DIR, "val")
test_dir = os.path.join(PROCESSED_DATA_DIR, "test")

processed_exists = (os.path.exists(train_dir) and 
                   os.path.exists(val_dir) and 
                   os.path.exists(test_dir))

if processed_exists:
    print("\nProcessed data already exists. Skipping preprocessing.")
    # Check count of images in each split
    train_yes = len(os.listdir(os.path.join(train_dir, "yes"))) if os.path.exists(os.path.join(train_dir, "yes")) else 0
    train_no = len(os.listdir(os.path.join(train_dir, "no"))) if os.path.exists(os.path.join(train_dir, "no")) else 0
    val_yes = len(os.listdir(os.path.join(val_dir, "yes"))) if os.path.exists(os.path.join(val_dir, "yes")) else 0
    val_no = len(os.listdir(os.path.join(val_dir, "no"))) if os.path.exists(os.path.join(val_dir, "no")) else 0
    test_yes = len(os.listdir(os.path.join(test_dir, "yes"))) if os.path.exists(os.path.join(test_dir, "yes")) else 0
    test_no = len(os.listdir(os.path.join(test_dir, "no"))) if os.path.exists(os.path.join(test_dir, "no")) else 0
    
    print(f"Train: {train_yes} yes, {train_no} no")
    print(f"Validation: {val_yes} yes, {val_no} no")
    print(f"Test: {test_yes} yes, {test_no} no")
else:
    print("\nStarting preprocessing...")
    print(f"Reading raw images from: {RAW_DATA_DIR}")
    print(f"Saving processed images to: {PROCESSED_DATA_DIR}")

    # Process the data
    if os.path.exists(RAW_DATA_DIR) and (os.path.exists(yes_dir) and os.path.exists(no_dir)):
        total_files = split_and_copy(RAW_DATA_DIR, PROCESSED_DATA_DIR, ["yes", "no"])
        print(f"✅ Preprocessing completed successfully! Processed {total_files} images.")
    else:
        print("⚠️ Could not find expected yes/no folders in the raw data directory.")
        print("Please make sure your Google Drive contains the correct folder structure.")

## 6. Train the Optimized CNN Model

Run the training script to train the CNN model on the preprocessed data. The optimized CNN has a better architecture with fewer parameters and proper progress tracking.

In [None]:
# First, let's verify that we have processed data to train on
import os

# Check if processed data exists with the right structure
train_dir = os.path.join(PROCESSED_DATA_DIR, "train")
val_dir = os.path.join(PROCESSED_DATA_DIR, "val")
test_dir = os.path.join(PROCESSED_DATA_DIR, "test")

processed_data_exists = (os.path.exists(train_dir) and 
                         os.path.exists(val_dir) and 
                         os.path.exists(test_dir))

# Check if we have class folders in the train directory
class_folders_exist = False
if processed_data_exists:
    train_yes = os.path.join(train_dir, "yes")
    train_no = os.path.join(train_dir, "no")
    
    class_folders_exist = (os.path.exists(train_yes) and 
                           os.path.exists(train_no))
    
    if class_folders_exist:
        yes_count = len(os.listdir(train_yes))
        no_count = len(os.listdir(train_no))
        
        print(f"Train directory has {yes_count} 'yes' images and {no_count} 'no' images")
        
        if yes_count == 0 or no_count == 0:
            class_folders_exist = False
            print("⚠️ One of the class folders is empty!")

# Choose the right directory for training
if processed_data_exists and class_folders_exist:
    # If processed data with correct structure exists, use it
    TRAIN_DATA_DIR = PROCESSED_DATA_DIR
    print(f"Using processed data for training: {TRAIN_DATA_DIR}")
    print("The script will automatically detect the train/val/test structure")
elif os.path.exists(os.path.join(RAW_DATA_DIR, "yes")) and os.path.exists(os.path.join(RAW_DATA_DIR, "no")):
    # If raw data exists with yes/no folders, use the original data structure
    TRAIN_DATA_DIR = RAW_DATA_DIR
    print(f"Using raw data for training: {TRAIN_DATA_DIR}")
    print("The script will use the original data structure with class folders")
else:
    # Fall back to the original DATA_DIR (parent of archive)
    TRAIN_DATA_DIR = DATA_DIR
    print(f"⚠️ Could not find properly structured data. Trying: {TRAIN_DATA_DIR}")
    print("Training may fail if the correct data structure is not found.")

# Calculate training parameters
batch_size = 32
if processed_data_exists and class_folders_exist:
    train_yes = os.path.join(train_dir, "yes")
    train_no = os.path.join(train_dir, "no")
    total_train_images = len(os.listdir(train_yes)) + len(os.listdir(train_no))
    steps_per_epoch = total_train_images // batch_size
    print(f"Total training images: {total_train_images}")
    print(f"With batch size {batch_size}, steps_per_epoch should be: {steps_per_epoch}")

# Train the model
print("\nTraining the CNN model...")
print(f"Using data from: {TRAIN_DATA_DIR}")
print(f"Saving results to: {RESULTS_DIR}")

# Use the enhanced train_cnn.py script which now handles both data structures
# and properly calculates steps_per_epoch
!python -m src.models.cnn.train_cnn \
    --data_dir {TRAIN_DATA_DIR} \
    --results_dir {RESULTS_DIR} \
    --epochs 20 \
    --batch_size {batch_size} \
    --img_size 224 224 \
    --use_processed {1 if processed_data_exists and class_folders_exist else 0}

## 7. Evaluate the Model

Run the evaluation script to assess model performance on the test set.

In [None]:
# Evaluate the model on the test set
print("Evaluating the CNN model...")

# Choose the right directory for evaluation
if processed_data_exists and class_folders_exist:
    # If processed data with correct structure exists, use it
    EVAL_DATA_DIR = PROCESSED_DATA_DIR
    print(f"Using processed data for evaluation: {EVAL_DATA_DIR}")
else:
    # Fall back to the training data directory
    EVAL_DATA_DIR = TRAIN_DATA_DIR
    print(f"Using training data directory for evaluation: {EVAL_DATA_DIR}")

print(f"Results will be saved to: {RESULTS_DIR}")

# Use the enhanced evaluate_cnn script
# The script now automatically detects whether to use the train/val/test structure
!python -m src.models.cnn.evaluate_cnn \
    --data_dir {EVAL_DATA_DIR} \
    --results_dir {RESULTS_DIR} \
    --batch_size {batch_size} \
    --img_size 224 224 \
    --use_processed {1 if processed_data_exists and class_folders_exist else 0}

## 8. Display Results

Show evaluation metrics, plots, and visualizations stored in the results directory.

In [None]:
# Display training history plot
import matplotlib.pyplot as plt
from PIL import Image

try:
    history_img = Image.open(f"{RESULTS_DIR}/history.png")
    plt.figure(figsize=(10, 6))
    plt.imshow(history_img)
    plt.axis('off')
    plt.title('Training History')
    plt.show()
except Exception as e:
    print(f"Error displaying training history: {e}")

In [None]:
# Display confusion matrix
try:
    cm_img = Image.open(f"{RESULTS_DIR}/confusion_matrix.png")
    plt.figure(figsize=(8, 8))
    plt.imshow(cm_img)
    plt.axis('off')
    plt.title('Confusion Matrix')
    plt.show()
except Exception as e:
    print(f"Error displaying confusion matrix: {e}")

In [None]:
# Display ROC curve and Precision-Recall curve if available
try:
    roc_img = Image.open(f"{RESULTS_DIR}/roc_curve.png")
    plt.figure(figsize=(8, 8))
    plt.imshow(roc_img)
    plt.axis('off')
    plt.title('ROC Curve')
    plt.show()
except Exception as e:
    print(f"Could not display ROC curve: {e}")

try:
    pr_img = Image.open(f"{RESULTS_DIR}/precision_recall_curve.png")
    plt.figure(figsize=(8, 8))
    plt.imshow(pr_img)
    plt.axis('off')
    plt.title('Precision-Recall Curve')
    plt.show()
except Exception as e:
    print(f"Could not display Precision-Recall curve: {e}")

In [None]:
# Display classification report
try:
    with open(f"{RESULTS_DIR}/classification_report.txt", 'r') as f:
        report = f.read()
    print("Classification Report:")
    print(report)
except Exception as e:
    print(f"Error reading classification report: {e}")

In [None]:
# Display metrics
import json

try:
    with open(f"{RESULTS_DIR}/metrics.json", 'r') as f:
        metrics = json.load(f)
    print("Model Metrics:")
    for metric, value in metrics.items():
        print(f"{metric}: {value:.4f}")
except Exception as e:
    print(f"Error reading metrics: {e}")

## 9. Make Predictions with the Model

Load the best model and make predictions on sample images.

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import glob

# Load the best model
try:
    # Try to load the Keras model first
    model_path = f"{RESULTS_DIR}/best_model.keras"
    if not os.path.exists(model_path):
        # Fallback to H5 format
        model_path = f"{RESULTS_DIR}/best_model.h5"
    
    model = tf.keras.models.load_model(model_path, compile=False)
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    print(f"Successfully loaded model from {model_path}")
    
    # Load class names
    try:
        with open(f"{RESULTS_DIR}/class_names.json", 'r') as f:
            class_names = json.load(f)
    except:
        class_names = ["no", "yes"]
    
    print(f"Classes: {class_names}")
except Exception as e:
    print(f"Error loading model: {e}")

In [None]:
# Function to make predictions on sample images
def predict_and_display(image_path):
    # Load and preprocess image
    img = tf.io.read_file(image_path)
    img = tf.io.decode_image(img, channels=3, expand_animations=False)
    img = tf.image.resize(img, (224, 224))
    img_display = img.numpy().astype(np.uint8)
    img = tf.cast(img, tf.float32) / 255.0
    img = tf.expand_dims(img, axis=0)
    
    # Make prediction
    pred = model.predict(img, verbose=0)[0][0]
    predicted_class = class_names[1] if pred > 0.5 else class_names[0]
    confidence = pred if pred > 0.5 else 1 - pred
    
    # Display results
    plt.figure(figsize=(6, 6))
    plt.imshow(img_display)
    plt.title(f"Prediction: {predicted_class} ({confidence:.2f})")
    plt.axis('off')
    plt.show()
    
    return predicted_class, confidence

# Find some sample images to predict - use the processed test images
test_yes_dir = os.path.join(PROCESSED_DATA_DIR, "test", "yes") 
test_no_dir = os.path.join(PROCESSED_DATA_DIR, "test", "no")

yes_samples = []
no_samples = []

# Try to get processed test images first
if os.path.exists(test_yes_dir):
    yes_files = os.listdir(test_yes_dir)
    yes_samples = [os.path.join(test_yes_dir, f) for f in yes_files[:2]] if yes_files else []
    
if os.path.exists(test_no_dir):
    no_files = os.listdir(test_no_dir)
    no_samples = [os.path.join(test_no_dir, f) for f in no_files[:2]] if no_files else []

# If we couldn't find processed test images, try the raw images
if not yes_samples and os.path.exists(os.path.join(RAW_DATA_DIR, "yes")):
    yes_files = os.listdir(os.path.join(RAW_DATA_DIR, "yes"))
    yes_samples = [os.path.join(RAW_DATA_DIR, "yes", f) for f in yes_files[:2]] if yes_files else []
    
if not no_samples and os.path.exists(os.path.join(RAW_DATA_DIR, "no")):
    no_files = os.listdir(os.path.join(RAW_DATA_DIR, "no"))
    no_samples = [os.path.join(RAW_DATA_DIR, "no", f) for f in no_files[:2]] if no_files else []

sample_images = yes_samples + no_samples

if sample_images:
    print(f"Making predictions on {len(sample_images)} sample images")
    for img_path in sample_images:
        print(f"\nImage: {os.path.basename(img_path)}")
        true_class = "yes" if "yes" in img_path else "no"
        pred_class, conf = predict_and_display(img_path)
        print(f"True class: {true_class}")
        print(f"Predicted: {pred_class} with {conf:.2f} confidence")
else:
    print("No sample images found for prediction.")

## 10. Conclusion

This notebook has demonstrated an end-to-end workflow for brain tumor classification using an optimized CNN model with the following advantages:

1. **Optimized Architecture**:
   - Reduced parameters: ~1.2 million vs 25.7 million in the original model
   - Smaller memory footprint: 4.55 MB vs 98.36 MB
   - Faster training time

2. **Proper Progress Tracking**:
   - Calculates steps_per_epoch properly for accurate progress bars
   - Shows exact training progress instead of "Unknown"

3. **Better Resource Utilization**:
   - More efficient memory usage
   - Faster inference time

For further improvements, you could:
- Try different CNN architectures like ResNet50 or EfficientNet
- Experiment with different preprocessing techniques
- Apply more advanced data augmentation
- Adjust hyperparameters like learning rate, batch size, etc.