## **Custom CNN**

This notebook documents the creation and training of the custom Convolutional Neural Network (CNN). This serves as the baseline model against which the performance of the transfer learning models will be compared.

Goal: Train a Custom CNN using 80/10/10 stratified split, 0-1 normalization, data augmentation, and class weights (due to class imbalance).

## Project Setup and Initialization

Imports and Paths

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
import os

In [2]:
# Import utility functions from  uploaded files
from data_utils import perform_stratified_split, DataGeneratorUtils, TARGET_SIZE, SEED
from train_utils import compile_model, get_callbacks

ModuleNotFoundError: No module named 'data_utils'

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Path to your metadata.csv file
METADATA_PATH = "/content/drive/MyDrive/NOVA_IMS/Deep_Learning_Project/rare_species/metadata.csv"
# Root directory containing all your original, un-split image files
IMAGE_ROOT_DIR = "/content/drive/MyDrive/NOVA_IMS/Deep_Learning_Project/rare_species"
# The target folder where the stratified data structure (train/val/test) will be created
DATA_TARGET_DIR = "/content/rive/MyDrive/NOVA_IMS/Deep_Learning_Project/data"

# Create the results directory if it doesn't exist to save model weights
os.makedirs("/content/drive/MyDrive/NOVA_IMS/Deep_Learning_Project/outputs", exist_ok=True)

Data Splitting and Generator Creation

*We use the functions from data_utils.py to handle the reproducible split and the data pipeline creation.*

In [None]:
# Load the metadata file
try:
    metadata_df = pd.read_csv(METADATA_PATH)
except FileNotFoundError:
    print("ERROR: Metadata file not found.")

In [None]:
# IMPORTANT: RUN perform_stratified_split ONLY ONCE!
#data_base_path = perform_stratified_split(metadata_df, IMAGE_ROOT_DIR, DATA_TARGET_DIR)
#print(f"Data structure created/verified at: {data_base_path}")

Data Split: Train=9585, Validation=1199, Test=1199
Organizing train set...
Organizing validation set...
Organizing test set...
Data directory structure successfully created/updated.
Data structure created/verified at: /content/rive/MyDrive/NOVA_IMS/Deep_Learning_Project/data


In [None]:
#!ls -R /content/rive


note: there is a typo in the path file - rive instead of drive

Verification: counting images per class in split directories

In [None]:
def count_images_per_class(base_directory, set_name):
    """Counts the number of images in each class (family folder) for a given set."""
    directory = os.path.join(base_directory, set_name)
    class_counts = {}

    if not os.path.exists(directory):
        return {f"ERROR: Directory not found at {directory}": 0}

    for folder in os.listdir(directory):
        folder_path = os.path.join(directory, folder)
        if os.path.isdir(folder_path):
            # Count files in the class folder
            image_count = len([f for f in os.listdir(folder_path) if f.lower().endswith(('.png', '.jpg', '.jpeg'))])
            class_counts[folder] = image_count
    return class_counts

In [None]:
# Count images in train, validation, and test directories
train_class_counts = count_images_per_class(DATA_TARGET_DIR, 'train')
val_class_counts = count_images_per_class(DATA_TARGET_DIR, 'validation')
test_class_counts = count_images_per_class(DATA_TARGET_DIR, 'test')

# Display results
print("\n--- Class Counts Verification ---")
print("Number of images per class in the TRAIN directory (Top 5):")
# Sort and print for readability
print(pd.Series(train_class_counts).sort_values(ascending=False).head(5))

print("\nNumber of images per class in the VALIDATION directory (Top 5):")
print(pd.Series(val_class_counts).sort_values(ascending=False).head(5))

print("\nNumber of images per class in the TEST directory (Top 5):")
print(pd.Series(test_class_counts).sort_values(ascending=False).head(5))

# Check for overall size consistency
total_counted = sum(train_class_counts.values()) + sum(val_class_counts.values()) + sum(test_class_counts.values())
print(f"\nTotal images successfully counted across all splits: {total_counted}")


--- Class Counts Verification ---
Number of images per class in the TRAIN directory (Top 5):
cercopithecidae    240
dactyloidae        240
formicidae         233
salamandridae      216
plethodontidae     216
dtype: int64

Number of images per class in the VALIDATION directory (Top 5):
cercopithecidae    30
dactyloidae        30
formicidae         29
salamandridae      27
plethodontidae     27
dtype: int64

Number of images per class in the TEST directory (Top 5):
cercopithecidae    30
dactyloidae        30
formicidae         29
salamandridae      27
plethodontidae     27
dtype: int64

Total images successfully counted across all splits: 11983


In [None]:
data_base_path = DATA_TARGET_DIR


In [None]:
#  Initialize Data Generators
data_util = DataGeneratorUtils(data_base_path)

train_generator = data_util.create_generators('train')
val_generator = data_util.create_generators('validation')


Found 9585 images belonging to 202 classes.
Found 1199 images belonging to 202 classes.


In [None]:
NUM_CLASSES = train_generator.num_classes

In [None]:
# Calculate Class Weights (Crucial for rare species imbalance)
class_weights = data_util.calculate_class_weights(train_generator)

print(f"\nSetup complete. Total classes: {NUM_CLASSES}")
print(f"Train samples: {train_generator.samples}, Validation samples: {val_generator.samples}")

Class weights calculated for 202 classes.

Setup complete. Total classes: 202
Train samples: 9585, Validation samples: 1199


# Hyperparameter Tuner and Callbacks

In [1]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

In [None]:
PROJECT_NAME = 'custom_cnn_hyperband'

In [None]:
# Early Stopping: Stops training if val_loss doesn't improve for 5 epochs
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True,
    verbose=1
)

In [None]:
# Reduce LR on Plateau: Reduces LR if val_loss doesn't improve for 5 epochs
reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.2,
    patience=5,
    min_lr=1e-6,
    verbose=1
)

In [None]:
callbacks = [early_stopping, reduce_lr]

In [None]:
#Instantiate the Hyperband Tuner
tuner = kt.Hyperband(
    hypermodel=build_hypermodel,
    objective='val_accuracy', # Maximize validation accuracy
    max_epochs=30,           # Max epochs for a full training run
    factor=3,                # Halving factor for Hyperband
    directory='/content/drive/MyDrive/NOVA_IMS/Deep_Learning_Project/outputs', # Path to save results
    project_name=PROJECT_NAME,
    overwrite=True           # Overwrite previous search results
)

print("Tuner search space summary:")
tuner.search_space_summary()

In [None]:
# Run the search using the training and validation generators and class weights
tuner.search(
    train_generator,
    epochs=30,
    validation_data=val_generator,
    callbacks=callbacks,
    class_weight=class_weights,
    verbose=1
)