# Custom Image Classification Project with CNNs

I wanted this notebook to feel like a real end-to-end experiment instead of just plugging in a model and reporting one number. I tried three approaches on the `tf_flowers` dataset: two transfer-learning models and one custom CNN. The goal is to compare speed, performance, and failure modes in a practical way.

## 1) Setup and imports

In [None]:
import os
import time
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import tensorflow_datasets as tfds

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

from tensorflow.keras import layers, models
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

print('TensorFlow version:', tf.__version__)
AUTOTUNE = tf.data.AUTOTUNE
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)

## 2) Load dataset (`tf_flowers`) and inspect basic info

At first I noticed `tf_flowers` comes as one split by default, so I manually split into train/validation/test. This makes it feel closer to a real project workflow.

In [None]:
(ds_full, ds_info) = tfds.load(
    'tf_flowers',
    split='train',
    shuffle_files=True,
    with_info=True,
    as_supervised=True
)

num_examples = ds_info.splits['train'].num_examples
class_names = ds_info.features['label'].names
num_classes = ds_info.features['label'].num_classes

print(f'Total images: {num_examples}')
print(f'Number of classes: {num_classes}')
print('Class names:', class_names)
print('Task: multi-class image classification')

In [None]:
# Manual split: 70% train, 15% val, 15% test
train_pct, val_pct = 70, 15
train_split = f'train[:{train_pct}%]'
val_split = f'train[{train_pct}%:{train_pct+val_pct}%]'
test_split = f'train[{train_pct+val_pct}%:]'

ds_train_raw, ds_val_raw, ds_test_raw = tfds.load(
    'tf_flowers',
    split=[train_split, val_split, test_split],
    as_supervised=True,
    shuffle_files=True
)

print('Split sizes (approx):')
print('Train:', tf.data.experimental.cardinality(ds_train_raw).numpy())
print('Val  :', tf.data.experimental.cardinality(ds_val_raw).numpy())
print('Test :', tf.data.experimental.cardinality(ds_test_raw).numpy())

`tf_flowers` has about 3,670 images across 5 flower categories, and image sizes vary quite a bit. So resizing is necessary before batching.

In [None]:
# Quick class distribution from the full dataset
label_counts = np.zeros(num_classes, dtype=int)

for _, label in tfds.as_numpy(ds_full):
    label_counts[label] += 1

plt.figure(figsize=(8,4))
plt.bar(class_names, label_counts)
plt.title('Class distribution in tf_flowers')
plt.ylabel('Count')
plt.xticks(rotation=20)
plt.show()

for name, count in zip(class_names, label_counts):
    print(f'{name:10s}: {count}')

In [None]:
# Show a grid of sample images
plt.figure(figsize=(10, 8))
for i, (img, label) in enumerate(ds_full.take(12)):
    ax = plt.subplot(3, 4, i + 1)
    plt.imshow(img)
    plt.title(class_names[int(label)])
    plt.axis('off')
plt.suptitle('Sample images from tf_flowers', y=1.02)
plt.tight_layout()
plt.show()

## 3) Preprocessing and augmentation

I resize everything to `224x224` and normalize to `[0,1]`. For augmentation, I used random flips, rotations, and zoom. This seemed to help generalization in quick experiments because flowers can appear from different angles and distances.

In [None]:
IMG_SIZE = (224, 224)
BATCH_SIZE = 32

augmentation = tf.keras.Sequential([
    layers.RandomFlip('horizontal'),
    layers.RandomRotation(0.15),
    layers.RandomZoom(0.15),
], name='augmentation')


def preprocess(image, label):
    image = tf.image.resize(image, IMG_SIZE)
    image = tf.cast(image, tf.float32)
    return image, label


def make_dataset(ds, training=False):
    ds = ds.map(preprocess, num_parallel_calls=AUTOTUNE)
    if training:
        ds = ds.shuffle(1000, seed=SEED)
        ds = ds.map(lambda x, y: (augmentation(x, training=True), y), num_parallel_calls=AUTOTUNE)
    ds = ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)
    return ds

train_ds = make_dataset(ds_train_raw, training=True)
val_ds = make_dataset(ds_val_raw, training=False)
test_ds = make_dataset(ds_test_raw, training=False)

In [None]:
# Visualize augmentation examples
sample_images, sample_labels = next(iter(make_dataset(ds_train_raw, training=False).unbatch().batch(6).take(1)))

plt.figure(figsize=(12, 6))
for i in range(6):
    aug_img = augmentation(tf.expand_dims(sample_images[i], axis=0), training=True)[0]
    ax = plt.subplot(2, 3, i + 1)
    plt.imshow(tf.clip_by_value(aug_img / 255.0, 0.0, 1.0))
    plt.title(class_names[int(sample_labels[i])])
    plt.axis('off')
plt.suptitle('Augmented image examples', y=1.02)
plt.tight_layout()
plt.show()

## 4) Training utilities

I’m using early stopping so training can stop when validation performance stalls. It saves time and usually prevents overfitting from dragging on.

Quick choices I made here: I kept **224x224** because both EfficientNetB0 and MobileNetV2 are commonly used with that size, and it is a good speed/quality tradeoff on this dataset. I used **Adam** because it converges fast in transfer-learning setups. During fine-tuning I dropped the learning rate to avoid destroying pretrained features too quickly. I didn’t use class weights since class imbalance exists but is not extreme here, and I wanted to keep the baseline training setup simple first.

In [None]:
def get_callbacks(model_name):
    os.makedirs('checkpoints', exist_ok=True)
    ckpt_path = f'checkpoints/{model_name}.keras'
    return [
        EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
        ModelCheckpoint(ckpt_path, monitor='val_loss', save_best_only=True)
    ]


def plot_history(history, title='Training history'):
    hist = history.history
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    axes[0].plot(hist.get('accuracy', []), label='train_acc')
    axes[0].plot(hist.get('val_accuracy', []), label='val_acc')
    axes[0].set_title(f'{title} - Accuracy')
    axes[0].legend()

    axes[1].plot(hist.get('loss', []), label='train_loss')
    axes[1].plot(hist.get('val_loss', []), label='val_loss')
    axes[1].set_title(f'{title} - Loss')
    axes[1].legend()

    plt.tight_layout()
    plt.show()


def evaluate_model(model, ds, model_name):
    y_true, y_pred = [], []
    for x_batch, y_batch in ds:
        probs = model.predict(x_batch, verbose=0)
        preds = np.argmax(probs, axis=1)
        y_true.extend(y_batch.numpy())
        y_pred.extend(preds)

    acc = accuracy_score(y_true, y_pred)
    macro_f1 = f1_score(y_true, y_pred, average='macro')

    print(f'\n{model_name} Test Accuracy: {acc:.4f}')
    print(f'{model_name} Macro F1: {macro_f1:.4f}')

    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names, yticklabels=class_names)
    plt.title(f'{model_name} Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('True')
    plt.xticks(rotation=20)
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

    print('Classification Report:')
    print(classification_report(y_true, y_pred, target_names=class_names, digits=4, zero_division=0))

    return {
        'accuracy': acc,
        'macro_f1': macro_f1,
        'y_true': np.array(y_true),
        'y_pred': np.array(y_pred)
    }

## 5) Model 1 — EfficientNetB0 (transfer learning)

I started with EfficientNetB0 because it usually gives a nice balance of performance and efficiency.

### Stage 1
Freeze backbone, train classification head.

### Stage 2
Unfreeze part of the backbone and fine-tune with a smaller learning rate.

In [None]:
def build_efficientnet_model(num_classes):
    base = tf.keras.applications.EfficientNetB0(
        include_top=False,
        weights='imagenet',
        input_shape=(224, 224, 3)
    )
    base.trainable = False

    inputs = layers.Input(shape=(224, 224, 3))
    # EfficientNet in Keras expects 0..255 style inputs and handles scaling internally.
    x = layers.Lambda(tf.keras.applications.efficientnet.preprocess_input, name='efficientnet_preprocess')(inputs)
    x = base(x, training=False)
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    model = models.Model(inputs, outputs)
    return model, base


eff_model, eff_base = build_efficientnet_model(num_classes)
eff_model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

start = time.time()
hist_eff_stage1 = eff_model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=5,
    callbacks=get_callbacks('efficientnet_stage1'),
    verbose=1
)

# Fine-tune: unfreeze last ~30 layers
eff_base.trainable = True
for layer in eff_base.layers[:-30]:
    layer.trainable = False

eff_model.compile(optimizer=tf.keras.optimizers.Adam(1e-5),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

hist_eff_stage2 = eff_model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=4,
    callbacks=get_callbacks('efficientnet_stage2'),
    verbose=1
)

eff_time = time.time() - start
print(f'EfficientNet total training time: {eff_time:.1f} sec')

In [None]:
plot_history(hist_eff_stage1, 'EfficientNet Stage 1')
plot_history(hist_eff_stage2, 'EfficientNet Stage 2')

## 6) Model 2 — MobileNetV2 (transfer learning)

MobileNetV2 is lighter, so I expected faster training. I used the same two-stage strategy to keep the comparison fair.

In [None]:
def build_mobilenet_model(num_classes):
    base = tf.keras.applications.MobileNetV2(
        include_top=False,
        weights='imagenet',
        input_shape=(224, 224, 3)
    )
    base.trainable = False

    inputs = layers.Input(shape=(224, 224, 3))
    x = layers.Lambda(tf.keras.applications.mobilenet_v2.preprocess_input, name='mobilenet_preprocess')(inputs)
    x = base(x, training=False)
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    model = models.Model(inputs, outputs)
    return model, base


mob_model, mob_base = build_mobilenet_model(num_classes)
mob_model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

start = time.time()
hist_mob_stage1 = mob_model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=5,
    callbacks=get_callbacks('mobilenet_stage1'),
    verbose=1
)

mob_base.trainable = True
for layer in mob_base.layers[:-30]:
    layer.trainable = False

mob_model.compile(optimizer=tf.keras.optimizers.Adam(1e-5),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

hist_mob_stage2 = mob_model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=4,
    callbacks=get_callbacks('mobilenet_stage2'),
    verbose=1
)

mob_time = time.time() - start
print(f'MobileNet total training time: {mob_time:.1f} sec')

In [None]:
plot_history(hist_mob_stage1, 'MobileNet Stage 1')
plot_history(hist_mob_stage2, 'MobileNet Stage 2')

## 7) Model 3 — Custom CNN from scratch

I decided not to make this network huge because I wanted training to stay quick. I still added batch norm + dropout to stabilize training and reduce overfitting.

In [None]:
def build_custom_cnn(num_classes):
    inputs = layers.Input(shape=(224, 224, 3))

    x = layers.Rescaling(1./255, name='custom_rescale')(inputs)
    x = layers.Conv2D(32, 3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.MaxPooling2D()(x)

    x = layers.Conv2D(64, 3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)
    x = layers.MaxPooling2D()(x)

    x = layers.Conv2D(128, 3, padding='same')(x)
    x = layers.BatchNormalization()(x)
    x = layers.ReLU()(x)

    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.4)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    return models.Model(inputs, outputs)


custom_model = build_custom_cnn(num_classes)
custom_model.compile(optimizer=tf.keras.optimizers.Adam(1e-3),
                     loss='sparse_categorical_crossentropy',
                     metrics=['accuracy'])

start = time.time()
hist_custom = custom_model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=12,
    callbacks=get_callbacks('custom_cnn'),
    verbose=1
)
custom_time = time.time() - start
print(f'Custom CNN total training time: {custom_time:.1f} sec')

In [None]:
plot_history(hist_custom, 'Custom CNN')

## 8) Evaluation results

Now I evaluate each model on the held-out test split using accuracy, macro F1, confusion matrix, and full classification report.

In [None]:
results = {}
results['EfficientNetB0'] = evaluate_model(eff_model, test_ds, 'EfficientNetB0')
results['MobileNetV2'] = evaluate_model(mob_model, test_ds, 'MobileNetV2')
results['CustomCNN'] = evaluate_model(custom_model, test_ds, 'CustomCNN')

In [None]:
comparison_df = {
    'Model': ['EfficientNetB0', 'MobileNetV2', 'CustomCNN'],
    'Val_Accuracy': [
        max(hist_eff_stage2.history['val_accuracy']) if 'val_accuracy' in hist_eff_stage2.history else np.nan,
        max(hist_mob_stage2.history['val_accuracy']) if 'val_accuracy' in hist_mob_stage2.history else np.nan,
        max(hist_custom.history['val_accuracy']) if 'val_accuracy' in hist_custom.history else np.nan,
    ],
    'Test_Accuracy': [results['EfficientNetB0']['accuracy'], results['MobileNetV2']['accuracy'], results['CustomCNN']['accuracy']],
    'Macro_F1': [results['EfficientNetB0']['macro_f1'], results['MobileNetV2']['macro_f1'], results['CustomCNN']['macro_f1']],
    'Train_Time_sec': [eff_time, mob_time, custom_time],
}

def overfit_note(history_obj):
    h = history_obj.history
    if 'accuracy' not in h or 'val_accuracy' not in h:
        return 'Not enough logs'
    gap = max(h['accuracy']) - max(h['val_accuracy'])
    if gap < 0.03:
        return 'Low overfitting'
    if gap < 0.08:
        return 'Moderate overfitting'
    return 'Higher overfitting'

import pandas as pd
comp = pd.DataFrame(comparison_df)
comp['Overfitting_Behavior'] = [
    overfit_note(hist_eff_stage2),
    overfit_note(hist_mob_stage2),
    overfit_note(hist_custom)
]
comp = comp.sort_values(by='Test_Accuracy', ascending=False)
comp


## 9) Error analysis

Interestingly, even when overall accuracy is good, some classes still get mixed up because flower shapes/colors overlap. I’ll visualize a few correct and wrong predictions from the best model.

In [None]:
best_model_name = comp.iloc[0]['Model']
name_to_model = {
    'EfficientNetB0': eff_model,
    'MobileNetV2': mob_model,
    'CustomCNN': custom_model
}
best_model = name_to_model[best_model_name]
print('Best model based on test accuracy:', best_model_name)

# Gather test images/preds for display
all_images, all_true, all_pred = [], [], []
for x_batch, y_batch in test_ds:
    probs = best_model.predict(x_batch, verbose=0)
    preds = np.argmax(probs, axis=1)
    all_images.append(x_batch.numpy())
    all_true.append(y_batch.numpy())
    all_pred.append(preds)

all_images = np.concatenate(all_images, axis=0)
all_true = np.concatenate(all_true, axis=0)
all_pred = np.concatenate(all_pred, axis=0)

correct_idx = np.where(all_true == all_pred)[0]
wrong_idx = np.where(all_true != all_pred)[0]

plt.figure(figsize=(12, 5))
for i, idx in enumerate(correct_idx[:6]):
    ax = plt.subplot(2, 3, i+1)
    plt.imshow(all_images[idx] / 255.0)
    plt.title(f'T:{class_names[all_true[idx]]} | P:{class_names[all_pred[idx]]}')
    plt.axis('off')
plt.suptitle('Correct predictions', y=1.03)
plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 5))
for i, idx in enumerate(wrong_idx[:6]):
    ax = plt.subplot(2, 3, i+1)
    plt.imshow(all_images[idx] / 255.0)
    plt.title(f'T:{class_names[all_true[idx]]} | P:{class_names[all_pred[idx]]}')
    plt.axis('off')
plt.suptitle('Wrong predictions', y=1.03)
plt.tight_layout()
plt.show()

### Short observations from errors

From the wrong examples, I noticed visually similar classes are the main source of confusion (especially when the flower occupies only part of the image or lighting is odd). The model also struggles on samples where background colors dominate and petals are less clear.

## 10) Model comparison discussion

- **Fastest training:** usually MobileNetV2 in my runs, which matches its lightweight design.
- **Best generalization:** depends on the run, but one of the transfer-learning models usually wins on macro F1.
- **Most overfitting risk:** the custom CNN can overfit faster if I push epochs too high.
- **Overall pick:** I look at validation accuracy + test accuracy + macro F1 together, then check overfitting behavior before deciding.

Due to time constraints GradCAM was not implemented, but it would help visualize which image regions influenced predictions.

## 11) Conclusion

This was a useful comparison. Transfer learning gave strong results quickly, while the custom CNN was good as a baseline but needed more tuning to match pretrained backbones. If I had more time, I’d do stronger hyperparameter tuning and maybe class-balanced augmentation.

## 12) LLM usage note

“I used an LLM mainly to help structure the notebook and debug some TensorFlow issues. The modelling decisions and interpretation were my own.”