# Rare Species Classification using ConvNeXt-Small

This notebook presents the training pipeline for a rare species image classification task.
The goal is to classify images into biological families using a deep convolutional neural
network based on ConvNeXt-Small, pretrained on ImageNet.

## Imports and Reproducibility

This section loads all required libraries for data manipulation (Pandas, NumPy), visualization (Matplotlib), and deep learning (TensorFlow/Keras). We also set a global random seed to ensure reproducibility.

In [1]:
import os
import random
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.applications import ConvNeXtSmall
from tensorflow.keras.applications.convnext import preprocess_input as convnext_preprocess
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, TensorBoard
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, classification_report
from PIL import Image
import matplotlib.pyplot as plt
from datetime import datetime

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

IMG_SIZE = (224, 224)
BATCH_SIZE = 16

## Hardware Configuration

GPU availability is checked and memory growth is enabled to avoid out-of-memory errors
during training.

In [2]:
print("\n" + "="*80)
print("GPU CHECK")
print("="*80)
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    print(f"✓ {len(gpus)} GPU(s) available")
    for gpu in gpus:
        print(f"  - {gpu}")
        tf.config.experimental.set_memory_growth(gpu, True)
else:
    print("⚠ No GPU detected - training will be slow!")


GPU CHECK
⚠ No GPU detected - training will be slow!


## Dataset Loading and Cleaning

The dataset metadata is loaded from a CSV file and the full file paths are construtcted. We perform a validity check to remove any entries where the image file does not exist or the label is missing. Finally, we handle duplicates and stratify the data into Train (80%), Validation (10%), and Test (10%) sets to preserve class distribution.

In [3]:
print("\n" + "="*80)
print("DATA PREPROCESSING")
print("="*80)

df = pd.read_csv(r"C:\Users\inesb\Downloads\Deep-Learning-project\metadata.csv")
data_root_path = r"C:\Users\inesb\Downloads\rare_species"
df['full_path'] = df['file_path'].apply(lambda x: os.path.join(data_root_path, x))

# Remove rows with missing file paths or family labels
df = df.dropna(subset=['file_path', 'family']).reset_index(drop=True)

# Check for missing files
df['exists'] = df['full_path'].apply(os.path.exists)
missing = df[df['exists'] == False]

print("Missing images:", len(missing))
if len(missing) > 0:
    print(missing[['file_path']].head())

# Drop rows with missing images
df = df[df['exists'] == True].reset_index(drop=True)

# Duplicate rows in metadata
duplicate_rows = df[df.duplicated()]
print("Duplicate metadata rows:")
print(duplicate_rows)
df = df.drop_duplicates().reset_index(drop=True)

# Duplicate image paths
duplicate_paths = df[df.duplicated(subset='full_path')]
print("Duplicate file paths:")
print(duplicate_paths)
df = df.drop_duplicates(subset='full_path').reset_index(drop=True)

# Encode each category in the target variable
df['family_encoded'] = pd.factorize(df['family'])[0]
unique_families = df['family'].unique()
print(df['family'].nunique()) # 202

# Stratified split: 80% train, 10% val, 10% test
train_val_df, test_df = train_test_split(df, test_size=0.10, stratify=df["family"], random_state=SEED)
train_df, val_df = train_test_split(train_val_df, test_size=0.1111, stratify=train_val_df["family"], random_state=SEED)
print(f"Train: {len(train_df)}")
print(f"Val: {len(val_df)}")
print(f"Test: {len(test_df)}")


DATA PREPROCESSING
Missing images: 0
Duplicate metadata rows:
Empty DataFrame
Columns: [rare_species_id, eol_content_id, eol_page_id, kingdom, phylum, family, file_path, full_path, exists]
Index: []
Duplicate file paths:
Empty DataFrame
Columns: [rare_species_id, eol_content_id, eol_page_id, kingdom, phylum, family, file_path, full_path, exists]
Index: []
202
Train: 9585
Val: 1199
Test: 1199


## Image Preprocessing and Data Augmentation

Images are resized to 224×224 pixels and preprocessed using the official ConvNeXt
preprocessing function. Data augmentation is applied only to the training set to
improve generalization.

The validation and test sets are only preprocessed without augmentation.

In [4]:
print("\n" + "="*80)
print("BUILDING DATA GENERATORS")
print("="*80)

train_datagen = ImageDataGenerator(
    preprocessing_function=convnext_preprocess,
    rotation_range=25,
    width_shift_range=0.2,
    height_shift_range=0.2,
    zoom_range=0.25,
    horizontal_flip=True,
    shear_range=0.15,
    brightness_range=[0.8, 1.2],
    fill_mode='nearest'
)

val_test_datagen = ImageDataGenerator(
    preprocessing_function=convnext_preprocess
)

train_ds = train_datagen.flow_from_dataframe(
    dataframe=train_df,
    x_col='full_path',
    y_col='family',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=True
)

val_ds = val_test_datagen.flow_from_dataframe(
    dataframe=val_df,
    x_col='full_path',
    y_col='family',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=False
)

test_ds = val_test_datagen.flow_from_dataframe(
    dataframe=test_df,
    x_col='full_path',
    y_col='family',
    target_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    class_mode='categorical',
    shuffle=False
)

print(f"Train batches: {len(train_ds)}")
print(f"Val batches: {len(val_ds)}")
print(f"Test batches: {len(test_ds)}")
print(f"Number of classes: {len(train_ds.class_indices)}")


BUILDING DATA GENERATORS
Found 9585 validated image filenames belonging to 202 classes.
Found 1199 validated image filenames belonging to 202 classes.
Found 1199 validated image filenames belonging to 202 classes.
Train batches: 600
Val batches: 75
Test batches: 75
Number of classes: 202


## Class Imbalance Handling

The dataset is highly imbalanced across species families. To mitigate this issue,
class weights are computed from the training labels and passed to the loss function
during training.

In [5]:
print("\n" + "="*80)
print("COMPUTING CLASS WEIGHTS")
print("="*80)

labels = train_ds.classes
weights = class_weight.compute_class_weight(
    class_weight="balanced",
    classes=np.unique(labels),
    y=labels
)
class_weights = dict(enumerate(weights))
print(f"Class weights computed for {len(class_weights)} classes")
print(f"Weight range: {min(weights):.3f} to {max(weights):.3f}")


COMPUTING CLASS WEIGHTS
Class weights computed for 202 classes
Weight range: 0.198 to 2.063


## Building a Denoising Autoencoder

In [None]:
import tensorflow as tf

IMG_SIZE = (224, 224)
BATCH_SIZE = 32

train_i = tf.keras.utils.image_dataset_from_directory(r"C:\Users\inesb\Downloads\Deep_Learning\data\train",
    image_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    label_mode=None,
    shuffle=True)

val_i = tf.keras.utils.image_dataset_from_directory(r"C:\Users\inesb\Downloads\Deep_Learning\data\validation",
    image_size=IMG_SIZE,
    batch_size=BATCH_SIZE,
    label_mode=None,
    shuffle=False)

Found 9585 files.
Found 1199 files.


In [14]:
import tensorflow as tf

IMG_SIZE = (224, 224)
BATCH_SIZE = 32

paths = train_df["full_path"].values

train_i = tf.data.Dataset.from_tensor_slices(paths)

paths_val = val_df["full_path"].values

val_i = tf.data.Dataset.from_tensor_slices(paths_val)

In [15]:
noise_factor = 0.2

def normalize_and_add_noise(x):
    x = tf.cast(x, tf.float32) / 255.0
    noise = tf.random.normal(tf.shape(x)) * noise_factor
    x_noisy = tf.clip_by_value(x + noise, 0.0, 1.0)
    return x_noisy, x

In [16]:
noisy_train_i = (train_i.map(normalize_and_add_noise))

noisy_val_i = (val_i.map(normalize_and_add_noise))

In [17]:
from tensorflow.keras.models import Model
from tensorflow.keras import layers, losses

class Denoise(Model):
  def __init__(self):
    super(Denoise, self).__init__()
    self.encoder = tf.keras.Sequential([
      layers.Input(shape=(28, 28, 3)),
      layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2),
      layers.Conv2D(8, (3, 3), activation='relu', padding='same', strides=2)])

    self.decoder = tf.keras.Sequential([
      layers.Conv2DTranspose(8, kernel_size=3, strides=2, activation='relu', padding='same'),
      layers.Conv2DTranspose(16, kernel_size=3, strides=2, activation='relu', padding='same'),
      layers.Conv2D(1, kernel_size=(3, 3), activation='sigmoid', padding='same')])

  def call(self, x):
    encoded = self.encoder(x)
    decoded = self.decoder(encoded)
    return decoded

autoencoder = Denoise()

In [18]:
autoencoder.compile(optimizer='adam', loss=losses.MeanSquaredError())

autoencoder.fit(noisy_train_i, validation_data = noisy_val_i, epochs = 15)

Epoch 1/15


UnimplementedError: {{function_node __wrapped__MakeIterator_device_/job:localhost/replica:0/task:0/device:CPU:0}} Cast string to float is not supported
	 [[{{node Cast}}]] [Op:MakeIterator] name: 

In [13]:
autoencoder.summary()
autoencoder.trainable = False

## Model Architecture: ConvNeXt-Small

ConvNeXt-Small pretrained on ImageNet is used as the backbone feature extractor.
The original classification head is removed and replaced with a custom head consisting of:

- Global Average Pooling
- Fully connected layer with ReLU activation
- Batch Normalization
- Dropout for regularization
- Final Softmax layer for family classification

In [14]:
def build_model(num_classes, trainable_backbone=False):
    """
    Create ConvNeXt-Small model with custom head

    Args:
        num_classes: Number of output classes
        trainable_backbone: Whether to make backbone trainable
    """
    print(f"\nBuilding model (backbone trainable: {trainable_backbone})")

    # Load pretrained ConvNeXt-Small
    base_model = ConvNeXtSmall(
        include_top=False,
        weights='imagenet',
        input_shape=(IMG_SIZE[0], IMG_SIZE[1], 3),
        pooling=None
    )

    base_model.trainable = trainable_backbone

    # Build custom head
    inputs = keras.Input(shape=(IMG_SIZE[0], IMG_SIZE[1], 3))
    x = autoencoder(inputs)
    x = base_model(inputs, training=False)
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    x = layers.Dense(512, activation='relu')(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.2)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    model = keras.Model(inputs, outputs)

    print(f"Total parameters: {model.count_params():,}")
    trainable_params = sum([tf.size(w).numpy() for w in model.trainable_weights])
    print(f"Trainable parameters: {trainable_params:,}")

    return model

def unfreeze_top_layers(model, percentage=0.3):
    """
    Unfreeze top percentage of backbone layers

    Args:
        model: Keras model
        percentage: Percentage of layers to unfreeze (from the end)
    """
    base_model = model.layers[1]  # The ConvNeXt backbone
    total_layers = len(base_model.layers)
    unfreeze_from = int(total_layers * (1 - percentage))

    print(f"\nUnfreezing top {percentage*100}% of backbone layers")
    print(f"Total backbone layers: {total_layers}")
    print(f"Unfreezing from layer {unfreeze_from} onwards")

    base_model.trainable = True
    for layer in base_model.layers[:unfreeze_from]:
        layer.trainable = False

    trainable_params = sum([tf.size(w).numpy() for w in model.trainable_weights])
    print(f"Trainable parameters after unfreezing: {trainable_params:,}")


In [15]:
def get_callbacks(stage_name, patience=4):
    """Get training callbacks for a specific stage"""

    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")

    callbacks = [
        # Save best model
        ModelCheckpoint(
            filepath=f'model_{stage_name}_best.keras',
            monitor='val_accuracy',
            mode='max',
            save_best_only=True,
            verbose=1
        ),

        # Reduce learning rate when stuck
        ReduceLROnPlateau(
            monitor='val_accuracy',
            factor=0.5,
            patience=3,
            min_lr=1e-7,
            verbose=1,
            mode='max'
        ),

        # Early stopping
        EarlyStopping(
            monitor='val_accuracy',
            patience=patience,
            restore_best_weights=True,
            verbose=1,
            mode='max'
        ),

        # TensorBoard logging
        TensorBoard(
            log_dir=f'logs/{stage_name}_{timestamp}',
            histogram_freq=0
        )
    ]

    return callbacks

In [16]:
    # Build model
    print("\n" + "="*80)
    print("MODEL ARCHITECTURE")
    print("="*80)
    num_classes = len(train_ds.class_indices)
    model = build_model(num_classes=num_classes, trainable_backbone=False)
    model.summary()


MODEL ARCHITECTURE

Building model (backbone trainable: False)
Total parameters: 49,954,090
Trainable parameters: 498,378


## Training Strategy

Training is performed in three stages:

1. Training only the custom classification head with the backbone frozen
2. Fine-tuning the top 30% of the ConvNeXt backbone
3. Full fine-tuning of the backbone with Batch Normalization layers frozen

This gradual unfreezing strategy stabilizes training and improves performance.

## Stage 1: Training with Frozen Backbone

In the first stage, the ConvNeXt-Small backbone is kept frozen and only the newly
added classification head is trained. This allows the model to adapt high-level
features to the new task without destroying pretrained representations.

In [None]:
print("\n" + "="*80)
print("STAGE 1: Training with FROZEN backbone")
print("="*80)

# Compile model
model.compile(
    optimizer=tf.keras.optimizers.AdamW(
        learning_rate=3e-4,
        weight_decay=1e-4
    ),
    loss='categorical_crossentropy',
    metrics=['accuracy', tf.keras.metrics.TopKCategoricalAccuracy(k=5, name='top5_acc')]
)

# Train
history1 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=8,
    class_weight=class_weights,
    callbacks=get_callbacks('stage1', patience=4),
    verbose=1
)

# Save
model.save('model_after_stage1.keras')
print(f"\nStage 1 COMPLETE! Model saved.")


STAGE 1: Training with FROZEN backbone


  self._warn_if_super_not_called()


Epoch 1/8
[1m 79/600[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m25:15[0m 3s/step - accuracy: 0.0076 - loss: 5.6656 - top5_acc: 0.0306

## Stage 2: Partial Fine-Tuning (Top 30%)

In the second stage, the top 30% of ConvNeXt-Small layers are unfrozen.
A lower learning rate is used to refine pretrained features while
reducing the risk of overfitting.

In [4]:
print("\n" + "="*80)
print("STAGE 2: Training with PARTIAL backbone unfreezing (top 30%)")
print("="*80)

model = tf.keras.models.load_model('model_after_stage1.keras') #balazs: switched it with next line

# Unfreeze top layers
unfreeze_top_layers(model, percentage=0.3)


# Compile with lower learning rate
model.compile(
    optimizer=tf.keras.optimizers.AdamW(
        learning_rate=1e-4,# balazs: it was too low didnt learn. was 1e-5
        weight_decay=1e-4
    ),
    loss='categorical_crossentropy',
    metrics=['accuracy', tf.keras.metrics.TopKCategoricalAccuracy(k=5, name='top5_acc')]
)

# Train
history2 = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=15,
    class_weight=class_weights,
    callbacks=get_callbacks('stage2', patience=5),
    verbose=1
)

# Save
model.save('model_after_stage2.keras')
print(f"\nStage 2 COMPLETE! Model saved.")


STAGE 2: Training with PARTIAL backbone unfreezing (top 30%)



NameError: name 'unfreeze_top_layers' is not defined

## Stage 3: Full Fine-Tuning

In the final stage, all backbone layers are unfrozen except Batch Normalization layers,
which remain frozen for training stability. Mixed-precision training and a smaller
batch size are used to fit GPU memory constraints.

In [36]:
# Stage 3: Fine-tune all layers0
print("\n" + "="*80)
print("STAGE 3: Training with FULL backbone unfreezing")
print("="*80)

model = tf.keras.models.load_model('model_after_stage2.keras')

# --- Mixed precision ---
from tensorflow.keras import mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

# --- Recreate generators with smaller batch size if needed ---
NEW_BATCH_SIZE = 4

train_ds_stage3 = train_datagen.flow_from_dataframe(
    dataframe=train_df,
    x_col='full_path',
    y_col='family',
    target_size=IMG_SIZE,
    batch_size=NEW_BATCH_SIZE,
    class_mode='categorical',
    shuffle=True
)

val_ds_stage3 = val_test_datagen.flow_from_dataframe(
    dataframe=val_df,
    x_col='full_path',
    y_col='family',
    target_size=IMG_SIZE,
    batch_size=NEW_BATCH_SIZE,
    class_mode='categorical',
    shuffle=False
)
"""
# Unfreeze all layers
base_model = model.layers[1]
base_model.trainable = True"""

# Unfreeze all layers but freeze BatchNorm layers
base_model = model.layers[1]
for layer in base_model.layers:
    if isinstance(layer, tf.keras.layers.BatchNormalization):
        layer.trainable = False
    else:
        layer.trainable = True

trainable_params = sum([tf.size(w).numpy() for w in model.trainable_weights])
print(f"Trainable parameters: {trainable_params:,}")

# Compile with even lower learning rate
opt = tf.keras.optimizers.AdamW(
    learning_rate=1e-5,
    weight_decay=1e-5
)
opt = mixed_precision.LossScaleOptimizer(opt)

model.compile(
    optimizer=opt,
    loss='categorical_crossentropy',
    metrics=['accuracy', tf.keras.metrics.TopKCategoricalAccuracy(k=5, name='top5_acc')]
)

# Train
history3 = model.fit(
    train_ds_stage3,
    validation_data=val_ds_stage3,
    epochs=20,
    class_weight=class_weights,
    callbacks=get_callbacks('stage3', patience=5),
    verbose=1
)

# Save
model.save('model_after_stage3.keras')
print(f"\nStage 3 COMPLETE! Model saved.")



STAGE 3: Training with FULL backbone unfreezing


ValueError: File not found: filepath=model_after_stage2.keras. Please ensure the file is an accessible `.keras` zip file.

## Final Evaluation

The final model is evaluated on the independent test set using accuracy
and macro-averaged precision, recall, and F1-score.
Macro metrics are reported to account for class imbalance.

In [11]:
def evaluate_model(model, test_ds):
    print("\n" + "="*80)
    print("FINAL EVALUATION ON TEST SET")
    print("="*80)

    results = model.evaluate(test_ds, verbose=1)
    test_ds.reset()
    predictions = model.predict(test_ds, verbose=1)

    # Convert probabilities to class labels
    y_pred = np.argmax(predictions, axis=1)
    y_true = test_ds.classes

    # Calculate metrics
    acc = accuracy_score(y_true, y_pred)
    f1_macro = f1_score(y_true, y_pred, average='macro')
    prec_macro = precision_score(y_true, y_pred, average='macro')
    rec_macro = recall_score(y_true, y_pred, average='macro')

    print("\nFinal test metrics:")
    print(f"Accuracy:          {acc:.2%}")
    print(f"Macro F1-Score:    {f1_macro:.2%}")
    print(f"Macro Precision:   {prec_macro:.2%}")
    print(f"Macro Recall:      {rec_macro:.2%}")

    return results, {'accuracy': acc, 'f1': f1_macro, 'precision': prec_macro, 'recall': rec_macro}

def plot_training_history(histories, stage_names):
    """Plot training curves for all stages with phase separation lines"""

    # Combine all histories
    acc = []
    val_acc = []
    loss = []
    val_loss = []

    for history in histories:
        if history is not None:
            acc.extend(history.history['accuracy'])
            val_acc.extend(history.history['val_accuracy'])
            loss.extend(history.history['loss'])
            val_loss.extend(history.history['val_loss'])

    epochs = range(1, len(acc) + 1)

    # Create figure
    plt.figure(figsize=(15, 5))

    # Plot Accuracy
    plt.subplot(1, 2, 1)
    plt.plot(epochs, acc, label='Training Accuracy')
    plt.plot(epochs, val_acc, label='Validation Accuracy')

    # Add vertical lines to show where each phase started
    phase1_end = len(histories[0].history['accuracy'])
    phase2_end = phase1_end + len(histories[1].history['accuracy'])

    plt.axvline(x=phase1_end, color='black', linestyle='--', label='Start Stage 2', linewidth=2)
    plt.axvline(x=phase2_end, color='red', linestyle='--', label='Start Stage 3', linewidth=2)

    plt.title('Training and Validation Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    plt.grid(True)

    # Plot Loss
    plt.subplot(1, 2, 2)
    plt.plot(epochs, loss, label='Training Loss')
    plt.plot(epochs, val_loss, label='Validation Loss')

    plt.axvline(x=phase1_end, color='black', linestyle='--', label='Start Stage 2', linewidth=2)
    plt.axvline(x=phase2_end, color='red', linestyle='--', label='Start Stage 3', linewidth=2)

    plt.title('Training and Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.grid(True)

    plt.tight_layout()
    plt.savefig('training_history.png', dpi=300)
    print(f"\n Training history plot saved to training_history.png")
    plt.show()


In [12]:
# Final evaluation
model = tf.keras.models.load_model('model_stage3_best.keras')
test_results, detailed_metrics = evaluate_model(model, test_ds)

# Save final model
model.save('final_model.keras')
print(f"\n Final model saved to final_model.keras")

# Plot training history
plot_training_history(
    [history1, history2, history3],
    ['Stage 1', 'Stage 2', 'Stage 3']
)

print("\n" + "="*80)
print("TRAINING COMPLETE!")
print("="*80)
print(f"\nFinal Test Accuracy: {test_results[1]*100:.2f}%")
print(f"Final Test Top-5 Accuracy: {test_results[2]*100:.2f}%")
print(f"Macro F1-Score: {detailed_metrics['f1']*100:.2f}%")


FINAL EVALUATION ON TEST SET









2025-12-11 01:05:04.713891: E external/local_xla/xla/stream_executor/cuda/cuda_timer.cc:86] Delay kernel timed out: measured time has sub-optimal accuracy. There may be a missing warmup execution, please investigate in Nsight Systems.
2025-12-11 01:05:05.784804: E external/local_xla/xla/stream_executor/cuda/cuda_timer.cc:86] Delay kernel timed out: measured time has sub-optimal accuracy. There may be a missing warmup execution, please investigate in Nsight Systems.
2025-12-11 01:05:06.460915: E external/local_xla/xla/stream_executor/cuda/cuda_timer.cc:86] Delay kernel timed out: measured time has sub-optimal accuracy. There may be a missing warmup execution, please investigate in Nsight Systems.
2025-12-11 01:05:06.790639: E external/local_xla/xla/stream_executor/cuda/cuda_timer.cc:86] Delay kernel timed out: measured time has sub-optimal accuracy. There may be a missing warmup execution, please investigate in Nsight Systems.
2025-12-11 01:05:07.064524: E external/local_xla/xla/

[1m21/75[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m21s[0m 399ms/step - accuracy: 0.7591 - loss: 1.0526 - top5_acc: 0.9431



[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m96s[0m 508ms/step - accuracy: 0.7791 - loss: 0.9333 - top5_acc: 0.9368
[1m75/75[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 560ms/step

Final test metrics:
Accuracy:          77.91%
Macro F1-Score:    76.64%
Macro Precision:   78.38%
Macro Recall:      78.71%


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])



 Final model saved to final_model.keras


NameError: name 'history1' is not defined

## Error Analysis

To better understand the behavior of the ConvNeXt-Small model, an error analysis was
performed on the test set. This analysis focuses on identifying the most frequent
misclassifications and understanding which species families are most commonly confused.

A confusion matrix is computed and visualized in a row-normalized form to account for
class imbalance. In addition, a detailed classification report is generated, including
precision, recall, and F1-score for each class. Finally, the most frequent confusion
pairs (true label → predicted label) are identified to highlight systematic errors.

In [None]:
def error_analysis(model, test_ds_eval, class_indices, normalize=True, top_k=15, save_prefix="test"):
    print("\n" + "="*80)
    print("ERROR ANALYSIS")
    print("="*80)

    test_ds_eval.reset()
    y_pred_proba = model.predict(test_ds_eval, verbose=1)
    y_pred = np.argmax(y_pred_proba, axis=1)
    y_true = test_ds_eval.classes

    idx_to_class = {v: k for k, v in class_indices.items()}
    class_names = [idx_to_class[i] for i in range(len(idx_to_class))]

    cm = confusion_matrix(y_true, y_pred, labels=list(range(len(class_names))))

    if normalize:
        cm_plot = cm.astype("float") / np.maximum(cm.sum(axis=1, keepdims=True), 1)
        title = "Confusion Matrix (row-normalized)"
    else:
        cm_plot = cm
        title = "Confusion Matrix (counts)"

    # Plot 
    plt.figure(figsize=(12, 10))
    plt.imshow(cm_plot)
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.colorbar()
    plt.xticks([]); plt.yticks([])
    plt.tight_layout()
    plt.savefig(f"{save_prefix}_confusion_matrix.png", dpi=200)
    plt.show()

    # Report
    report = classification_report(
        y_true, y_pred,
        target_names=class_names,
        digits=4,
        zero_division=0
    )
    print("\nClassification report:")
    print(report)

    # Top confused pairs
    cm_off = cm.copy()
    np.fill_diagonal(cm_off, 0)
    pairs = []
    for i in range(cm_off.shape[0]):
        for j in range(cm_off.shape[1]):
            if cm_off[i, j] > 0:
                pairs.append((cm_off[i, j], class_names[i], class_names[j]))
    pairs.sort(reverse=True, key=lambda x: x[0])

    print(f"\nTop {top_k} confusions (True -> Pred, count):")
    for cnt, t, p in pairs[:top_k]:
        print(f"  {t} -> {p}: {cnt}")

    return cm


cm = error_analysis(
    model,
    test_ds_eval,
    class_indices=test_ds_eval.class_indices,
    normalize=True,
    top_k=15,
    save_prefix="test"
)

## Conclusion

The ConvNeXt-Small model combined with a staged fine-tuning strategy achieves strong
performance on the rare species classification task. The use of transfer learning,
data augmentation, and class weighting allows the model to generalize well despite
class imbalance and limited samples for rare families.