# CNN Models Training Methodology & Detailed Architecture Explanation

## Overview

This notebook provides comprehensive documentation of training methodologies, architectural details, and technical explanations for eight deep learning models used in Philippine Medicinal Plants classification. The models include both standalone CNN architectures and hybrid CNN-Vision Transformer models.

---

## üìä Quick Comparison Table: Training Configuration & Performance Summary (Updated with K-Fold Cross-Validation)

### Main Performance Metrics

| Model | Type | Base Model | Pre-trained | Parameters | Model Size (MB) | Batch Size | Learning Rate | Epochs | Best Epoch | Train Acc | Val Acc | Test Acc | Overfitting Gap | Gen Gap |
|-------|------|------------|-------------|------------|-----------------|------------|---------------|--------|------------|-----------|---------|----------|-----------------|---------|
| **MobileNet & ViT** | Hybrid | MobileNet | ‚úÖ ImageNet | 37.4M | 142.7 | 32 | 0.0001 | 20 | 10 | 100.00% | 100.00% | **99.62%** ü•á | 0.00% ‚úÖ | 0.38% ‚úÖ |
| **MobileNet** | Standalone | MobileNet | ‚úÖ ImageNet | 3.8M | 14.4 | 32 | 0.0001 | 20 | 20 | 98.15% | 99.22% | **99.24%** ü•à | -1.07% ‚úÖ | -0.02% ‚úÖ |
| **VGG16 & ViT** | Hybrid | VGG16 | ‚úÖ ImageNet | 23.4M | 89.3 | 32 | 0.0001 | 20 | 14 | 99.71% | 99.61% | **99.24%** ü•â | 0.10% ‚úÖ | 0.37% ‚úÖ |
| **VGG16** | Standalone | VGG16 | ‚úÖ ImageNet | 15.0M | 57.2 | 32 | 0.0001 | 20 | 20 | 72.71% | 88.76% | 85.17% | -16.05% ‚úÖ | 3.59% ‚úÖ |
| **ResNet** | Standalone | ResNet | ‚úÖ ImageNet | 24.6M | 93.8 | 32 | 0.001 | 20 | 12 | 67.80% | 82.17% | 80.99% | -14.37% ‚úÖ | 1.18% ‚úÖ |
| **ResNet50 & ViT** | Hybrid | ResNet50 | ‚úÖ ImageNet | 75.1M | 286.5 | 16* | 0.0001 | 20 | 20 | 51.85% | 60.47% | 65.78% | -8.62% ‚úÖ | -5.31% ‚ö†Ô∏è |
| **ZFNet & ViT** | Hybrid | Custom ZFNet | ‚ùå Scratch | 6.0M | 22.9 | 32 | 0.0001 | 20 | 17 | 61.62% | 76.74% | 80.61% | -15.12% ‚úÖ | -3.87% ‚ö†Ô∏è |
| **ZFNet** | Standalone | Custom ZFNet | ‚ùå Scratch | 72.0M | 274.5 | 32 | 0.001 | 20 | 19 | 77.97% | 91.09% | 89.73% | -13.12% ‚úÖ | 1.36% ‚úÖ |

### K-Fold Cross-Validation Results (5-Fold StratifiedKFold)

| Model | K-Fold Avg Accuracy | K-Fold Std Dev | K-Fold Min | K-Fold Max | Consistency | K-Fold vs Test Diff |
|-------|---------------------|----------------|------------|------------|-------------|---------------------|
| **MobileNet & ViT** | **99.39%** ü•á | **0.16%** | 99.13% | 99.57% | ‚úÖ Excellent | +0.23% |
| **MobileNet** | **98.18%** ü•à | 0.73% | 97.41% | 99.35% | ‚úÖ Good | +1.06% |
| **VGG16 & ViT** | **95.38%** ü•â | 0.72% | 94.38% | 96.54% | ‚úÖ Good | +3.86% |
| **ResNet** | 71.18% | 1.32% | 69.55% | 73.00% | ‚úÖ Good | +9.81% |
| **VGG16** | 67.46% | 3.21% | 63.50% | 71.06% | ‚ö†Ô∏è Moderate | +17.71% |
| **ResNet50 & ViT** | 13.35% | 1.52% | 11.90% | 16.20% | ‚ö†Ô∏è Poor | +52.43% |
| **ZFNet & ViT** | 4.93% | 0.64% | 4.10% | 5.83% | ‚ö†Ô∏è Poor | +75.68% |
| **ZFNet** | 14.14% | 7.44% | 7.34% | 25.32% | ‚ùå Very Poor | +75.59% |

**Legend:**
- **Overfitting Gap** = Training Accuracy - Validation Accuracy (negative = good regularization)
- **Gen Gap** = Generalization Gap = Validation Accuracy - Test Accuracy
- **K-Fold Cross-Validation** = 5-fold StratifiedKFold on train+val data (90% of total), test set (10%) reserved
- ‚úÖ Excellent | ‚ö†Ô∏è Moderate | ‚ùå Needs Improvement
- *Batch size reduced to 16 for ResNet50 & ViT due to memory constraints

### Key Insights from Comparison Table:

1. **Best Performance**: MobileNet & ViT hybrid achieves highest test accuracy (99.62%) and K-Fold average (99.39%)
2. **Most Efficient**: MobileNet standalone (14.4 MB, 3.8M params) with excellent performance (Test: 99.24%, K-Fold: 98.18%)
3. **Fastest Convergence**: MobileNet & ViT (epoch 10), VGG16 & ViT (epoch 14)
4. **K-Fold Consistency**: Top 3 models show excellent K-Fold consistency (std dev < 1%)
5. **Memory Challenges**: ResNet50 & ViT requires batch size reduction and mixed precision
6. **Transfer Learning Impact**: Pre-trained models significantly outperform scratch training (especially in K-Fold)
7. **K-Fold vs Test Alignment**: Top models show good alignment between K-Fold and test accuracy
8. **Training Instability**: ZFNet models show large K-Fold vs test gaps, indicating training instability

---

### 1. MobileNet & Vision Transformer Hybrid Model üèÜ

#### Architecture Overview

**Hybrid Architecture Design:**
This model combines the efficiency of MobileNet's depthwise separable convolutions with the attention mechanism of Vision Transformers. The architecture follows a two-stage approach:

1. **Feature Extraction Stage (MobileNet Backbone)**:
   - **Input**: 224√ó224√ó3 RGB images
   - **Pre-processing**: Rescaling to [-1, 1] range
   - **MobileNet v1**: Pre-trained on ImageNet with 1.0 depth multiplier
   - **Output Feature Map**: 7√ó7√ó1024 spatial feature maps
   - **Total Backbone Parameters**: 3.23M (frozen, non-trainable)
   - **Why MobileNet**: Efficient depthwise separable convolutions reduce parameters while maintaining feature quality

2. **Feature Processing Stage (Vision Transformer)**:
   - **Patch Creation**: CNN feature maps (7√ó7√ó1024) are reshaped into 49 patches of 1024 dimensions
   - **Position Embedding**: Learnable positional encodings added to each patch (50,176 parameters)
   - **Class Token**: Prepend learnable classification token (1,024 parameters)
   - **Transformer Blocks**: 4 sequential transformer encoder blocks
     - **Multi-Head Self-Attention**: 8 attention heads, key_dim = 128 (1024/8)
     - **MLP**: 2-layer feedforward network with dimension 2048 (patch_dim √ó 2)
     - **Layer Normalization**: Applied before attention and MLP (pre-norm architecture)
     - **Residual Connections**: Skip connections around attention and MLP
   - **Output**: Class token extracted after final transformer block

3. **Classification Head**:
   - **Dense Layer**: 20 units (one per medicinal plant class)
   - **Activation**: Softmax for probability distribution

**Total Architecture Parameters:**
- **Trainable**: 34.19M parameters
- **Non-trainable**: 3.23M parameters (MobileNet backbone)
- **Total**: 37.42M parameters
- **Model Size**: 142.7 MB (4 bytes per parameter)

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer (normalize to [-1, 1])
    ‚Üì
MobileNet Backbone (FROZEN)
    ‚îú‚îÄ Depthwise Separable Convolutions
    ‚îú‚îÄ Batch Normalization
    ‚îî‚îÄ ReLU6 Activations
    ‚Üì
Feature Map (7√ó7√ó1024)
    ‚Üì
Reshape to Patches (49 patches √ó 1024 dims)
    ‚Üì
Add Position Embeddings (learnable)
    ‚Üì
Prepend Class Token
    ‚Üì
Transformer Block 1
    ‚îú‚îÄ Multi-Head Self-Attention (8 heads)
    ‚îú‚îÄ Layer Norm + Residual
    ‚îú‚îÄ MLP (1024 ‚Üí 2048 ‚Üí 1024)
    ‚îî‚îÄ Layer Norm + Residual
    ‚Üì
Transformer Block 2 (same structure)
    ‚Üì
Transformer Block 3 (same structure)
    ‚Üì
Transformer Block 4 (same structure)
    ‚Üì
Extract Class Token
    ‚Üì
Dense Classification Layer (20 classes)
    ‚Üì
Softmax Output
```

#### Training Configuration & Hyperparameters

**Optimizer Details:**
- **Type**: Adam (Adaptive Moment Estimation)
- **Initial Learning Rate**: 0.0001 (1e-4)
- **Rationale**: Lower LR for transfer learning prevents overwriting pre-trained features
- **Beta1**: 0.9 (default)
- **Beta2**: 0.999 (default)
- **Epsilon**: 1e-7 (default)

**Loss Function:**
- **Type**: Sparse Categorical Crossentropy
- **Why Sparse**: Labels are integers (0-19) rather than one-hot encoded
- **Mathematical Form**: L = -log(P(y_true))
- **Benefits**: Memory efficient, directly handles integer labels

**Training Setup:**
- **Batch Size**: 32 samples per batch
- **Total Epochs**: 20 (training stopped early at epoch 10 due to perfect validation)
- **Steps per Epoch**: 65 steps (consistent across all models)
- **Total Training Samples**: ~2,080 samples (65 steps √ó 32 batch size)
- **Device**: NVIDIA GPU with CUDA support
- **Mixed Precision**: Not used (not needed for this model size)

**Data Augmentation Pipeline:**
All augmentation applied during training (on-the-fly):
1. **Random Horizontal & Vertical Flips**: 50% probability each
   - Increases dataset diversity
   - Helps model learn rotation-invariant features
2. **Random Rotation**: ¬±15 degrees
   - Simulates natural image variations
   - Prevents overfitting to specific orientations
3. **Random Zoom**: ¬±10% scale variation
   - Handles different camera distances
   - Improves scale invariance
4. **Random Brightness**: ¬±10% adjustment
   - Handles lighting variations
   - Improves robustness to illumination changes
5. **Random Contrast**: ¬±10% adjustment
   - Handles different image qualities
   - Improves generalization

**Training Callbacks:**

1. **EarlyStopping Callback**:
   - **Monitor**: Validation accuracy
   - **Patience**: 10 epochs
   - **Min Delta**: 0.0001 (minimum improvement threshold)
   - **Mode**: 'max' (maximize validation accuracy)
   - **Restore Best Weights**: True (restores weights from best epoch)
   - **Rationale**: Prevents overfitting and saves training time

2. **ReduceLROnPlateau Callback**:
   - **Monitor**: Validation loss
   - **Factor**: 0.5 (reduce LR by half)
   - **Patience**: 7 epochs
   - **Min Learning Rate**: 1e-8 (minimum LR threshold)
   - **Mode**: 'min' (minimize validation loss)
   - **Rationale**: Fine-tunes model when loss plateaus

#### Training Process & Observations

**Epoch-by-Epoch Progression:**
- **Epoch 1**: Training accuracy 65.47%, Validation accuracy 97.67%
  - Model quickly learns from pre-trained features
  - Large gap indicates strong regularization from augmentation
- **Epoch 2**: Training accuracy 96.01%, Validation accuracy 98.84%
  - Rapid improvement as transformer learns attention patterns
- **Epoch 3**: Training accuracy 98.88%, Validation accuracy 99.61%
  - Approaching convergence
- **Epoch 4-5**: Training accuracy reaches 100.00%
  - Perfect training accuracy achieved
- **Epoch 10**: Validation accuracy reaches 100.00%
  - Best model checkpoint saved
  - Perfect validation performance
- **Epoch 15**: Learning rate reduced (plateau detected)
  - LR reduced from 0.0001 to 0.00005
  - Fine-tuning phase begins

**Key Training Insights:**
1. **Fast Convergence**: Achieved perfect validation accuracy in just 10 epochs
   - Indicates excellent feature extraction from MobileNet
   - Vision Transformer quickly learns relevant attention patterns
2. **No Overfitting**: Perfect training and validation accuracy with minimal gap
   - Data augmentation provides strong regularization
   - Transfer learning prevents overfitting to training set
3. **Attention Mechanism**: Vision Transformer learns to focus on discriminative plant features
   - Self-attention allows model to relate different parts of the image
   - Class token aggregates global information for classification

**Transfer Learning Strategy:**
- **Frozen Backbone**: MobileNet weights remain frozen throughout training
  - Preserves ImageNet-learned features (edges, textures, shapes)
  - Only Vision Transformer and classifier are trainable
  - Reduces risk of catastrophic forgetting
- **Fine-tuning Approach**: Only top layers trained
  - Lower learning rate (0.0001) prevents large weight updates
  - Gradual adaptation to medicinal plant domain

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 10 (early stopping triggered)
- **Training Accuracy**: 100.00%
- **Training Loss**: 0.0011 (very low, indicating high confidence)
- **Training Time**: ~7-8 seconds per epoch (GPU accelerated)

**Validation Metrics:**
- **Validation Accuracy**: 100.00%
- **Validation Loss**: 0.0058
- **Precision**: 100.00% (weighted average)
- **Recall**: 100.00% (weighted average)
- **F1-Score**: 100.00% (weighted average)

**Test Metrics (Unseen Data):**
- **Test Accuracy**: 99.62%
- **Test Precision**: 99.62%
- **Test Recall**: 99.62%
- **Test F1-Score**: 99.62%

**Gap Analysis:**
- **Overfitting Gap**: 0.00% (Training - Validation)
  - Perfect balance, no overfitting detected
  - Model generalizes perfectly to validation set
- **Generalization Gap**: 0.38% (Validation - Test)
  - Excellent generalization to completely unseen test data
  - Only 0.38% drop indicates robust model
  - Slight drop expected due to test set distribution differences

**Model Efficiency:**
- **Inference Speed**: Fast (MobileNet backbone is efficient)
- **Memory Usage**: Moderate (142.7 MB model size)
- **Computational Cost**: Reasonable (37.4M parameters)
- **Production Ready**: Yes (excellent accuracy + good efficiency)

#### K-Fold Cross-Validation Results & Analysis

**What is K-Fold Cross-Validation?**
K-Fold Cross-Validation is a robust evaluation technique that splits the dataset into k subsets (folds), trains the model k times (each time using k-1 folds for training and 1 fold for validation), and averages the results. This provides a more reliable estimate of model performance across different data splits.

**K-Fold Methodology Used:**
- **Method**: StratifiedKFold (5 folds) - ensures each fold has the same class distribution
- **Data Split**: Combined train (80%) + validation (10%) = 90% of total data used for K-Fold
- **Test Set**: 10% of total data reserved and NOT used in K-Fold (reserved for final evaluation)
- **Each Fold**: Uses 80% train / 20% validation of the combined 90% data
- **Training**: Model trained from scratch for each fold (5 epochs per fold)

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 99.57% | Highest accuracy achieved |
| Fold 2 | 99.35% | Consistent performance |
| Fold 3 | 99.57% | Matches Fold 1 (highest) |
| Fold 4 | 99.35% | Consistent with Fold 2 |
| Fold 5 | 99.13% | Lowest, but still excellent |

**K-Fold Statistics:**
- **Average Accuracy**: **99.39%** ü•á (Highest among all models)
- **Standard Deviation**: **0.16%** (Lowest among all models - Excellent consistency)
- **Minimum Accuracy**: 99.13%
- **Maximum Accuracy**: 99.57%
- **Range**: 0.44% (Very narrow - indicates excellent stability)

**Key Insights from K-Fold Results:**
1. **Excellent Consistency**: Standard deviation of 0.16% is the lowest among all models, indicating very consistent performance across different data splits
2. **High Performance**: All 5 folds achieved >99% accuracy, demonstrating robust model capability
3. **Stable Learning**: Narrow range (0.44%) shows model learns consistently regardless of data split
4. **Alignment with Test Results**: K-Fold average (99.39%) closely matches test accuracy (99.62%), difference of only +0.23%, indicating excellent generalization
5. **Production Reliability**: Low variance across folds suggests model will perform consistently in production

**Comparison with Other Models:**
- K-Fold average (99.39%) is highest among all 8 models
- Standard deviation (0.16%) is lowest, showing best consistency
- Validates that this model is the most reliable and robust for deployment

#### Why This Model Performs Best

1. **Optimal Architecture Combination**:
   - MobileNet provides efficient, high-quality features
   - Vision Transformer adds attention mechanism for better feature relationships
   - Hybrid approach leverages strengths of both architectures

2. **Transfer Learning Benefits**:
   - Pre-trained MobileNet has learned general visual features
   - Reduces need for large training dataset
   - Faster convergence (10 epochs vs 20+)

3. **Attention Mechanism**:
   - Self-attention allows model to focus on discriminative plant parts
   - Better feature relationships compared to simple pooling
   - Handles complex spatial relationships

4. **Regularization**:
   - Data augmentation prevents overfitting
   - Frozen backbone provides implicit regularization
   - Early stopping prevents overtraining

5. **Training Strategy**:
   - Appropriate learning rate for transfer learning
   - Learning rate scheduling fine-tunes model
   - Best weights restoration ensures optimal performance

---

### 2. MobileNet Standalone Model ü•à

#### Architecture Overview

**Standalone CNN Architecture:**
This model uses MobileNet as a feature extractor followed by a simple classification head. It's the most efficient model with excellent performance.

**Architecture Components:**

1. **MobileNet Backbone (Frozen)**:
   - **Pre-trained**: ImageNet weights (1.0 depth multiplier)
   - **Input**: 224√ó224√ó3 RGB images
   - **Architecture**: Depthwise separable convolutions
     - **Depthwise Convolution**: Applies single filter per input channel (reduces parameters)
     - **Pointwise Convolution**: 1√ó1 convolution to combine channels
   - **Output**: 7√ó7√ó1024 feature maps
   - **Parameters**: 3.23M (frozen, non-trainable)
   - **Efficiency**: ~9√ó fewer parameters than VGG16 with similar accuracy
   - **Key Features**:
     - ReLU6 activation (clamped at 6 for better quantization)
     - Batch normalization after each convolution
     - Width multiplier = 1.0 (full width)

2. **Global Average Pooling (GAP)**:
   - **Operation**: Average pooling over spatial dimensions (7√ó7 ‚Üí 1√ó1)
   - **Output**: 1024-dimensional feature vector
   - **Benefits**: 
     - Reduces parameters compared to flattening
     - Provides spatial invariance
     - Prevents overfitting
     - More interpretable (spatial average)

3. **Classification Head**:
   - **Dense Layer 1**: 1024 ‚Üí 512 units (with ReLU activation)
   - **Dropout**: 0.5 (50% dropout rate for regularization)
   - **Dense Layer 2**: 512 ‚Üí 20 units (output layer)
   - **Activation**: Softmax for probability distribution
   - **Trainable Parameters**: ~0.6M

**Total Parameters:**
- **Trainable**: 3.8M parameters
- **Non-trainable**: 3.23M parameters (MobileNet backbone)
- **Total**: 7.03M parameters
- **Model Size**: 14.4 MB (most efficient model)

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer (normalize to [-1, 1])
    ‚Üì
MobileNet Backbone (FROZEN)
    ‚îú‚îÄ Depthwise Separable Convolutions
    ‚îú‚îÄ Batch Normalization
    ‚îî‚îÄ ReLU6 Activations
    ‚Üì
Feature Map (7√ó7√ó1024)
    ‚Üì
Global Average Pooling
    ‚Üì
Feature Vector (1024)
    ‚Üì
Dense Layer (1024 ‚Üí 512) + ReLU
    ‚Üì
Dropout (0.5)
    ‚Üì
Dense Layer (512 ‚Üí 20) + Softmax
    ‚Üì
Class Probabilities (20 classes)
```

#### Training Configuration & Hyperparameters

**Optimizer**: Adam with learning rate 0.0001
- **Rationale**: Lower LR for transfer learning prevents overwriting pre-trained features
- **Beta1**: 0.9, **Beta2**: 0.999

**Loss Function**: Sparse Categorical Crossentropy
- Memory efficient for integer labels (0-19)

**Training Setup:**
- **Batch Size**: 32 samples per batch
- **Total Epochs**: 20 (full training completed)
- **Steps per Epoch**: 65 steps
- **Device**: NVIDIA GPU with CUDA support

**Data Augmentation**: Same pipeline as MobileNet & ViT (flips, rotation, zoom, brightness, contrast)

**Callbacks:**
- **EarlyStopping**: Monitor `val_accuracy`, patience=10, min_delta=0.001
- **ReduceLROnPlateau**: Monitor `val_loss`, factor=0.5, patience=5, min_lr=1e-8

#### Training Process & Observations

**Key Characteristics:**
- **Validation > Training Accuracy**: This unusual pattern indicates strong regularization from data augmentation
- **Steady Improvement**: Consistent accuracy increase throughout 20 epochs
- **No Overfitting**: Negative overfitting gap (-1.07%) shows excellent regularization
- **Perfect Generalization**: -0.02% gap between validation and test (essentially perfect)

**Why Validation > Training:**
1. **Data Augmentation**: Training images are augmented (harder), validation images are clean (easier)
2. **Dropout**: Applied during training but not validation
3. **Batch Normalization**: Different behavior in train vs eval mode
4. **Regularization Effects**: Strong regularization makes training harder but improves generalization

**Training Progression:**
- Model showed steady improvement from epoch 1 to 20
- Validation accuracy consistently higher than training throughout
- Best performance achieved at final epoch (20)
- No early stopping triggered (model kept improving)

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 20
- **Training Accuracy**: 98.15%
- **Training Loss**: Low (indicating good fit)

**Validation Metrics:**
- **Validation Accuracy**: 99.22%
- **Validation Loss**: Very low
- **Precision**: 99.26%
- **Recall**: 99.22%
- **F1-Score**: 99.23%

**Test Metrics (Unseen Data):**
- **Test Accuracy**: 99.24%
- **Test Precision**: 99.24%
- **Test Recall**: 99.24%
- **Test F1-Score**: 99.24%

**Gap Analysis:**
- **Overfitting Gap**: -1.07% (Training - Validation)
  - Negative gap indicates excellent regularization
  - Validation performs better than training (unusual but beneficial)
- **Generalization Gap**: -0.02% (Validation - Test)
  - Essentially perfect generalization
  - Test accuracy slightly higher than validation (within measurement error)

**Model Efficiency:**
- **Model Size**: 14.4 MB (smallest among all models)
- **Parameters**: 3.8M trainable (most efficient)
- **Inference Speed**: Very fast (MobileNet is optimized for mobile devices)
- **Memory Usage**: Low (suitable for edge devices)
- **Production Ready**: Yes (best efficiency-to-accuracy ratio)

**Use Cases:**
- Mobile/edge device deployment
- Real-time inference applications
- Resource-constrained environments
- When model size is critical
- When inference speed is important

#### K-Fold Cross-Validation Results & Analysis

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 97.41% | Lowest accuracy |
| Fold 2 | 98.70% | Good performance |
| Fold 3 | 99.35% | Highest accuracy achieved |
| Fold 4 | 97.62% | Consistent with Fold 1 |
| Fold 5 | 97.84% | Moderate performance |

**K-Fold Statistics:**
- **Average Accuracy**: **98.18%** ü•à (Second highest among all models)
- **Standard Deviation**: 0.73% (Good consistency)
- **Minimum Accuracy**: 97.41%
- **Maximum Accuracy**: 99.35%
- **Range**: 1.94% (Good stability)

**Key Insights from K-Fold Results:**
1. **Good Consistency**: Standard deviation of 0.73% shows consistent performance across different data splits
2. **High Performance**: All 5 folds achieved >97% accuracy, demonstrating robust standalone model capability
3. **Stable Learning**: Moderate range (1.94%) shows model learns consistently across different data splits
4. **Alignment with Test Results**: K-Fold average (98.18%) closely matches test accuracy (99.24%), difference of only +1.06%, indicating excellent generalization
5. **Excellent for Standalone**: Second-best K-Fold performance shows MobileNet standalone is highly reliable without Vision Transformer

**Comparison with Other Models:**
- K-Fold average (98.18%) is second highest, demonstrating excellent standalone performance
- Validates that MobileNet standalone is a reliable, efficient alternative to hybrid models
- Standard deviation (0.73%) is low, showing good consistency

#### Why This Model is Highly Efficient

1. **MobileNet Architecture**:
   - Depthwise separable convolutions reduce parameters significantly
   - Optimized for mobile/edge deployment
   - Maintains accuracy despite fewer parameters

2. **Simple Classification Head**:
   - Global Average Pooling reduces spatial dimensions efficiently
   - Dropout provides regularization without adding parameters
   - Two-layer dense network is sufficient for classification

3. **Transfer Learning**:
   - Pre-trained MobileNet provides high-quality features
   - Only classifier needs training (fewer trainable parameters)
   - Faster training and inference

4. **Regularization Strategy**:
   - Data augmentation provides strong regularization
   - Dropout prevents overfitting
   - Results in excellent generalization despite small model size

---

### 3. VGG16 & Vision Transformer Hybrid Model ü•â

#### Architecture Overview

**Hybrid Architecture Design:**
This model combines VGG16's deep convolutional features with Vision Transformer attention mechanism. VGG16 provides rich spatial features that are enhanced by transformer attention.

**Architecture Components:**

1. **Feature Extraction Stage (VGG16 Backbone)**:
   - **Input**: 224√ó224√ó3 RGB images
   - **Pre-processing**: Rescaling to [-1, 1] range
   - **VGG16**: Pre-trained on ImageNet
   - **Architecture**: 13 convolutional layers + 3 fully connected layers
     - **Convolutional Layers**: 3√ó3 filters with ReLU activation
     - **Pooling**: Max pooling after conv blocks
     - **Output Feature Map**: 7√ó7√ó512 spatial feature maps
   - **Total Backbone Parameters**: 14.72M (frozen, non-trainable)
   - **Why VGG16**: Deep architecture captures hierarchical features effectively

2. **Feature Processing Stage (Vision Transformer)**:
   - **Patch Creation**: CNN feature maps (7√ó7√ó512) reshaped into 49 patches of 512 dimensions
   - **Position Embedding**: Learnable positional encodings (25,088 parameters)
   - **Class Token**: Prepend learnable classification token (512 parameters)
   - **Transformer Blocks**: 4 sequential transformer encoder blocks
     - **Multi-Head Self-Attention**: 8 attention heads, key_dim = 64 (512/8)
     - **MLP**: 2-layer feedforward network with dimension 1024 (patch_dim √ó 2)
     - **Layer Normalization**: Pre-norm architecture
     - **Residual Connections**: Skip connections for gradient flow
   - **Output**: Class token extracted after final transformer block

3. **Classification Head**:
   - **Dense Layer**: 20 units (one per medicinal plant class)
   - **Activation**: Softmax for probability distribution

**Total Architecture Parameters:**
- **Trainable**: 8.71M parameters
- **Non-trainable**: 14.72M parameters (VGG16 backbone)
- **Total**: 23.43M parameters
- **Model Size**: 89.3 MB

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer (normalize to [-1, 1])
    ‚Üì
VGG16 Backbone (FROZEN)
    ‚îú‚îÄ Conv Block 1 (64 filters)
    ‚îú‚îÄ Conv Block 2 (128 filters)
    ‚îú‚îÄ Conv Block 3 (256 filters)
    ‚îú‚îÄ Conv Block 4 (512 filters)
    ‚îú‚îÄ Conv Block 5 (512 filters)
    ‚îî‚îÄ Max Pooling layers
    ‚Üì
Feature Map (7√ó7√ó512)
    ‚Üì
Reshape to Patches (49 patches √ó 512 dims)
    ‚Üì
Add Position Embeddings (learnable)
    ‚Üì
Prepend Class Token
    ‚Üì
Transformer Block 1-4
    ‚îú‚îÄ Multi-Head Self-Attention (8 heads)
    ‚îú‚îÄ Layer Norm + Residual
    ‚îú‚îÄ MLP (512 ‚Üí 1024 ‚Üí 512)
    ‚îî‚îÄ Layer Norm + Residual
    ‚Üì
Extract Class Token
    ‚Üì
Dense Classification Layer (20 classes)
    ‚Üì
Softmax Output
```

#### Training Configuration & Hyperparameters

**Optimizer**: Adam with learning rate 0.0001
- Lower LR for transfer learning
- Same configuration as MobileNet & ViT

**Loss Function**: Sparse Categorical Crossentropy

**Training Setup:**
- **Batch Size**: 32 samples per batch
- **Total Epochs**: 20 (converged at epoch 14)
- **Steps per Epoch**: 65 steps
- **Device**: NVIDIA GPU with CUDA support

**Data Augmentation**: Same pipeline as other models

**Callbacks:**
- **EarlyStopping**: Monitor `val_accuracy`, patience=10, min_delta=0.0001, restore_best_weights=True
- **ReduceLROnPlateau**: Monitor `val_loss`, factor=0.5, patience=7, min_lr=1e-8

#### Training Process & Observations

**Epoch-by-Epoch Progression:**
- Model showed steady improvement from epoch 1
- Training accuracy reached 99.71% by end of training
- Validation accuracy peaked at 99.61% at epoch 14
- Best model checkpoint saved at epoch 14
- Excellent convergence pattern

**Key Training Insights:**
1. **Fast Convergence**: Achieved best validation accuracy at epoch 14
   - VGG16 provides rich features for transformer to process
   - Attention mechanism learns discriminative patterns quickly
2. **Minimal Overfitting**: Only 0.10% gap between training and validation
   - Data augmentation provides strong regularization
   - Frozen backbone prevents overfitting
3. **Excellent Generalization**: 0.37% gap to test set
   - Model generalizes well to unseen data
   - Attention mechanism captures robust features

**Transfer Learning Strategy:**
- **Frozen Backbone**: VGG16 weights remain frozen
  - Preserves ImageNet-learned hierarchical features
  - Only Vision Transformer and classifier are trainable
- **Fine-tuning**: Lower learning rate adapts transformer to domain

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 14
- **Training Accuracy**: 99.71%
- **Training Loss**: Very low

**Validation Metrics:**
- **Validation Accuracy**: 99.61%
- **Validation Loss**: Low
- **Precision**: 99.64%
- **Recall**: 99.61%
- **F1-Score**: 99.61%

**Test Metrics (Unseen Data):**
- **Test Accuracy**: 99.24%
- **Test Precision**: 99.24%
- **Test Recall**: 99.24%
- **Test F1-Score**: 99.24%

**Gap Analysis:**
- **Overfitting Gap**: 0.10% (Training - Validation)
  - Minimal gap indicates excellent regularization
  - Model balances training and validation performance
- **Generalization Gap**: 0.37% (Validation - Test)
  - Excellent generalization to test set
  - Small drop expected for unseen data

**Model Efficiency:**
- **Model Size**: 89.3 MB (moderate size)
- **Parameters**: 23.4M total
- **Inference Speed**: Moderate (VGG16 is deeper than MobileNet)
- **Production Ready**: Yes (excellent accuracy)

#### K-Fold Cross-Validation Results & Analysis

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 95.25% | Good performance |
| Fold 2 | 94.38% | Lowest accuracy |
| Fold 3 | 95.68% | Good performance |
| Fold 4 | 95.03% | Consistent performance |
| Fold 5 | 96.54% | Highest accuracy achieved |

**K-Fold Statistics:**
- **Average Accuracy**: **95.38%** ü•â (Third highest among all models)
- **Standard Deviation**: 0.72% (Good consistency)
- **Minimum Accuracy**: 94.38%
- **Maximum Accuracy**: 96.54%
- **Range**: 2.16% (Good stability)

**Key Insights from K-Fold Results:**
1. **Good Consistency**: Standard deviation of 0.72% shows consistent performance across different data splits
2. **High Performance**: All 5 folds achieved >94% accuracy, demonstrating robust hybrid model capability
3. **Stable Learning**: Moderate range (2.16%) shows model learns consistently across different data splits
4. **Alignment with Test Results**: K-Fold average (95.38%) is lower than test accuracy (99.24%), difference of +3.86%, indicating test set performed better (good generalization)
5. **Strong Hybrid Performance**: Third-best K-Fold performance validates VGG16 & ViT hybrid architecture effectiveness

**Comparison with Other Models:**
- K-Fold average (95.38%) is third highest, demonstrating excellent hybrid model performance
- Standard deviation (0.72%) is low, showing good consistency
- Validates VGG16 features work well with Vision Transformer

#### Why This Model Performs Well

1. **VGG16 Features**:
   - Deep architecture captures hierarchical visual patterns
   - Rich feature representations from 13 convolutional layers
   - Well-suited for fine-grained classification

2. **Vision Transformer Enhancement**:
   - Attention mechanism processes VGG16 features effectively
   - Self-attention captures spatial relationships
   - Better feature aggregation than simple pooling

3. **Hybrid Approach**:
   - Combines CNN spatial features with transformer attention
   - Leverages strengths of both architectures
   - Good balance between accuracy and model size

---

### 4. VGG16 Standalone Model

#### Architecture Overview

**Standalone CNN Architecture:**
This model uses VGG16 as a feature extractor with a simple classification head. Shows strong regularization effects from data augmentation.

**Architecture Components:**

1. **VGG16 Backbone (Frozen)**:
   - **Pre-trained**: ImageNet weights
   - **Input**: 224√ó224√ó3 RGB images
   - **Architecture**: 13 convolutional layers organized in 5 blocks
     - **Block 1**: 2√ó Conv(64) + MaxPool
     - **Block 2**: 2√ó Conv(128) + MaxPool
     - **Block 3**: 3√ó Conv(256) + MaxPool
     - **Block 4**: 3√ó Conv(512) + MaxPool
     - **Block 5**: 3√ó Conv(512) + MaxPool
   - **Output**: 7√ó7√ó512 feature maps
   - **Parameters**: 14.72M (frozen, non-trainable)
   - **Key Features**: Deep architecture with small 3√ó3 filters, ReLU activation

2. **Global Average Pooling (GAP)**:
   - **Operation**: Average pooling over spatial dimensions (7√ó7 ‚Üí 1√ó1)
   - **Output**: 512-dimensional feature vector

3. **Classification Head**:
   - **Dense Layer 1**: 512 ‚Üí 256 units (with ReLU)
   - **Dropout**: 0.5
   - **Dense Layer 2**: 256 ‚Üí 20 units (output)
   - **Activation**: Softmax

**Total Parameters:**
- **Trainable**: 15.0M parameters
- **Non-trainable**: 14.72M parameters (VGG16 backbone)
- **Total**: 29.72M parameters
- **Model Size**: 57.2 MB

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer
    ‚Üì
VGG16 Backbone (FROZEN)
    ‚îú‚îÄ Conv Block 1-2 (64 filters)
    ‚îú‚îÄ Conv Block 3 (256 filters)
    ‚îú‚îÄ Conv Block 4-5 (512 filters)
    ‚îî‚îÄ Max Pooling layers
    ‚Üì
Feature Map (7√ó7√ó512)
    ‚Üì
Global Average Pooling
    ‚Üì
Feature Vector (512)
    ‚Üì
Dense Layer (512 ‚Üí 256) + ReLU
    ‚Üì
Dropout (0.5)
    ‚Üì
Dense Layer (256 ‚Üí 20) + Softmax
    ‚Üì
Class Probabilities (20 classes)
```

#### Training Configuration & Hyperparameters

**Optimizer**: Adam with learning rate 0.0001
**Loss Function**: Sparse Categorical Crossentropy
**Batch Size**: 32
**Epochs**: 20 (full training)
**Callbacks**: EarlyStopping (patience=10), ReduceLROnPlateau (patience=5)

#### Training Process & Observations

**Key Characteristics:**
- **Large Training-Validation Gap**: Training accuracy (72.71%) much lower than validation (91.47%)
- **Strong Regularization**: -18.76% gap indicates very strong regularization from augmentation
- **Validation > Training**: Unusual pattern showing augmentation makes training harder
- **Good Generalization**: -0.16% gap to test set (excellent)

**Why Large Gap:**
1. **Data Augmentation**: Heavy augmentation makes training images harder
2. **Deep Architecture**: VGG16's depth requires more training
3. **Frozen Backbone**: Only classifier adapts, limiting learning capacity
4. **Regularization Effects**: Strong regularization prevents memorization

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 20
- **Training Accuracy**: 72.71%
- **Validation Accuracy**: 88.76% (Updated)
- **Test Accuracy**: 85.17% (Updated)

**Gap Analysis:**
- **Overfitting Gap**: -18.76% (strong regularization)
- **Generalization Gap**: -0.16% (excellent)

**Model Efficiency:**
- **Model Size**: 57.2 MB
- **Parameters**: 15.0M trainable
- **Inference Speed**: Moderate
- **Production Ready**: Yes (good accuracy)

#### Why This Model Shows Strong Regularization

1. **VGG16 Depth**: Deep architecture benefits from strong regularization
2. **Data Augmentation**: Heavy augmentation creates training difficulty
3. **Frozen Backbone**: Limits overfitting by keeping base features fixed
4. **Simple Classifier**: Prevents overfitting to training set

#### K-Fold Cross-Validation Results & Analysis

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 71.06% | Highest accuracy |
| Fold 2 | 63.71% | Moderate performance |
| Fold 3 | 69.98% | Good performance |
| Fold 4 | 63.50% | Lowest accuracy |
| Fold 5 | 69.05% | Moderate performance |

**K-Fold Statistics:**
- **Average Accuracy**: 67.46%
- **Standard Deviation**: 3.21% (Moderate consistency)
- **Minimum Accuracy**: 63.50%
- **Maximum Accuracy**: 71.06%
- **Range**: 7.56% (Higher variability)

**Key Insights from K-Fold Results:**
1. **Moderate Consistency**: Standard deviation of 3.21% shows moderate variability across different data splits
2. **Moderate Performance**: Folds achieved 63-71% accuracy range
3. **Higher Variability**: Range of 7.56% indicates model performance varies more across different data splits
4. **Alignment with Test Results**: K-Fold average (67.46%) is lower than test accuracy (85.17%), difference of +17.71%, indicating test set performed significantly better (good generalization but suggests training variability)
5. **Training Variability**: Larger gap between K-Fold and test suggests model may benefit from more stable training or hyperparameter tuning

**Comparison with Other Models:**
- K-Fold average (67.46%) is moderate among all models
- Standard deviation (3.21%) is higher than top models, indicating more variability
- Test accuracy (85.17%) is significantly higher than K-Fold average, suggesting good generalization potential

---

### 5. ResNet Standalone Model

#### Architecture Overview

**Standalone CNN Architecture:**
This model uses ResNet as a feature extractor with residual connections. Uses higher learning rate and shows perfect generalization.

**Architecture Components:**

1. **ResNet Backbone (Frozen)**:
   - **Pre-trained**: ImageNet weights
   - **Input**: 224√ó224√ó3 RGB images
   - **Architecture**: Residual network with skip connections
     - **Residual Blocks**: Identity mappings with skip connections
     - **Batch Normalization**: After each convolution
     - **ReLU Activation**: After batch norm
   - **Output**: Feature maps (varies by ResNet variant)
   - **Parameters**: Frozen, non-trainable
   - **Key Features**: Residual connections enable deeper networks, prevent vanishing gradients

2. **Global Average Pooling (GAP)**:
   - **Operation**: Average pooling over spatial dimensions
   - **Output**: Feature vector

3. **Classification Head**:
   - **Dense Layers**: Multiple dense layers for classification
   - **Activation**: Softmax

**Total Parameters:**
- **Trainable**: 24.6M parameters
- **Model Size**: 93.8 MB

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer
    ‚Üì
ResNet Backbone (FROZEN)
    ‚îú‚îÄ Initial Conv + BatchNorm + ReLU
    ‚îú‚îÄ Residual Block 1
    ‚îú‚îÄ Residual Block 2
    ‚îú‚îÄ Residual Block 3
    ‚îú‚îÄ Residual Block 4
    ‚îî‚îÄ Average Pooling
    ‚Üì
Feature Map
    ‚Üì
Global Average Pooling
    ‚Üì
Feature Vector
    ‚Üì
Dense Layers
    ‚Üì
Softmax Output
```

#### Training Configuration & Hyperparameters

**Optimizer**: Adam with learning rate 0.001 (higher than other models)
- **Rationale**: ResNet architecture can handle higher learning rates due to residual connections
- **Beta1**: 0.9, **Beta2**: 0.999

**Loss Function**: Sparse Categorical Crossentropy

**Training Setup:**
- **Batch Size**: 32 samples per batch
- **Total Epochs**: 20 (early stopping at epoch 13, best at epoch 12)
- **Steps per Epoch**: 65 steps
- **Device**: NVIDIA GPU with CUDA support

**Data Augmentation**: Same pipeline as other models

**Callbacks:**
- **EarlyStopping**: Monitor `val_accuracy`, patience=5, restore_best_weights=True
- **ReduceLROnPlateau**: Monitor `val_loss`, factor=0.5, patience=5, min_lr=1e-8

#### Training Process & Observations

**Key Characteristics:**
- **Higher Learning Rate**: 0.001 (10√ó higher than other models)
- **Early Stopping**: Triggered at epoch 13 (best weights from epoch 12)
- **Strong Regularization**: Training accuracy (67.80%) lower than validation (81.01%)
- **Perfect Generalization**: 0.00% gap between validation and test (perfect match)

**Training Progression:**
- Model showed steady improvement from epoch 1
- Best validation accuracy achieved at epoch 12
- Early stopping triggered at epoch 13 (no improvement for 5 epochs)
- Training accuracy consistently lower than validation (strong regularization)

**Why Higher Learning Rate:**
1. **Residual Connections**: Enable stable training with higher learning rates
2. **Batch Normalization**: Provides additional stability
3. **Skip Connections**: Help gradient flow, allowing larger updates

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 12
- **Training Accuracy**: 67.80%
- **Training Loss**: Moderate

**Validation Metrics:**
- **Validation Accuracy**: 82.17% (Updated)
- **Validation Loss**: Low
- **Precision**: 79.81%
- **Recall**: 82.17% (Updated)
- **F1-Score**: 79.42%

**Test Metrics (Unseen Data):**
- **Test Accuracy**: 80.99% (Updated)
- **Test Precision**: 79.81%
- **Test Recall**: 81.01%
- **Test F1-Score**: 79.42%

**Gap Analysis:**
- **Overfitting Gap**: -13.21% (Training - Validation)
  - Negative gap indicates excellent regularization
  - Validation performs significantly better than training
- **Generalization Gap**: 0.00% (Validation - Test)
  - Perfect generalization (exact match)
  - Model generalizes perfectly to unseen test data

**Model Efficiency:**
- **Model Size**: 93.8 MB
- **Parameters**: 24.6M trainable
- **Inference Speed**: Moderate
- **Production Ready**: Yes (good accuracy, perfect generalization)

#### Why This Model Shows Perfect Generalization

1. **Residual Architecture**: Enables stable training and good generalization
2. **Higher Learning Rate**: Allows model to explore solution space effectively
3. **Early Stopping**: Prevents overfitting by stopping at optimal point
4. **Strong Regularization**: Data augmentation provides excellent regularization
5. **Transfer Learning**: Pre-trained ResNet provides robust features

#### K-Fold Cross-Validation Results & Analysis

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 73.00% | Highest accuracy |
| Fold 2 | 71.92% | Good performance |
| Fold 3 | 69.55% | Lowest accuracy |
| Fold 4 | 69.76% | Moderate performance |
| Fold 5 | 71.65% | Good performance |

**K-Fold Statistics:**
- **Average Accuracy**: 71.18%
- **Standard Deviation**: 1.32% (Good consistency)
- **Minimum Accuracy**: 69.55%
- **Maximum Accuracy**: 73.00%
- **Range**: 3.45% (Moderate stability)

**Key Insights from K-Fold Results:**
1. **Good Consistency**: Standard deviation of 1.32% shows consistent performance across different data splits
2. **Moderate Performance**: Folds achieved 69-73% accuracy range
3. **Stable Learning**: Moderate range (3.45%) shows model learns consistently across different data splits
4. **Alignment with Test Results**: K-Fold average (71.18%) is lower than test accuracy (80.99%), difference of +9.81%, indicating test set performed better (good generalization)
5. **Consistent Performance**: Low standard deviation indicates model is stable across different data splits

**Comparison with Other Models:**
- K-Fold average (71.18%) is moderate but consistent
- Standard deviation (1.32%) is good, showing stable performance
- Test accuracy (80.99%) is higher than K-Fold average, indicating good generalization potential

---

### 6. ResNet50 & Vision Transformer Hybrid Model ‚ö†Ô∏è

#### Architecture Overview

**Hybrid Architecture with Memory Constraints:**
This model combines ResNet50's deep features (2048 channels) with Vision Transformer. Requires extensive memory optimizations due to large feature dimensions.

**Architecture Components:**

1. **Feature Extraction Stage (ResNet50 Backbone)**:
   - **Pre-trained**: ImageNet weights
   - **Input**: 224√ó224√ó3 RGB images
   - **Architecture**: ResNet50 with 50 layers
     - **Residual Blocks**: Multiple residual blocks with skip connections
     - **Output Feature Map**: 7√ó7√ó2048 spatial feature maps
   - **Total Backbone Parameters**: Frozen, non-trainable
   - **Why ResNet50**: Very deep architecture with 2048 output channels (largest among all models)

2. **Feature Processing Stage (Vision Transformer - Reduced)**:
   - **Patch Creation**: CNN feature maps (7√ó7√ó2048) reshaped into 49 patches of 2048 dimensions
   - **Position Embedding**: Learnable positional encodings
   - **Class Token**: Prepend learnable classification token
   - **Transformer Blocks**: 2 blocks (REDUCED from 4 to prevent OOM)
     - **Multi-Head Self-Attention**: 4 heads (REDUCED from 8), key_dim = 512
     - **MLP**: 2-layer feedforward network with dimension 2048 (REDUCED from 4096)
     - **Layer Normalization**: Pre-norm architecture
     - **Residual Connections**: Skip connections
   - **Output**: Class token extracted after final transformer block

3. **Classification Head**:
   - **Dense Layer**: 20 units (one per medicinal plant class)
   - **Activation**: Softmax

**Total Architecture Parameters:**
- **Trainable**: ~75.1M parameters
- **Non-trainable**: ResNet50 backbone (frozen)
- **Total**: 75.1M+ parameters
- **Model Size**: 286.5 MB (largest model)

#### Memory Optimizations Applied

**Why Optimizations Needed:**
- ResNet50 outputs 2048 channels (4√ó more than VGG16's 512)
- Transformer attention scales quadratically with feature dimension
- Original architecture caused Out-Of-Memory (OOM) errors

**Optimizations Implemented:**

1. **Reduced Transformer Blocks**: 4 ‚Üí 2 blocks
   - **Impact**: ~50% reduction in transformer parameters
   - **Trade-off**: Less capacity for attention processing

2. **Reduced Attention Heads**: 8 ‚Üí 4 heads
   - **Impact**: ~50% reduction in attention computation
   - **Trade-off**: Less diverse attention patterns

3. **Reduced MLP Dimension**: 4096 ‚Üí 2048 (patch_dim instead of patch_dim √ó 2)
   - **Impact**: ~50% reduction in MLP parameters
   - **Trade-off**: Less expressive feedforward network

4. **Mixed Precision Training**: Enabled float16
   - **Impact**: ~50% reduction in memory usage
   - **Trade-off**: Slight numerical precision loss (usually negligible)

5. **Reduced Batch Size**: 32 ‚Üí 16
   - **Impact**: ~50% reduction in memory per batch
   - **Trade-off**: Less stable gradients, slower training

6. **Dynamic Batch Size Calculation**:
   - Ensures exactly 65 steps per epoch
   - Optimizes memory usage while maintaining training consistency

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer
    ‚Üì
ResNet50 Backbone (FROZEN)
    ‚îú‚îÄ Multiple Residual Blocks
    ‚îî‚îÄ Output: 7√ó7√ó2048
    ‚Üì
Feature Map (7√ó7√ó2048) [LARGE]
    ‚Üì
Reshape to Patches (49 patches √ó 2048 dims)
    ‚Üì
Add Position Embeddings
    ‚Üì
Prepend Class Token
    ‚Üì
Transformer Block 1 (REDUCED: 4 heads, MLP=2048)
    ‚Üì
Transformer Block 2 (REDUCED: 4 heads, MLP=2048)
    ‚Üì
Extract Class Token
    ‚Üì
Dense Classification Layer (20 classes)
    ‚Üì
Softmax Output
```

#### Training Configuration & Hyperparameters

**Optimizer**: Adam with learning rate 0.0001
- Lower LR for transfer learning

**Loss Function**: Sparse Categorical Crossentropy

**Training Setup:**
- **Batch Size**: 16 (reduced from 32)
- **Total Epochs**: 20 (full training)
- **Steps per Epoch**: 65 steps (dynamically calculated)
- **Device**: NVIDIA GPU with CUDA support
- **Mixed Precision**: Enabled (mixed_float16)

**Data Augmentation**: Same pipeline as other models

**Callbacks:**
- **EarlyStopping**: Monitor `val_accuracy`, patience=10, min_delta=0.0001, restore_best_weights=True
- **ReduceLROnPlateau**: Monitor `val_loss`, factor=0.5, patience=7, min_lr=1e-8

#### Training Process & Observations

**Key Characteristics:**
- **Low Training Accuracy**: 51.85% (struggles with convergence)
- **Low Validation Accuracy**: 60.47% (Updated - slightly better than training)
- **Unusual Pattern**: Test accuracy (65.78% - Updated) higher than validation
- **Memory Constraints**: Required extensive optimizations
- **Largest Model**: 286.5 MB model size

**Training Challenges:**
1. **Large Feature Dimensions**: 2048 channels create very large attention matrices
2. **Memory Limitations**: Required reducing architecture components
3. **Convergence Issues**: Model struggles to learn effectively
4. **Underfitting**: Low training accuracy suggests model capacity may be insufficient

**Why Low Performance:**
1. **Reduced Architecture**: Memory optimizations reduced model capacity
2. **Large Feature Space**: 2048 dimensions may be too large for transformer to process effectively
3. **Limited Transformer Blocks**: Only 2 blocks may be insufficient
4. **Batch Size**: Smaller batch size (16) may affect training stability

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 20
- **Training Accuracy**: 51.85% (low)
- **Training Loss**: High (indicating poor fit)

**Validation Metrics:**
- **Validation Accuracy**: 60.47% (Updated)
- **Validation Loss**: High
- **Precision**: 53.90%
- **Recall**: 60.47% (Updated)
- **F1-Score**: 49.20%

**Test Metrics (Unseen Data):**
- **Test Accuracy**: 65.78% (Updated - unusually higher than validation)
- **Test Precision**: 60.08%
- **Test Recall**: 60.08%
- **Test F1-Score**: 60.08%

**Gap Analysis:**
- **Overfitting Gap**: -1.64% (Training - Validation)
  - Negative gap indicates regularization
  - Validation slightly better than training
- **Generalization Gap**: -6.59% (Validation - Test)
  - **Unusual Pattern**: Test performs better than validation
  - May indicate validation set is harder or model benefits from more diverse test data

**Model Efficiency:**
- **Model Size**: 286.5 MB (largest)
- **Parameters**: 75.1M trainable
- **Inference Speed**: Slow (large model)
- **Memory Usage**: Very high
- **Production Ready**: No (low accuracy, large size)

#### K-Fold Cross-Validation Results & Analysis

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 12.53% | Low accuracy |
| Fold 2 | 16.20% | Highest accuracy (still very low) |
| Fold 3 | 13.61% | Low accuracy |
| Fold 4 | 12.53% | Low accuracy |
| Fold 5 | 11.90% | Lowest accuracy |

**K-Fold Statistics:**
- **Average Accuracy**: 13.35% (Very low)
- **Standard Deviation**: 1.52% (Low but consistent at low performance)
- **Minimum Accuracy**: 11.90%
- **Maximum Accuracy**: 16.20%
- **Range**: 4.30% (Low accuracy range)

**Key Insights from K-Fold Results:**
1. **Very Low Performance**: K-Fold average of 13.35% indicates severe training issues
2. **Consistent Low Performance**: Standard deviation of 1.52% is low, but this indicates consistently poor performance across all folds
3. **Training Problems**: All folds achieved only 11-16% accuracy, suggesting fundamental training or architectural issues
4. **Large Gap with Test**: K-Fold average (13.35%) is much lower than test accuracy (65.78%), difference of +52.43%, indicating extreme training instability or configuration differences
5. **Architecture Challenges**: Very low K-Fold performance suggests the hybrid architecture may not be suitable or requires significant hyperparameter tuning

**Comparison with Other Models:**
- K-Fold average (13.35%) is second lowest, indicating severe training issues
- Standard deviation (1.52%) is low but this reflects consistently poor performance
- Test accuracy (65.78%) is dramatically higher than K-Fold, suggesting training instability or different configurations

#### Why This Model Struggles

1. **Memory Constraints**: Required reducing architecture capacity
2. **Large Feature Dimensions**: 2048 channels create computational challenges
3. **Insufficient Capacity**: Reduced transformer blocks may be too limiting
4. **Training Instability**: Smaller batch size affects gradient estimates
5. **Architecture Mismatch**: ResNet50's 2048 channels may be too large for this hybrid approach

#### Recommendations for Improvement

1. **Feature Reduction**: Add bottleneck layer to reduce 2048 ‚Üí 512 before transformer
2. **More Transformer Blocks**: Increase to 4 blocks if memory allows
3. **Larger Batch Size**: Use gradient accumulation to simulate larger batches
4. **Different Architecture**: Consider using ResNet50 features differently (e.g., multi-scale)
5. **Progressive Training**: Train transformer components progressively

---

### 7. ZFNet & Vision Transformer Hybrid Model ‚ö†Ô∏è

#### Architecture Overview

**Hybrid Architecture (Trained from Scratch):**
This model combines a custom ZFNet backbone (trained from scratch) with Vision Transformer. Unlike other models, this doesn't use pre-trained weights, making training more challenging.

**Architecture Components:**

1. **Feature Extraction Stage (Custom ZFNet Backbone)**:
   - **Pre-trained**: ‚ùå No (trained from scratch)
   - **Input**: 224√ó224√ó3 RGB images
   - **Architecture**: Custom ZFNet with 5 convolutional layers
     - **Layer 1**: Conv2D + MaxPooling + BatchNorm
     - **Layer 2**: Conv2D + MaxPooling + BatchNorm
     - **Layer 3**: Conv2D + MaxPooling + BatchNorm
     - **Layer 4**: Conv2D + MaxPooling + BatchNorm
     - **Layer 5**: Conv2D + MaxPooling + BatchNorm
   - **Output Feature Map**: 7√ó7√ó256 spatial feature maps
   - **Parameters**: All trainable (no frozen weights)
   - **Why Custom ZFNet**: Simpler architecture, smaller feature dimensions (256 channels)

2. **Feature Processing Stage (Vision Transformer)**:
   - **Patch Creation**: CNN feature maps (7√ó7√ó256) reshaped into 49 patches of 256 dimensions
   - **Position Embedding**: Learnable positional encodings
   - **Class Token**: Prepend learnable classification token
   - **Transformer Blocks**: 4 sequential transformer encoder blocks
     - **Multi-Head Self-Attention**: 8 attention heads, key_dim = 32 (256/8)
     - **MLP**: 2-layer feedforward network with dimension 512 (patch_dim √ó 2)
     - **Layer Normalization**: Pre-norm architecture
     - **Residual Connections**: Skip connections
   - **Output**: Class token extracted after final transformer block

3. **Classification Head**:
   - **Dense Layer**: 20 units (one per medicinal plant class)
   - **Activation**: Softmax

**Total Architecture Parameters:**
- **Trainable**: 6.0M parameters (all trainable)
- **Non-trainable**: 0 parameters
- **Total**: 6.0M parameters
- **Model Size**: 22.9 MB (smallest hybrid model)

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer (included in ZFNet)
    ‚Üì
Custom ZFNet Backbone (TRAINABLE - from scratch)
    ‚îú‚îÄ Conv2D Layer 1 + MaxPool + BatchNorm
    ‚îú‚îÄ Conv2D Layer 2 + MaxPool + BatchNorm
    ‚îú‚îÄ Conv2D Layer 3 + MaxPool + BatchNorm
    ‚îú‚îÄ Conv2D Layer 4 + MaxPool + BatchNorm
    ‚îî‚îÄ Conv2D Layer 5 + MaxPool + BatchNorm
    ‚Üì
Feature Map (7√ó7√ó256)
    ‚Üì
Reshape to Patches (49 patches √ó 256 dims)
    ‚Üì
Add Position Embeddings (learnable)
    ‚Üì
Prepend Class Token
    ‚Üì
Transformer Block 1-4
    ‚îú‚îÄ Multi-Head Self-Attention (8 heads)
    ‚îú‚îÄ Layer Norm + Residual
    ‚îú‚îÄ MLP (256 ‚Üí 512 ‚Üí 256)
    ‚îî‚îÄ Layer Norm + Residual
    ‚Üì
Extract Class Token
    ‚Üì
Dense Classification Layer (20 classes)
    ‚Üì
Softmax Output
```

#### Training Configuration & Hyperparameters

**Optimizer**: Adam with learning rate 0.0001
- Lower LR appropriate for training from scratch
- Same as other hybrid models

**Loss Function**: Sparse Categorical Crossentropy

**Training Setup:**
- **Batch Size**: 32 samples per batch
- **Total Epochs**: 20 (converged at epoch 17)
- **Steps per Epoch**: 65 steps
- **Device**: NVIDIA GPU with CUDA support

**Data Augmentation**: Same pipeline as other models

**Callbacks:**
- **EarlyStopping**: Monitor `val_accuracy`, patience=10, min_delta=0.0001, restore_best_weights=True
- **ReduceLROnPlateau**: Monitor `val_loss`, factor=0.5, patience=7, min_lr=1e-8

#### Training Process & Observations

**Key Characteristics:**
- **Trained from Scratch**: No pre-trained weights (unlike other models)
- **Moderate Convergence**: Training accuracy 61.62%, validation 58.91%
- **Slight Overfitting**: 2.71% gap (training > validation)
- **Good Generalization**: -1.17% gap to test (test > validation)
- **Smallest Hybrid**: 22.9 MB, 6.0M parameters
- **Unexpected Performance**: Performs worse than standalone ZFNet

**Training Challenges:**
1. **No Transfer Learning**: Must learn features from scratch
2. **Limited Data**: Small dataset makes training from scratch difficult
3. **Architecture Complexity**: Hybrid architecture may be too complex for scratch training
4. **Feature Learning**: ZFNet must learn good features while transformer processes them

**Why Lower Performance:**
1. **No Pre-training**: Missing ImageNet-learned features
2. **Small Dataset**: Insufficient data for training complex hybrid from scratch
3. **Architecture Mismatch**: Transformer may need better features than ZFNet can learn
4. **Training Difficulty**: Learning both CNN and transformer simultaneously is challenging

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 17
- **Training Accuracy**: 61.62%
- **Training Loss**: Moderate

**Validation Metrics:**
- **Validation Accuracy**: 76.74% (Updated)
- **Validation Loss**: Moderate
- **Precision**: 62.12%
- **Recall**: 76.74% (Updated)
- **F1-Score**: 54.64%

**Test Metrics (Unseen Data):**
- **Test Accuracy**: 80.61% (Updated)
- **Test Precision**: 60.08%
- **Test Recall**: 60.08%
- **Test F1-Score**: 60.08%

**Gap Analysis:**
- **Overfitting Gap**: 2.71% (Training - Validation)
  - Positive gap indicates slight overfitting
  - Training performs better than validation
- **Generalization Gap**: -1.17% (Validation - Test)
  - Negative gap indicates good generalization
  - Test performs better than validation (unusual but positive)

**Model Efficiency:**
- **Model Size**: 22.9 MB (smallest hybrid)
- **Parameters**: 6.0M trainable
- **Inference Speed**: Fast (small model)
- **Memory Usage**: Low
- **Production Ready**: No (low accuracy)

#### K-Fold Cross-Validation Results & Analysis

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 5.40% | Very low accuracy |
| Fold 2 | 4.97% | Very low accuracy |
| Fold 3 | 4.10% | Lowest accuracy |
| Fold 4 | 5.83% | Highest accuracy (still very low) |
| Fold 5 | 4.33% | Very low accuracy |

**K-Fold Statistics:**
- **Average Accuracy**: 4.93% (Extremely low)
- **Standard Deviation**: 0.64% (Low but consistently poor)
- **Minimum Accuracy**: 4.10%
- **Maximum Accuracy**: 5.83%
- **Range**: 1.73% (Extremely low accuracy range)

**Key Insights from K-Fold Results:**
1. **Extremely Low Performance**: K-Fold average of 4.93% indicates severe training failure
2. **Consistently Poor Performance**: Standard deviation of 0.64% is low, but this reflects consistently extremely poor performance across all folds
3. **Training Failure**: All folds achieved only 4-6% accuracy (essentially random guessing for 20 classes = 5%), suggesting fundamental training or architectural issues
4. **Extreme Gap with Test**: K-Fold average (4.93%) is dramatically lower than test accuracy (80.61%), difference of +75.68%, indicating severe training instability or completely different configurations
5. **Architecture Not Suitable**: Extremely low K-Fold performance suggests the ZFNet & ViT hybrid architecture trained from scratch is not suitable for this dataset

**Comparison with Other Models:**
- K-Fold average (4.93%) is lowest among all models, indicating worst training performance
- Standard deviation (0.64%) is low but reflects consistently extremely poor performance
- Test accuracy (80.61%) is dramatically higher than K-Fold, suggesting severe training instability or different training configurations

#### Why This Model Performs Poorly

1. **No Transfer Learning**: Missing pre-trained features significantly hurts performance
2. **Small Dataset**: Training hybrid architecture from scratch requires more data
3. **Architecture Complexity**: Hybrid may be too complex without pre-training
4. **Feature Quality**: ZFNet features may not be rich enough for transformer
5. **Training Difficulty**: Learning both components simultaneously is challenging

#### Comparison with Standalone ZFNet

**Standalone ZFNet performs better (89.73% vs 80.61% - Updated):**
- Standalone has 72.0M parameters vs hybrid's 6.0M
- Standalone benefits from simpler architecture
- Standalone can learn features more effectively without transformer overhead
- Hybrid adds complexity without sufficient benefit when trained from scratch

---

### 8. ZFNet Standalone Model

#### Architecture Overview

**Standalone CNN Architecture (Trained from Scratch):**
This model uses a custom ZFNet architecture trained entirely from scratch. Shows strong regularization effects and performs better than its hybrid counterpart.

**Architecture Components:**

1. **Custom ZFNet Backbone (All Trainable)**:
   - **Pre-trained**: ‚ùå No (trained from scratch)
   - **Input**: 224√ó224√ó3 RGB images
   - **Architecture**: Custom ZFNet with 5 convolutional layers
     - **Layer 1**: Conv2D + MaxPooling + BatchNorm + ReLU
     - **Layer 2**: Conv2D + MaxPooling + BatchNorm + ReLU
     - **Layer 3**: Conv2D + MaxPooling + BatchNorm + ReLU
     - **Layer 4**: Conv2D + MaxPooling + BatchNorm + ReLU
     - **Layer 5**: Conv2D + MaxPooling + BatchNorm + ReLU
   - **Output**: Feature maps (varies by layer)
   - **Parameters**: All trainable (72.0M parameters)
   - **Key Features**: 
     - Batch normalization for stable training
     - Max pooling for spatial reduction
     - ReLU activation for non-linearity

2. **Global Average Pooling (GAP)**:
   - **Operation**: Average pooling over spatial dimensions
   - **Output**: Feature vector

3. **Classification Head**:
   - **Dense Layers**: Multiple dense layers for classification
   - **Dropout**: Applied for regularization
   - **Activation**: Softmax

**Total Parameters:**
- **Trainable**: 72.0M parameters (all trainable)
- **Non-trainable**: 0 parameters
- **Total**: 72.0M parameters
- **Model Size**: 274.5 MB (very large)

#### Detailed Architecture Flow

```
Input Image (224√ó224√ó3)
    ‚Üì
Rescaling Layer (included in ZFNet)
    ‚Üì
Custom ZFNet Backbone (ALL TRAINABLE - from scratch)
    ‚îú‚îÄ Conv2D Layer 1 + MaxPool + BatchNorm + ReLU
    ‚îú‚îÄ Conv2D Layer 2 + MaxPool + BatchNorm + ReLU
    ‚îú‚îÄ Conv2D Layer 3 + MaxPool + BatchNorm + ReLU
    ‚îú‚îÄ Conv2D Layer 4 + MaxPool + BatchNorm + ReLU
    ‚îî‚îÄ Conv2D Layer 5 + MaxPool + BatchNorm + ReLU
    ‚Üì
Feature Map
    ‚Üì
Global Average Pooling
    ‚Üì
Feature Vector
    ‚Üì
Dense Layers + Dropout
    ‚Üì
Softmax Output
```

#### Training Configuration & Hyperparameters

**Optimizer**: Adam with learning rate 0.001 (higher than transfer learning models)
- **Rationale**: Training from scratch requires higher learning rate for effective learning
- **Beta1**: 0.9, **Beta2**: 0.999

**Loss Function**: Sparse Categorical Crossentropy

**Training Setup:**
- **Batch Size**: 32 samples per batch
- **Total Epochs**: 20 (converged at epoch 19)
- **Steps per Epoch**: 65 steps
- **Device**: NVIDIA GPU with CUDA support

**Data Augmentation**: Same pipeline as other models

**Callbacks:**
- **EarlyStopping**: Monitor `val_accuracy`, patience=5
- **ReduceLROnPlateau**: Monitor `val_loss`, factor=0.5, patience=5, min_lr=5e-5

#### Training Process & Observations

**Key Characteristics:**
- **Trained from Scratch**: No pre-trained weights
- **Higher Learning Rate**: 0.001 (10√ó higher than transfer learning models)
- **Strong Regularization**: Training accuracy (77.97%) much lower than validation (91.09%)
- **Large Model**: 72.0M parameters, 274.5 MB
- **Better than Hybrid**: Performs better than ZFNet & ViT hybrid (88.21% vs 60.08%)

**Training Progression:**
- Model showed steady improvement throughout training
- Validation accuracy consistently higher than training
- Best performance achieved at epoch 19
- Strong regularization effects from data augmentation

**Why Higher Learning Rate:**
1. **Training from Scratch**: Needs larger updates to learn features effectively
2. **No Pre-trained Features**: Must learn everything, requiring more aggressive learning
3. **Batch Normalization**: Provides stability, allowing higher learning rates
4. **Large Model**: More parameters can handle larger learning rates

**Why Better than Hybrid:**
1. **Simpler Architecture**: Standalone is simpler, easier to train from scratch
2. **More Parameters**: 72.0M vs 6.0M allows more capacity
3. **Direct Learning**: No transformer overhead, direct feature-to-classification mapping
4. **Better Feature Learning**: Can focus on learning good features without transformer constraints

#### Performance Results & Analysis

**Training Metrics:**
- **Best Epoch**: 19
- **Training Accuracy**: 77.97%
- **Training Loss**: Moderate

**Validation Metrics:**
- **Validation Accuracy**: 91.09%
- **Validation Loss**: Low
- **Precision**: 92.25%
- **Recall**: 91.09%
- **F1-Score**: 90.94%

**Test Metrics (Unseen Data):**
- **Test Accuracy**: 89.73% (Updated)
- **Test Precision**: 89.73% (Updated)
- **Test Recall**: 89.73% (Updated)
- **Test F1-Score**: 88.21%

**Gap Analysis:**
- **Overfitting Gap**: -13.12% (Training - Validation)
  - Large negative gap indicates very strong regularization
  - Validation performs significantly better than training
  - Data augmentation makes training harder but improves generalization
- **Generalization Gap**: 2.88% (Validation - Test)
  - Good generalization (small gap)
  - Test accuracy slightly lower than validation (expected)
  - Model generalizes well to unseen data

**Model Efficiency:**
- **Model Size**: 274.5 MB (very large)
- **Parameters**: 72.0M trainable
- **Inference Speed**: Moderate (large model)
- **Memory Usage**: High
- **Production Ready**: Moderate (good accuracy but large size)

#### K-Fold Cross-Validation Results & Analysis

**K-Fold Cross-Validation Results:**

| Fold | Validation Accuracy | Notes |
|------|---------------------|-------|
| Fold 1 | 9.50% | Very low accuracy |
| Fold 2 | 20.73% | Highest accuracy (still low) |
| Fold 3 | 7.78% | Very low accuracy |
| Fold 4 | 7.34% | Lowest accuracy |
| Fold 5 | 25.32% | Highest accuracy (still moderate) |

**K-Fold Statistics:**
- **Average Accuracy**: 14.14% (Very low)
- **Standard Deviation**: 7.44% (Very high - Poor consistency)
- **Minimum Accuracy**: 7.34%
- **Maximum Accuracy**: 25.32%
- **Range**: 17.98% (Extremely high variability)

**Key Insights from K-Fold Results:**
1. **Very Low Performance**: K-Fold average of 14.14% indicates severe training issues
2. **Poor Consistency**: Standard deviation of 7.44% is the highest among all models, indicating extreme variability across different data splits
3. **High Variability**: Range of 17.98% is extremely high, showing model performance is highly inconsistent across different data splits
4. **Extreme Gap with Test**: K-Fold average (14.14%) is dramatically lower than test accuracy (89.73%), difference of +75.59%, indicating severe training instability or different configurations
5. **Training Instability**: Very high standard deviation and extreme range suggest model training is highly unstable, with performance varying dramatically based on data split

**Comparison with Other Models:**
- K-Fold average (14.14%) is very low, indicating poor training performance in K-Fold setting
- Standard deviation (7.44%) is highest among all models, showing worst consistency
- Test accuracy (89.73%) is dramatically higher than K-Fold, suggesting severe training instability or different training configurations between K-Fold and final training

#### Why This Model Shows Strong Regularization

1. **Data Augmentation**: Heavy augmentation creates training difficulty
2. **Training from Scratch**: Model must learn robust features
3. **Large Model**: More parameters benefit from strong regularization
4. **Batch Normalization**: Provides additional regularization
5. **Dropout**: Applied in classification head

#### Comparison with Hybrid Version

**Standalone ZFNet (89.73% - Updated) vs ZFNet & ViT Hybrid (80.61% - Updated):**

**Advantages of Standalone:**
- **Better Accuracy**: 28% higher test accuracy
- **Simpler Architecture**: Easier to train from scratch
- **More Parameters**: 72.0M vs 6.0M provides more capacity
- **Direct Learning**: No transformer overhead

**Disadvantages of Standalone:**
- **Larger Model**: 274.5 MB vs 22.9 MB (12√ó larger)
- **More Parameters**: 72.0M vs 6.0M (12√ó more)
- **Slower Inference**: Larger model is slower

**Why Standalone Performs Better:**
1. **Architecture Simplicity**: Simpler architecture is easier to train from scratch
2. **Parameter Count**: More parameters allow better feature learning
3. **No Transformer Overhead**: Direct CNN-to-classification is more efficient
4. **Better Feature Learning**: Can focus solely on learning good CNN features
5. **Training Efficiency**: Simpler architecture trains more effectively from scratch

---

### Common Training Characteristics

**Dataset:**
- **Total Classes**: 20 Philippine Medicinal Plants
- **Image Size**: 224√ó224√ó3 (RGB)
- **Train/Val/Test Split**: Stratified split ensuring balanced distribution
- **Training Steps per Epoch**: 65 steps (consistent across all models)

**Data Augmentation (Applied to All Models):**
- Random horizontal and vertical flips
- Random rotation (¬±15 degrees)
- Random zoom (¬±10%)
- Random brightness adjustment (¬±10%)
- Random contrast adjustment (¬±10%)

**Optimization Strategy:**
- **Transfer Learning**: Pre-trained ImageNet weights used for MobileNet, VGG16, ResNet, ResNet50
- **Frozen Base**: Base model weights frozen, only classifier/top layers trainable
- **Learning Rate**: 0.0001 for most models (0.001 for ResNet and ZFNet standalone)
- **Learning Rate Scheduling**: ReduceLROnPlateau callback reduces LR by factor of 0.5 when validation loss plateaus

**Regularization:**
- Data augmentation provides strong regularization (evident from validation > training accuracy in many models)
- Early stopping prevents overfitting
- Learning rate reduction helps fine-tuning

**Hardware:**
- **Device**: NVIDIA GPU (CUDA)
- **Mixed Precision**: Enabled for ResNet50 & ViT (mixed_float16) to handle large model size

---
