## Architecture-Specific Head Designs (v4) - Regularized Dual-Pooling

### Design Principles

1. **Dual Pooling: GAP + GMP Concatenated**
   - Captures both average patterns (GAP) and salient features (GMP)
   - Richer feature representation than single pooling
   - Doubles feature dimension after concatenation

2. **Smaller classification heads**
   - Reduces overfitting risk compared to v3
   - Forces model to learn better representations in backbone

3. **Label smoothing (0.05)**
   - Softens target labels: correct class gets 0.95, others share 0.05
   - Improves generalization and calibration
   - Prevents overconfident predictions

4. **AdamW optimizer with weight decay**
   - Weight decay (1e-4) instead of L2 regularization
   - Better generalization than standard Adam
   - Decouples weight decay from gradient updates

5. **Differential learning rates**
   - Classification head: 1e-4 (higher LR for new layers)
   - Backbone: 1e-5 (lower LR to preserve pretrained features)
   - Prevents catastrophic forgetting during fine-tuning

---

### Head Architectures (v4)

**EfficientNet Family (B0, B2, B3):**
```
Dual Pooling (GAP + GMP concatenated)
    ‚Üì
BatchNormalization
    ‚Üì
Dense(256, activation='swish')
    ‚Üì
Dropout(0.3)
    ‚Üì
Dense(128, activation='swish')
    ‚Üì
Dropout(0.2)
    ‚Üì
Dense(num_classes, activation='softmax')
```
- **Regularization:** BatchNorm + progressive dropout (0.3 ‚Üí 0.2)
- **Size:** 256‚Üí128 (smaller than v3's 512‚Üí256)

**ResNet50:**
```
Dual Pooling (GAP + GMP concatenated)
    ‚Üì
BatchNormalization
    ‚Üì
Dense(256, activation='relu')
    ‚Üì
Dropout(0.4)
    ‚Üì
Dense(num_classes, activation='softmax')
```
- **Regularization:** Single hidden layer with higher dropout (0.4)
- **Size:** 256 (reduced from v3's 1024‚Üí512)

**VGG16:**
```
Dual Pooling (GAP + GMP concatenated)
    ‚Üì
BatchNormalization
    ‚Üì
Dense(256, activation='relu')
    ‚Üì
Dropout(0.4)
    ‚Üì
Dense(num_classes, activation='softmax')
```
- **Regularization:** Same as ResNet50 (single layer + high dropout)
- **Size:** 256 (reduced from v3's 512‚Üí256)

**MobileNet Family (V2, V3-Large):**
```
Dual Pooling (GAP + GMP concatenated)
    ‚Üì
Dense(128, activation='relu')
    ‚Üì
Dropout(0.25)
    ‚Üì
Dense(num_classes, activation='softmax')
```
- **Regularization:** Single hidden layer, moderate dropout
- **Size:** 128 (compact to match lightweight backbone)

---

### Key Differences from Experiment 3 (v3)

| Aspect | Experiment 3 (v3) | Experiment 4 (v4) |
|--------|------------------|------------------|
| **Pooling** | Single GAP | Dual GAP + GMP |
| **Head Size** | Larger (512‚Üí256, 1024‚Üí512) | Smaller (256‚Üí128, 256 only) |
| **Label Smoothing** | None (0.0) | 0.05 |
| **Optimizer** | Adam | AdamW (weight decay 1e-4) |
| **Learning Rates** | Single LR (1e-3, 1e-4) | Differential (head: 1e-4, backbone: 1e-5) |
| **Unfreezing** | Architecture-specific % | Uniform 50% for all |

**Hypothesis:** Stronger regularization (smaller heads, label smoothing, weight decay) will:
- ‚úÖ Reduce overfitting
- ‚úÖ Improve generalization (higher test accuracy)
- ‚úÖ Increase consistency (lower std deviation)
- ‚úÖ Better calibrated predictions

---

## Training Strategy (3-Way Split with Validation Monitoring)

### Data Split (Subject-Independent)
- **Training set (~55% of subjects)**: Used for model learning
- **Validation set (~15% of subjects)**: Used for early stopping and learning rate scheduling
- **Test set (~30% of subjects)**: Used ONLY for final evaluation (never seen during training)

‚úÖ **No Data Leakage:** All samples from a subject stay in the same group (train/val/test)

### Phase 1: Frozen Backbone (up to 30 epochs)
- All backbone layers frozen
- Train only dual-pooling classification head
- Learning rate: 0.0001 (1e-4)
- Optimizer: **AdamW** with weight decay 1e-4
- **Callbacks:** EarlyStopping (patience=10), ReduceLROnPlateau, ModelCheckpoint
- **Monitoring:** Validation loss (stops when validation stops improving)

### Phase 2: Progressive Unfreezing (up to 20 epochs)
- Unfreeze top 50% of backbone layers (uniform for all architectures)
- Learning rate: 0.00001 (1e-5) - 10√ó reduction
- Optimizer: **AdamW** with weight decay 1e-4
- **Same callbacks** with validation monitoring
- Differential LR approximated via lower global LR

### Data Augmentation (Basic - Same as Exp 3)
- Horizontal flip: `True`
- Translation: ¬±15% (height + width)
- **No rotation/zoom/brightness** (keep it simple)

### Regularization Techniques Applied
1. **Dual pooling** (GAP + GMP) - Richer features
2. **Smaller heads** - Less overfitting
3. **Label smoothing (0.05)** - Better generalization
4. **AdamW optimizer** - Weight decay regularization
5. **Differential LRs** - Preserve pretrained features
6. **Progressive dropout** - Higher ‚Üí lower through layers
7. **BatchNormalization** - Stable training

### Evaluation Protocol
- **5 runs per backbone** (quick validation phase)
- Subject-independent 3-way split (55%/15%/30%)
- Test set evaluated ONLY at the end (after all training completes)
- Metrics: Training accuracy, validation accuracy, test accuracy
- **Success criteria compared to Experiment 3:**
  - Improved: Higher mean test accuracy
  - More stable: Lower std deviation
  - Better calibration: Reduced validation-test gap

---

## Setup: TensorFlow Configuration

This cell configures TensorFlow to suppress warnings and enable GPU memory growth.

**Key configurations:**
- Suppress TensorFlow/CUDA warnings for cleaner output
- Enable GPU memory growth (prevents out-of-memory errors)
- Set logging levels to ERROR only

In [1]:
# CRITICAL: Run this cell FIRST before any other imports
# Suppress TensorFlow warnings at the OS level before TensorFlow loads
import os
import sys
import warnings
import io
import tensorflow as tf

# Set environment variables BEFORE TensorFlow is imported anywhere
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # 0=all, 1=filter INFO, 2=filter WARNING, 3=errors only
os.environ['AUTOGRAPH_VERBOSITY'] = '0'   # Disable AutoGraph conversion warnings

# Filter Python warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

# Suppress absl logging (used by TensorFlow internally)
try:
    from absl import logging as absl_logging
    absl_logging.set_verbosity(absl_logging.ERROR)
except ImportError:
    pass

# Redirect stderr temporarily to suppress any remaining warnings during TF import
stderr_backup = sys.stderr
sys.stderr = io.StringIO()

# Restore stderr
sys.stderr = stderr_backup

# Final TensorFlow logging configuration
try:
    tf.get_logger().setLevel('ERROR')
    tf.autograph.set_verbosity(0)
except Exception:
    pass

# Enable GPU memory growth
try:
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
except Exception:
    pass

print("‚úÖ TensorFlow configured:")
print("   TF_CPP_MIN_LOG_LEVEL:", os.environ.get('TF_CPP_MIN_LOG_LEVEL'))
print("   AUTOGRAPH_VERBOSITY:", os.environ.get('AUTOGRAPH_VERBOSITY'))
print("   TensorFlow version:", tf.__version__)
print("   GPUs detected:", len(tf.config.list_physical_devices('GPU')))

‚úÖ TensorFlow configured:
   TF_CPP_MIN_LOG_LEVEL: 3
   AUTOGRAPH_VERBOSITY: 0
   TensorFlow version: 2.10.0
   GPUs detected: 1


In [9]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import logging
import yaml
from pathlib import Path
from tqdm import tqdm

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Import project modules
from src.data import load_front_side_geis, get_subjects_identities
from src.scripts.experiment_4 import train_experiment_4
from src.utils.io_utils import load_config
from src.utils.metrics import load_backbone_results_with_config

# Configure logging
root_logger = logging.getLogger()
if root_logger.hasHandlers():
    for handler in root_logger.handlers[:]:
        root_logger.removeHandler(handler)

logging.basicConfig(level=logging.WARNING, format='%(message)s')
logger = logging.getLogger(__name__)

# Visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f"‚úÖ Project root: {project_root}")
print("‚úÖ All modules imported successfully")

‚úÖ Project root: d:\Graduation_Project\ai-virtual-coach
‚úÖ All modules imported successfully


## Load Configuration

## Data Loading

Use the shared `load_front_side_geis` helper to merge both camera views with a reproducible shuffle driven by the configuration seed.

‚ÑπÔ∏è **New:** Experiment 4 now consumes these samples through the streaming tf.data helpers (Option B), so the heavy preprocessing happens on-the-fly with minimal RAM pressure.

In [7]:
# Load YAML configuration shared across Experiment 4 pipelines
CONFIG_PATH = project_root / 'config' / 'experiment_4.yaml'
config_path = CONFIG_PATH  # Preserve legacy variable name for downstream cells
config = load_config(str(CONFIG_PATH))

print(f"‚úÖ Configuration loaded from: {CONFIG_PATH}")
print("\nKey parameters:")
print(f"  Strategy         : {config['training']['strategy']}")
print(f"  Frozen epochs    : {config['training']['frozen_epochs']}")
print(f"  Fine-tune epochs : {config['training']['fine_tune_epochs']}")
print(f"  Batch size       : {config['training']['batch_size']}")
print(f"  Label smoothing  : {config['model']['label_smoothing']}")
print(f"  Random seed      : {config['random_seed']}")

print("\nUnfreezing strategy (Phase 2):")
#for backbone, params in config['unfreezing'].items():
    #pct = params['phase2_unfreeze_percent'] * 100
    #print(f"  {backbone:<18} ‚Üí {pct:>4.0f}% of layers")

‚úÖ Configuration loaded from: d:\Graduation_Project\ai-virtual-coach\config\experiment_4.yaml

Key parameters:
  Strategy         : 2-phase-validation-monitored-regularized-dual-pooling
  Frozen epochs    : 30
  Fine-tune epochs : 20
  Batch size       : 32
  Label smoothing  : 0.05
  Random seed      : 42

Unfreezing strategy (Phase 2):


In [11]:
# Load datasets using shared helper (front + side views)
front_base_folder = str(project_root / 'datasets' / 'GEIs_of_rgb_front' / 'GEIs')
side_base_folder = str(project_root / 'datasets' / 'GEIs_of_rgb_side' / 'GEIs')

print(f"Loading GEIs from:\n  Front: {front_base_folder}\n  Side : {side_base_folder}")

dataset, dataset_summary = load_front_side_geis(
    front_base_folder=front_base_folder,
    side_base_folder=side_base_folder,
    seed=config['random_seed'],
    shuffle=True,
)

print(
    f"‚úÖ Dataset loaded: {dataset_summary['total_count']} samples "
    f"(front: {dataset_summary['front_count']}, side: {dataset_summary['side_count']})"
)
subjects = get_subjects_identities(dataset)
subject_count = len(subjects)


print(f'Total unique subjects: {subject_count}')
print(f'Subject preview: {subjects[:10]}')

if dataset:
    sample_label, sample_img, sample_subject = dataset[0]
    print("   Sample structure: (label:str, image:np.ndarray, subject:str)")
    print(f"   Sample types: {type(sample_label).__name__}, {sample_img.shape}, {type(sample_subject).__name__}")

Loading GEIs from:
  Front: d:\Graduation_Project\ai-virtual-coach\datasets\GEIs_of_rgb_front\GEIs
  Side : d:\Graduation_Project\ai-virtual-coach\datasets\GEIs_of_rgb_side\GEIs
‚úÖ Dataset loaded: 3142 samples (front: 1574, side: 1568)
Total unique subjects: 70
Subject preview: ['V3', 'V31', 'V39', 'V4', 'V46', 'V47', 'V48', 'V5', 'V50', 'Volunteer #1']
   Sample structure: (label:str, image:np.ndarray, subject:str)
   Sample types: str, (1280, 720), str
‚úÖ Dataset loaded: 3142 samples (front: 1574, side: 1568)
Total unique subjects: 70
Subject preview: ['V3', 'V31', 'V39', 'V4', 'V46', 'V47', 'V48', 'V5', 'V50', 'Volunteer #1']
   Sample structure: (label:str, image:np.ndarray, subject:str)
   Sample types: str, (1280, 720), str


## Single Run Training (Quick Test)

Train one run with EfficientNet-B0 to verify the pipeline works correctly.

**What happens:**
- Dataset split by subjects (55%/15%/30% train/val/test)
- Phase 1: Train frozen backbone (up to 30 epochs with early stopping)
- Phase 2: Fine-tune unfrozen top 50% (up to 20 epochs with early stopping)
- Save best model and results

**Streaming loader reminder:**
- The training call now uses the new streaming Option B helpers to keep memory usage in check.
- Re-run this single-run smoke test after pulling the changes to confirm the kernel stays stable.

**Expected duration:** ~5-10 minutes (depends on early stopping)

In [5]:
# Example: Train one run with EfficientNet-B0
backbone = 'efficientnet_b0'

print("\n" + "="*80)
print(f"Starting single-run smoke test for {backbone}")
print("="*80)

single_run_results = train_experiment_4(
    dataset=dataset,
    backbones=[backbone],
    config_path=str(config_path),
    num_runs=1
)

run_metrics = single_run_results[backbone][0]

print("\n" + "="*80)
print("Training completed!")
print("="*80)
print(f"Train accuracy: {run_metrics['train_acc']:.4f}")
print(f"Val accuracy (frozen): {run_metrics['val_acc_frozen']:.4f}")
print(f"Val accuracy (unfrozen): {run_metrics['val_acc_unfrozen']:.4f}")
print(f"Test accuracy: {run_metrics['test_acc']:.4f}")
print(f"Test loss: {run_metrics['test_loss']:.4f}")


Starting single-run smoke test for efficientnet_b0
Epoch 1/30
Epoch 1/30


: 

## Full Experiment Execution (Multiple Backbones √ó 5 Runs)

**What's different from Experiment 3:**
- Uses v4 architecture with dual pooling (GAP + GMP)
- Smaller classification heads (less overfitting)
- Label smoothing (0.05) for better generalization
- AdamW optimizer with weight decay (1e-4)
- Uniform 50% unfreezing for all backbones

**Expected duration:** ~1-2 hours for 5 backbones √ó 5 runs = 25 training runs

**Backbones tested:**
1. EfficientNet B0 - Lightweight, efficient
2. EfficientNet B2 - Balanced performance
3. ResNet50 - Deep residual architecture
4. VGG16 - Classic deep CNN
5. MobileNet V2 - Mobile-optimized

‚ö†Ô∏è **Note:** This cell will take significant time. Use GPU if available!

In [None]:
# Configure full benchmark sweep
BACKBONES_TO_TEST = [
    'efficientnet_b0',
    'efficientnet_b2',
    'resnet50',
    'vgg16',
    'mobilenet_v2'
]
N_RUNS = 5

print("="*80)
print("EXPERIMENT 4: Regularized Dual-Pooling Heads")
print("="*80)
print(f"Backbones: {BACKBONES_TO_TEST}")
print(f"Runs per backbone: {N_RUNS}")
print(f"Total training runs: {len(BACKBONES_TO_TEST) * N_RUNS}")
print("="*80)

exp4_run_results = train_experiment_4(
    dataset=dataset,
    backbones=BACKBONES_TO_TEST,
    config_path=str(config_path),
    num_runs=N_RUNS
)

print("\n" + "="*80)
print("‚úÖ EXPERIMENT 4 COMPLETE")
print("="*80)
print("Results saved to: experiments/exer_recog/results/exp_04_regularized/")

## Results Analysis

Load and analyze results from all training runs.

In [None]:
# Load Experiment 4 results
results_base_dir = project_root / 'experiments' / 'exer_recog' / 'results' / 'exp_04_regularized'
exp4_results = load_backbone_results_with_config(results_base_dir=str(results_base_dir))

# Display summary statistics
print("="*80)
print("EXPERIMENT 4 RESULTS SUMMARY")
print("="*80)

for backbone in BACKBONES_TO_TEST:
    if backbone in exp4_results:
        runs = exp4_results[backbone]
        test_accs = [r['test_acc'] for r in runs]
        val_accs = [r.get('val_acc_unfrozen', r.get('val_acc', 0)) for r in runs]
        
        print(f"\n{backbone}:")
        print(f"  Mean test accuracy: {np.mean(test_accs):.4f} ¬± {np.std(test_accs):.4f}")
        print(f"  Best test accuracy: {np.max(test_accs):.4f}")
        print(f"  Worst test accuracy: {np.min(test_accs):.4f}")
        print(f"  Mean val accuracy: {np.mean(val_accs):.4f}")
        print(f"  Number of runs: {len(runs)}")
    else:
        print(f"\n{backbone}: No results found")

print("\n" + "="*80)

## Visualization: Experiment 4 Performance

Bar plot showing mean test accuracy with error bars for each backbone.

In [None]:
# Prepare data for visualization
backbones_list = []
mean_accs = []
std_accs = []

for backbone in BACKBONES_TO_TEST:
    if backbone in exp4_results:
        runs = exp4_results[backbone]
        test_accs = [r['test_acc'] for r in runs]
        
        backbones_list.append(backbone)
        mean_accs.append(np.mean(test_accs))
        std_accs.append(np.std(test_accs))

# Bar plot with error bars
plt.figure(figsize=(12, 6))
bars = plt.bar(range(len(backbones_list)), mean_accs, yerr=std_accs, 
               capsize=5, alpha=0.7, color='coral', edgecolor='darkred', linewidth=1.5)
plt.xticks(range(len(backbones_list)), backbones_list, rotation=45, ha='right')
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('Experiment 4: Regularized Dual-Pooling Heads (Mean ¬± Std)', fontsize=14, fontweight='bold')
plt.ylim([0, 1])
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, (mean, std) in enumerate(zip(mean_accs, std_accs)):
    plt.text(i, mean + std + 0.02, f'{mean:.3f}', ha='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.show()

# Print best performer
if backbones_list:
    best_idx = np.argmax(mean_accs)
    print(f"\nüèÜ Best performing backbone: {backbones_list[best_idx]}")
    print(f"   Mean test accuracy: {mean_accs[best_idx]:.4f} ¬± {std_accs[best_idx]:.4f}")

## Compare with Experiment 3 (Smart Heads)

Load Experiment 3 results and compare with Experiment 4 to evaluate the impact of regularization techniques.

In [None]:
# Load Experiment 3 results for comparison
exp3_base_dir = project_root / 'experiments' / 'exer_recog' / 'results' / 'exp_03_smart_heads'

try:
    exp3_results = load_backbone_results_with_config(results_base_dir=str(exp3_base_dir))
    print("‚úÖ Experiment 3 results loaded successfully")
except Exception as e:
    print(f"‚ö†Ô∏è Could not load Experiment 3 results: {e}")
    exp3_results = {}

# Comparison bar plot
if exp3_results and exp4_results:
    fig, ax = plt.subplots(figsize=(14, 7))
    
    x = np.arange(len(backbones_list))
    width = 0.35
    
    # Calculate means for both experiments
    exp3_means = []
    exp4_means = []
    
    for backbone in backbones_list:
        if backbone in exp3_results:
            exp3_accs = [r['test_acc'] for r in exp3_results[backbone]]
            exp3_means.append(np.mean(exp3_accs))
        else:
            exp3_means.append(0)
        
        if backbone in exp4_results:
            exp4_accs = [r['test_acc'] for r in exp4_results[backbone]]
            exp4_means.append(np.mean(exp4_accs))
        else:
            exp4_means.append(0)
    
    bars1 = ax.bar(x - width/2, exp3_means, width, label='Exp 3: Smart Heads', 
                   alpha=0.8, color='steelblue', edgecolor='darkblue', linewidth=1.5)
    bars2 = ax.bar(x + width/2, exp4_means, width, label='Exp 4: Regularized', 
                   alpha=0.8, color='coral', edgecolor='darkred', linewidth=1.5)
    
    ax.set_xlabel('Backbone', fontsize=12)
    ax.set_ylabel('Mean Test Accuracy', fontsize=12)
    ax.set_title('Experiment 3 vs Experiment 4: Mean Test Accuracy Comparison', 
                 fontsize=14, fontweight='bold')
    ax.set_xticks(x)
    ax.set_xticklabels(backbones_list, rotation=45, ha='right')
    ax.legend(fontsize=11)
    ax.grid(axis='y', alpha=0.3)
    ax.set_ylim([0, 1])
    
    plt.tight_layout()
    plt.show()
    
    # Summary statistics
    print("\n" + "="*80)
    print("COMPARISON SUMMARY")
    print("="*80)
    
    total_improvement = 0
    num_improved = 0
    
    for i, backbone in enumerate(backbones_list):
        if exp3_means[i] > 0 and exp4_means[i] > 0:
            improvement = exp4_means[i] - exp3_means[i]
            total_improvement += improvement
            if improvement > 0:
                num_improved += 1
            print(f"{backbone}: {exp3_means[i]:.4f} ‚Üí {exp4_means[i]:.4f} ({improvement:+.4f})")
    
    print(f"\nBackbones improved: {num_improved}/{len(backbones_list)}")
    print(f"Average improvement: {total_improvement/len(backbones_list):+.4f}")
else:
    print("‚ö†Ô∏è Cannot compare - missing results from one or both experiments")

## Statistical Comparison: Variance Reduction

In [None]:
# Compare standard deviations (variance reduction check)
print("\nVariance Comparison (Exp 3 vs Exp 4):")
print("="*80)
print(f"{'Backbone':<20} {'Exp 3 Std':<15} {'Exp 4 Std':<15} {'Change'}")
print("-"*80)

for backbone in backbones_list:
    if exp3_results[backbone]:
        exp3_std = exp3_results[backbone]['std_test_acc']
        exp4_std = exp4_results[backbone]['std_test_acc']
        change = ((exp4_std - exp3_std) / exp3_std) * 100
        
        print(f"{backbone:<20} {exp3_std:<15.4f} {exp4_std:<15.4f} {change:+.1f}%")

# Overall statistics
if all(exp3_results[b] for b in backbones_list):
    avg_exp3_std = np.mean([exp3_results[b]['std_test_acc'] for b in backbones_list])
    avg_exp4_std = np.mean([exp4_results[b]['std_test_acc'] for b in backbones_list])
    
    print("-"*80)
    print(f"{'Average':<20} {avg_exp3_std:<15.4f} {avg_exp4_std:<15.4f} "
          f"{((avg_exp4_std - avg_exp3_std) / avg_exp3_std) * 100:+.1f}%")

## Box Plot: Distribution Comparison

In [None]:
# Box plot comparing distributions
fig, axes = plt.subplots(1, len(backbones_list), figsize=(16, 5), sharey=True)

for i, backbone in enumerate(backbones_list):
    ax = axes[i]
    
    data_to_plot = []
    labels = []
    
    if exp3_results[backbone]:
        data_to_plot.append(exp3_results[backbone]['test_acc_values'])
        labels.append('Exp 3')
    
    data_to_plot.append(exp4_results[backbone]['test_acc_values'])
    labels.append('Exp 4')
    
    bp = ax.boxplot(data_to_plot, labels=labels, patch_artist=True)
    
    # Color boxes
    colors = ['steelblue', 'coral']
    for patch, color in zip(bp['boxes'], colors[-len(data_to_plot):]):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    
    ax.set_title(backbone, fontsize=10)
    ax.grid(axis='y', alpha=0.3)
    
    if i == 0:
        ax.set_ylabel('Test Accuracy')

fig.suptitle('Distribution Comparison: Experiment 3 vs Experiment 4', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## Key Findings Summary

In [None]:
# Generate comprehensive summary report
print("\n" + "="*80)
print("EXPERIMENT 4 FINAL SUMMARY: Regularized Dual-Pooling Heads")
print("="*80)

if exp4_results:
    # Best performing backbone
    best_backbone = None
    best_mean = 0
    best_std = 0
    
    for backbone in BACKBONES_TO_TEST:
        if backbone in exp4_results:
            runs = exp4_results[backbone]
            test_accs = [r['test_acc'] for r in runs]
            mean_acc = np.mean(test_accs)
            
            if mean_acc > best_mean:
                best_mean = mean_acc
                best_std = np.std(test_accs)
                best_backbone = backbone
    
    print(f"\n1. Best Performing Backbone: {best_backbone}")
    print(f"   Mean test accuracy: {best_mean:.4f} ¬± {best_std:.4f}")
    
    # Comparison with Experiment 3
    if exp3_results:
        print(f"\n2. Comparison with Experiment 3:")
        
        improvements = []
        for backbone in backbones_list:
            if backbone in exp3_results and backbone in exp4_results:
                exp3_accs = [r['test_acc'] for r in exp3_results[backbone]]
                exp4_accs = [r['test_acc'] for r in exp4_results[backbone]]
                improvements.append(np.mean(exp4_accs) - np.mean(exp3_accs))
        
        if improvements:
            avg_improvement = np.mean(improvements)
            num_improved = sum(1 for imp in improvements if imp > 0)
            
            print(f"   Average accuracy change: {avg_improvement:+.4f}")
            print(f"   Backbones improved: {num_improved}/{len(improvements)}")
            
            # Variance comparison
            exp3_stds = []
            exp4_stds = []
            
            for backbone in backbones_list:
                if backbone in exp3_results and backbone in exp4_results:
                    exp3_accs = [r['test_acc'] for r in exp3_results[backbone]]
                    exp4_accs = [r['test_acc'] for r in exp4_results[backbone]]
                    exp3_stds.append(np.std(exp3_accs))
                    exp4_stds.append(np.std(exp4_accs))
            
            if exp3_stds and exp4_stds:
                avg_exp3_std = np.mean(exp3_stds)
                avg_exp4_std = np.mean(exp4_stds)
                std_reduction = ((avg_exp4_std - avg_exp3_std) / avg_exp3_std) * 100
                
                print(f"   Average variance change: {std_reduction:+.1f}%")
    
    print(f"\n3. Regularization Techniques Applied:")
    print(f"   ‚úì Dual pooling (GAP + GMP concatenated)")
    print(f"   ‚úì Smaller classification heads (reduced capacity)")
    print(f"   ‚úì Label smoothing ({config['model']['label_smoothing']})")
    print(f"   ‚úì AdamW optimizer (weight decay: {config['training']['weight_decay']})")
    print(f"   ‚úì Differential learning rates (head: {config['training']['initial_lr']}, backbone: {config['training']['fine_tune_lr']})")
    print(f"   ‚úì Progressive dropout (higher ‚Üí lower through layers)")
    print(f"   ‚úì BatchNormalization for stable training")
    
    # Success criteria evaluation
    print(f"\n4. Success Criteria Evaluation:")
    
    if best_mean > 0.88 and best_std < 0.02:
        print("   ‚úÖ EXCELLENT SUCCESS: Mean > 88% and std < 2%")
        print("      ‚Üí Strong regularization achieved both high accuracy and stability")
        print("      ‚Üí Ready for production deployment")
    elif best_mean > 0.86 and best_std < 0.025:
        print("   ‚úÖ GOOD SUCCESS: Mean > 86% and std < 2.5%")
        print("      ‚Üí Regularization techniques are effective")
        print("      ‚Üí Consider scaling to more runs for best performers")
    elif best_mean > 0.84:
        print("   ‚úÖ MINIMUM SUCCESS: Mean > 84%")
        print("      ‚Üí Regularization shows promise but may need refinement")
    else:
        print("   ‚ö†Ô∏è BELOW TARGET: Mean ‚â§ 84%")
        print("      ‚Üí Regularization may be too strong (underfitting)")
        print("      ‚Üí Consider relaxing some constraints")

print("\n" + "="*80)
print("Results saved to: experiments/exer_recog/results/exp_04_regularized/")
print("="*80)

---

## Experiment 4 Complete! üéâ

### Results Summary
All results saved to: `experiments/exer_recog/results/exp_04_regularized/`

**Generated files:**
- Individual backbone folders with run-specific results (`results.yaml`)
- Model checkpoints (`best_model.keras`)
- Training history and metrics

### Key Questions to Answer

**1. Did regularization improve performance over Experiment 3?**
- Compare mean test accuracy: Exp4 vs Exp3
- Check if dual pooling + smaller heads + label smoothing helped

**2. Is the model more stable/consistent?**
- Compare std deviation: Exp4 vs Exp3
- Lower std = more reliable performance across runs

**3. Which regularization technique contributed most?**
- Dual pooling for richer features?
- Label smoothing for calibration?
- Weight decay for parameter control?
- Smaller heads for reduced overfitting?

### Regularization Impact Analysis

**If Mean Accuracy Improved AND Std Decreased:**
- ‚úÖ Regularization is effective
- ‚úÖ Model generalizes better
- ‚úÖ More reliable for deployment
- ‚Üí **Success!** Consider this architecture for production

**If Mean Accuracy Same BUT Std Decreased:**
- ‚úÖ More consistent predictions
- ‚úÖ Reduced overfitting
- ‚Üí **Partial success** - stability improved without accuracy loss

**If Mean Accuracy Decreased:**
- ‚ùå Over-regularized (underfitting)
- ‚Üí Consider relaxing constraints:
  - Increase head size
  - Reduce dropout rates
  - Lower weight decay
  - Reduce label smoothing

### Next Steps

**If Excellent Success (mean > 88%, std < 2%):**
- üöÄ Deploy best backbone for production
- üìä Analyze confusion matrix for failure modes
- üéØ Consider ensemble of top performers

**If Good Success (mean > 86%, std < 2.5%):**
- üìà Scale to 30 runs for statistical confidence
- üî¨ Ablation study: test each regularization technique individually
- üé® Try additional augmentation

**If Below Expectations:**
- üîÑ Revert to Experiment 3 architecture
- üß™ Ablation study: which regularization hurt performance?
- üéõÔ∏è Hyperparameter tuning for weight decay, label smoothing

### Design Philosophy Validated?

This experiment tests whether **strong regularization improves both accuracy and stability**:
- ‚úÖ Dual pooling (GAP + GMP)
- ‚úÖ Smaller heads (256‚Üí128 vs 512‚Üí256)
- ‚úÖ Label smoothing (0.05)
- ‚úÖ AdamW with weight decay (1e-4)
- ‚úÖ Differential learning rates
- ‚úÖ Progressive dropout
- ‚úÖ BatchNormalization

**Results will confirm:**
1. Does dual pooling capture more informative features?
2. Do smaller heads prevent overfitting?
3. Does label smoothing improve calibration?
4. Is AdamW superior to Adam for this task?

---

**Comparison with Previous Experiments:**
- **Exp 1 (Baseline):** Universal heads, simple training
- **Exp 3 (Smart Heads):** Architecture-specific heads, optimized sizes
- **Exp 4 (Regularized):** Dual pooling, aggressive regularization, AdamW

The progression shows increasingly sophisticated architectures and training strategies! üöÄ