# Experiment 2: Progressive Training with Optimized Architectures

**Goal:** Improve upon baseline results using 3-stage progressive unfreezing with architecture-specific custom heads.

## Experiment Design

### Key Differences from Baseline
1. **Architecture-specific heads:** Custom classification heads tailored to each backbone family
2. **Progressive unfreezing:** 3 stages instead of 2 phases
3. **More granular fine-tuning:** Gradual unfreezing of backbone layers

### Training Strategy (3-Stage Progressive Unfreezing)

**Stage 1: Frozen Backbone (10 epochs)**
- All backbone layers frozen
- Train only custom classification head
- Learning rate: 0.001
- Optimizer: Adam

**Stage 2: Partial Unfreezing (10 epochs)**
- Unfreeze last N layers (architecture-dependent)
  - EfficientNet: Last 20 layers
  - ResNet50: Last 10 layers  
  - VGG16: Last 4 layers
  - MobileNet: Last 15 layers
- Learning rate: 0.0001 (10√ó reduction)
- Fine-tune high-level features

**Stage 3: Full Fine-tuning (10 epochs)**
- Unfreeze entire backbone
- Learning rate: 0.00001 (100√ó reduction from Stage 1)
- Full end-to-end training

### Architecture-Specific Custom Heads

**EfficientNet Family (B0, B2, B3):**
```
GlobalAveragePooling2D
    ‚Üì
Dense(256, activation='relu')
    ‚Üì
Dropout(0.5)
    ‚Üì
Dense(num_classes, activation='softmax')
```

**ResNet50:**
```
GlobalAveragePooling2D
    ‚Üì
Dense(512, activation='relu')
    ‚Üì
Dropout(0.4)
    ‚Üì
Dense(num_classes, activation='softmax')
```

**VGG16:**
```
Flatten
    ‚Üì
Dense(1024, activation='relu')
    ‚Üì
Dropout(0.5)
    ‚Üì
Dense(512, activation='relu')
    ‚Üì
Dropout(0.5)
    ‚Üì
Dense(num_classes, activation='softmax')
```

**MobileNet Family (V2, V3-Large):**
```
GlobalAveragePooling2D
    ‚Üì
Dense(128, activation='relu')
    ‚Üì
Dropout(0.3)
    ‚Üì
Dense(num_classes, activation='softmax')
```

### Hypothesis
Progressive unfreezing with architecture-specific heads should:
- ‚úÖ Prevent catastrophic forgetting
- ‚úÖ Allow better feature adaptation
- ‚úÖ Achieve higher test accuracy than baseline
- ‚úÖ Show more stable training curves

---

In [None]:
# CRITICAL: Run this cell FIRST before any other imports
# Suppress TensorFlow warnings at the OS level before TensorFlow loads
import os
import sys
import warnings
import io
import tensorflow as tf

# Set environment variables BEFORE TensorFlow is imported anywhere
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # 0=all, 1=filter INFO, 2=filter WARNING, 3=errors only
os.environ['AUTOGRAPH_VERBOSITY'] = '0'   # Disable AutoGraph conversion warnings

# Filter Python warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

# Suppress absl logging (used by TensorFlow internally)
try:
    from absl import logging as absl_logging
    absl_logging.set_verbosity(absl_logging.ERROR)
except ImportError:
    pass

# Redirect stderr temporarily to suppress any remaining warnings during TF import
stderr_backup = sys.stderr
sys.stderr = io.StringIO()

# Restore stderr
sys.stderr = stderr_backup

# Final TensorFlow logging configuration
try:
    tf.get_logger().setLevel('ERROR')
    tf.autograph.set_verbosity(0)
except Exception:
    pass

# Enable GPU memory growth
try:
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
except Exception:
    pass

print("‚úÖ TensorFlow imported with all warnings suppressed")
print("   TF_CPP_MIN_LOG_LEVEL:", os.environ.get('TF_CPP_MIN_LOG_LEVEL'))
print("   AUTOGRAPH_VERBOSITY:", os.environ.get('AUTOGRAPH_VERBOSITY'))
print("   TensorFlow version:", tf.__version__)
print("   GPUs detected:", len(tf.config.list_physical_devices('GPU')))

## Setup: TensorFlow Configuration

Configure TensorFlow environment before any heavy imports.

**Critical settings:**
- GPU memory growth enabled (prevents OOM errors during long training)
- All TensorFlow warnings suppressed
- Logging level set to ERROR only

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random
import logging
from tqdm import tqdm
from pathlib import Path
import sys

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Project modules
from src.data import load_front_side_geis
from src.scripts.experiment_4 import train_experiment_4
from src.utils.io_utils import load_config

logger = logging.getLogger(__name__)
print("‚úÖ All modules imported successfully from refactored structure")

## Import Refactored Modules

Import modular components from the refactored codebase.

**Key changes vs the original implementation:**
- Shared dataset loading through `load_front_side_geis`
- Centralized training via `train_experiment_4` (no custom loops here)
- Minimal notebook-side state (all logging, results, and summaries handled by pipelines)

**Memory management is still critical** for this experiment due to:
- Longer training (3 stages vs 2 phases)
- Multiple runs back-to-back
- Large models (some >100M parameters)

## Load Datasets (Front + Side Views)

In [None]:
# Load configuration from YAML file
CONFIG_PATH = project_root / 'config' / 'experiment_4.yaml'
config = load_config(str(CONFIG_PATH))

print(f"‚úÖ Configuration loaded from: {CONFIG_PATH}")
print(f"   Strategy: {config['training']['strategy']}")
print(f"   Batch size: {config['training']['batch_size']}")
print(f"   Test ratio: {config['dataset']['test_ratio']}")
print(f"   Number of runs: 10 (hardcoded in notebook)")

# Folder paths - using the datasets folder in the project
front_base_folder = str(project_root / 'datasets/GEIs_of_rgb_front/GEIs')
side_base_folder = str(project_root / 'datasets/GEIs_of_rgb_side/GEIs')

# Load both datasets using shared helper
dataset, dataset_summary = load_front_side_geis(
    front_base_folder=front_base_folder,
    side_base_folder=side_base_folder,
    seed=config['random_seed'],
    shuffle=True
)

# Summary
print(f"\nMerged dataset size: {dataset_summary['total_count']} (front: {dataset_summary['front_count']}, side: {dataset_summary['side_count']})")

if dataset:
    sample_label, sample_img, sample_subject = dataset[0]
    print(f"Sample tuple structure: (label:str, image:np.ndarray[H,W], subject:str) -> {type(sample_label).__name__} {sample_img.shape} {type(sample_subject).__name__}")

# Get unique labels
all_labels = [item[0] for item in dataset]
unique_labels = sorted(set(all_labels))
print(f"Number of classes: {len(unique_labels)}")

## Data Loading

Load and merge GEI datasets from both camera views.

**Same data as Experiment 1** to ensure fair comparison:
- Front-view GEIs: Multiple angles of exercises
- Side-view GEIs: Complementary perspective
- Total: ~1000+ samples across 15 exercise classes
- Subject-independent splits (same volunteers never in train + test)

## Train Progressive Models (7 Backbones √ó 10 Runs)

In [None]:
logger.info("\n" + "#" * 80)
logger.info("EXPERIMENT 2: MULTI-BACKBONE PROGRESSIVE TRAINING (delegates to Experiment 4 pipeline)")
logger.info("#" * 80)

BACKBONES_TO_TEST = [
    'efficientnet_b0',
    'efficientnet_b2',
    'efficientnet_b3',
    'resnet50',
    'vgg16',
    'mobilenet_v2',
    'mobilenet_v3_large',
]

N_RUNS = 10  # Number of runs per backbone

experiment_results = train_experiment_4(
    dataset=dataset,
    backbones=BACKBONES_TO_TEST,
    config_path=str(CONFIG_PATH),
    num_runs=N_RUNS,
)

comparison_rows = []
for backbone, runs in experiment_results.items():
    if not runs:
        logger.warning(f"No successful runs recorded for {backbone}")
        continue
    test_accs = [run['test_acc'] for run in runs]
    comparison_rows.append({
        'backbone': backbone,
        'mean_test_acc': np.mean(test_accs),
        'std_test_acc': np.std(test_accs),
        'successful_runs': len(runs),
    })

if comparison_rows:
    comparison_df = pd.DataFrame(comparison_rows).set_index('backbone')
    comparison_df = comparison_df.sort_values('mean_test_acc', ascending=False)
    display(comparison_df)

    os.makedirs(config['results']['base_dir'], exist_ok=True)
    csv_path = Path(config['results']['base_dir']) / 'backbone_comparison_exp2.csv'
    comparison_df.to_csv(csv_path)
    logger.info(f"\n‚úì Results saved to: {csv_path}")
else:
    logger.error("No results to summarize; check earlier logs for failures.")

## Progressive Training Execution

**3-Stage training pipeline (implemented inside `src/Training/experiment_4.py`):**

1. **Stage 1 (Frozen):** Train custom head only
   - Fast convergence
   - Adapts head to dataset
   
2. **Stage 2 (Partial Unfreeze):** Fine-tune top layers
   - Adjusts high-level features
   - Maintains pretrained low-level features
   
3. **Stage 3 (Full Unfreeze):** End-to-end fine-tuning
   - Final refinement
   - All layers adapt to exercise recognition

**What this cell does now:**
- Delegates the full sweep to `train_experiment_4`
- Reuses the shared dataset already loaded in memory
- Saves summaries/CSV outputs through the central pipeline

**Expected duration:** ~15-20 minutes per backbone (3 stages √ó 10 epochs each)

‚ö†Ô∏è **Note:** Full experiment takes several hours (7 backbones √ó 10 runs = 70 complete 3-stage training pipelines)

**Progress tracking:**
- Logging comes from the training pipeline (per-run + per-backbone)
- Final DataFrame summarises mean/stdev accuracy per backbone

---

## Experiment 2 Complete! üéâ

### Results Summary
All artifacts saved to: `experiments/exer_recog/results/exp_04_regularized/` (as configured in `experiment_4.yaml`).

**Each backbone folder contains:**
- `summary.csv` - Statistics across runs
- `all_results.json` - Complete metrics for all runs
- `plots/` - Confusion matrices and learning curves
- `models/` - Saved model weights (optional)

### Next Steps
1. **Compare with Baseline:** Open `99_comparison.ipynb` to see side-by-side accuracy comparisons and statistical tests.
2. **Analyze Individual Runs:** Inspect saved plots/confusion matrices per backbone.
3. **Model Deployment:** Use best performing backbone for production.

### Expected Improvements Over Baseline
- ‚úÖ Higher mean test accuracy
- ‚úÖ Lower standard deviation (more stable)
- ‚úÖ Better handling of difficult exercise classes
- ‚úÖ Smoother learning curves

**Hypothesis validation:** Check if progressive unfreezing + custom heads outperform standard transfer learning!