# Model 5beta: Gradient Boosting Models

This notebook demonstrates the Gradient Boosting models (XGBoost, LightGBM, CatBoost) for deepfake video detection.

## Model Overview

Gradient Boosting models (XGBoost, LightGBM, CatBoost) trained on handcrafted features from Stage 2/4. These are tree-based ensemble methods that can capture non-linear patterns.

## Training Instructions

To train these models, run:

```bash
sbatch scripts/slurm_jobs/slurm_stage5beta.sh
```

Or use Python:

```python
from src.scripts.train_gradient_boosting import train_gradient_boosting

results = train_gradient_boosting(
    project_root=".",
    scaled_metadata_path="data/stage3/scaled_metadata.parquet",
    features_stage2_path="data/stage2/features_metadata.parquet",
    features_stage4_path=None,
    output_dir="data/stage5",
    n_splits=5,
    models=["xgboost", "lightgbm", "catboost"],
    delete_existing=False
)
```

## Architecture Deep-Dive

**gradient_boosting** architecture details.

See model implementation in `lib/training/` for specific architecture code.


## Hyperparameter Configuration

Hyperparameters configured in `lib/training/grid_search.py`.


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

See master pipeline notebook (`00_MASTER_PIPELINE_JOURNEY.ipynb`) for comprehensive design rationale.


## Architecture Deep-Dive

**gradient_boosting** architecture details.

See model implementation in `lib/training/` for specific architecture code.


## Hyperparameter Configuration

Hyperparameters configured in `lib/training/grid_search.py`.


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

See master pipeline notebook (`00_MASTER_PIPELINE_JOURNEY.ipynb`) for comprehensive design rationale.


In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Video, display, HTML
import json
import joblib

# Add project root to path
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root))

from lib.utils.paths import load_metadata_flexible
from lib.training.metrics_utils import compute_classification_metrics

# Configuration
BASE_MODEL_DIR = project_root / "data" / "stage5"
SCALED_METADATA_PATH = project_root / "data" / "stage3" / "scaled_metadata.parquet"
FEATURES_STAGE2_PATH = project_root / "data" / "stage2" / "features_metadata.parquet"

# Available models
MODELS = ["xgboost", "lightgbm", "catboost"]

print(f"Project root: {project_root}")
print(f"Base model directory: {BASE_MODEL_DIR}")
print(f"Base model directory exists: {BASE_MODEL_DIR.exists()}")

## Check for Saved Models

In [None]:
def check_saved_models(base_dir: Path, model_names: list):
    """Check for saved gradient boosting model files."""
    available_models = {}
    
    for model_name in model_names:
        model_dir = base_dir / model_name
        
        if not model_dir.exists():
            print(f"[X] {model_name}: Model directory does not exist: {model_dir}")
            continue
        
        # Different models save in different formats
        if model_name == "xgboost":
            model_file = model_dir / "model.json"
        elif model_name == "lightgbm":
            model_file = model_dir / "model.joblib"
        elif model_name == "catboost":
            model_file = model_dir / "model.cbm"
        else:
            model_file = model_dir / "model.joblib"  # Default
        
        metrics_file = model_dir / "metrics.json"
        
        if model_file.exists():
            print(f"[OK] {model_name}: Found {model_file.name}")
            if metrics_file.exists():
                print(f"  [OK] {model_name}: Found metrics.json")
            available_models[model_name] = model_file
        else:
            print(f"[X] {model_name}: No model file found in {model_dir}")
    
    return len(available_models) > 0, available_models

models_available, model_files = check_saved_models(BASE_MODEL_DIR, MODELS)

if not models_available:
    print("\n[WARN]  No trained models found. Please train the models first using the instructions above.")
    print(f"Expected locations:")
    for model_name in MODELS:
        print(f"  - {BASE_MODEL_DIR / model_name}")

## Load Models

In [None]:
loaded_models = {}

if models_available:
    # Try to import gradient boosting libraries
    try:
        XGBOOST_AVAILABLE = True
    except ImportError:
        XGBOOST_AVAILABLE = False
        print("[WARN]  XGBoost not available")
    
        LIGHTGBM_AVAILABLE = True
    except ImportError:
        LIGHTGBM_AVAILABLE = False
        print("[WARN]  LightGBM not available")
    
        CATBOOST_AVAILABLE = True
    except ImportError:
        CATBOOST_AVAILABLE = False
        print("[WARN]  CatBoost not available")
    
    # Load each available model
    for model_name, model_file in model_files.items():
        try:
            if model_name == "xgboost" and XGBOOST_AVAILABLE:
                model = xgb.Booster()
                model.load_model(str(model_file))
                loaded_models[model_name] = model
                print(f"[OK] Loaded {model_name} model")
            elif model_name == "lightgbm" and LIGHTGBM_AVAILABLE:
                model = joblib.load(model_file)
                loaded_models[model_name] = model
                print(f"[OK] Loaded {model_name} model")
            elif model_name == "catboost" and CATBOOST_AVAILABLE:
                model = cb.CatBoostClassifier()
                model.load_model(str(model_file))
                loaded_models[model_name] = model
                print(f"[OK] Loaded {model_name} model")
        except Exception as e:
            print(f"[WARN] Error loading {model_name}: {e}")
    
    # Load metadata
    scaled_df = load_metadata_flexible(str(SCALED_METADATA_PATH))
    features_df = load_metadata_flexible(str(FEATURES_STAGE2_PATH))
    
    if scaled_df is not None and features_df is not None:
        print(f"\n[OK] Loaded {scaled_df.height} videos from scaled metadata")
        print(f"[OK] Loaded {features_df.height} feature rows from Stage 2")
        
        # Get sample videos
        sample_videos = scaled_df.head(5).to_pandas()
        print(f"\nSample videos for demonstration:")
        print(sample_videos[["video_path", "label"]].to_string())
    else:
        print("[WARN]  Could not load metadata files")
else:
    print("[WARN]  Skipping model loading - no trained models found")

## Display Sample Videos

In [None]:
if models_available and 'sample_videos' in locals():
    fig, axes = plt.subplots(1, min(3, len(sample_videos)), figsize=(15, 5))
    if len(sample_videos) == 1:
        axes = [axes]
    
    for idx, (ax, row) in enumerate(zip(axes, sample_videos.iterrows())):
        video_path = project_root / row[1]["video_path"]
        label = row[1]["label"]
        
        try:
            cap = cv2.VideoCapture(str(video_path))
            if cap.isOpened():
                ret, frame = cap.read()
                if ret:
                    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    ax.imshow(frame_rgb)
                    ax.set_title(f"{Path(video_path).name}\nLabel: {label}", fontsize=10)
                cap.release()
        except Exception as e:
            ax.text(0.5, 0.5, f"Video: {Path(video_path).name}\nLabel: {label}", 
                    ha='center', va='center', fontsize=12, transform=ax.transAxes)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print("\nNote: To play videos in the notebook, use:")
    print("display(Video('path/to/video.mp4', embed=True, width=640, height=480))")

## Model Performance Summary

In [None]:
if models_available:
    all_metrics = {}
    
    # Load metrics for each model
    for model_name in model_files.keys():
        model_dir = BASE_MODEL_DIR / model_name
        metrics_file = model_dir / "metrics.json"
        
        if metrics_file.exists():
            with open(metrics_file, 'r') as f:
                metrics = json.load(f)
            all_metrics[model_name] = metrics
            
            print(f"\n{model_name.upper()} Performance Metrics:")
            print("=" * 50)
            for key, value in metrics.items():
                if isinstance(value, (int, float)):
                    print(f"  {key}: {value:.4f}")
    
    # Create comparison visualization if multiple models available
    if len(all_metrics) > 1:
        fig, ax = plt.subplots(figsize=(12, 6))
        
        metric_names = ['test_accuracy', 'test_precision', 'test_recall', 'test_f1']
        x = np.arange(len(metric_names))
        width = 0.25
        
        for idx, (model_name, metrics) in enumerate(all_metrics.items()):
            values = [metrics.get(m, 0) for m in metric_names]
            offset = (idx - len(all_metrics) / 2) * width + width / 2
            ax.bar(x + offset, values, width, label=model_name.upper())
        
        ax.set_ylabel('Score')
        ax.set_title('Gradient Boosting Models Performance Comparison')
        ax.set_xticks(x)
        ax.set_xticklabels(metric_names)
        ax.legend()
        ax.set_ylim(0, 1)
        
        plt.tight_layout()
        plt.show()
    elif len(all_metrics) == 1:
        # Single model visualization
        model_name = list(all_metrics.keys())[0]
        metrics = all_metrics[model_name]
        
        fig, ax = plt.subplots(figsize=(8, 6))
        metric_names = ['test_accuracy', 'test_precision', 'test_recall', 'test_f1']
        metric_values = [metrics.get(m, 0) for m in metric_names]
        
        bars = ax.bar(metric_names, metric_values, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
        ax.set_ylabel('Score')
        ax.set_title(f'{model_name.upper()} Model Performance')
        ax.set_ylim(0, 1)
        
        for bar, val in zip(bars, metric_values):
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{val:.3f}', ha='center', va='bottom')
        
        plt.tight_layout()
        plt.show()

## Training Plots

The following plots were generated during model training and provide insights into model performance across cross-validation folds and hyperparameter search.

In [None]:
# Display training plots if available
from IPython.display import Image, display, HTML

plots_dir = MODEL_DIR / "plots"

if plots_dir.exists():
    print(f"[OK] Found plots directory: {plots_dir}")
    
    # List of expected plot files
    plot_files = {
        "cv_fold_comparison.png": "Cross-Validation Fold Comparison",
        "hyperparameter_search.png": "Hyperparameter Search Results",
        "learning_curves.png": "Learning Curves (if available)",
        "roc_curve.png": "ROC Curve (if available)",
        "precision_recall_curve.png": "Precision-Recall Curve (if available)",
        "confusion_matrix.png": "Confusion Matrix (if available)"
    }
    
    plots_found = []
    for plot_file, plot_name in plot_files.items():
        plot_path = plots_dir / plot_file
        if plot_path.exists():
            plots_found.append((plot_path, plot_name))
            print(f"  [OK] Found: {plot_file}")
    
    if plots_found:
        print(f"\n[PLOT] Displaying {len(plots_found)} training plot(s):\n")
        for plot_path, plot_name in plots_found:
            print(f"\n### {plot_name}")
            display(Image(str(plot_path), width=800))
    else:
        print("[WARN]  No plot files found in plots directory.")
        print(f"Expected plots directory: {plots_dir}")
else:
    print(f"[WARN]  Plots directory not found: {plots_dir}")
    print("Plots are generated during training. Please ensure training has completed successfully.")

## Model Architecture Summary

**Gradient Boosting Models** are tree-based ensemble methods:

### XGBoost
- Extreme Gradient Boosting with regularization
- Uses handcrafted features from Stage 2/4
- Handles missing values, supports parallel processing
- Saves model as JSON format

### LightGBM
- Microsoft's gradient boosting framework
- Uses leaf-wise tree growth for efficiency
- Faster training than XGBoost on large datasets
- Saves model as joblib format

### CatBoost
- Yandex's gradient boosting framework
- Handles categorical features automatically
- Robust to overfitting
- Saves model as .cbm format

**Advantages:**
- Can capture non-linear patterns and feature interactions
- Good performance on tabular data (handcrafted features)
- Fast inference time
- Interpretable feature importance

**Limitations:**
- Require feature engineering (handcrafted features)
- No temporal modeling (treats features as independent)
- May overfit on small datasets

**Difference from 5f-5j:** These models use handcrafted features directly, while 5f-5j use deep learning features extracted from pretrained video models.