# Model 5b: Support Vector Machine (SVM)

This notebook demonstrates the SVM model for deepfake video detection.

## Model Overview

Support Vector Machine is a baseline feature-based model that uses handcrafted features extracted from videos (Stage 2 features). It uses kernel methods to find optimal decision boundaries.

## Training Instructions

To train this model, run:

```bash
sbatch src/scripts/slurm_stage5b.sh
```

Or use Python:

```python
from lib.training.pipeline import stage5_train_models

results = stage5_train_models(
    project_root=".",
    scaled_metadata_path="data/stage3/scaled_metadata.parquet",
    features_stage2_path="data/stage2/features_metadata.parquet",
    features_stage4_path=None,
    model_types=["svm"],
    n_splits=5,
    num_frames=1000,
    output_dir="data/stage5",
    use_tracking=True,
    use_mlflow=True
)
```

## Architecture Deep-Dive

**svm** architecture details.

See model implementation in `lib/training/` for specific architecture code.


## Hyperparameter Configuration

**Training Hyperparameters** (from `lib/training/grid_search.py`):

- **C**: [0.1, 1.0, 10.0] (grid search)
- **kernel**: ['linear', 'rbf'] (grid search)

**Rationale**:
- **Single Hyperparameter Combination**: Reduced from multiple combinations for efficiency
- **Grid Search**: Performed on 20% sample, best params used for full training


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

See master pipeline notebook (`00_MASTER_PIPELINE_JOURNEY.ipynb`) for comprehensive design rationale.


## Architecture Deep-Dive

**svm** architecture details.

See model implementation in `lib/training/` for specific architecture code.


## Hyperparameter Configuration

**Training Hyperparameters** (from `lib/training/grid_search.py`):

- **C**: [0.1, 1.0, 10.0] (grid search)
- **kernel**: ['linear', 'rbf'] (grid search)

**Rationale**:
- **Single Hyperparameter Combination**: Reduced from multiple combinations for efficiency
- **Grid Search**: Performed on 20% sample, best params used for full training


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

See master pipeline notebook (`00_MASTER_PIPELINE_JOURNEY.ipynb`) for comprehensive design rationale.


In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Video, display, HTML
import json

# Add project root to path
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root))

from lib.training.model_factory import create_model
from lib.mlops.config import RunConfig
from lib.utils.paths import load_metadata_flexible
from lib.training.metrics_utils import compute_classification_metrics

# Configuration
MODEL_TYPE = "svm"
MODEL_DIR = project_root / "data" / "stage5" / MODEL_TYPE
SCALED_METADATA_PATH = project_root / "data" / "stage3" / "scaled_metadata.parquet"
FEATURES_STAGE2_PATH = project_root / "data" / "stage2" / "features_metadata.parquet"

print(f"Project root: {project_root}")
print(f"Model directory: {MODEL_DIR}")
print(f"Model directory exists: {MODEL_DIR.exists()}")

## Check for Saved Models

In [None]:
def check_saved_models(model_dir: Path):
    """Check for saved model files in the model directory."""
    if not model_dir.exists():
        print(f"❌ Model directory does not exist: {model_dir}")
        return False, []
    
    # Check for fold directories
    fold_dirs = sorted([d for d in model_dir.iterdir() if d.is_dir() and d.name.startswith("fold_")])
    
    if not fold_dirs:
        print(f"❌ No fold directories found in {model_dir}")
        return False, []
    
    print(f"✓ Found {len(fold_dirs)} fold(s)")
    
    models_found = []
    for fold_dir in fold_dirs:
        # Check for joblib model file (sklearn models)
        model_file = fold_dir / "model.joblib"
        if model_file.exists():
            models_found.append((fold_dir.name, model_file))
            print(f"  ✓ {fold_dir.name}: Found model.joblib")
        else:
            print(f"  ❌ {fold_dir.name}: No model.joblib found")
    
    return len(models_found) > 0, models_found

models_available, model_files = check_saved_models(MODEL_DIR)

if not models_available:
    print("\n⚠️  No trained models found. Please train the model first using the instructions above.")
    print(f"Expected location: {MODEL_DIR}")

## Load Model and Make Predictions

In [None]:
if models_available:
    import joblib
    
    # Load the first available model
    fold_name, model_path = model_files[0]
    print(f"Loading model from: {model_path}")
    
    model = joblib.load(model_path)
    print(f"✓ Model loaded successfully from {fold_name}")
    print(f"Model type: {type(model)}")
    
    # Load metadata
    scaled_df = load_metadata_flexible(str(SCALED_METADATA_PATH))
    features_df = load_metadata_flexible(str(FEATURES_STAGE2_PATH))
    
    if scaled_df is not None and features_df is not None:
        print(f"\n✓ Loaded {scaled_df.height} videos from scaled metadata")
        print(f"✓ Loaded {features_df.height} feature rows from Stage 2")
        
        # Get sample videos
        sample_videos = scaled_df.head(5).to_pandas()
        print(f"\nSample videos for demonstration:")
        print(sample_videos[["video_path", "label"]].to_string())
    else:
        print("⚠️  Could not load metadata files")
else:
    print("⚠️  Skipping model loading - no trained models found")

## Display Sample Videos and Predictions

In [None]:
if models_available and 'model' in locals() and 'sample_videos' in locals():
    # Create a simple visualization
    fig, axes = plt.subplots(1, min(3, len(sample_videos)), figsize=(15, 5))
    if len(sample_videos) == 1:
        axes = [axes]
    
    for idx, (ax, row) in enumerate(zip(axes, sample_videos.iterrows())):
        video_path = project_root / row[1]["video_path"]
        label = row[1]["label"]
        
        # Try to load and display video thumbnail
        try:
            import cv2
            cap = cv2.VideoCapture(str(video_path))
            if cap.isOpened():
                ret, frame = cap.read()
                if ret:
                    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    ax.imshow(frame_rgb)
                    ax.set_title(f"{Path(video_path).name}\nLabel: {label}", fontsize=10)
                cap.release()
        except Exception as e:
            ax.text(0.5, 0.5, f"Video: {Path(video_path).name}\nLabel: {label}", 
                    ha='center', va='center', fontsize=12, transform=ax.transAxes)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print("\nNote: To play videos in the notebook, use:")
    print("display(Video('path/to/video.mp4', embed=True, width=640, height=480))")

## Model Performance Summary

In [None]:
if models_available:
    # Try to load metrics from fold directory
    fold_dir = model_files[0][0]
    metrics_file = MODEL_DIR / fold_dir / "metrics.json"
    
    if metrics_file.exists():
        with open(metrics_file, 'r') as f:
            metrics = json.load(f)
        
        print("Model Performance Metrics:")
        print("=" * 50)
        for key, value in metrics.items():
            if isinstance(value, (int, float)):
                print(f"{key}: {value:.4f}")
            else:
                print(f"{key}: {value}")
        
        # Create visualization if metrics available
        if 'accuracy' in metrics or 'f1_score' in metrics:
            fig, ax = plt.subplots(figsize=(8, 6))
            metric_names = ['accuracy', 'precision', 'recall', 'f1_score']
            metric_values = [metrics.get(m, 0) for m in metric_names]
            
            bars = ax.bar(metric_names, metric_values, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
            ax.set_ylabel('Score')
            ax.set_title('SVM Model Performance')
            ax.set_ylim(0, 1)
            
            # Add value labels on bars
            for bar, val in zip(bars, metric_values):
                height = bar.get_height()
                ax.text(bar.get_x() + bar.get_width()/2., height,
                       f'{val:.3f}', ha='center', va='bottom')
            
            plt.tight_layout()
            plt.show()
    else:
        print("⚠️  Metrics file not found. Model may not have been fully trained.")
        print(f"Expected: {metrics_file}")

## Model Architecture Summary

**Support Vector Machine (SVM)** is a kernel-based classifier that:
- Uses handcrafted features from Stage 2 (noise residual, DCT statistics, blur/sharpness, codec cues)
- Supports linear and RBF kernels for non-linear decision boundaries
- Finds optimal hyperplane to separate real vs fake videos
- Outputs probability scores for binary classification

**Advantages:**
- Can capture non-linear patterns with RBF kernel
- Effective for high-dimensional feature spaces
- Good generalization with proper regularization

**Limitations:**
- Slower than logistic regression for large datasets
- Requires careful kernel and hyperparameter selection
- No temporal modeling