# Model 5alpha: sklearn LogisticRegression

This notebook demonstrates the sklearn LogisticRegression model for deepfake video detection.

## Model Overview

sklearn LogisticRegression with L1/L2/ElasticNet regularization. Uses handcrafted features from Stage 2/4. This is a standalone implementation separate from the pipeline's logistic regression.

## Training Instructions

To train this model, run:

```bash
sbatch scripts/slurm_jobs/slurm_stage5alpha.sh
```

Or use Python:

```python
from src.scripts.train_sklearn_logreg import train_sklearn_logreg

results = train_sklearn_logreg(
    project_root=".",
    scaled_metadata_path="data/stage3/scaled_metadata.parquet",
    features_stage2_path="data/stage2/features_metadata.parquet",
    features_stage4_path=None,
    output_dir="data/stage5/sklearn_logreg",
    n_splits=5,
    delete_existing=False
)
```

## Architecture Deep-Dive

**sklearn_logreg** architecture details.

See model implementation in `lib/training/` for specific architecture code.


## Hyperparameter Configuration

Hyperparameters configured in `lib/training/grid_search.py`.


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

See master pipeline notebook (`00_MASTER_PIPELINE_JOURNEY.ipynb`) for comprehensive design rationale.


## Architecture Deep-Dive

**sklearn_logreg** architecture details.

See model implementation in `lib/training/` for specific architecture code.


## Hyperparameter Configuration

Hyperparameters configured in `lib/training/grid_search.py`.


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

See master pipeline notebook (`00_MASTER_PIPELINE_JOURNEY.ipynb`) for comprehensive design rationale.


In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Video, display, HTML
import json
import joblib

# Add project root to path
# Find project root by looking for lib/ directory
import os
current_dir = Path(os.getcwd()).resolve()
project_root = current_dir

# Walk up the directory tree to find project root (look for lib/ directory)
for _ in range(10):  # Max 10 levels up
    if (project_root / "lib").exists() and (project_root / "lib" / "__init__.py").exists():
        break
    parent = project_root.parent
    if parent == project_root:  # Reached filesystem root
        # Fallback: use current directory
        project_root = current_dir
        break
    project_root = parent

sys.path.insert(0, str(project_root))

from lib.utils.paths import load_metadata_flexible
from lib.training.metrics_utils import compute_classification_metrics

# Configuration
MODEL_TYPE = "sklearn_logreg"
MODEL_DIR = project_root / "data" / "stage5" / "sklearn_logreg"
SCALED_METADATA_PATH = project_root / "data" / "stage3" / "scaled_metadata.parquet"
FEATURES_STAGE2_PATH = project_root / "data" / "stage2" / "features_metadata.parquet"

print(f"Project root: {project_root}")
print(f"Model directory: {MODEL_DIR}")
print(f"Model directory exists: {MODEL_DIR.exists()}")

## Check for Saved Models

In [None]:
def check_saved_models(model_dir: Path):
    """Check for saved sklearn model files."""
    if not model_dir.exists():
        print(f"[X] Model directory does not exist: {model_dir}")
        return False, None
    
    # sklearn_logreg saves model directly in output_dir, not in fold subdirectories
    model_file = model_dir / "model.joblib"
    scaler_file = model_dir / "scaler.joblib"
    metrics_file = model_dir / "metrics.json"
    
    if model_file.exists():
        print(f"[OK] Found model.joblib")
        if scaler_file.exists():
            print(f"[OK] Found scaler.joblib")
        if metrics_file.exists():
            print(f"[OK] Found metrics.json")
        return True, model_file
    else:
        print(f"[X] No model.joblib found in {model_dir}")
        return False, None

models_available, model_file = check_saved_models(MODEL_DIR)

if not models_available:
    print("\n[WARN]  No trained models found. Please train the model first using the instructions above.")
    print(f"Expected location: {MODEL_DIR / 'model.joblib'}")

## Load Model and Make Predictions

In [None]:
if models_available:
    print(f"Loading model from: {model_file}")
    
    model = joblib.load(model_file)
    print(f"[OK] Model loaded successfully")
    print(f"Model type: {type(model)}")
    
    # Load scaler if available
    scaler = None
    scaler_file = MODEL_DIR / "scaler.joblib"
    if scaler_file.exists():
        scaler = joblib.load(scaler_file)
        print(f"[OK] Scaler loaded")
    
    # Load metadata
    scaled_df = load_metadata_flexible(str(SCALED_METADATA_PATH))
    features_df = load_metadata_flexible(str(FEATURES_STAGE2_PATH))
    
    if scaled_df is not None and features_df is not None:
        print(f"\n[OK] Loaded {scaled_df.height} videos from scaled metadata")
        print(f"[OK] Loaded {features_df.height} feature rows from Stage 2")
        
        # Get sample videos
        sample_videos = scaled_df.head(5).to_pandas()
        print(f"\nSample videos for demonstration:")
        print(sample_videos[["video_path", "label"]].to_string())
    else:
        print("[WARN]  Could not load metadata files")
else:
    print("[WARN]  Skipping model loading - no trained models found")

## Display Sample Videos and Predictions

In [None]:
if models_available and 'model' in locals() and 'sample_videos' in locals():
    # Create a simple visualization
    fig, axes = plt.subplots(1, min(3, len(sample_videos)), figsize=(15, 5))
    if len(sample_videos) == 1:
        axes = [axes]
    
    for idx, (ax, row) in enumerate(zip(axes, sample_videos.iterrows())):
        video_path = project_root / row[1]["video_path"]
        label = row[1]["label"]
        
        # Try to load and display video thumbnail
        try:
            cap = cv2.VideoCapture(str(video_path))
            if cap.isOpened():
                ret, frame = cap.read()
                if ret:
                    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    ax.imshow(frame_rgb)
                    ax.set_title(f"{Path(video_path).name}\nLabel: {label}", fontsize=10)
                cap.release()
        except Exception as e:
            ax.text(0.5, 0.5, f"Video: {Path(video_path).name}\nLabel: {label}", 
                    ha='center', va='center', fontsize=12, transform=ax.transAxes)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print("\nNote: To play videos in the notebook, use:")
    print("display(Video('path/to/video.mp4', embed=True, width=640, height=480))")

## Model Performance Summary

In [None]:
if models_available:
    # Note: Performance metrics are not saved to JSON files
    # Metrics are aggregated in training results returned from stage5_train_models()
    # To view metrics, check:
    # 1. Training logs in logs/stage5/
    # 2. MLflow UI (if enabled): mlflow ui
    # 3. Aggregated results returned from stage5_train_models()
    
    print("Model files available:")
    print("=" * 50)
    if MODEL_DIR.exists():
        for item in sorted(MODEL_DIR.iterdir()):
            if item.is_dir():
                print(f"  Directory: {item.name}")
                for file in sorted(item.iterdir()):
                    print(f"    - {file.name}")
            else:
                print(f"  File: {item.name}")
    else:
        print(f"[WARN]  Model directory not found: {MODEL_DIR}")

## Training Plots

The following plots were generated during model training and provide insights into model performance across cross-validation folds and hyperparameter search.

In [None]:

# Display training plots if available
from pathlib import Path
from IPython.display import Image, display, HTML

# Define MODEL_DIR if not already defined
if 'MODEL_DIR' not in globals():
    import os
    current_dir = Path(os.getcwd()).resolve()
    project_root = current_dir
    
    # Walk up the directory tree to find project root (look for lib/ directory)
    for _ in range(10):  # Max 10 levels up
        if (project_root / "lib").exists() and (project_root / "lib" / "__init__.py").exists():
            break
        parent = project_root.parent
        if parent == project_root:  # Reached filesystem root
            # Fallback: use current directory
            project_root = current_dir
            break
        project_root = parent
    MODEL_TYPE = "sklearn_logreg"
    MODEL_DIR = project_root / "data" / "stage5" / "sklearn_logreg"

# sklearn_logreg saves plots in root directory, not plots/ subdirectory
plots_dir = MODEL_DIR  # Plots are in root, not plots/ subdirectory

# Check for roc_pr_curves.png in root
roc_pr_file = MODEL_DIR / "roc_pr_curves.png"

if roc_pr_file.exists():
    print(f"[OK] Found roc_pr_curves.png: {roc_pr_file}")

# List of expected plot files
plot_files = {
    "roc_pr_curves.png": "ROC and Precision-Recall Curves"
}

plots_found = []
for plot_file, plot_name in plot_files.items():
    plot_path = roc_pr_file
    if plot_path.exists():
        plots_found.append((plot_path, plot_name))
        print(f"  [OK] Found: {plot_file}")

if plots_found:
    print(f"\n[PLOT] Displaying {len(plots_found)} training plot(s):\n")
    for plot_path, plot_name in plots_found:
        print(f"\n### {plot_name}")
        display(Image(str(plot_path), width=800))
else:
    print("[WARN]  No plot files found in plots directory.")
    print(f"Expected plots directory: {plots_dir}")

## Model Architecture Summary

**sklearn LogisticRegression** is a linear classifier that:
- Uses handcrafted features from Stage 2 (noise residual, DCT statistics, blur/sharpness, codec cues)
- Supports L1, L2, and ElasticNet regularization
- Uses StandardScaler for feature normalization
- Outputs probability scores for binary classification (real vs fake)
- Trained with grid search on hyperparameters (C, penalty, solver)

**Advantages:**
- Simple and interpretable
- Fast training and inference
- Multiple regularization options (L1/L2/ElasticNet)
- Good baseline for comparison

**Limitations:**
- Linear decision boundary (may not capture complex patterns)
- Relies on quality of handcrafted features
- No temporal modeling

**Difference from 5a:** This is a standalone sklearn implementation with more regularization options, while 5a uses the pipeline's LogisticRegressionBaseline class.