# Model 5alpha: sklearn LogisticRegression

This notebook demonstrates the sklearn LogisticRegression model for deepfake video detection.

## Model Overview

sklearn LogisticRegression with L1/L2/ElasticNet regularization. Uses handcrafted features from Stage 2/4. This is a standalone implementation separate from the pipeline's logistic regression.

## Training Instructions

To train this model, run:

```bash
sbatch src/scripts/slurm_stage5alpha.sh
```

Or use Python:

```python
from src.scripts.train_sklearn_logreg import train_sklearn_logreg

results = train_sklearn_logreg(
    project_root=".",
    scaled_metadata_path="data/stage3/scaled_metadata.parquet",
    features_stage2_path="data/stage2/features_metadata.parquet",
    features_stage4_path=None,
    output_dir="data/stage5/sklearn_logreg",
    n_splits=5,
    delete_existing=False
)
```

In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Video, display, HTML
import json
import joblib

# Add project root to path
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root))

from lib.utils.paths import load_metadata_flexible
from lib.training.metrics_utils import compute_classification_metrics

# Configuration
MODEL_TYPE = "sklearn_logreg"
MODEL_DIR = project_root / "data" / "stage5" / "sklearn_logreg"
SCALED_METADATA_PATH = project_root / "data" / "stage3" / "scaled_metadata.parquet"
FEATURES_STAGE2_PATH = project_root / "data" / "stage2" / "features_metadata.parquet"

print(f"Project root: {project_root}")
print(f"Model directory: {MODEL_DIR}")
print(f"Model directory exists: {MODEL_DIR.exists()}")

## Check for Saved Models

In [None]:
def check_saved_models(model_dir: Path):
    """Check for saved sklearn model files."""
    if not model_dir.exists():
        print(f"❌ Model directory does not exist: {model_dir}")
        return False, None
    
    # sklearn_logreg saves model directly in output_dir, not in fold subdirectories
    model_file = model_dir / "model.joblib"
    scaler_file = model_dir / "scaler.joblib"
    metrics_file = model_dir / "metrics.json"
    
    if model_file.exists():
        print(f"✓ Found model.joblib")
        if scaler_file.exists():
            print(f"✓ Found scaler.joblib")
        if metrics_file.exists():
            print(f"✓ Found metrics.json")
        return True, model_file
    else:
        print(f"❌ No model.joblib found in {model_dir}")
        return False, None

models_available, model_file = check_saved_models(MODEL_DIR)

if not models_available:
    print("\n⚠️  No trained models found. Please train the model first using the instructions above.")
    print(f"Expected location: {MODEL_DIR / 'model.joblib'}")

## Load Model and Make Predictions

In [None]:
if models_available:
    print(f"Loading model from: {model_file}")
    
    model = joblib.load(model_file)
    print(f"✓ Model loaded successfully")
    print(f"Model type: {type(model)}")
    
    # Load scaler if available
    scaler = None
    scaler_file = MODEL_DIR / "scaler.joblib"
    if scaler_file.exists():
        scaler = joblib.load(scaler_file)
        print(f"✓ Scaler loaded")
    
    # Load metadata
    scaled_df = load_metadata_flexible(str(SCALED_METADATA_PATH))
    features_df = load_metadata_flexible(str(FEATURES_STAGE2_PATH))
    
    if scaled_df is not None and features_df is not None:
        print(f"\n✓ Loaded {scaled_df.height} videos from scaled metadata")
        print(f"✓ Loaded {features_df.height} feature rows from Stage 2")
        
        # Get sample videos
        sample_videos = scaled_df.head(5).to_pandas()
        print(f"\nSample videos for demonstration:")
        print(sample_videos[["video_path", "label"]].to_string())
    else:
        print("⚠️  Could not load metadata files")
else:
    print("⚠️  Skipping model loading - no trained models found")

## Display Sample Videos and Predictions

In [None]:
if models_available and 'model' in locals() and 'sample_videos' in locals():
    # Create a simple visualization
    fig, axes = plt.subplots(1, min(3, len(sample_videos)), figsize=(15, 5))
    if len(sample_videos) == 1:
        axes = [axes]
    
    for idx, (ax, row) in enumerate(zip(axes, sample_videos.iterrows())):
        video_path = project_root / row[1]["video_path"]
        label = row[1]["label"]
        
        # Try to load and display video thumbnail
        try:
            import cv2
            cap = cv2.VideoCapture(str(video_path))
            if cap.isOpened():
                ret, frame = cap.read()
                if ret:
                    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    ax.imshow(frame_rgb)
                    ax.set_title(f"{Path(video_path).name}\nLabel: {label}", fontsize=10)
                cap.release()
        except Exception as e:
            ax.text(0.5, 0.5, f"Video: {Path(video_path).name}\nLabel: {label}", 
                    ha='center', va='center', fontsize=12, transform=ax.transAxes)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print("\nNote: To play videos in the notebook, use:")
    print("display(Video('path/to/video.mp4', embed=True, width=640, height=480))")

## Model Performance Summary

In [None]:
if models_available:
    # Try to load metrics from model directory
    metrics_file = MODEL_DIR / "metrics.json"
    
    if metrics_file.exists():
        with open(metrics_file, 'r') as f:
            metrics = json.load(f)
        
        print("Model Performance Metrics:")
        print("=" * 50)
        for key, value in metrics.items():
            if isinstance(value, (int, float)):
                print(f"{key}: {value:.4f}")
            else:
                print(f"{key}: {value}")
        
        # Create visualization if metrics available
        if 'test_f1' in metrics or 'test_accuracy' in metrics:
            fig, ax = plt.subplots(figsize=(8, 6))
            metric_names = ['test_accuracy', 'test_precision', 'test_recall', 'test_f1']
            metric_values = [metrics.get(m, 0) for m in metric_names]
            
            bars = ax.bar(metric_names, metric_values, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
            ax.set_ylabel('Score')
            ax.set_title('sklearn LogisticRegression Model Performance')
            ax.set_ylim(0, 1)
            
            # Add value labels on bars
            for bar, val in zip(bars, metric_values):
                height = bar.get_height()
                ax.text(bar.get_x() + bar.get_width()/2., height,
                       f'{val:.3f}', ha='center', va='bottom')
            
            plt.tight_layout()
            plt.show()
    else:
        print("⚠️  Metrics file not found. Model may not have been fully trained.")
        print(f"Expected: {metrics_file}")

## Model Architecture Summary

**sklearn LogisticRegression** is a linear classifier that:
- Uses handcrafted features from Stage 2 (noise residual, DCT statistics, blur/sharpness, codec cues)
- Supports L1, L2, and ElasticNet regularization
- Uses StandardScaler for feature normalization
- Outputs probability scores for binary classification (real vs fake)
- Trained with grid search on hyperparameters (C, penalty, solver)

**Advantages:**
- Simple and interpretable
- Fast training and inference
- Multiple regularization options (L1/L2/ElasticNet)
- Good baseline for comparison

**Limitations:**
- Linear decision boundary (may not capture complex patterns)
- Relies on quality of handcrafted features
- No temporal modeling

**Difference from 5a:** This is a standalone sklearn implementation with more regularization options, while 5a uses the pipeline's LogisticRegressionBaseline class.