# Model 5c: Naive CNN

This notebook demonstrates the Naive CNN model for deepfake video detection.

## Model Overview

Naive CNN is a simple 3D convolutional neural network that processes video frames directly. It uses 3D convolutions to capture spatiotemporal patterns in videos.

## Training Instructions

To train this model, run:

```bash
sbatch src/scripts/slurm_stage5c.sh
```

Or use Python:

```python
from lib.training.pipeline import stage5_train_models

results = stage5_train_models(
    project_root=".",
    scaled_metadata_path="data/stage3/scaled_metadata.parquet",
    features_stage2_path=None,
    features_stage4_path=None,
    model_types=["naive_cnn"],
    n_splits=5,
    num_frames=500,  # Reduced for memory efficiency
    output_dir="data/stage5",
    use_tracking=True,
    use_mlflow=True
)
```

## Architecture Deep-Dive

**Naive CNN** processes video frames independently using 2D convolutions, then aggregates frame-level predictions.

### Architecture Details

**Input**: (N, C, T, H, W) or (N, T, C, H, W) video tensors
- N: batch size
- C: channels (3 for RGB)
- T: temporal frames (up to 1000)
- H, W: spatial dimensions (256x256 after scaling)

**Processing Pipeline**:
1. **Frame Reshaping**: (N, T, C, H, W) → (N×T, C, H, W)
2. **Chunked Processing**: Process 10 frames at a time to avoid OOM
3. **2D CNN Layers**:
   - Conv2d(3→32) + BatchNorm + ReLU + MaxPool(2)
   - Conv2d(32→64) + BatchNorm + ReLU + MaxPool(2)
   - Conv2d(64→128) + BatchNorm + ReLU + AdaptiveAvgPool(1,1)
4. **Classification Head**:
   - Linear(128→64) + ReLU + Dropout(0.5)
   - Linear(64→2) for binary classification
5. **Temporal Aggregation**: Average frame predictions for video-level output

### Implementation Code

**Location**: `lib/training/_cnn.py`

```python
class NaiveCNNBaseline(nn.Module):
    def __init__(self, num_frames: int = 1000, num_classes: int = 2):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc1 = nn.Linear(128, 64)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(64, num_classes)
        
    def forward(self, x):
        # Process frames in chunks, average predictions
        ...
```

### Memory Optimization

- **Chunked Processing**: 10 frames per chunk (prevents OOM)
- **Batch Size**: 1 (processes 1000 frames per video)
- **Gradient Accumulation**: 16 steps (effective batch size = 16)
- **Initialization**: He initialization for ReLU activations


## Hyperparameter Configuration

**Training Hyperparameters** (from `lib/training/grid_search.py`):

- **learning_rate**: 0.0005
- **weight_decay**: 0.0001
- **batch_size**: 1
- **num_epochs**: 25

**Rationale**:
- **Batch Size 1**: Memory-constrained (processes up to 1000 frames per video)
- **Single Hyperparameter Combination**: Reduced from 5+ combinations for training efficiency
- **Gradient Accumulation**: Maintains effective batch size despite small batch_size


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

### Why "Naive" CNN?

- **Baseline Purpose**: Simple 2D CNN establishes baseline for video models
- **Frame-Independent**: Processes each frame independently (no temporal modeling)
- **Memory Efficient**: Chunked processing handles long videos (1000 frames)
- **Comparison Point**: Demonstrates benefit of temporal models (3D CNNs, Transformers)

### Trade-offs

- **No Temporal Modeling**: Loses temporal relationships between frames
- **Simple Architecture**: May underfit complex patterns
- **Chunked Processing**: Adds complexity but necessary for memory constraints


## Architecture Deep-Dive

**Naive CNN** processes video frames independently using 2D convolutions, then aggregates frame-level predictions.

### Architecture Details

**Input**: (N, C, T, H, W) or (N, T, C, H, W) video tensors
- N: batch size
- C: channels (3 for RGB)
- T: temporal frames (up to 1000)
- H, W: spatial dimensions (256x256 after scaling)

**Processing Pipeline**:
1. **Frame Reshaping**: (N, T, C, H, W) → (N×T, C, H, W)
2. **Chunked Processing**: Process 10 frames at a time to avoid OOM
3. **2D CNN Layers**:
   - Conv2d(3→32) + BatchNorm + ReLU + MaxPool(2)
   - Conv2d(32→64) + BatchNorm + ReLU + MaxPool(2)
   - Conv2d(64→128) + BatchNorm + ReLU + AdaptiveAvgPool(1,1)
4. **Classification Head**:
   - Linear(128→64) + ReLU + Dropout(0.5)
   - Linear(64→2) for binary classification
5. **Temporal Aggregation**: Average frame predictions for video-level output

### Implementation Code

**Location**: `lib/training/_cnn.py`

```python
class NaiveCNNBaseline(nn.Module):
    def __init__(self, num_frames: int = 1000, num_classes: int = 2):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc1 = nn.Linear(128, 64)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(64, num_classes)
        
    def forward(self, x):
        # Process frames in chunks, average predictions
        ...
```

### Memory Optimization

- **Chunked Processing**: 10 frames per chunk (prevents OOM)
- **Batch Size**: 1 (processes 1000 frames per video)
- **Gradient Accumulation**: 16 steps (effective batch size = 16)
- **Initialization**: He initialization for ReLU activations


## Hyperparameter Configuration

**Training Hyperparameters** (from `lib/training/grid_search.py`):

- **learning_rate**: 0.0005
- **weight_decay**: 0.0001
- **batch_size**: 1
- **num_epochs**: 25

**Rationale**:
- **Batch Size 1**: Memory-constrained (processes up to 1000 frames per video)
- **Single Hyperparameter Combination**: Reduced from 5+ combinations for training efficiency
- **Gradient Accumulation**: Maintains effective batch size despite small batch_size


## MLOps Integration

### Experiment Tracking with MLflow

This model integrates with MLflow for comprehensive experiment tracking:

```python
from lib.mlops.mlflow_tracker import create_mlflow_tracker

# MLflow automatically tracks:
# - Hyperparameters (learning_rate, batch_size, etc.)
# - Metrics (train_loss, val_acc, test_f1, etc.)
# - Model artifacts (checkpoints, configs)
# - Run metadata (tags, timestamps, fold numbers)
```

**Access MLflow UI**:
```bash
mlflow ui --port 5000
# Open http://localhost:5000
```

### DuckDB Analytics

Query training results with SQL for fast analytics:

```python
from lib.utils.duckdb_analytics import DuckDBAnalytics

analytics = DuckDBAnalytics()
analytics.register_parquet('results', 'data/stage5/{model_type}/metrics.json')
result = analytics.query("""
    SELECT 
        fold,
        AVG(test_f1) as avg_f1,
        STDDEV(test_f1) as std_f1
    FROM results
    GROUP BY fold
""")
```

### Airflow Orchestration

Pipeline orchestrated via Apache Airflow DAG (`airflow/dags/fvc_pipeline_dag.py`):
- **Dependency Management**: Automatic task ordering
- **Retry Logic**: Automatic retries on failure
- **Monitoring**: Web UI for pipeline status
- **Scheduling**: Cron-based scheduling support


## Training Methodology

### 5-Fold Stratified Cross-Validation

- **Purpose**: Robust performance estimates, prevents overfitting
- **Stratification**: Ensures class balance in each fold
- **Evaluation**: Metrics averaged across folds with standard deviation
- **Rationale**: More reliable than single train/test split

### Regularization Strategy

- **Weight Decay (L2)**: 1e-4 (PyTorch models)
- **Dropout**: 0.5 in classification heads (PyTorch models)
- **Early Stopping**: Patience=5 epochs (prevents overfitting)
- **Gradient Clipping**: max_norm=1.0 (prevents exploding gradients)
- **Class Weights**: Balanced sampling for imbalanced datasets

### Optimization

- **Optimizer**: AdamW with betas=(0.9, 0.999)
- **Mixed Precision**: AMP (Automatic Mixed Precision) for memory efficiency
- **Gradient Accumulation**: Dynamic based on batch size (maintains effective batch size)
- **Learning Rate Schedule**: Cosine annealing with warmup (2 epochs)
- **Differential Learning Rates**: Lower LR for pretrained backbones (5e-6) vs heads (5e-4)

### Data Pipeline

- **Video Loading**: Frame-by-frame decoding (50x memory reduction)
- **Augmentation**: Pre-generated augmentations (reproducible, fast)
- **Scaling**: Fixed 256x256 max dimension with letterboxing
- **Frame Sampling**: Uniform sampling across video duration


## Design Rationale

### Why "Naive" CNN?

- **Baseline Purpose**: Simple 2D CNN establishes baseline for video models
- **Frame-Independent**: Processes each frame independently (no temporal modeling)
- **Memory Efficient**: Chunked processing handles long videos (1000 frames)
- **Comparison Point**: Demonstrates benefit of temporal models (3D CNNs, Transformers)

### Trade-offs

- **No Temporal Modeling**: Loses temporal relationships between frames
- **Simple Architecture**: May underfit complex patterns
- **Chunked Processing**: Adds complexity but necessary for memory constraints


In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Video, display, HTML
import json
import torch
import torch.nn as nn

# Add project root to path
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root))

from lib.training.model_factory import create_model
from lib.mlops.config import RunConfig
from lib.utils.paths import load_metadata_flexible
from lib.training.metrics_utils import compute_classification_metrics

# Configuration
MODEL_TYPE = "naive_cnn"
MODEL_DIR = project_root / "data" / "stage5" / MODEL_TYPE
SCALED_METADATA_PATH = project_root / "data" / "stage3" / "scaled_metadata.parquet"

print(f"Project root: {project_root}")
print(f"Model directory: {MODEL_DIR}")
print(f"Model directory exists: {MODEL_DIR.exists()}")

## Check for Saved Models

In [None]:
def check_saved_models(model_dir: Path):
    """Check for saved PyTorch model files."""
    if not model_dir.exists():
        print(f"❌ Model directory does not exist: {model_dir}")
        return False, []
    
    fold_dirs = sorted([d for d in model_dir.iterdir() if d.is_dir() and d.name.startswith("fold_")])
    
    if not fold_dirs:
        print(f"❌ No fold directories found in {model_dir}")
        return False, []
    
    print(f"✓ Found {len(fold_dirs)} fold(s)")
    
    models_found = []
    for fold_dir in fold_dirs:
        model_file = fold_dir / "model.pt"
        if model_file.exists():
            models_found.append((fold_dir.name, model_file))
            print(f"  ✓ {fold_dir.name}: Found model.pt")
        else:
            print(f"  ❌ {fold_dir.name}: No model.pt found")
    
    return len(models_found) > 0, models_found

models_available, model_files = check_saved_models(MODEL_DIR)

if not models_available:
    print("\n⚠️  No trained models found. Please train the model first using the instructions above.")
    print(f"Expected location: {MODEL_DIR}")

## Load Model

In [None]:
if models_available:
    # Create model instance
    config = RunConfig(
        run_id="demo",
        experiment_name="demo",
        model_type=MODEL_TYPE,
        num_frames=500
    )
    
    model = create_model(MODEL_TYPE, config)
    
    # Load weights
    fold_name, model_path = model_files[0]
    print(f"Loading model weights from: {model_path}")
    
    checkpoint = torch.load(model_path, map_location='cpu')
    if isinstance(checkpoint, dict) and 'model_state_dict' in checkpoint:
        model.load_state_dict(checkpoint['model_state_dict'])
    else:
        model.load_state_dict(checkpoint)
    
    model.eval()
    print(f"✓ Model loaded successfully from {fold_name}")
    print(f"Model architecture: {type(model).__name__}")
    
    # Load metadata
    scaled_df = load_metadata_flexible(str(SCALED_METADATA_PATH))
    
    if scaled_df is not None:
        print(f"\n✓ Loaded {scaled_df.height} videos from scaled metadata")
        sample_videos = scaled_df.head(5).to_pandas()
        print(f"\nSample videos for demonstration:")
        print(sample_videos[["video_path", "label"]].to_string())
    else:
        print("⚠️  Could not load metadata files")
else:
    print("⚠️  Skipping model loading - no trained models found")

## Display Sample Videos

In [None]:
if models_available and 'sample_videos' in locals():
    fig, axes = plt.subplots(1, min(3, len(sample_videos)), figsize=(15, 5))
    if len(sample_videos) == 1:
        axes = [axes]
    
    for idx, (ax, row) in enumerate(zip(axes, sample_videos.iterrows())):
        video_path = project_root / row[1]["video_path"]
        label = row[1]["label"]
        
        try:
            import cv2
            cap = cv2.VideoCapture(str(video_path))
            if cap.isOpened():
                ret, frame = cap.read()
                if ret:
                    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                    ax.imshow(frame_rgb)
                    ax.set_title(f"{Path(video_path).name}\nLabel: {label}", fontsize=10)
                cap.release()
        except Exception as e:
            ax.text(0.5, 0.5, f"Video: {Path(video_path).name}\nLabel: {label}", 
                    ha='center', va='center', fontsize=12, transform=ax.transAxes)
        ax.axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print("\nNote: To play videos in the notebook, use:")
    print("display(Video('path/to/video.mp4', embed=True, width=640, height=480))")

## Model Performance Summary

In [None]:
if models_available:
    fold_dir = model_files[0][0]
    metrics_file = MODEL_DIR / fold_dir / "metrics.json"
    
    if metrics_file.exists():
        with open(metrics_file, 'r') as f:
            metrics = json.load(f)
        
        print("Model Performance Metrics:")
        print("=" * 50)
        for key, value in metrics.items():
            if isinstance(value, (int, float)):
                print(f"{key}: {value:.4f}")
            else:
                print(f"{key}: {value}")
        
        if 'accuracy' in metrics or 'f1_score' in metrics:
            fig, ax = plt.subplots(figsize=(8, 6))
            metric_names = ['accuracy', 'precision', 'recall', 'f1_score']
            metric_values = [metrics.get(m, 0) for m in metric_names]
            
            bars = ax.bar(metric_names, metric_values, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
            ax.set_ylabel('Score')
            ax.set_title('Naive CNN Model Performance')
            ax.set_ylim(0, 1)
            
            for bar, val in zip(bars, metric_values):
                height = bar.get_height()
                ax.text(bar.get_x() + bar.get_width()/2., height,
                       f'{val:.3f}', ha='center', va='bottom')
            
            plt.tight_layout()
            plt.show()
    else:
        print("⚠️  Metrics file not found.")
        print(f"Expected: {metrics_file}")

## Model Architecture Summary

**Naive CNN** is a simple 3D convolutional neural network that:
- Processes raw video frames directly (no feature extraction)
- Uses 3D convolutions to capture spatiotemporal patterns
- Processes 500 frames at 256x256 resolution
- End-to-end trainable architecture

**Advantages:**
- Learns features automatically from raw video
- Captures temporal relationships through 3D convolutions
- No manual feature engineering required

**Limitations:**
- Very memory intensive (requires batch_size=1)
- Simple architecture may not capture complex patterns
- Long training time due to processing many frames