# üéØ DEEPFAKE DETECTION - INCEPTIONRESNETV2 + ATTENTION (LOCAL GPU)

**Mission:** Train single InceptionResNetV2 + Attention Pooling model for deepfake detection

**Architecture:**
- CNN: InceptionResNetV2 (pretrained)
- Temporal: Attention Pooling (learns to focus on suspicious frames)
- Classifier: FC layers with dropout

**Training Strategy:**
- 70:30 Train/Validation Split (420/180 videos)
- Gradient Accumulation (simulates batch_size=4)
- Mixed precision training (FP16)
- Hyperparameter tuning with Optuna (LR, attention_dim, fc_dropout)
- **Hardware:** RTX 3050 Laptop (4GB VRAM)
- **GPU Safety:** Thermal monitoring, pause at 78¬∞C
- **Data Leak Prevention:** Clean split before Optuna

**Expected Runtime:** ~6-8 hours

---

## üìã SESSION TRACKING
Mark your progress:
- [ ] Environment setup complete
- [ ] Optuna hyperparameter tuning complete
- [ ] Training started
- [ ] Model training complete

## üîß STEP 1: Environment Setup & GPU Configuration

In [None]:
# Check GPU availability and system info
import torch
import platform
print(f"System: {platform.system()} {platform.release()}")
print(f"Python Version: {platform.python_version()}")
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA Version: {torch.version.cuda}")
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
else:
    print("‚ùå ERROR: GPU not detected! Check CUDA installation.")
    raise RuntimeError("GPU required for training")

In [None]:
# Verify required packages are installed
# (Packages should already be installed from requirements-local.txt)
try:
    import cv2
    import albumentations
    import timm
    import pynvml
    print("‚úÖ All required packages verified!")
except ImportError as e:
    print(f"‚ùå Missing package{e}")
    print("Please run: pip install -r requirements-local.txt")

In [None]:
# Import all required libraries
import os
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# PyTorch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast, GradScaler
import torchvision.models as models
from torchvision import transforms

# Additional libraries
import albumentations as A
from albumentations.pytorch import ToTensorV2
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
import optuna
from optuna.pruners import MedianPruner
import timm
import random
import time
import json
from datetime import datetime
import signal
import sys
import logging

# GPU Monitoring & Hardware Safety
import pynvml
import psutil

# Initialize NVML for GPU monitoring
try:
    pynvml.nvmlInit()
    print("‚úÖ NVML initialized for GPU monitoring")
except Exception as e:
    print(f"‚ö†Ô∏è NVML initialization failed: {e}")
    print("Thermal monitoring may be limited")

# Set random seeds for reproducibility
def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed(42)
print("‚úÖ Libraries imported and seed set!")

## üíæ STEP 2: Setup Local Paths & Verify Dataset

In [None]:
# Setup local paths (Windows)
import os
from pathlib import Path

# Base directory - adjust if needed
BASE_DIR = Path(r"D:\Data\Github\SheldonC2005\ModelArena")
BASE_PATH = BASE_DIR / "archive"
TRAIN_FAKE_PATH = BASE_PATH / "train" / "fake"
TRAIN_REAL_PATH = BASE_PATH / "train" / "real"
TEST_PATH = BASE_PATH / "test"
LABELS_PATH = BASE_PATH / "train_labels.csv"
TEST_CSV_PATH = BASE_PATH / "test_public.csv"

# Output directories
SAVE_PATH = BASE_DIR / "models"
LOG_PATH = BASE_DIR / "logs"
SAVE_PATH.mkdir(exist_ok=True, parents=True)
LOG_PATH.mkdir(exist_ok=True, parents=True)

print(f"‚úÖ Working directory: {BASE_DIR}")
print(f"‚úÖ Archive path: {BASE_PATH}")
print(f"‚úÖ Models will be saved to: {SAVE_PATH}")
print(f"‚úÖ Logs will be saved to: {LOG_PATH}")

In [None]:
# Verify dataset structure
print("üìÅ Dataset Structure Verification:")
print(f"Train Fake videos: {len(list(TRAIN_FAKE_PATH.glob('*.mp4')))} files")
print(f"Train Real videos: {len(list(TRAIN_REAL_PATH.glob('*.mp4')))} files")
print(f"Test videos: {len(list(TEST_PATH.glob('*.mp4')))} files")
print(f"Labels CSV exists: {LABELS_PATH.exists()}")
print(f"Test CSV exists: {TEST_CSV_PATH.exists()}")

# Check if counts match expected
assert len(list(TRAIN_FAKE_PATH.glob('*.mp4'))) == 300, "‚ùå Expected 300 fake videos!"
assert len(list(TRAIN_REAL_PATH.glob('*.mp4'))) == 300, "‚ùå Expected 300 real videos!"
assert len(list(TEST_PATH.glob('*.mp4'))) == 200, "‚ùå Expected 200 test videos!"
print("\n‚úÖ Dataset verified successfully!")

In [None]:
# GPU Monitoring & Hardware Safety Functions
class GPUMonitor:
    """Monitor GPU temperature and VRAM usage for hardware safety"""
    
    def __init__(self, temp_threshold=78, temp_resume=72, vram_threshold=0.90):
        self.temp_threshold = temp_threshold  # Pause training at this temp
        self.temp_resume = temp_resume  # Resume when temp drops to this
        self.vram_threshold = vram_threshold  # Warn at 90% VRAM usage
        self.gpu_handle = None
        
        try:
            self.gpu_handle = pynvml.nvmlDeviceGetHandleByIndex(0)
            print(f"‚úÖ GPU Monitor initialized")
            print(f"   Thermal protection: Pause at {temp_threshold}¬∞C, resume at {temp_resume}¬∞C")
            print(f"   VRAM protection: Warning at {vram_threshold*100}%")
        except Exception as e:
            print(f"‚ö†Ô∏è GPU monitoring limited: {e}")
    
    def get_gpu_temp(self):
        """Get current GPU temperature"""
        try:
            if self.gpu_handle is None:
                return 0
            temp = pynvml.nvmlDeviceGetTemperature(self.gpu_handle, pynvml.NVML_TEMPERATURE_GPU)
            return temp
        except:
            return 0
    
    def get_vram_usage(self):
        """Get current VRAM usage percentage"""
        try:
            if self.gpu_handle is None:
                return 0.0
            mem_info = pynvml.nvmlDeviceGetMemoryInfo(self.gpu_handle)
            return mem_info.used / mem_info.total
        except:
            return 0.0
    
    def check_thermal_safety(self):
        """Check if GPU temperature is safe"""
        temp = self.get_gpu_temp()
        
        # If monitoring is unavailable (temp=0), assume safe
        if temp == 0 and self.gpu_handle is None:
            return True, 0
        
        if temp >= self.temp_threshold:
            return False, temp
        return True, temp
    
    def wait_for_cooling(self):
        """Wait for GPU to cool down"""
        print(f"\nüî• GPU overheating! Pausing training...")
        while True:
            temp = self.get_gpu_temp()
            print(f"   Current temp: {temp}¬∞C. Waiting for {self.temp_resume}¬∞C...", end='\r')
            if temp <= self.temp_resume:
                print(f"\n‚úÖ GPU cooled to {temp}¬∞C. Resuming training...")
                break
            time.sleep(30)  # Check every 30 seconds
    
    def check_vram_safety(self):
        """Check VRAM usage"""
        usage = self.get_vram_usage()
        if usage > self.vram_threshold:
            print(f"‚ö†Ô∏è VRAM usage high: {usage*100:.1f}%")
            return False
        return True

# Initialize GPU monitor
gpu_monitor = GPUMonitor(temp_threshold=78, temp_resume=72, vram_threshold=0.90)

# Setup logging
log_file = LOG_PATH / f"training_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log"
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler(log_file),
        logging.StreamHandler(sys.stdout)
    ]
)
logger = logging.getLogger(__name__)
logger.info("="*80)
logger.info("TRAINING SESSION STARTED")
logger.info(f"GPU: {torch.cuda.get_device_name(0)}")
logger.info(f"Log file: {log_file}")
logger.info("="*80)

## ‚öôÔ∏è STEP 3: Configuration & Hyperparameters

In [None]:
# Configuration class - Optimized for RTX 3050 (4GB VRAM)
class Config:
    # Model configuration - InceptionResNetV2 only
    MODEL_NAME = 'inception_resnet_v2_best.pt'
    IMG_SIZE = 299
    FEATURE_DIM = 1536
    
    # Training hyperparameters - OPTIMIZED FOR 4GB VRAM
    FRAMES_PER_VIDEO = 24
    BATCH_SIZE = 1  # Safe for 4GB VRAM with InceptionResNetV2
    ACCUMULATION_STEPS = 4  # Simulates batch_size=4
    NUM_EPOCHS = 30
    EARLY_STOP_PATIENCE = 3
    TRAIN_VAL_SPLIT = 0.7  # 70% train, 30% validation
    
    # Attention parameters (will be tuned by Optuna)
    ATTENTION_DIM = 256  # Default, will be tuned
    
    # Optimizer parameters (will be tuned by Optuna)
    LEARNING_RATE = 1e-4  # Default, will be tuned
    WEIGHT_DECAY = 1e-5
    FC_DROPOUT = 0.5  # Default, will be tuned
    
    # Other settings - OPTIMIZED FOR LOCAL GPU
    NUM_WORKERS = 0  # Safe for Windows
    DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    MIXED_PRECISION = True  # Critical for 4GB VRAM
    
    # Optuna hyperparameter search
    OPTUNA_N_TRIALS = 10
    OPTUNA_TIMEOUT = 1800  # 30 minutes
    
    # Hardware safety settings
    CHECK_GPU_TEMP_EVERY_N_BATCHES = 50
    ENABLE_THERMAL_MONITORING = True
    ENABLE_VRAM_MONITORING = True

config = Config()
logger.info("="*80)
logger.info("CONFIGURATION LOADED - INCEPTIONRESNETV2 + ATTENTION")
logger.info(f"Device: {config.DEVICE}")
logger.info(f"Model: {config.MODEL_NAME}")
logger.info(f"Batch Size: {config.BATCH_SIZE} (with accumulation_steps={config.ACCUMULATION_STEPS})")
logger.info(f"Frames per video: {config.FRAMES_PER_VIDEO}")
logger.info(f"Train/Val Split: {int(config.TRAIN_VAL_SPLIT*100)}/{int((1-config.TRAIN_VAL_SPLIT)*100)}")
logger.info(f"Mixed Precision: {config.MIXED_PRECISION}")
logger.info("="*80)

print("‚úÖ Configuration loaded!")
print(f"Device: {config.DEVICE}")
print(f"Model: InceptionResNetV2 + Attention Pooling")
print(f"Effective Batch Size: {config.BATCH_SIZE * config.ACCUMULATION_STEPS} (via gradient accumulation)")

## üìä STEP 4: Data Preprocessing & Dataset Class

In [None]:
# Load labels
labels_df = pd.read_csv(LABELS_PATH)
test_df = pd.read_csv(TEST_CSV_PATH)

print(f"Total training samples: {len(labels_df)}")
print(f"Total test samples: {len(test_df)}")

# Validate columns
assert 'filename' in labels_df.columns, "‚ùå 'filename' column missing in train_labels.csv!"
assert 'label' in labels_df.columns, "‚ùå 'label' column missing in train_labels.csv!"
assert 'filename' in test_df.columns, "‚ùå 'filename' column missing in test_public.csv!"

print("üìä Label Distribution:")
print(labels_df['label'].value_counts())
print(f"\nLabel 0 (Real): {(labels_df['label']==0).sum()}")
print(f"Label 1 (Fake): {(labels_df['label']==1).sum()}")

# Create full paths
def get_video_path(filename, label):
    if label == 1:  # Fake
        return str(TRAIN_FAKE_PATH / filename)
    else:  # Real
        return str(TRAIN_REAL_PATH / filename)

labels_df['video_path'] = labels_df.apply(
    lambda row: get_video_path(row['filename'], row['label']), axis=1
)

test_df['video_path'] = test_df['filename'].apply(
    lambda x: str(TEST_PATH / x)
)

print(f"\n‚úÖ Labels prepared: {len(labels_df)} training videos")
print(f"‚úÖ Test data prepared: {len(test_df)} test videos")

In [None]:
# Define augmentations for InceptionResNetV2
def get_augmentations(img_size=299, is_train=True):
    """
    Augmentations optimized for InceptionResNetV2 deepfake detection
    """
    if is_train:
        return A.Compose([
            A.Resize(img_size, img_size),
            A.ShiftScaleRotate(shift_limit=0.1, scale_limit=0.15, rotate_limit=0, p=0.5),
            A.HorizontalFlip(p=0.5),
            A.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),  # Inception normalization
            ToTensorV2()
        ])
    else:
        return A.Compose([
            A.Resize(img_size, img_size),
            A.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
            ToTensorV2()
        ])

print("‚úÖ Augmentation functions defined for InceptionResNetV2!")

In [None]:
# Video Dataset with on-the-fly frame extraction
class VideoDataset(Dataset):
    def __init__(self, dataframe, img_size=299, num_frames=24, is_train=True, has_labels=True):
        self.df = dataframe.reset_index(drop=True)
        self.img_size = img_size
        self.num_frames = num_frames
        self.is_train = is_train
        self.has_labels = has_labels
        self.transform = get_augmentations(img_size, is_train)
    
    def __len__(self):
        return len(self.df)
    
    def extract_frames(self, video_path):
        """
        Extract uniformly spaced frames from video
        """
        cap = cv2.VideoCapture(video_path)
        frames = []
        
        total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        
        # Handle corrupted or empty videos
        if total_frames <= 0:
            cap.release()
            # Return black frames as fallback
            return [np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8) for _ in range(self.num_frames)]
        
        if total_frames < self.num_frames:
            # If video has fewer frames, repeat some frames
            indices = np.linspace(0, total_frames - 1, self.num_frames, dtype=int)
        else:
            # Uniformly sample frames
            indices = np.linspace(0, total_frames - 1, self.num_frames, dtype=int)
        
        for idx in indices:
            cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
            ret, frame = cap.read()
            
            if ret:
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
                frames.append(frame)
            else:
                # If frame read fails, use last valid frame or black frame
                if len(frames) > 0:
                    frames.append(frames[-1])
                else:
                    frames.append(np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8))
        
        cap.release()
        return frames
    
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        video_path = row['video_path']
        
        # Extract frames
        frames = self.extract_frames(video_path)
        
        # Apply augmentations to each frame
        transformed_frames = []
        for frame in frames:
            augmented = self.transform(image=frame)
            transformed_frames.append(augmented['image'])
        
        # Stack frames: [num_frames, C, H, W]
        frames_tensor = torch.stack(transformed_frames)
        
        if self.has_labels:
            label = torch.tensor(row['label'], dtype=torch.long)
            return frames_tensor, label
        else:
            return frames_tensor

print("‚úÖ VideoDataset class defined with on-the-fly frame extraction!")

## üèóÔ∏è STEP 5: Model Architecture (CNN + Attention Pooling)

In [None]:
class CNN_Attention_Model(nn.Module):
    """
    InceptionResNetV2 (Feature Extractor) + Attention Pooling (Temporal Modeling) for video classification
    Attention learns to focus on "suspicious" frames for deepfake detection
    """
    def __init__(self, feature_dim=1536, attention_dim=256, fc_dropout=0.5, num_classes=2):
        super(CNN_Attention_Model, self).__init__()
        
        # Load pretrained InceptionResNetV2
        self.cnn = timm.create_model('inception_resnet_v2', pretrained=True, num_classes=0)
        
        # Attention mechanism for temporal modeling
        self.attention = nn.Sequential(
            nn.Linear(feature_dim, attention_dim),
            nn.Tanh(),
            nn.Linear(attention_dim, 1)
        )
        
        # Classifier
        self.fc = nn.Sequential(
            nn.Linear(feature_dim, 256),
            nn.ReLU(),
            nn.Dropout(fc_dropout),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        # x shape: [batch_size, num_frames, C, H, W]
        batch_size, num_frames, C, H, W = x.shape
        
        # Reshape to process all frames: [batch_size * num_frames, C, H, W]
        x = x.view(batch_size * num_frames, C, H, W)
        
        # Extract features from CNN
        features = self.cnn(x)  # [batch_size * num_frames, feature_dim]
        
        # Reshape back to sequence: [batch_size, num_frames, feature_dim]
        features = features.view(batch_size, num_frames, -1)
        
        # Attention mechanism
        attn_scores = self.attention(features)  # [batch_size, num_frames, 1]
        attn_weights = F.softmax(attn_scores, dim=1)  # [batch_size, num_frames, 1]
        
        # Weighted sum of features
        context = torch.sum(features * attn_weights, dim=1)  # [batch_size, feature_dim]
        
        # Classification
        output = self.fc(context)  # [batch_size, num_classes]
        
        return output

print("‚úÖ InceptionResNetV2 + Attention Model architecture defined!")

In [None]:
# Test model instantiation
print("üß™ Testing InceptionResNetV2 + Attention model...\n")

model = CNN_Attention_Model(
    feature_dim=config.FEATURE_DIM,
    attention_dim=config.ATTENTION_DIM,
    fc_dropout=config.FC_DROPOUT
)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Test forward pass
dummy_input = torch.randn(1, config.FRAMES_PER_VIDEO, 3, config.IMG_SIZE, config.IMG_SIZE)
output = model(dummy_input)
print(f"Output shape: {output.shape}")
print(f"\n‚úÖ Model working correctly!\n")

del model
torch.cuda.empty_cache()

## üîç STEP 6: Hyperparameter Tuning with Optuna

## üìä STEP 6A: Create 70:30 Train/Validation Split (BEFORE Optuna)

**CRITICAL: Split data FIRST to prevent any possibility of data leak**
- Creates train_df (420 videos, 70%)
- Creates val_df (180 videos, 30%)
- Optuna will ONLY use train_df for hyperparameter tuning

In [None]:
# Create 70:30 train/validation split BEFORE Optuna
# This ensures Optuna ONLY sees training data, preventing any data leak
print("üìä Creating 70:30 Train/Validation Split...")
print(f"Total videos: {len(labels_df)}")
print(f"Train split: {config.TRAIN_VAL_SPLIT*100:.0f}%")
print(f"Val split: {(1-config.TRAIN_VAL_SPLIT)*100:.0f}%\n")

train_df, val_df = train_test_split(
    labels_df,
    train_size=config.TRAIN_VAL_SPLIT,
    stratify=labels_df['label'],
    random_state=42
)

# Display split statistics
print(f"‚úÖ Split created successfully!")
print(f"\nTrain set: {len(train_df)} videos")
print(f"  - Real: {len(train_df[train_df['label']==0])}")
print(f"  - Fake: {len(train_df[train_df['label']==1])}")

print(f"\nValidation set: {len(val_df)} videos")
print(f"  - Real: {len(val_df[val_df['label']==0])}")
print(f"  - Fake: {len(val_df[val_df['label']==1])}")

print(f"\nClass balance preserved: {len(train_df[train_df['label']==1])/len(train_df)*100:.1f}% fake in train, {len(val_df[val_df['label']==1])/len(val_df)*100:.1f}% fake in val")

print("\n" + "="*70)
print("‚ö†Ô∏è IMPORTANT: Optuna will ONLY use train_df (420 videos)")
print("             Validation data (180 videos) will NOT be touched by Optuna")
print("="*70 + "\n")

## üîç STEP 6B: Hyperparameter Tuning with Optuna (Using train_df ONLY)

In [None]:
# Training function with gradient accumulation and hardware safety monitoring
def train_one_epoch(model, dataloader, criterion, optimizer, scaler, device, epoch_num=0):
    model.train()
    running_loss = 0.0
    all_preds = []
    all_labels = []
    batch_count = 0
    accumulation_step = 0  # Track gradient accumulation
    
    # Initialize gradients
    optimizer.zero_grad()
    
    pbar = tqdm(dataloader, desc=f'Training Epoch {epoch_num+1}')
    
    for batch_idx, (frames, labels) in enumerate(pbar):
        # Thermal safety check every N batches
        if batch_idx % config.CHECK_GPU_TEMP_EVERY_N_BATCHES == 0 and config.ENABLE_THERMAL_MONITORING:
            is_safe, temp = gpu_monitor.check_thermal_safety()
            if not is_safe:
                logger.warning(f"GPU temperature {temp}¬∞C exceeds threshold. Pausing...")
                gpu_monitor.wait_for_cooling()
        
        # VRAM safety check
        if config.ENABLE_VRAM_MONITORING and batch_idx % 20 == 0:
            if not gpu_monitor.check_vram_safety():
                torch.cuda.empty_cache()
        
        try:
            frames, labels = frames.to(device), labels.to(device)
            
            # Mixed precision training
            with autocast(enabled=config.MIXED_PRECISION):
                outputs = model(frames)
                loss = criterion(outputs, labels)
                # Normalize loss for gradient accumulation
                loss = loss / config.ACCUMULATION_STEPS
            
            # Backward pass - accumulate gradients
            scaler.scale(loss).backward()
            
            accumulation_step += 1
            
            # Update weights only every ACCUMULATION_STEPS batches
            if accumulation_step % config.ACCUMULATION_STEPS == 0:
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad()
            
            # Track metrics (use original loss scale for logging)
            running_loss += loss.item() * config.ACCUMULATION_STEPS
            preds = torch.argmax(outputs, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
            
            batch_count += 1
            pbar.set_postfix({
                'loss': loss.item() * config.ACCUMULATION_STEPS, 
                'accum': f"{accumulation_step % config.ACCUMULATION_STEPS}/{config.ACCUMULATION_STEPS}",
                'temp': f"{gpu_monitor.get_gpu_temp()}¬∞C"
            })
            
        except RuntimeError as e:
            if "out of memory" in str(e):
                logger.error(f"OOM Error in batch {batch_idx}. Clearing cache and skipping batch.")
                torch.cuda.empty_cache()
                # Reset gradients on OOM to avoid corrupted state
                optimizer.zero_grad()
                accumulation_step = 0
                continue
            else:
                raise e
    
    # Final optimizer step if there are remaining accumulated gradients
    if accumulation_step % config.ACCUMULATION_STEPS != 0:
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
    
    epoch_loss = running_loss / max(batch_count, 1)
    epoch_acc = accuracy_score(all_labels, all_preds)
    
    return epoch_loss, epoch_acc

def validate(model, dataloader, criterion, device):
    model.eval()
    running_loss = 0.0
    all_preds = []
    all_labels = []
    all_probs = []
    batch_count = 0
    
    with torch.no_grad():
        pbar = tqdm(dataloader, desc='Validation')
        for batch_idx, (frames, labels) in enumerate(pbar):
            try:
                frames, labels = frames.to(device), labels.to(device)
                
                with autocast(enabled=config.MIXED_PRECISION):
                    outputs = model(frames)
                    loss = criterion(outputs, labels)
                
                running_loss += loss.item()
                batch_count += 1
                probs = F.softmax(outputs, dim=1)
                preds = torch.argmax(outputs, dim=1)
                
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
                all_probs.extend(probs[:, 1].cpu().numpy())
                
                pbar.set_postfix({'loss': loss.item()})
                
            except RuntimeError as e:
                if "out of memory" in str(e):
                    logger.error(f"OOM Error in validation batch {batch_idx}. Clearing cache and skipping.")
                    torch.cuda.empty_cache()
                    continue
                else:
                    raise e
    
    # Handle edge case of all batches failing
    if batch_count == 0 or len(all_labels) == 0:
        logger.error("No valid validation batches processed!")
        return 0.0, 0.0, 0.0, 0.0
    
    epoch_loss = running_loss / max(batch_count, 1)
    epoch_acc = accuracy_score(all_labels, all_preds)
    epoch_f1 = f1_score(all_labels, all_preds, average='binary')
    epoch_auc = roc_auc_score(all_labels, all_probs)
    
    return epoch_loss, epoch_acc, epoch_f1, epoch_auc

# Graceful shutdown handler
def signal_handler(sig, frame):
    logger.info("\nüõë Ctrl+C detected. Saving checkpoint and exiting gracefully...")
    print("\nüõë Training interrupted. Checkpoint saved. You can resume later.")
    sys.exit(0)

# Register signal handler
signal.signal(signal.SIGINT, signal_handler)

print("‚úÖ Training and validation functions defined with hardware safety features!")

In [None]:
# Optuna objective function for hyperparameter tuning
def optuna_objective(trial):
    """
    Optimize hyperparameters using InceptionResNetV2 + Attention
    DATA LEAK PREVENTION: Uses ONLY train_df (already split), validation data never touched
    """
    # Suggest hyperparameters
    lr = trial.suggest_float('learning_rate', 1e-5, 1e-3, log=True)
    attention_dim = trial.suggest_categorical('attention_dim', [128, 256, 512])
    fc_dropout = trial.suggest_float('fc_dropout', 0.3, 0.7)
    
    # Use 20% of train_df for Optuna tuning (~84 videos from 420 training videos)
    # val_df (180 videos) is NEVER touched by Optuna
    tune_df = train_df.sample(frac=0.2, random_state=42)
    
    # Split Optuna data into train/val (80/20 split = 67/17 videos)
    tune_train_df = tune_df.iloc[:int(len(tune_df)*0.8)]
    tune_val_df = tune_df.iloc[int(len(tune_df)*0.8):]
    
    # Create datasets
    train_dataset = VideoDataset(
        tune_train_df, config.IMG_SIZE, 
        config.FRAMES_PER_VIDEO, is_train=True
    )
    val_dataset = VideoDataset(
        tune_val_df, config.IMG_SIZE, 
        config.FRAMES_PER_VIDEO, is_train=False
    )
    
    # Use num_workers=0 to avoid multiprocessing issues on Windows
    train_loader = DataLoader(train_dataset, batch_size=config.BATCH_SIZE, 
                             shuffle=True, num_workers=0)
    val_loader = DataLoader(val_dataset, batch_size=config.BATCH_SIZE, 
                           shuffle=False, num_workers=0)
    
    # Create model with tuned parameters
    model = CNN_Attention_Model(
        feature_dim=config.FEATURE_DIM,
        attention_dim=attention_dim,
        fc_dropout=fc_dropout
    ).to(config.DEVICE)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=config.WEIGHT_DECAY)
    scaler = GradScaler(enabled=config.MIXED_PRECISION)
    
    # Train for 5 epochs (quick tuning)
    best_val_acc = 0.0
    
    for epoch in range(5):
        train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, scaler, config.DEVICE, epoch_num=epoch)
        val_loss, val_acc, val_f1, val_auc = validate(model, val_loader, criterion, config.DEVICE)
        
        # Report intermediate value
        trial.report(val_acc, epoch)
        
        # Prune trial if not promising
        if trial.should_prune():
            raise optuna.TrialPruned()
        
        best_val_acc = max(best_val_acc, val_acc)
    
    return best_val_acc

print("‚úÖ Optuna objective function defined with data leak prevention!")

In [None]:
# Run hyperparameter tuning
print("üîç Starting hyperparameter tuning with Optuna...")
print(f"Number of trials: {config.OPTUNA_N_TRIALS}")
print(f"Timeout: {config.OPTUNA_TIMEOUT} seconds ({config.OPTUNA_TIMEOUT/60:.1f} minutes)\n")

study = optuna.create_study(
    direction='maximize',
    pruner=MedianPruner(n_startup_trials=3, n_warmup_steps=2)
)

study.optimize(optuna_objective, n_trials=config.OPTUNA_N_TRIALS, timeout=config.OPTUNA_TIMEOUT)

# Get best hyperparameters
best_params = study.best_params
print("\n" + "="*50)
print("üèÜ BEST HYPERPARAMETERS FOUND:")
print("="*50)
for key, value in best_params.items():
    print(f"{key}: {value}")
print(f"Best validation accuracy: {study.best_value:.4f}")
print("="*50)

# Update config with best parameters
config.LEARNING_RATE = best_params['learning_rate']
config.ATTENTION_DIM = best_params['attention_dim']
config.FC_DROPOUT = best_params['fc_dropout']

print("\n‚úÖ Hyperparameters optimized and updated!")

In [None]:
# Full training function with early stopping
def train_model_full(train_df, val_df):
    """
    Train InceptionResNetV2 + Attention model on 70:30 split
    """
    print(f"\n{'='*70}")
    print(f"Training InceptionResNetV2 + Attention Pooling")
    print(f"{'='*70}\n")
    
    # Create datasets
    train_dataset = VideoDataset(
        train_df, config.IMG_SIZE, 
        config.FRAMES_PER_VIDEO, is_train=True
    )
    val_dataset = VideoDataset(
        val_df, config.IMG_SIZE, 
        config.FRAMES_PER_VIDEO, is_train=False
    )
    
    # Use num_workers=0 to avoid Windows multiprocessing issues
    train_loader = DataLoader(
        train_dataset, batch_size=config.BATCH_SIZE, 
        shuffle=True, num_workers=0, pin_memory=True
    )
    val_loader = DataLoader(
        val_dataset, batch_size=config.BATCH_SIZE, 
        shuffle=False, num_workers=0, pin_memory=True
    )
    
    # Create model with optimized hyperparameters from Optuna
    model = CNN_Attention_Model(
        feature_dim=config.FEATURE_DIM,
        attention_dim=config.ATTENTION_DIM,
        fc_dropout=config.FC_DROPOUT
    ).to(config.DEVICE)
    
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=config.LEARNING_RATE, 
                                  weight_decay=config.WEIGHT_DECAY)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='max', factor=0.5, patience=2, verbose=True
    )
    scaler = GradScaler(enabled=config.MIXED_PRECISION)
    
    # Training tracking
    best_val_acc = 0.0
    best_model_wts = None
    patience_counter = 0
    history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': [], 'val_f1': [], 'val_auc': []}
    
    # Training loop
    for epoch in range(config.NUM_EPOCHS):
        print(f"\nEpoch {epoch + 1}/{config.NUM_EPOCHS}")
        print("-" * 50)
        
        # Train
        train_loss, train_acc = train_one_epoch(model, train_loader, criterion, optimizer, scaler, config.DEVICE, epoch_num=epoch)
        
        # Validate
        val_loss, val_acc, val_f1, val_auc = validate(model, val_loader, criterion, config.DEVICE)
        
        # Update scheduler
        scheduler.step(val_acc)
        
        # Save history
        history['train_loss'].append(train_loss)
        history['train_acc'].append(train_acc)
        history['val_loss'].append(val_loss)
        history['val_acc'].append(val_acc)
        history['val_f1'].append(val_f1)
        history['val_auc'].append(val_auc)
        
        print(f"\nTrain Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f}")
        print(f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f} | Val F1: {val_f1:.4f} | Val AUC: {val_auc:.4f}")
        
        # Early stopping and model saving
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            best_model_wts = model.state_dict().copy()
            patience_counter = 0
            print(f"‚úÖ New best model! Val Acc: {val_acc:.4f}")
            
            # Save checkpoint
            checkpoint_path = SAVE_PATH / config.MODEL_NAME
            torch.save({
                'epoch': epoch,
                'model_state_dict': best_model_wts,
                'optimizer_state_dict': optimizer.state_dict(),
                'val_acc': best_val_acc,
                'val_f1': val_f1,
                'val_auc': val_auc,
                'history': history,
                'config': {
                    'model_name': config.MODEL_NAME,
                    'feature_dim': config.FEATURE_DIM,
                    'attention_dim': config.ATTENTION_DIM,
                    'fc_dropout': config.FC_DROPOUT
                }
            }, checkpoint_path)
            print(f"üíæ Model saved to {checkpoint_path}")
        else:
            patience_counter += 1
            print(f"Patience: {patience_counter}/{config.EARLY_STOP_PATIENCE}")
        
        # Early stopping
        if patience_counter >= config.EARLY_STOP_PATIENCE:
            print(f"\n‚èπÔ∏è Early stopping triggered at epoch {epoch + 1}")
            break
    
    # Load best model
    if best_model_wts is not None:
        model.load_state_dict(best_model_wts)
    
    print(f"\n{'='*70}")
    print(f"‚úÖ InceptionResNetV2 + Attention Training Complete!")
    print(f"Best Val Accuracy: {best_val_acc:.4f}")
    print(f"{'='*70}\n")
    
    return model, best_val_acc, history

print("‚úÖ Full training function defined!")

## üöÄ STEP 7: Train Model

**Training Configuration:**
- Model: InceptionResNetV2 + Attention Pooling
- Batch Size: 1 (with gradient accumulation √ó 4)
- Mixed Precision: FP16
- Hardware: RTX 3050 Laptop (4GB VRAM)
- **Data:** Using train_df (420 videos) and val_df (180 videos) from STEP 6A

**Estimated Time:** ~6-8 hours on RTX 3050 Laptop GPU

In [None]:
# Train InceptionResNetV2 + Attention model
print("üöÄ Starting training...")
print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
print(f"Model: InceptionResNetV2 + Attention Pooling")
print(f"Hyperparameters: LR={config.LEARNING_RATE:.6f}, Attention_Dim={config.ATTENTION_DIM}, FC_Dropout={config.FC_DROPOUT}\n")

# Record training start time
training_start_time = time.time()

# Train model
model, best_val_acc, history = train_model_full(train_df, val_df)

# Calculate training time
training_duration = time.time() - training_start_time
hours = int(training_duration // 3600)
minutes = int((training_duration % 3600) // 60)

print(f"\n{'='*70}")
print(f"üéâ TRAINING COMPLETE!")
print(f"{'='*70}")
print(f"Training time: {hours}h {minutes}m")
print(f"Best validation accuracy: {best_val_acc:.4f}")
print(f"Model saved to: {SAVE_PATH / config.MODEL_NAME}")
print(f"{'='*70}\n")

## üìà STEP 8: Visualize Training Results

In [None]:
# Visualize training curves
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('InceptionResNetV2 + Attention - Training Results', fontsize=16, fontweight='bold')

# Plot 1: Training and Validation Accuracy
ax = axes[0, 0]
ax.plot(history['train_acc'], label='Training Accuracy', linewidth=2, color='#2E86AB')
ax.plot(history['val_acc'], label='Validation Accuracy', linewidth=2, color='#A23B72', linestyle='--')
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Accuracy Curves', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Plot 2: Training and Validation Loss
ax = axes[0, 1]
ax.plot(history['train_loss'], label='Training Loss', linewidth=2, color='#2E86AB')
ax.plot(history['val_loss'], label='Validation Loss', linewidth=2, color='#A23B72', linestyle='--')
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Loss', fontsize=12)
ax.set_title('Loss Curves', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Plot 3: Validation F1 Score
ax = axes[1, 0]
ax.plot(history['val_f1'], label='Validation F1', linewidth=2, color='#F18F01')
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('F1 Score', fontsize=12)
ax.set_title('F1 Score Over Time', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Plot 4: Validation AUC
ax = axes[1, 1]
ax.plot(history['val_auc'], label='Validation AUC', linewidth=2, color='#C73E1D')
ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('AUC', fontsize=12)
ax.set_title('AUC Over Time', fontsize=13, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(SAVE_PATH / 'training_curves.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"‚úÖ Training curves saved to: {SAVE_PATH / 'training_curves.png'}")

In [None]:
# Create final summary
print("\n" + "="*70)
print("üìä FINAL TRAINING SUMMARY")
print("="*70)

# Find best epoch
best_epoch = history['val_acc'].index(max(history['val_acc'])) + 1
best_metrics = {
    'Best Epoch': best_epoch,
    'Best Val Accuracy': max(history['val_acc']),
    'Val F1 Score': history['val_f1'][best_epoch-1],
    'Val AUC': history['val_auc'][best_epoch-1],
    'Train Accuracy': history['train_acc'][best_epoch-1],
    'Train Loss': history['train_loss'][best_epoch-1],
    'Val Loss': history['val_loss'][best_epoch-1]
}

# Print summary
print(f"\nModel: InceptionResNetV2 + Attention Pooling")
print(f"Training Duration: {hours}h {minutes}m")
print(f"\nBest Performance (Epoch {best_epoch}):")
print(f"  {'Metric':<20} {'Value':>10}")
print(f"  {'-'*30}")
for metric, value in best_metrics.items():
    if metric == 'Best Epoch':
        print(f"  {metric:<20} {value:>10}")
    else:
        print(f"  {metric:<20} {value:>10.4f}")

print(f"\nModel saved to: {SAVE_PATH / config.MODEL_NAME}")
print("="*70)

# Save summary to JSON
import json
summary_dict = {
    'model': 'InceptionResNetV2 + Attention',
    'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'training_duration_hours': round(training_duration / 3600, 2),
    'hyperparameters': {
        'learning_rate': config.LEARNING_RATE,
        'attention_dim': config.ATTENTION_DIM,
        'fc_dropout': config.FC_DROPOUT,
        'batch_size': config.BATCH_SIZE,
        'accumulation_steps': config.ACCUMULATION_STEPS,
        'num_epochs': config.NUM_EPOCHS,
        'frames_per_video': config.FRAMES_PER_VIDEO,
        'img_size': config.IMG_SIZE
    },
    'best_metrics': best_metrics,
    'history': {
        'train_loss': [float(x) for x in history['train_loss']],
        'train_acc': [float(x) for x in history['train_acc']],
        'val_loss': [float(x) for x in history['val_loss']],
        'val_acc': [float(x) for x in history['val_acc']],
        'val_f1': [float(x) for x in history['val_f1']],
        'val_auc': [float(x) for x in history['val_auc']]
    }
}

with open(SAVE_PATH / 'training_summary.json', 'w') as f:
    json.dump(summary_dict, f, indent=2)

print(f"\n‚úÖ Summary saved to: {SAVE_PATH / 'training_summary.json'}")

## ‚úÖ Training Complete!

**Next Steps:**
1. Review training curves and metrics above
2. Run `INFERENCE_PIPELINE.ipynb` to make predictions on test data
3. Model is saved at: `models/inception_resnet_v2_best.pt`

## ‚úÖ TRAINING COMPLETE!

### üì¶ Saved Files:
- `inception_resnet_v2_best.pt` - Best model checkpoint
- `training_curves.png` - Training visualization (4 panels: Accuracy, Loss, F1, AUC)
- `training_summary.json` - Detailed metrics and history

### üîÑ Next Steps:
1. Review training curves and metrics above
2. Open `INFERENCE_PIPELINE.ipynb`  
3. Load the trained model
4. Generate predictions on test set
5. Create `PREDICTIONS.CSV` for submission