# Phase 2: Feature Extraction with TimeSformer (MIL-Ready)
## Weakly Supervised Video Anomaly Detection using TimeSformer and MIL

This notebook implements **Phase 2**: Extracting 768-dimensional features from video **clips** using a pretrained **TimeSformer** model.

### üîÑ Critical Update: "Bag of Instances" Architecture
This version is aligned with the **Sliding Window + Dilated Sampling** approach from Phase 1:
- **Input**: Each video has **50-200 clip subfolders** (clip_0000, clip_0001, ...)
- **Each Clip**: Contains **16 frames** spanning ~2.5 seconds of video
- **Output**: Feature matrix of shape `(Num_Clips, 768)` per video - the **"Bag"** for MIL

### Pipeline Overview:
1. **GPU Setup & Verification** - Ensure CUDA is available
2. **Load Phase 1 Metadata** - Read `dataset_metadata.json`
3. **TimeSformer Model** - Load pretrained model (frozen weights)
4. **Batch Feature Extraction** - Process clips in batches (GPU-efficient)
5. **Save Feature Bags** - Store `(N_clips, 768)` arrays for Phase 3 (MIL Training)

### Expected Input (from Phase 1):
```
Processed_Clips/
‚îú‚îÄ‚îÄ Explosion/
‚îÇ   ‚îú‚îÄ‚îÄ Explosion001/
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ clip_0000/  (16 frames: img_000.jpg ... img_015.jpg)
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ clip_0001/
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ ... (50-200 clips)
‚îÇ   ‚îî‚îÄ‚îÄ Explosion002/
‚îú‚îÄ‚îÄ Fighting/
‚îî‚îÄ‚îÄ Normal/
```

### Expected Output:
```
TimeSformer_Features/
‚îú‚îÄ‚îÄ Explosion/
‚îÇ   ‚îú‚îÄ‚îÄ Explosion001.npy  ‚Üí Shape: (num_clips, 768)
‚îÇ   ‚îî‚îÄ‚îÄ Explosion002.npy
‚îú‚îÄ‚îÄ Fighting/
‚îî‚îÄ‚îÄ Normal/
```

## Cell 1: Imports & Configuration
Sets up paths, model parameters, and verifies GPU availability.

In [1]:
"""
Phase 2: Feature Extraction with TimeSformer (MIL-Ready)
Cell 1: Imports & Configuration
"""

import os
import json
import torch
import numpy as np
from PIL import Image
from tqdm.notebook import tqdm
from transformers import AutoImageProcessor, TimesformerModel
from torch.utils.data import Dataset, DataLoader
import time
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# ================= CONFIGURATION =================
# Input: Metadata file from Phase 1 (CRITICAL: Updated path)
METADATA_PATH = r"C:\UCF_video_dataset\Processed_Clips\dataset_metadata.json"

# Output: Where to save the feature BAGS (.npy files)
FEATURE_OUTPUT_DIR = r"C:\UCF_video_dataset\TimeSformer_Features"

# Model Settings
MODEL_CKPT = "facebook/timesformer-base-finetuned-k400"
FEATURE_DIM = 768  # TimeSformer [CLS] token dimension

# Frame parameters (MUST match Phase 1)
NUM_FRAMES_PER_CLIP = 16  # Changed from 32 to 16 (aligned with Phase 1)

# Processing Settings (Adjust BATCH_SIZE based on your GPU VRAM)
BATCH_SIZE = 8  # Safe for RTX 3080 Ti (12GB). Try 16 if stable.
NUM_WORKERS = 0  # Use 0 for Windows stability, 4 for Linux

# ================= SYSTEM CHECK =================
def check_gpu_status():
    print("\n" + "="*70)
    print("GPU STATUS CHECK")
    print("="*70)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    print(f"\nüì¶ PyTorch Version: {torch.__version__}")
    print(f"üîß CUDA Available: {torch.cuda.is_available()}")
    
    if torch.cuda.is_available():
        gpu_name = torch.cuda.get_device_name(0)
        total_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
        
        print(f"\nüñ•Ô∏è  GPU Device: {gpu_name}")
        print(f"üíæ Total Memory: {total_memory:.2f} GB")
        print(f"\n‚úÖ GPU IS READY FOR FEATURE EXTRACTION!")
    else:
        print("\n‚ùå NO GPU AVAILABLE - Will use CPU (MUCH SLOWER!)")
    
    print("="*70)
    return device

# Run GPU check
DEVICE = check_gpu_status()

# Print configuration summary
print(f"\nüéØ Using device: {DEVICE}")
print(f"\nüìÅ Configuration:")
print(f"   Metadata: {METADATA_PATH}")
print(f"   Output: {FEATURE_OUTPUT_DIR}")
print(f"   Model: {MODEL_CKPT}")
print(f"   Frames/Clip: {NUM_FRAMES_PER_CLIP}")
print(f"   Batch Size: {BATCH_SIZE}")
print("="*70)


GPU STATUS CHECK

üì¶ PyTorch Version: 2.7.1+cu118
üîß CUDA Available: True

üñ•Ô∏è  GPU Device: NVIDIA GeForce RTX 3080 Ti
üíæ Total Memory: 12.00 GB

‚úÖ GPU IS READY FOR FEATURE EXTRACTION!

üéØ Using device: cuda

üìÅ Configuration:
   Metadata: C:\UCF_video_dataset\Processed_Clips\dataset_metadata.json
   Output: C:\UCF_video_dataset\TimeSformer_Features
   Model: facebook/timesformer-base-finetuned-k400
   Frames/Clip: 16
   Batch Size: 8


## Cell 2: The Dataset Class (Handling "Bag of Clips" Logic)

This is the **engine**. It opens a specific video folder, finds all clip sub-folders, and prepares them for the model.

In [2]:
"""
Cell 2: VideoClipsDataset - Handles the "Bag of Clips" Structure
"""

class VideoClipsDataset(Dataset):
    """
    Reads a single video's folder and returns all its clips as a batch.
    Each video folder contains: clip_0000/, clip_0001/, ... clip_NNNN/
    Each clip folder contains: img_000.jpg, img_001.jpg, ... img_015.jpg (16 frames)
    """
    
    def __init__(self, clips_root_path, processor):
        """
        Args:
            clips_root_path: Path to video folder (e.g., .../Explosion/Explosion001/)
            processor: HuggingFace AutoImageProcessor for TimeSformer
        """
        self.clips_root = clips_root_path
        self.processor = processor
        
        # Find all valid clip folders (clip_0000, clip_0001, ...)
        self.clip_folders = sorted([
            d for d in os.listdir(clips_root_path) 
            if os.path.isdir(os.path.join(clips_root_path, d)) and d.startswith("clip_")
        ])
        
    def __len__(self):
        return len(self.clip_folders)
    
    def __getitem__(self, idx):
        """
        Load a single clip (16 frames) and preprocess for TimeSformer.
        
        Returns:
            Tensor of shape (3, 16, 224, 224) - ready for TimeSformer
        """
        clip_name = self.clip_folders[idx]
        clip_path = os.path.join(self.clips_root, clip_name)
        
        # Load the 16 images for this clip
        images = []
        # Sort ensures temporal order (img_000.jpg, img_001.jpg, ...)
        filenames = sorted([f for f in os.listdir(clip_path) if f.endswith(".jpg")])
        
        for fname in filenames:
            img_path = os.path.join(clip_path, fname)
            img = Image.open(img_path).convert("RGB")
            images.append(img)
        
        # Validation: Must have exactly 16 frames
        if len(images) != NUM_FRAMES_PER_CLIP:
            print(f"‚ö†Ô∏è Warning: {clip_name} has {len(images)} frames. Expected {NUM_FRAMES_PER_CLIP}.")
            # Return zeros as placeholder (will be filtered out)
            return torch.zeros((3, NUM_FRAMES_PER_CLIP, 224, 224))
        
        # Preprocess using HuggingFace Processor
        # TimeSformer expects list of PIL images
        inputs = self.processor(images=images, return_tensors="pt")
        
        # Shape: [1, 3, 16, 224, 224] -> Remove batch dim -> [3, 16, 224, 224]
        return inputs['pixel_values'].squeeze(0)


print("‚úÖ VideoClipsDataset class defined.")
print("   This class handles the 'Bag of Clips' structure from Phase 1.")

‚úÖ VideoClipsDataset class defined.
   This class handles the 'Bag of Clips' structure from Phase 1.


## Cell 3: Load TimeSformer Model

Loads the pretrained TimeSformer from Facebook and **freezes weights** (we're only extracting features, not training).

In [3]:
"""
Cell 3: Load TimeSformer Model
"""

print("="*70)
print("LOADING TIMESFORMER MODEL")
print("="*70)

print(f"\n‚è≥ Downloading/Loading: {MODEL_CKPT}")
print("   This may take a few minutes on first run...")

start_time = time.time()

# Load image processor (handles normalization, resizing)
processor = AutoImageProcessor.from_pretrained(MODEL_CKPT)
print("   ‚úì Image processor loaded")

# Load model
model = TimesformerModel.from_pretrained(MODEL_CKPT)
print("   ‚úì Model loaded")

# Move to GPU and set to evaluation mode
model.to(DEVICE)
model.eval()
print(f"   ‚úì Model moved to {DEVICE}")

# CRITICAL: Freeze all weights (no gradient computation)
for param in model.parameters():
    param.requires_grad = False
print("   ‚úì Weights frozen (no training, only feature extraction)")

elapsed = time.time() - start_time

# Print model info
total_params = sum(p.numel() for p in model.parameters())
print(f"\nüìä Model Statistics:")
print(f"   Total Parameters: {total_params:,}")
print(f"   Model Size: ~{total_params * 4 / (1024**3):.2f} GB (FP32)")
print(f"   Load Time: {elapsed:.2f} seconds")

# GPU memory check
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated(0) / (1024**3)
    print(f"\nüíæ GPU Memory Used: {allocated:.2f} GB")

print(f"\n{'='*70}")
print("‚úÖ TIMESFORMER READY FOR FEATURE EXTRACTION!")
print(f"{'='*70}")

LOADING TIMESFORMER MODEL

‚è≥ Downloading/Loading: facebook/timesformer-base-finetuned-k400
   This may take a few minutes on first run...


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


   ‚úì Image processor loaded
   ‚úì Model loaded
   ‚úì Model moved to cuda
   ‚úì Weights frozen (no training, only feature extraction)

üìä Model Statistics:
   Total Parameters: 121,258,752
   Model Size: ~0.45 GB (FP32)
   Load Time: 2.34 seconds

üíæ GPU Memory Used: 0.45 GB

‚úÖ TIMESFORMER READY FOR FEATURE EXTRACTION!


## Cell 4: Main Feature Extraction Loop

Reads the `dataset_metadata.json` from Phase 1 and processes every video.  
For each video, it creates a **feature bag** of shape `(num_clips, 768)`.

In [4]:
"""
Cell 4: Main Feature Extraction Loop - OPTIMIZED with Clip Limit
Key fixes:
1. MAX_CLIPS_PER_VIDEO - limits huge videos (some have 15000+ clips!)
2. Progress bar within video - see progress on large videos
3. Better timeout handling
"""

# ===== CRITICAL: Clip limit for huge videos =====
MAX_CLIPS_PER_VIDEO = 500  # Limit clips per video (500 is plenty for MIL)
SHOW_PROGRESS_THRESHOLD = 50  # Show inner progress bar if video has > 50 clips

def extract_features_optimized():
    """
    OPTIMIZED extraction with clip limiting for huge videos.
    """
    
    # ========== 1. LOAD METADATA ==========
    print("\n" + "="*70)
    print("PHASE 2: FEATURE EXTRACTION (OPTIMIZED)")
    print("="*70)
    
    if not os.path.exists(METADATA_PATH):
        print(f"‚ùå Error: Metadata file not found!")
        print(f"   Expected: {METADATA_PATH}")
        return

    with open(METADATA_PATH, 'r') as f:
        video_list = json.load(f)
        
    print(f"\nüìÇ Found {len(video_list)} videos to process.")
    print(f"‚ö° Max clips per video: {MAX_CLIPS_PER_VIDEO} (limits huge videos)")
    
    # Count already processed
    already_done = 0
    for vm in video_list:
        save_path = os.path.join(FEATURE_OUTPUT_DIR, vm['class_name'], f"{vm['video_name']}.npy")
        if os.path.exists(save_path):
            already_done += 1
    print(f"‚úÖ Already processed: {already_done} videos (will be skipped)")
    print(f"üìã Remaining: {len(video_list) - already_done} videos")
    
    os.makedirs(FEATURE_OUTPUT_DIR, exist_ok=True)
    
    # ========== 2. TRACKING ==========
    results = {
        'successful': 0,
        'failed': 0,
        'skipped': 0,
        'clipped': 0,  # Videos that were clip-limited
        'total_clips_processed': 0,
        'videos': []
    }
    
    start_time = time.time()
    
    # ========== 3. PROCESS VIDEOS ==========
    print("\nüöÄ Starting extraction...\n")
    
    pbar = tqdm(video_list, desc="Processing Videos", unit="vid")
    
    for video_meta in pbar:
        video_name = video_meta['video_name']
        class_name = video_meta['class_name']
        clips_path = video_meta['clips_path']
        num_clips_expected = video_meta['num_clips']
        
        # Update progress bar description
        pbar.set_postfix({
            'video': video_name[:20],
            'clips': num_clips_expected,
            'done': results['successful']
        })
        
        # Setup Output Path
        save_dir = os.path.join(FEATURE_OUTPUT_DIR, class_name)
        os.makedirs(save_dir, exist_ok=True)
        save_path = os.path.join(save_dir, f"{video_name}.npy")
        
        # Skip if already processed
        if os.path.exists(save_path):
            results['skipped'] += 1
            continue
        
        # Check if clips directory exists
        if not os.path.exists(clips_path):
            results['failed'] += 1
            continue
            
        try:
            # ========== 4. CREATE DATASET ==========
            dataset = VideoClipsDataset(clips_path, processor)
            original_len = len(dataset)
            
            if original_len == 0:
                results['failed'] += 1
                continue
            
            # ===== CLIP LIMITING =====
            was_clipped = False
            if original_len > MAX_CLIPS_PER_VIDEO:
                # Uniformly sample MAX_CLIPS_PER_VIDEO clips
                indices = np.linspace(0, original_len - 1, MAX_CLIPS_PER_VIDEO, dtype=int)
                dataset.clip_folders = [dataset.clip_folders[i] for i in indices]
                # Note: Only clip_folders needs updating (clip_files doesn't exist in this version)
                was_clipped = True
                results['clipped'] += 1
                tqdm.write(f"   ‚ö° {video_name}: {original_len} clips ‚Üí {MAX_CLIPS_PER_VIDEO} (sampled)")

            # DataLoader
            loader = DataLoader(
                dataset, 
                batch_size=BATCH_SIZE, 
                num_workers=NUM_WORKERS,
                pin_memory=True if DEVICE.type == 'cuda' else False
            )
            
            video_features = []
            
            # ========== 5. EXTRACT FEATURES ==========
            with torch.no_grad():
                # Show inner progress for large videos
                if len(dataset) > SHOW_PROGRESS_THRESHOLD:
                    batch_iter = tqdm(loader, desc=f"  {video_name[:25]}", leave=False, unit="batch")
                else:
                    batch_iter = loader
                    
                for batch in batch_iter:
                    batch = batch.to(DEVICE)
                    outputs = model(pixel_values=batch)
                    cls_features = outputs.last_hidden_state[:, 0, :].cpu().numpy()
                    video_features.append(cls_features)
            
            # ========== 6. SAVE ==========
            full_video_features = np.concatenate(video_features, axis=0)
            np.save(save_path, full_video_features)
            
            results['successful'] += 1
            results['total_clips_processed'] += len(dataset)
            results['videos'].append({
                'video_name': video_name,
                'class_name': class_name,
                'original_clips': original_len,
                'processed_clips': len(dataset),
                'was_clipped': was_clipped,
                'feature_shape': list(full_video_features.shape),
                'feature_path': save_path
            })
            
        except Exception as e:
            tqdm.write(f"‚ùå Error {video_name}: {e}")
            results['failed'] += 1
            
        # Periodic cleanup
        if results['successful'] % 50 == 0 and torch.cuda.is_available():
            torch.cuda.empty_cache()

    pbar.close()
    
    # ========== 7. FINAL REPORT ==========
    total_time = time.time() - start_time
    
    print("\n" + "="*70)
    print("‚úÖ PHASE 2 COMPLETE")
    print("="*70)
    print(f"\nüìä Results:")
    print(f"   Successful: {results['successful']}")
    print(f"   Failed: {results['failed']}")
    print(f"   Skipped (already done): {results['skipped']}")
    print(f"   Clip-limited (huge videos): {results['clipped']}")
    print(f"   Total Clips Processed: {results['total_clips_processed']}")
    print(f"\n‚è±Ô∏è  Time: {total_time/60:.2f} minutes ({total_time:.0f} seconds)")
    if results['successful'] > 0:
        print(f"   Avg per video: {total_time/results['successful']:.1f} seconds")
    print(f"\nüíæ Features saved to: {FEATURE_OUTPUT_DIR}")
    print("="*70)
    
    # Save metadata
    extraction_meta = {
        'timestamp': datetime.now().isoformat(),
        'total_videos': len(video_list),
        'successful': results['successful'],
        'failed': results['failed'],
        'skipped': results['skipped'],
        'clipped': results['clipped'],
        'total_clips': results['total_clips_processed'],
        'max_clips_per_video': MAX_CLIPS_PER_VIDEO,
        'model': MODEL_CKPT,
        'feature_dim': FEATURE_DIM,
        'processing_time_seconds': total_time,
        'videos': results['videos']
    }
    
    meta_path = os.path.join(FEATURE_OUTPUT_DIR, 'extraction_metadata.json')
    with open(meta_path, 'w') as f:
        json.dump(extraction_meta, f, indent=2, default=str)
    print(f"\nüìÑ Metadata saved to: {meta_path}")
    
    return results


print("‚úÖ extract_features_optimized() function defined.")
print(f"\n‚ö° Key optimization: Videos with >500 clips will be uniformly sampled.")
print("   This prevents huge videos (15000+ clips) from hanging the process.")
print("\n‚ö†Ô∏è  Run the next cell to START (will skip already-processed videos).")

‚úÖ extract_features_optimized() function defined.

‚ö° Key optimization: Videos with >500 clips will be uniformly sampled.
   This prevents huge videos (15000+ clips) from hanging the process.

‚ö†Ô∏è  Run the next cell to START (will skip already-processed videos).


## Cell 5: Run Feature Extraction

**‚ö†Ô∏è This will process all videos. Estimated time: 1-3 hours depending on dataset size.**

In [5]:
"""
Cell 5: RUN OPTIMIZED FEATURE EXTRACTION
Will skip already-processed videos automatically!
"""

# Run the OPTIMIZED extraction (with clip limiting)
extraction_results = extract_features_optimized()


PHASE 2: FEATURE EXTRACTION (OPTIMIZED)

üìÇ Found 1900 videos to process.
‚ö° Max clips per video: 500 (limits huge videos)
‚úÖ Already processed: 1900 videos (will be skipped)
üìã Remaining: 0 videos

üöÄ Starting extraction...



Processing Videos:   0%|          | 0/1900 [00:00<?, ?vid/s]


‚úÖ PHASE 2 COMPLETE

üìä Results:
   Successful: 0
   Failed: 0
   Skipped (already done): 1900
   Clip-limited (huge videos): 0
   Total Clips Processed: 0

‚è±Ô∏è  Time: 0.02 minutes (1 seconds)

üíæ Features saved to: C:\UCF_video_dataset\TimeSformer_Features

üìÑ Metadata saved to: C:\UCF_video_dataset\TimeSformer_Features\extraction_metadata.json


## Cell 6: Verify Extracted Features

Check that the feature bags are correctly shaped for MIL.

In [6]:
"""
Cell 6: Verify Extracted Features
Check that Feature Bags are correctly shaped for MIL
"""

def verify_feature_bags(features_path=FEATURE_OUTPUT_DIR):
    """
    Verify that all feature bags are correctly shaped.
    Expected: Each .npy file should have shape (num_clips, 768)
    """
    print("\n" + "="*70)
    print("FEATURE BAG VERIFICATION REPORT")
    print("="*70)
    
    if not os.path.exists(features_path):
        print(f"‚ùå Features directory not found: {features_path}")
        return
    
    # Get all class folders
    class_folders = [d for d in os.listdir(features_path) 
                     if os.path.isdir(os.path.join(features_path, d))]
    
    total_files = 0
    total_clips = 0
    shape_distribution = {}
    
    print(f"\nüìÅ Checking features in: {features_path}\n")
    
    for class_name in sorted(class_folders):
        class_path = os.path.join(features_path, class_name)
        npy_files = [f for f in os.listdir(class_path) if f.endswith('.npy')]
        
        class_clips = 0
        for npy_file in npy_files:
            file_path = os.path.join(class_path, npy_file)
            features = np.load(file_path)
            
            # Track shape distribution
            shape_key = f"({features.shape[0]}, {features.shape[1]})"
            shape_distribution[shape_key] = shape_distribution.get(shape_key, 0) + 1
            
            class_clips += features.shape[0]
            total_clips += features.shape[0]
        
        total_files += len(npy_files)
        print(f"   {class_name:20s}: {len(npy_files):4d} videos, {class_clips:6d} total clips")
    
    print(f"\nüìä Summary:")
    print(f"   Total Feature Files: {total_files}")
    print(f"   Total Clips (Instances): {total_clips}")
    print(f"   Feature Dimension: {FEATURE_DIM}")
    
    print(f"\nüìê Shape Distribution (num_clips, 768):")
    for shape, count in sorted(shape_distribution.items(), key=lambda x: -x[1])[:10]:
        print(f"   {shape}: {count} videos")
    
    # Sample one file to show detailed stats
    print(f"\nüìÑ Sample Feature Bag Analysis:")
    sample_class = class_folders[0] if class_folders else None
    if sample_class:
        sample_path = os.path.join(features_path, sample_class)
        sample_files = [f for f in os.listdir(sample_path) if f.endswith('.npy')]
        if sample_files:
            sample_file = os.path.join(sample_path, sample_files[0])
            sample_features = np.load(sample_file)
            
            print(f"   File: {sample_files[0]}")
            print(f"   Shape: {sample_features.shape}")
            print(f"   Dtype: {sample_features.dtype}")
            print(f"   Min: {sample_features.min():.4f}")
            print(f"   Max: {sample_features.max():.4f}")
            print(f"   Mean: {sample_features.mean():.4f}")
            print(f"   Std: {sample_features.std():.4f}")
    
    print("\n" + "="*70)
    print("‚úÖ VERIFICATION COMPLETE")
    print("   Each .npy file is a 'Bag of Instances' ready for MIL!")
    print("="*70)


# Run verification
verify_feature_bags()


FEATURE BAG VERIFICATION REPORT

üìÅ Checking features in: C:\UCF_video_dataset\TimeSformer_Features

   Abuse               :   50 videos,   2989 total clips
   Arrest              :   50 videos,   4613 total clips
   Arson               :   50 videos,   4214 total clips
   Assault             :   50 videos,   1995 total clips
   Burglary            :  100 videos,   7290 total clips
   Explosion           :   50 videos,   3910 total clips
   Fighting            :   50 videos,   4013 total clips
   Normal              :  950 videos, 111429 total clips
   RoadAccidents       :  150 videos,   3973 total clips
   Robbery             :  150 videos,   6505 total clips
   Shooting            :   50 videos,   2270 total clips
   Shoplifting         :   50 videos,   4491 total clips
   Stealing            :  100 videos,   7236 total clips
   Vandalism           :   50 videos,   2264 total clips

üìä Summary:
   Total Feature Files: 1900
   Total Clips (Instances): 167192
   Feature Dimensio

## Cell 7: Summary and Next Steps

Final summary and preparation for Phase 3 (MIL Training).

In [7]:
"""
Cell 7: Phase 2 Summary
"""

print("\n" + "="*70)
print("PHASE 2 COMPLETE: FEATURE EXTRACTION (MIL-READY)")
print("="*70)
print("""
‚úÖ What was accomplished:
   1. Loaded pretrained TimeSformer model (facebook/timesformer-base-finetuned-k400)
   2. Processed each video's "Bag of Clips" from Phase 1
   3. Extracted 768-dimensional [CLS] token features for EACH CLIP
   4. Saved Feature Bags as .npy files

üìÅ Output Structure:
   TimeSformer_Features/
   ‚îú‚îÄ‚îÄ Abuse/
   ‚îÇ   ‚îú‚îÄ‚îÄ Abuse001.npy      ‚Üí Shape: (num_clips, 768) e.g. (87, 768)
   ‚îÇ   ‚îî‚îÄ‚îÄ Abuse002.npy      ‚Üí Shape: (num_clips, 768) e.g. (124, 768)
   ‚îú‚îÄ‚îÄ Explosion/
   ‚îÇ   ‚îî‚îÄ‚îÄ Explosion001.npy  ‚Üí Shape: (num_clips, 768) e.g. (156, 768)
   ‚îú‚îÄ‚îÄ Normal/
   ‚îÇ   ‚îî‚îÄ‚îÄ Normal001.npy     ‚Üí Shape: (num_clips, 768) e.g. (203, 768)
   ‚îî‚îÄ‚îÄ extraction_metadata.json

üéØ Why This Shape Matters for MIL:
   ‚Ä¢ Each video is now a "BAG" of multiple instances (clips)
   ‚Ä¢ MIL can compare instances within and across bags
   ‚Ä¢ Example: "Clips 1-40 look normal, but Clip 45 is anomalous"
   ‚Ä¢ This enables FRAME-LEVEL LOCALIZATION (key thesis requirement!)

üöÄ Next Steps (Phase 3 - MIL Training):
   1. Load Feature Bags
   2. Implement MIL Network with Attention Mechanism
   3. Train with:
      - Ranking Loss: max(anomaly_scores) > max(normal_scores)
      - Focal Loss: Handle class imbalance
      - Temporal Smoothness: Consistent adjacent predictions
   4. Evaluate: AUC-ROC, Per-Frame Localization
""")
print("="*70)

# Clear GPU memory
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    allocated = torch.cuda.memory_allocated(0) / (1024**3)
    print(f"\nüßπ GPU memory cleared. Current usage: {allocated:.2f} GB")


PHASE 2 COMPLETE: FEATURE EXTRACTION (MIL-READY)

‚úÖ What was accomplished:
   1. Loaded pretrained TimeSformer model (facebook/timesformer-base-finetuned-k400)
   2. Processed each video's "Bag of Clips" from Phase 1
   3. Extracted 768-dimensional [CLS] token features for EACH CLIP
   4. Saved Feature Bags as .npy files

üìÅ Output Structure:
   TimeSformer_Features/
   ‚îú‚îÄ‚îÄ Abuse/
   ‚îÇ   ‚îú‚îÄ‚îÄ Abuse001.npy      ‚Üí Shape: (num_clips, 768) e.g. (87, 768)
   ‚îÇ   ‚îî‚îÄ‚îÄ Abuse002.npy      ‚Üí Shape: (num_clips, 768) e.g. (124, 768)
   ‚îú‚îÄ‚îÄ Explosion/
   ‚îÇ   ‚îî‚îÄ‚îÄ Explosion001.npy  ‚Üí Shape: (num_clips, 768) e.g. (156, 768)
   ‚îú‚îÄ‚îÄ Normal/
   ‚îÇ   ‚îî‚îÄ‚îÄ Normal001.npy     ‚Üí Shape: (num_clips, 768) e.g. (203, 768)
   ‚îî‚îÄ‚îÄ extraction_metadata.json

üéØ Why This Shape Matters for MIL:
   ‚Ä¢ Each video is now a "BAG" of multiple instances (clips)
   ‚Ä¢ MIL can compare instances within and across bags
   ‚Ä¢ Example: "Clips 1-40 look normal, b