# YouTube Frame Extractor - Quickstart Guide

This notebook demonstrates how to use the YouTube Frame Extractor package to extract and analyze frames from YouTube videos using different methods.

## Overview

The YouTube Frame Extractor offers two main approaches:

1. **Browser-based extraction**: Uses Selenium to capture frames directly from the YouTube player.
2. **Download-based extraction**: Downloads videos using yt-dlp and extracts frames.

Both methods can be enhanced with Vision Language Models (VLMs) for intelligent frame selection based on natural language descriptions.

## 1. Setup and Installation

First, let's make sure we have the package installed and set up the environment:

In [None]:
# Add the parent directory to the path for importing the package
import sys
import os
from pathlib import Path

# Move up two directories from the current notebook location
project_root = Path().absolute().parent.parent
sys.path.insert(0, str(project_root))

# Verify we can import the package
try:
    from src.youtube_frame_extractor.extractors.browser import BrowserExtractor
    from src.youtube_frame_extractor.extractors.download import DownloadExtractor
    from src.youtube_frame_extractor.analysis.vlm import VLMAnalyzer
    print("✅ Successfully imported YouTube Frame Extractor package")
except ImportError as e:
    print(f"❌ Error importing package: {str(e)}")
    print("Please make sure you're running this notebook from the examples/notebooks directory")
    raise

In [None]:
# Configure logging
import logging
import warnings

# Set up logging to display in the notebook
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Suppress unnecessary warnings
warnings.filterwarnings('ignore', category=UserWarning)

# Create output directory for extracted frames
output_dir = Path("./notebook_output")
output_dir.mkdir(exist_ok=True)

print(f"Output will be saved to: {output_dir.absolute()}")

## 2. Helper Functions for Display

Let's define some helper functions to display extracted frames in the notebook:

In [None]:
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from IPython.display import display, HTML

def display_frames(frames, max_frames=6, figsize=(15, 10), title="Extracted Frames"):
    """Display a grid of extracted frames."""
    num_frames = min(max_frames, len(frames))
    if num_frames == 0:
        print("No frames to display")
        return
    
    # Calculate grid dimensions
    cols = min(3, num_frames)
    rows = (num_frames + cols - 1) // cols
    
    plt.figure(figsize=figsize)
    plt.suptitle(title, fontsize=16)
    
    for i in range(num_frames):
        plt.subplot(rows, cols, i + 1)
        
        # Get the frame image
        if 'frame' in frames[i] and frames[i]['frame'] is not None:
            img = frames[i]['frame']
        elif 'path' in frames[i] and os.path.exists(frames[i]['path']):
            img = Image.open(frames[i]['path'])
        else:
            plt.text(0.5, 0.5, "Image not available", ha='center', va='center')
            plt.axis('off')
            continue
        
        if isinstance(img, Image.Image):
            img = np.array(img)
        
        plt.imshow(img)
        subtitle = f"Frame {i+1}"
        if 'time' in frames[i]:
            subtitle += f" | Time: {frames[i]['time']:.2f}s"
        if 'similarity' in frames[i]:
            subtitle += f" | Score: {frames[i]['similarity']:.2f}"
        plt.title(subtitle)
        plt.axis('off')
    
    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.show()

def display_video_info(video_id):
    """Display YouTube video embed and basic info."""
    embed_html = f"""
    <div style='width:560px;'>
        <h3>YouTube Video: {video_id}</h3>
        <iframe width='560' height='315' src='https://www.youtube.com/embed/{video_id}' 
                frameborder='0' allow='accelerometer; autoplay; clipboard-write; encrypted-media; 
                gyroscope; picture-in-picture' allowfullscreen>
        </iframe>
    </div>
    """
    display(HTML(embed_html))

## 3. Browser-Based Frame Extraction

Let's start with browser-based extraction, which captures frames directly from the YouTube player without downloading the full video.

In [None]:
# Define a YouTube video to extract frames from
video_id = "dQw4w9WgXcQ"

# Display the video for reference
display_video_info(video_id)

In [None]:
# Create a browser extractor
browser_extractor = BrowserExtractor(
    output_dir=str(output_dir / "browser"),
    headless=True
)

# Extract frames (e.g., 5 frames every 3 seconds)
try:
    frames = browser_extractor.extract_frames(
        video_id=video_id,
        interval=3.0,
        max_frames=5
    )
    print(f"Successfully extracted {len(frames)} frames")
    display_frames(frames, title="Browser-Extracted Frames")
except Exception as e:
    print(f"Error extracting frames: {str(e)}")
    print("Note: Browser-based extraction requires Chrome/Chromium to be installed")

## 4. Download-Based Frame Extraction

Now let's try the download-based approach, which downloads the video and extracts frames locally.

In [None]:
# Create a download extractor
download_extractor = DownloadExtractor(
    output_dir=str(output_dir / "download")
)

# Extract frames (e.g., 5 frames at 0.25 fps)
try:
    frames = download_extractor.extract_frames(
        video_id=video_id,
        frame_rate=0.25,
        max_frames=5
    )
    print(f"Successfully extracted {len(frames)} frames")
    display_frames(frames, title="Download-Extracted Frames")
except Exception as e:
    print(f"Error extracting frames: {str(e)}")
    print("Note: Download-based extraction requires ffmpeg to be installed")

## 5. VLM-Based Intelligent Frame Analysis

Now let's use a Vision Language Model (VLM) to find frames that match a specific description.

In [None]:
# Initialize VLM analyzer
try:
    vlm_analyzer = VLMAnalyzer(model_name="openai/clip-vit-base-patch16")
    print("✅ VLM analyzer initialized successfully")
except Exception as e:
    print(f"❌ Error initializing VLM analyzer: {str(e)}")
    print("Skipping VLM-based analysis")
    vlm_analyzer = None

In [None]:
# Proceed with VLM analysis if available
if vlm_analyzer is not None:
    search_query = "person singing into microphone"
    try:
        matched_frames = browser_extractor.scan_video_for_frames(
            video_id=video_id,
            search_query=search_query,
            vlm_analyzer=vlm_analyzer,
            interval=2.0,
            threshold=0.25,
            max_frames=10
        )
        print(f"Found {len(matched_frames)} frames matching the query: '{search_query}'")
        matched_frames.sort(key=lambda x: x.get('similarity', 0), reverse=True)
        display_frames(matched_frames, title=f"Frames Matching: '{search_query}'")
    except Exception as e:
        print(f"Error in VLM analysis: {str(e)}")

## 6. Custom Frame Processing and Analysis

Let's demonstrate how to process the extracted frames with custom analysis functions (e.g., face detection).

In [None]:
import cv2
from PIL import Image, ImageDraw
import numpy as np

def detect_faces(image):
    """Detect faces in an image using OpenCV."""
    if isinstance(image, Image.Image):
        image = np.array(image)
        image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
    face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
    return faces, image

def draw_faces_on_image(image, faces):
    if not isinstance(image, Image.Image):
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        image = Image.fromarray(image)
    draw = ImageDraw.Draw(image)
    for (x, y, w, h) in faces:
        draw.rectangle([(x, y), (x+w, y+h)], outline="red", width=3)
        draw.text((x, y-10), "Face", fill="red")
    return image

def process_frames_with_face_detection(frames):
    processed_frames = []
    for frame in frames:
        if 'frame' in frame and frame['frame'] is not None:
            image = frame['frame']
        elif 'path' in frame and os.path.exists(frame['path']):
            image = Image.open(frame['path'])
        else:
            continue
        faces, cv_image = detect_faces(image)
        processed_image = draw_faces_on_image(cv_image, faces)
        processed_frame = frame.copy()
        processed_frame['frame'] = processed_image
        processed_frame['faces_detected'] = len(faces)
        processed_frames.append(processed_frame)
    return processed_frames

In [None]:
# If frames are available, process them with face detection
try:
    if 'frames' in locals() and frames:
        processed_frames = process_frames_with_face_detection(frames)
        display_frames(processed_frames, title="Frames with Face Detection")
        face_counts = [frame.get('faces_detected', 0) for frame in processed_frames]
        total_faces = sum(face_counts)
        frames_with_faces = sum(1 for count in face_counts if count > 0)
        print(f"Detected {total_faces} faces in {frames_with_faces} frames")
    else:
        print("No frames available for processing")
except Exception as e:
    print(f"Error processing frames: {str(e)}")

## 7. Saving Processed Results

Finally, let's see how to save processed frames and metadata for later use.

In [None]:
import json
import time

def save_processed_results(frames, video_id, output_path):
    """Save processed frames and metadata."""
    results_dir = Path(output_path) / "results"
    results_dir.mkdir(exist_ok=True, parents=True)
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    metadata = {
        "video_id": video_id,
        "extraction_time": timestamp,
        "frame_count": len(frames),
        "frames": []
    }
    for i, frame in enumerate(frames):
        frame_filename = f"{video_id}_{timestamp}_{i:03d}.jpg"
        frame_path = results_dir / frame_filename
        if 'frame' in frame and frame['frame'] is not None:
            image = frame['frame']
        elif 'path' in frame and os.path.exists(frame['path']):
            image = Image.open(frame['path'])
        else:
            continue
        if not isinstance(image, Image.Image):
            image = Image.fromarray(image)
        image.save(frame_path)
        frame_meta = {
            "filename": frame_filename,
            "path": str(frame_path),
            "index": i
        }
        for key, value in frame.items():
            if key not in ['frame', 'path'] and not callable(value):
                if hasattr(value, 'item'):
                    value = value.item()
                frame_meta[key] = value
        metadata["frames"].append(frame_meta)
    metadata_path = results_dir / f"{video_id}_{timestamp}_metadata.json"
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)
    return str(metadata_path)

if 'processed_frames' in locals() and processed_frames:
    metadata_path = save_processed_results(
        frames=processed_frames,
        video_id=video_id,
        output_path=output_dir
    )
    print(f"Saved processed results to: {metadata_path}")
    try:
        with open(metadata_path, 'r') as f:
            metadata = json.load(f)
        print("\nMetadata summary:")
        print(f"- Video ID: {metadata['video_id']}")
        print(f"- Extraction time: {metadata['extraction_time']}")
        print(f"- Frame count: {metadata['frame_count']}")
        print(f"- First frame: {metadata['frames'][0]['filename']}")
    except Exception as e:
        print(f"Error displaying metadata: {str(e)}")
else:
    print("No processed frames available to save")

## 8. Cleanup

Finally, let's clean up any resources and show a summary of what we've learned.

In [None]:
# Clean up resources
try:
    if 'browser_extractor' in locals() and browser_extractor._driver is not None:
        browser_extractor._driver.quit()
        print("Browser extractor cleaned up")
    for var in ['frames', 'matched_frames', 'processed_frames']:
        if var in locals():
            locals()[var] = None
    print("Cleanup complete")
except Exception as e:
    print(f"Error during cleanup: {str(e)}")

## Summary

In this quickstart guide, you've learned how to:
1. **Set up** the YouTube Frame Extractor package
2. **Extract frames** using browser-based and download-based methods
3. **Analyze frames** with a Vision Language Model (VLM) to find content matching specific descriptions
4. **Process frames** with custom analysis (face detection)
5. **Save results** for later use

### Next Steps
- Try extracting frames from different videos
- Experiment with different search queries for VLM analysis
- Implement custom frame processing for your specific needs
- Check out the Advanced Analysis notebook for more complex examples

For more details on API usage and advanced features, refer to the documentation.