# Notebook 05: Naive Sliding Windows

**Course:** Deep Neural Network Architectures (21CSE558T)  
**Module 5:** Object Detection and Localization  
**Week 13:** Object Localization Fundamentals  
**Duration:** ~10 minutes

## Learning Objectives
By the end of this notebook, you will be able to:
- Understand the classical sliding window approach to object detection
- Implement a basic sliding window detector
- Analyze computational limitations of this approach
- Appreciate why modern methods (YOLO, R-CNN) are necessary
- Understand the motivation for deep learning-based detection

## Introduction

Before deep learning revolutionized computer vision (circa 2012), object detection relied on:

### Classical Approach (2000-2012):
1. **Sliding Window**: Move a fixed-size window across the image
2. **Feature Extraction**: HOG (Histogram of Oriented Gradients), SIFT, etc.
3. **Classification**: SVM, Random Forest on hand-crafted features
4. **Multi-Scale**: Repeat at different image scales

### Problems:
- **Slow**: 1000s of windows per image
- **Redundant computation**: Same features computed many times
- **Fixed aspect ratio**: One window size per scale
- **Not end-to-end**: Separate feature extraction and classification

### Why Study This?
- Historical context: Appreciate modern methods
- Understand the evolution of object detection
- Some concepts (multi-scale, window-based) still relevant

## Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import Rectangle
import cv2
from PIL import Image
import time

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"OpenCV version: {cv2.__version__}")

## The 2010 Approach: Before Deep Learning

### How Object Detection Worked (Pre-2012)

#### 1. Hand-Crafted Features
Instead of learning features from data, researchers manually designed feature extractors:

- **HOG (Histogram of Oriented Gradients)**:
  - Count edge orientations in local regions
  - Good for detecting pedestrians, vehicles
  - Introduced in Dalal & Triggs (2005)

- **SIFT (Scale-Invariant Feature Transform)**:
  - Detect and describe local keypoints
  - Robust to scale and rotation
  - Lowe (2004)

- **Haar Cascades**:
  - Rectangle features for face detection
  - Viola-Jones (2001)

#### 2. Sliding Window
- Move a fixed-size window across the entire image
- Extract features from each window
- Classify: "Does this window contain an object?"

#### 3. Multi-Scale Pyramid
- Create multiple resized versions of the image
- Apply sliding window at each scale
- Detect objects of different sizes

#### 4. Classification
- Train SVM (Support Vector Machine) or similar
- Binary: object vs background
- Or multi-class: car vs person vs dog vs background

### Landmark Systems:
- **Viola-Jones Face Detector (2001)**: First real-time detector
- **Dalal-Triggs Pedestrian Detector (2005)**: HOG + SVM
- **Deformable Part Models (2008)**: Won PASCAL VOC challenges

### The Deep Learning Revolution (2012)
- **AlexNet (2012)**: CNN beats hand-crafted features on ImageNet
- **R-CNN (2014)**: First deep learning object detector
- **YOLO (2016)**: Real-time detection with single forward pass
- **Current**: YOLOv8, Faster R-CNN dominate

## Create Simple Classifier

For demonstration, we'll use a mock classifier instead of training a real one:

In [None]:
class MockClassifier:
    """
    Mock classifier to simulate object detection.
    In reality, this would be a trained SVM or neural network.
    """
    def __init__(self, target_regions=None):
        """
        Args:
            target_regions: List of (x, y, width, height) where objects are located
        """
        self.target_regions = target_regions or []
    
    def predict(self, window_x, window_y, window_size, image):
        """
        Simulate classification of a window.
        
        Returns:
            confidence: 0-1 score (higher if window overlaps target region)
        """
        max_overlap = 0.0
        
        for target_x, target_y, target_w, target_h in self.target_regions:
            # Calculate overlap with target region
            x_overlap = max(0, min(window_x + window_size, target_x + target_w) - 
                          max(window_x, target_x))
            y_overlap = max(0, min(window_y + window_size, target_y + target_h) - 
                          max(window_y, target_y))
            
            overlap_area = x_overlap * y_overlap
            window_area = window_size * window_size
            
            if window_area > 0:
                overlap_ratio = overlap_area / window_area
                max_overlap = max(max_overlap, overlap_ratio)
        
        # Simulate confidence score with some noise
        confidence = max_overlap * 0.9 + np.random.rand() * 0.1
        return min(1.0, confidence)

# Create a test image with objects
image = np.ones((480, 640, 3), dtype=np.uint8) * 240  # Gray background

# Add some "objects" (colored rectangles)
objects = [
    (100, 80, 120, 140),   # Red rectangle (x, y, w, h)
    (400, 200, 150, 160),  # Blue rectangle
]

# Draw objects
cv2.rectangle(image, (100, 80), (220, 220), (220, 50, 50), -1)  # Red
cv2.circle(image, (475, 280), 75, (50, 50, 220), -1)           # Blue

# Create classifier that knows where objects are
classifier = MockClassifier(target_regions=objects)

# Display image
plt.figure(figsize=(10, 7))
plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.title("Test Image with Objects", fontsize=14, fontweight='bold')
plt.axis('off')
plt.tight_layout()
plt.show()

print(f"Image size: {image.shape[1]} × {image.shape[0]} pixels")
print(f"Number of objects: {len(objects)}")
print("\nClassifier created (mock for demonstration)")

## Sliding Window Function

The core of the classical detection approach:

In [None]:
def sliding_window(image, window_size, stride):
    """
    Generate sliding windows across an image.
    
    Args:
        image: Input image (H, W, C)
        window_size: Size of the square window
        stride: Step size for sliding (pixels)
    
    Yields:
        (x, y, window): Position and extracted window
    """
    height, width = image.shape[:2]
    
    for y in range(0, height - window_size + 1, stride):
        for x in range(0, width - window_size + 1, stride):
            # Extract window
            window = image[y:y+window_size, x:x+window_size]
            yield (x, y, window)

def count_windows(image_width, image_height, window_size, stride):
    """
    Count total number of windows for given parameters.
    """
    num_x = (image_width - window_size) // stride + 1
    num_y = (image_height - window_size) // stride + 1
    return num_x * num_y

# Test with different window sizes and strides
window_size = 100
stride = 50

height, width = image.shape[:2]
total_windows = count_windows(width, height, window_size, stride)

print("Sliding Window Configuration:")
print("=" * 60)
print(f"Image size:    {width} × {height} pixels")
print(f"Window size:   {window_size} × {window_size} pixels")
print(f"Stride:        {stride} pixels")
print(f"\nTotal windows: {total_windows}")

# Show impact of different strides
print("\nEffect of Stride on Number of Windows:")
print("=" * 60)
print(f"{'Stride':<10} {'Windows':<15} {'Coverage':<15}")
print("=" * 60)
for s in [10, 20, 30, 50, 100]:
    windows = count_windows(width, height, window_size, s)
    coverage = "Dense" if s <= window_size/3 else "Moderate" if s <= window_size/2 else "Sparse"
    print(f"{s:<10} {windows:<15} {coverage:<15}")

print("\nTrade-off:")
print("  - Smaller stride → More windows → Better coverage, slower")
print("  - Larger stride → Fewer windows → Faster, might miss objects")

## Visualize Windows

Let's see where all the windows are positioned:

In [None]:
def visualize_sliding_windows(image, window_size, stride, max_windows=50):
    """
    Visualize sliding window positions on the image.
    
    Args:
        image: Input image
        window_size: Window size
        stride: Stride size
        max_windows: Maximum windows to display (for clarity)
    """
    fig, ax = plt.subplots(figsize=(12, 8))
    ax.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    
    count = 0
    total = 0
    
    for x, y, window in sliding_window(image, window_size, stride):
        total += 1
        if count < max_windows:
            # Draw window
            rect = Rectangle((x, y), window_size, window_size, 
                           linewidth=1, edgecolor='lime', facecolor='none', alpha=0.6)
            ax.add_patch(rect)
            count += 1
    
    ax.set_title(f"Sliding Windows (showing {count} of {total} windows)\n" + 
                f"Window: {window_size}×{window_size}, Stride: {stride}",
                fontsize=14, fontweight='bold')
    ax.axis('off')
    plt.tight_layout()
    plt.show()
    
    return total

# Visualize with moderate stride
total = visualize_sliding_windows(image, window_size=100, stride=80, max_windows=50)
print(f"\nTotal windows at this scale: {total}")
print("Note: Only first 50 windows shown for clarity")

# Compare different window sizes
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

configs = [(64, 60), (100, 80), (150, 120)]

for ax, (ws, st) in zip(axes, configs):
    ax.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    
    count = 0
    total = 0
    for x, y, window in sliding_window(image, ws, st):
        total += 1
        if count < 30:  # Show fewer for clarity
            rect = Rectangle((x, y), ws, ws, 
                           linewidth=1.5, edgecolor='lime', facecolor='none', alpha=0.6)
            ax.add_patch(rect)
            count += 1
    
    ax.set_title(f"Window: {ws}×{ws}\nStride: {st}\nTotal: {total} windows",
                fontweight='bold')
    ax.axis('off')

plt.suptitle("Different Window Sizes", fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nObservation: Larger windows cover objects better but need more scales")

## Computational Cost Analysis

Let's calculate the computational cost of this approach:

In [None]:
def analyze_computational_cost(image_width, image_height, window_sizes, strides, num_scales=1):
    """
    Analyze computational cost of sliding window detection.
    
    Args:
        image_width, image_height: Image dimensions
        window_sizes: List of window sizes
        strides: List of strides (same length as window_sizes)
        num_scales: Number of image scales (pyramid levels)
    
    Returns:
        Total number of windows to process
    """
    total_windows = 0
    
    print("Computational Cost Analysis:")
    print("=" * 70)
    print(f"{'Scale':<8} {'Size':<15} {'Window':<12} {'Stride':<10} {'Windows':<10}")
    print("=" * 70)
    
    for scale_idx in range(num_scales):
        scale_factor = 0.8 ** scale_idx  # Each scale is 80% of previous
        scaled_width = int(image_width * scale_factor)
        scaled_height = int(image_height * scale_factor)
        
        for window_size, stride in zip(window_sizes, strides):
            if scaled_width < window_size or scaled_height < window_size:
                continue
            
            windows = count_windows(scaled_width, scaled_height, window_size, stride)
            total_windows += windows
            
            print(f"{scale_idx+1:<8} {scaled_width}×{scaled_height:<9} "
                  f"{window_size}×{window_size:<6} {stride:<10} {windows:<10}")
    
    print("=" * 70)
    print(f"\nTotal windows to process: {total_windows:,}")
    
    # Estimate time (assuming 10ms per window for feature extraction + classification)
    time_per_window_ms = 10
    total_time_ms = total_windows * time_per_window_ms
    total_time_sec = total_time_ms / 1000
    
    print(f"\nEstimated processing time:")
    print(f"  Assuming {time_per_window_ms}ms per window")
    print(f"  Total: {total_time_sec:.2f} seconds ({total_time_ms:,} ms)")
    print(f"  FPS: {1/total_time_sec:.2f}")
    
    return total_windows

# Analyze for typical configuration
window_sizes = [64, 100, 150]
strides = [32, 50, 75]

total = analyze_computational_cost(
    image_width=640,
    image_height=480,
    window_sizes=window_sizes,
    strides=strides,
    num_scales=5  # 5 different image scales
)

print("\n" + "=" * 70)
print("\nConclusion:")
print("  - Processing 1000s of windows per image is SLOW")
print("  - Not suitable for real-time applications")
print("  - Most windows are background (wasted computation)")
print("  - This motivated the development of faster methods")

## Multi-Scale Pyramid

To detect objects of different sizes, we need to process the image at multiple scales:

In [None]:
def create_image_pyramid(image, num_scales=5, scale_factor=0.8):
    """
    Create an image pyramid for multi-scale detection.
    
    Args:
        image: Original image
        num_scales: Number of pyramid levels
        scale_factor: Resize factor between levels (< 1.0)
    
    Returns:
        List of scaled images
    """
    pyramid = [image]
    
    for i in range(1, num_scales):
        scale = scale_factor ** i
        width = int(image.shape[1] * scale)
        height = int(image.shape[0] * scale)
        
        if width < 50 or height < 50:  # Minimum size
            break
        
        scaled = cv2.resize(image, (width, height))
        pyramid.append(scaled)
    
    return pyramid

# Create pyramid
pyramid = create_image_pyramid(image, num_scales=5, scale_factor=0.75)

# Visualize pyramid
fig, axes = plt.subplots(1, len(pyramid), figsize=(18, 4))

for idx, (ax, scaled_img) in enumerate(zip(axes, pyramid)):
    ax.imshow(cv2.cvtColor(scaled_img, cv2.COLOR_BGR2RGB))
    h, w = scaled_img.shape[:2]
    scale = w / image.shape[1]
    ax.set_title(f"Scale {idx}\n{w}×{h}\n({scale:.2f}×)", fontweight='bold')
    ax.axis('off')

plt.suptitle("Image Pyramid for Multi-Scale Detection", fontsize=16, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()

print("\nPyramid Statistics:")
print("=" * 60)
print(f"{'Level':<8} {'Size':<15} {'Scale Factor':<15} {'Pixels':<15}")
print("=" * 60)
for idx, scaled_img in enumerate(pyramid):
    h, w = scaled_img.shape[:2]
    scale = w / image.shape[1]
    pixels = h * w
    print(f"{idx:<8} {w}×{h:<9} {scale:<15.3f} {pixels:,}")

total_pixels = sum(img.shape[0] * img.shape[1] for img in pyramid)
original_pixels = image.shape[0] * image.shape[1]
print("=" * 60)
print(f"Total pixels to process: {total_pixels:,} ({total_pixels/original_pixels:.1f}× original)")

print("\nWhy Multiple Scales?")
print("  - Fixed window size (e.g., 100×100 pixels)")
print("  - Small objects: Need large image scale")
print("  - Large objects: Need small image scale")
print("  - Solution: Process image at multiple scales")

## Run Detection

Let's simulate the full detection pipeline:

In [None]:
def run_sliding_window_detection(image, classifier, window_size=100, stride=50, 
                                 threshold=0.5, num_scales=3):
    """
    Run complete sliding window detection pipeline.
    
    Args:
        image: Input image
        classifier: Classifier object with predict() method
        window_size: Window size
        stride: Stride size
        threshold: Confidence threshold
        num_scales: Number of pyramid scales
    
    Returns:
        List of detections (x, y, size, confidence)
    """
    detections = []
    windows_processed = 0
    
    # Create image pyramid
    pyramid = create_image_pyramid(image, num_scales=num_scales)
    
    start_time = time.time()
    
    for scale_idx, scaled_img in enumerate(pyramid):
        scale_factor = scaled_img.shape[1] / image.shape[1]
        
        # Skip if image too small for window
        if scaled_img.shape[0] < window_size or scaled_img.shape[1] < window_size:
            continue
        
        # Slide window
        for x, y, window in sliding_window(scaled_img, window_size, stride):
            windows_processed += 1
            
            # Classify window
            confidence = classifier.predict(x, y, window_size, scaled_img)
            
            if confidence >= threshold:
                # Convert coordinates back to original image scale
                orig_x = int(x / scale_factor)
                orig_y = int(y / scale_factor)
                orig_size = int(window_size / scale_factor)
                
                detections.append({
                    'x': orig_x,
                    'y': orig_y,
                    'size': orig_size,
                    'confidence': confidence,
                    'scale': scale_idx
                })
    
    elapsed_time = time.time() - start_time
    
    print(f"Detection Complete:")
    print(f"  Windows processed: {windows_processed:,}")
    print(f"  Detections found: {len(detections)}")
    print(f"  Time elapsed: {elapsed_time:.3f} seconds")
    print(f"  FPS: {1/elapsed_time:.2f}")
    
    return detections

# Run detection
print("Running Sliding Window Detection...")
print("=" * 60)
detections = run_sliding_window_detection(
    image, classifier, 
    window_size=100, 
    stride=30,  # Small stride for better coverage
    threshold=0.6,
    num_scales=3
)

# Visualize detections
result_img = image.copy()

for det in detections:
    x, y, size, conf = det['x'], det['y'], det['size'], det['confidence']
    
    # Color based on confidence
    color = (0, int(255 * conf), int(255 * (1 - conf)))
    
    cv2.rectangle(result_img, (x, y), (x + size, y + size), color, 2)
    cv2.putText(result_img, f"{conf:.2f}", (x, y - 5),
               cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 1)

plt.figure(figsize=(12, 8))
plt.imshow(cv2.cvtColor(result_img, cv2.COLOR_BGR2RGB))
plt.title(f"Detections: {len(detections)} boxes found\n(Green = high confidence, Red = low confidence)",
         fontsize=14, fontweight='bold')
plt.axis('off')
plt.tight_layout()
plt.show()

print("\nTop Detections:")
print("=" * 60)
sorted_detections = sorted(detections, key=lambda x: x['confidence'], reverse=True)
for i, det in enumerate(sorted_detections[:5], 1):
    print(f"{i}. Position: ({det['x']}, {det['y']})  "
          f"Size: {det['size']}×{det['size']}  "
          f"Confidence: {det['confidence']:.3f}  "
          f"Scale: {det['scale']}")

## Problems Summary

### Why Sliding Windows Don't Scale:

#### 1. Computational Inefficiency
```
Typical configuration:
  - Image: 640×480
  - Window sizes: 3 (64, 100, 150)
  - Scales: 5
  - Total windows: ~2000-3000
  - Processing time: 10-20 seconds per image
  - FPS: 0.05-0.1 (NOT real-time!)
```

#### 2. Redundant Computation
- **Problem**: Extract features from overlapping windows independently
- **Waste**: Same pixels processed many times
- **Example**: 50% overlap → 2× redundant computation

#### 3. Fixed Aspect Ratios
- **Problem**: Square windows (100×100)
- **Reality**: Objects have various aspect ratios
  - Pedestrian: 1:3 (tall and narrow)
  - Car: 3:1 (wide and short)
- **Solution needed**: Multiple aspect ratios → even more windows!

#### 4. Not End-to-End
- **Classical**: Hand-crafted features → Separate classifier
- **Deep Learning**: Learn features and classification together
- **Result**: Better accuracy with learned features

#### 5. Poor Localization
- **Problem**: Discrete grid of windows
- **Limitation**: Bounding box positions quantized to stride
- **Example**: Stride=50 → Can't detect object at position 25

### Performance Comparison

| Method | Speed (FPS) | mAP | Year |
|--------|-------------|-----|------|
| Sliding Window + HOG | 0.1 | 0.30 | 2005 |
| Deformable Parts | 0.07 | 0.33 | 2008 |
| R-CNN | 0.05 | 0.54 | 2014 |
| Fast R-CNN | 0.5 | 0.70 | 2015 |
| Faster R-CNN | 7 | 0.73 | 2015 |
| YOLO v1 | 45 | 0.63 | 2016 |
| YOLO v8 | **60+** | **0.53** | 2023 |

### Key Insight:
Modern methods (YOLO, R-CNN) are **100-1000× faster** while achieving **2-3× better accuracy**!

## The Solution: Modern Deep Learning Methods

### How Modern Detectors Solve These Problems:

#### 1. Region Proposal Networks (Faster R-CNN)
**Idea**: Instead of exhaustive search, predict where objects might be
- **Input**: Entire image
- **Output**: ~300 candidate regions (not 3000 windows!)
- **How**: Convolutional network predicts "objectness" at each location
- **Result**: 10× fewer regions to classify

#### 2. Single Shot Detection (YOLO, SSD)
**Idea**: Predict bounding boxes + classes in one forward pass
- **Input**: Image → Single CNN → Output: All detections
- **No sliding window**: Grid-based prediction
- **No pyramid**: Multi-scale feature maps
- **Result**: Real-time (30-60 FPS)

#### 3. Feature Pyramid Networks (FPN)
**Idea**: Build multi-scale features inside the network
- **Classical**: Resize image multiple times → Process separately
- **Modern**: Single image → Multi-scale feature maps in CNN
- **Benefit**: Share computation across scales

#### 4. End-to-End Learning
**Idea**: Learn features, proposals, and classification together
- **Classical**: HOG (fixed) → SVM (trained)
- **Modern**: All weights learned from data
- **Result**: Features optimized for detection task

### Architecture Comparison:

```
Classical Sliding Window:
  Image → Resize (5×) → Slide Window (2000×) → HOG → SVM → Detections
  Time: 10-20 seconds

YOLO:
  Image → CNN → Predictions (grid) → NMS → Detections
  Time: 0.02 seconds (50 FPS)

Faster R-CNN:
  Image → CNN → Region Proposals (300) → Classify → Detections
  Time: 0.1 seconds (10 FPS)
```

### Key Innovations:

1. **YOLO (You Only Look Once)**:
   - Divide image into grid (e.g., 7×7)
   - Each cell predicts bounding boxes + class probabilities
   - Single CNN forward pass → All detections

2. **R-CNN Family**:
   - R-CNN: Selective Search + CNN features
   - Fast R-CNN: ROI pooling for speed
   - Faster R-CNN: Replace Selective Search with RPN

3. **Feature Reuse**:
   - Compute features once for entire image
   - Extract region features from shared feature map
   - No redundant computation!

### What We'll Learn Next:

**Week 14: YOLO**
- Grid-based prediction
- Anchor boxes
- Loss function design
- Real-time detection

**Week 15: R-CNN Family**
- Region Proposal Networks
- ROI pooling
- Two-stage detection
- Higher accuracy applications

## Exercise: Calculate Windows Needed

Test your understanding of computational costs:

In [None]:
print("EXERCISE: Sliding Window Calculations")
print("=" * 60)

print("\n1. Calculate number of windows for:")
print("   - Image: 800×600")
print("   - Window: 100×100")
print("   - Stride: 25")
print("   - Single scale")

# Solution:
# windows = count_windows(800, 600, 100, 25)
# print(f"\n   Answer: {windows} windows")

print("\n" + "=" * 60)
print("\n2. How many windows for multi-scale?")
print("   - Same configuration as above")
print("   - 4 scales: 1.0×, 0.8×, 0.6×, 0.4×")
print("   - Assume stride scales proportionally")

# Solution:
# scales = [1.0, 0.8, 0.6, 0.4]
# total = 0
# for scale in scales:
#     w = int(800 * scale)
#     h = int(600 * scale)
#     s = int(25 * scale)
#     total += count_windows(w, h, 100, s)
# print(f"\n   Answer: {total} windows")

print("\n" + "=" * 60)
print("\n3. Processing time estimate:")
print("   - Use your answer from question 2")
print("   - Assume 5ms per window (feature extraction + classification)")
print("   - Calculate: Total time in seconds, FPS")

# Solution:
# time_ms = total * 5
# time_sec = time_ms / 1000
# fps = 1 / time_sec
# print(f"\n   Answer: {time_sec:.2f} seconds, {fps:.3f} FPS")

print("\n" + "=" * 60)
print("\n4. Compare with YOLO:")
print("   - YOLO processes same image in 20ms")
print("   - How much faster is YOLO?")

# Solution:
# speedup = time_ms / 20
# print(f"\n   Answer: YOLO is {speedup:.1f}× faster")

print("\n" + "=" * 60)
print("\nUncomment the solution code to see answers!")

## Summary + Motivation for Week 14

### What We Learned

1. **Classical Sliding Window Approach (2000-2012)**:
   - Exhaustive search: Slide window across image
   - Hand-crafted features: HOG, SIFT, Haar
   - Multi-scale pyramid: Process image at different sizes
   - Separate classification: SVM on extracted features

2. **Computational Cost**:
   - 2000-5000 windows per image
   - 10-20 seconds processing time
   - 0.05-0.1 FPS (NOT real-time)
   - Most computation wasted on background

3. **Fundamental Problems**:
   - Too slow for real-time applications
   - Redundant computation (overlapping windows)
   - Fixed aspect ratios (square windows)
   - Not end-to-end (separate feature extraction)
   - Poor localization (discrete grid)

4. **The Deep Learning Revolution**:
   - R-CNN (2014): Region proposals + CNN features
   - YOLO (2016): Single-shot detection, real-time
   - Modern: 100-1000× faster, 2-3× more accurate

### Why This Matters

Understanding the classical approach helps you appreciate modern methods:

- **YOLO's Innovation**: Grid prediction instead of sliding windows
- **Faster R-CNN**: Learned region proposals instead of exhaustive search
- **FPN**: Multi-scale features without image pyramids
- **End-to-End**: Learn features optimized for detection

### Historical Timeline

```
2001: Viola-Jones face detector (first real-time)
2005: HOG pedestrian detector
2008: Deformable Part Models (won PASCAL VOC)
2012: AlexNet (deep learning revolution)
2014: R-CNN (first deep learning detector)
2015: Fast R-CNN, Faster R-CNN
2016: YOLO, SSD (real-time detection)
2017: RetinaNet, Feature Pyramid Networks
2020: EfficientDet, DETR
2023: YOLOv8, RT-DETR
```

### Next Steps: Week 14-15

**Week 14: YOLO Architecture**
- How YOLO achieves real-time detection
- Grid-based prediction mechanism
- Anchor boxes and multi-scale detection
- Loss function design
- Hands-on: Implement YOLO detector

**Week 15: R-CNN Family**
- Region Proposal Networks (RPN)
- Two-stage detection pipeline
- ROI pooling and alignment
- When to use R-CNN vs YOLO
- Hands-on: Fine-tune Faster R-CNN

### Key Takeaway

**The Problem**: Sliding windows are too slow and wasteful

**The Solution**: 
- **YOLO**: Predict all boxes in single pass (speed)
- **R-CNN**: Smart region proposals (accuracy)
- **Both**: End-to-end learning, feature reuse, multi-scale

**Result**: Real-time object detection with high accuracy!

---

**Completion Time:** ~10 minutes  
**Next Notebook:** Week 14 - YOLO Architecture and Implementation