# Notebook 05: SSD Architecture Demo

**Week 14 - Module 5: Object Detection Models**  
**Understanding Single Shot MultiBox Detector (SSD)**

## Learning Objectives
- Understand SSD (Single Shot Detector) architecture
- Compare SSD with YOLO conceptually
- Use pre-trained SSD for detection
- Analyze multi-scale feature detection
- Understand default boxes and their role

**Estimated Time:** 15 minutes  
**Prerequisites:** Basic understanding of CNNs and object detection

## What is SSD?

### SSD: Single Shot MultiBox Detector

**Published:** December 2016 (same year as YOLOv1)  
**Authors:** Liu et al., University of North Carolina  
**Key Innovation:** Multi-scale feature maps for detection

### Core Principles:
1. **Single Shot**: Detect objects in one forward pass (like YOLO)
2. **MultiBox**: Multiple default boxes (anchors) at each location
3. **Multi-Scale**: Use feature maps from different network layers

### Why SSD?
- ‚úÖ **Real-time performance**: 59 FPS on SSD300, 22 FPS on SSD512
- ‚úÖ **Good accuracy**: 74.3% mAP (SSD300), 76.8% mAP (SSD512)
- ‚úÖ **Multi-scale detection**: Better for objects of varying sizes
- ‚úÖ **Flexibility**: Works with different backbones (VGG, ResNet, MobileNet)

### SSD vs YOLO (Quick Overview):
- **YOLO**: 3 detection scales, custom backbone, simpler architecture
- **SSD**: 6 detection scales, VGG/ResNet backbone, more complex

## SSD vs YOLO: Conceptual Comparison

| Feature | YOLO | SSD |
|---------|------|-----|
| **Architecture** | Custom (Darknet) | VGG16, ResNet, MobileNet |
| **Feature Scales** | 3 scales | 6 scales |
| **Default Boxes** | 3 anchors per scale | 4-6 boxes per scale |
| **Total Anchors** | ~10,647 | 8,732 (SSD300) |
| **Input Size** | 640√ó640 (YOLOv8) | 300√ó300 or 512√ó512 |
| **Speed (FPS)** | ~45 (YOLOv8n) | 59 (SSD300), 22 (SSD512) |
| **mAP** | 37.3% (YOLOv8n) | 74.3% (SSD300), 76.8% (SSD512) |
| **First Release** | 2016 (YOLOv1) | 2016 |
| **Latest Version** | 2023 (YOLOv8) | SSD with various backbones |
| **Popularity** | High (active development) | Moderate (established) |
| **Ease of Use** | Very easy (Ultralytics) | Moderate (TensorFlow/PyTorch) |
| **Best For** | General object detection | Multi-scale detection |

### Key Differences:
1. **Multi-Scale Strategy**: SSD uses more feature maps (6 vs 3)
2. **Backbone**: SSD flexible (VGG, ResNet), YOLO custom
3. **Default Boxes**: SSD has more aspect ratios per location
4. **Training**: YOLO easier to train, SSD requires more tuning

## SSD Architecture Diagram

```
Input Image (300√ó300√ó3)
        |
        v
   VGG-16 Base Network
        |
        |--- Conv4_3 (38√ó38√ó512) --> Detections (4 boxes/cell = 5,776 boxes)
        |
        v
        |--- FC7 (19√ó19√ó1024) -----> Detections (6 boxes/cell = 2,166 boxes)
        |
        v
   Extra Feature Layers
        |
        |--- Conv8_2 (10√ó10√ó512) ---> Detections (6 boxes/cell = 600 boxes)
        |
        |--- Conv9_2 (5√ó5√ó256) -----> Detections (6 boxes/cell = 150 boxes)
        |
        |--- Conv10_2 (3√ó3√ó256) ----> Detections (4 boxes/cell = 36 boxes)
        |
        |--- Conv11_2 (1√ó1√ó256) ----> Detections (4 boxes/cell = 4 boxes)
        |
        v
   Total: 8,732 default boxes
        |
        v
   Non-Maximum Suppression (NMS)
        |
        v
   Final Detections
```

### Layer-by-Layer Breakdown:
1. **Conv4_3**: Detects small objects (early features, high resolution)
2. **FC7**: Detects medium objects
3. **Conv8_2 - Conv11_2**: Detect progressively larger objects

### Why Multi-Scale?
- **Small objects** (e.g., person far away): Detected in high-resolution layers (38√ó38)
- **Large objects** (e.g., car close-up): Detected in low-resolution layers (1√ó1)

## Setup and Dependencies

In [None]:
# Install dependencies
!pip install -q tensorflow tensorflow-hub opencv-python matplotlib numpy pillow

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import matplotlib.pyplot as plt
import cv2
from PIL import Image
import urllib.request
from pathlib import Path

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU available: {tf.test.is_gpu_available()}")
print("\n‚úÖ Setup complete!")

## Load Pre-trained SSD Model

We'll use SSD MobileNet V2 from TensorFlow Hub (trained on COCO dataset).

In [None]:
# Load pre-trained SSD MobileNet V2 from TensorFlow Hub
print("üì• Loading SSD MobileNet V2 model...")

model_url = "https://tfhub.dev/tensorflow/ssd_mobilenet_v2/2"
detector = hub.load(model_url)

print("‚úÖ Model loaded successfully!")
print("\nüìä Model Details:")
print("  Architecture: SSD MobileNet V2")
print("  Trained on: COCO dataset (90 classes)")
print("  Input size: 320√ó320 (flexible)")
print("  Speed: ~25 FPS on CPU")

# COCO class names (80 classes)
COCO_CLASSES = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat',
    'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat',
    'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack',
    'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
    'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake',
    'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop',
    'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]

print(f"\n  Classes: {len(COCO_CLASSES)} (person, car, dog, etc.)")

## Run SSD Detection on Sample Images

In [None]:
# Download sample test images
test_images = [
    ('https://ultralytics.com/images/bus.jpg', 'bus.jpg'),
    ('https://ultralytics.com/images/zidane.jpg', 'zidane.jpg'),
]

for url, filename in test_images:
    if not Path(filename).exists():
        urllib.request.urlretrieve(url, filename)
        print(f"‚úÖ Downloaded: {filename}")

print("\nüì∏ Test images ready!")

In [None]:
def run_ssd_detection(image_path, confidence_threshold=0.5):
    """
    Run SSD object detection on an image.
    
    Args:
        image_path: Path to input image
        confidence_threshold: Minimum confidence score (0-1)
    
    Returns:
        detections: Dictionary with boxes, scores, classes
    """
    # Load image
    image = Image.open(image_path)
    image_np = np.array(image)
    
    # Convert to tensor
    input_tensor = tf.convert_to_tensor(image_np)
    input_tensor = input_tensor[tf.newaxis, ...]
    
    # Run detection
    detections = detector(input_tensor)
    
    # Extract results
    num_detections = int(detections.pop('num_detections'))
    detections = {key: value[0, :num_detections].numpy()
                  for key, value in detections.items()}
    detections['num_detections'] = num_detections
    
    # Filter by confidence
    indices = detections['detection_scores'] >= confidence_threshold
    
    return {
        'boxes': detections['detection_boxes'][indices],
        'scores': detections['detection_scores'][indices],
        'classes': detections['detection_classes'][indices].astype(int),
        'image': image_np
    }

def visualize_detections(detections, class_names=COCO_CLASSES):
    """
    Visualize SSD detection results.
    """
    image = detections['image'].copy()
    h, w = image.shape[:2]
    
    # Draw bounding boxes
    for box, score, class_id in zip(detections['boxes'], detections['scores'], detections['classes']):
        ymin, xmin, ymax, xmax = box
        left, right, top, bottom = int(xmin * w), int(xmax * w), int(ymin * h), int(ymax * h)
        
        # Draw box
        cv2.rectangle(image, (left, top), (right, bottom), (0, 255, 0), 2)
        
        # Draw label
        class_name = class_names[class_id - 1] if class_id <= len(class_names) else f'Class {class_id}'
        label = f'{class_name}: {score:.2f}'
        cv2.putText(image, label, (left, top - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
    
    return image

# Run detection on test images
fig, axes = plt.subplots(1, 2, figsize=(15, 7))

for idx, (_, filename) in enumerate(test_images):
    print(f"\nüîç Detecting objects in {filename}...")
    
    detections = run_ssd_detection(filename, confidence_threshold=0.5)
    vis_image = visualize_detections(detections)
    
    axes[idx].imshow(cv2.cvtColor(vis_image, cv2.COLOR_BGR2RGB))
    axes[idx].set_title(f'{filename} - {len(detections["boxes"])} detections', fontsize=12, fontweight='bold')
    axes[idx].axis('off')
    
    # Print detection details
    print(f"  Found {len(detections['boxes'])} objects:")
    for box, score, class_id in zip(detections['boxes'], detections['scores'], detections['classes']):
        class_name = COCO_CLASSES[class_id - 1] if class_id <= len(COCO_CLASSES) else f'Class {class_id}'
        print(f"    - {class_name}: {score:.2f}")

plt.tight_layout()
plt.show()

## Multi-Scale Detection Visualization

Let's understand how SSD detects objects at different scales.

In [None]:
# Visualize multi-scale detection concept
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('SSD Multi-Scale Feature Maps', fontsize=16, fontweight='bold')

# Define feature map sizes for SSD300
feature_maps = [
    ('Conv4_3', 38, 'Small objects'),
    ('FC7', 19, 'Medium objects'),
    ('Conv8_2', 10, 'Medium-large objects'),
    ('Conv9_2', 5, 'Large objects'),
    ('Conv10_2', 3, 'Very large objects'),
    ('Conv11_2', 1, 'Huge objects'),
]

# Create visualization for each feature map
for idx, (name, size, description) in enumerate(feature_maps):
    ax = axes[idx // 3, idx % 3]
    
    # Create grid
    grid = np.zeros((size, size))
    
    # Highlight some cells
    if size > 1:
        grid[size//4:3*size//4, size//4:3*size//4] = 0.5
    else:
        grid[0, 0] = 0.5
    
    # Plot
    ax.imshow(grid, cmap='viridis', interpolation='nearest')
    ax.set_title(f'{name}\n{size}√ó{size}\n({description})', fontsize=11, fontweight='bold')
    ax.set_xticks([])
    ax.set_yticks([])
    
    # Add grid lines
    for i in range(size + 1):
        ax.axhline(i - 0.5, color='white', linewidth=0.5)
        ax.axvline(i - 0.5, color='white', linewidth=0.5)
    
    # Add text
    boxes_per_cell = 4 if size in [38, 3, 1] else 6
    total_boxes = size * size * boxes_per_cell
    ax.text(0.5, -0.15, f'{total_boxes} default boxes', 
            ha='center', transform=ax.transAxes, fontsize=10)

plt.tight_layout()
plt.show()

print("\nüìä Feature Map Summary:")
print("  Early layers (38√ó38): High resolution ‚Üí Detect small objects")
print("  Middle layers (19√ó19, 10√ó10): Medium resolution ‚Üí Detect medium objects")
print("  Late layers (5√ó5, 3√ó3, 1√ó1): Low resolution ‚Üí Detect large objects")
print("\n  Total default boxes: 8,732 (across all scales)")

## Default Boxes Explanation

### What are Default Boxes?
Default boxes (also called **anchor boxes** or **priors**) are predefined bounding boxes:
- Fixed sizes and aspect ratios
- Placed at each cell in feature maps
- Network predicts **offsets** from these boxes

### Default Box Configuration:

| Feature Map | Size | Boxes/Cell | Aspect Ratios | Total Boxes |
|-------------|------|------------|---------------|-------------|
| Conv4_3 | 38√ó38 | 4 | 1:1, 1:2, 2:1, 1:1 (extra) | 5,776 |
| FC7 | 19√ó19 | 6 | 1:1, 1:2, 2:1, 1:3, 3:1, 1:1 | 2,166 |
| Conv8_2 | 10√ó10 | 6 | 1:1, 1:2, 2:1, 1:3, 3:1, 1:1 | 600 |
| Conv9_2 | 5√ó5 | 6 | 1:1, 1:2, 2:1, 1:3, 3:1, 1:1 | 150 |
| Conv10_2 | 3√ó3 | 4 | 1:1, 1:2, 2:1, 1:1 (extra) | 36 |
| Conv11_2 | 1√ó1 | 4 | 1:1, 1:2, 2:1, 1:1 (extra) | 4 |
| **Total** | | | | **8,732** |

### Why So Many Boxes?
- **Coverage**: Ensure at least one box overlaps with every object
- **Aspect Ratios**: Match different object shapes (person=tall, car=wide)
- **Scales**: Match different object sizes

### Training Process:
1. Match default boxes to ground truth objects (IoU > 0.5)
2. Predict **class scores** for each box
3. Predict **offset adjustments** (Œîx, Œîy, Œîw, Œîh)
4. Apply Non-Maximum Suppression (NMS) to remove duplicates

In [None]:
# Visualize default boxes at different aspect ratios
fig, axes = plt.subplots(1, 5, figsize=(20, 4))
fig.suptitle('SSD Default Box Aspect Ratios', fontsize=16, fontweight='bold')

aspect_ratios = [
    (1, 1, '1:1 (Square)'),
    (1, 2, '1:2 (Tall)'),
    (2, 1, '2:1 (Wide)'),
    (1, 3, '1:3 (Very Tall)'),
    (3, 1, '3:1 (Very Wide)'),
]

for idx, (w, h, label) in enumerate(aspect_ratios):
    ax = axes[idx]
    
    # Draw box
    box_w, box_h = w * 0.3, h * 0.3
    rect = plt.Rectangle((0.5 - box_w/2, 0.5 - box_h/2), box_w, box_h, 
                          fill=False, edgecolor='red', linewidth=3)
    ax.add_patch(rect)
    
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_aspect('equal')
    ax.set_title(label, fontsize=12, fontweight='bold')
    ax.axis('off')

plt.tight_layout()
plt.show()

print("\nüìè Aspect Ratio Examples:")
print("  1:1 - General objects, cars")
print("  1:2 - People, bottles")
print("  2:1 - Buses, trains")
print("  1:3 - Tall buildings, lampposts")
print("  3:1 - Horizontal signs, benches")

## SSD Performance Metrics

### Official SSD Results (2016 Paper):

| Model | Input Size | mAP (COCO) | FPS (Titan X) |
|-------|-----------|------------|---------------|
| SSD300 | 300√ó300 | 74.3% | 59 |
| SSD512 | 512√ó512 | 76.8% | 22 |

### Comparison with Other Detectors (2016):

| Model | mAP | FPS | Notes |
|-------|-----|-----|-------|
| Faster R-CNN | 73.2% | 7 | Two-stage (slow) |
| YOLOv1 | 63.4% | 45 | Single-stage (fast) |
| SSD300 | 74.3% | 59 | **Best speed/accuracy trade-off** |
| SSD512 | 76.8% | 22 | Higher accuracy, slower |

### Modern SSD Variants:
- **SSD MobileNet**: Lightweight for mobile devices (~25 FPS on CPU)
- **SSD ResNet**: Higher accuracy with deeper backbone
- **SSDLite**: Optimized for mobile deployment

## SSD Strengths and Weaknesses

### Strengths:
‚úÖ **Real-time performance**: 59 FPS (SSD300), faster than Faster R-CNN  
‚úÖ **Good accuracy**: Competitive with two-stage detectors  
‚úÖ **Multi-scale detection**: 6 feature maps ‚Üí better for varying object sizes  
‚úÖ **Flexible backbone**: Works with VGG, ResNet, MobileNet  
‚úÖ **End-to-end training**: Single network, no region proposals  
‚úÖ **Well-established**: Production-ready, TensorFlow/PyTorch support  

### Weaknesses:
‚ùå **Struggles with small objects**: Despite multi-scale, small objects (<5% image) are challenging  
‚ùå **More complex than YOLO**: 8,732 default boxes, harder to tune  
‚ùå **Imbalanced training**: Many negative (background) boxes, requires hard negative mining  
‚ùå **Fixed input size**: Requires resizing images (300√ó300 or 512√ó512)  
‚ùå **Less active development**: Surpassed by YOLO, EfficientDet in recent years  

### When to Use SSD:
- ‚úÖ Need multi-scale detection (objects of varying sizes)
- ‚úÖ Using TensorFlow ecosystem
- ‚úÖ Production deployment (stable, well-tested)
- ‚úÖ Mobile deployment (SSD MobileNet)

### When to Use YOLO Instead:
- ‚úÖ Need latest state-of-the-art accuracy
- ‚úÖ Easier training and fine-tuning
- ‚úÖ Active community and updates
- ‚úÖ Simpler architecture

## Exercise: Compare SSD and YOLO on Same Image

Let's run both SSD and YOLOv8 on the same image and compare results.

In [None]:
# Install YOLO for comparison
!pip install -q ultralytics

from ultralytics import YOLO

# Load YOLOv8
yolo_model = YOLO('yolov8n.pt')

# Compare on test image
test_image = 'bus.jpg'

print("üîç Running SSD detection...")
ssd_detections = run_ssd_detection(test_image, confidence_threshold=0.5)
ssd_vis = visualize_detections(ssd_detections)

print("üîç Running YOLO detection...")
yolo_results = yolo_model(test_image)
yolo_vis = yolo_results[0].plot()

# Visualize side-by-side
fig, axes = plt.subplots(1, 2, figsize=(15, 7))

axes[0].imshow(cv2.cvtColor(ssd_vis, cv2.COLOR_BGR2RGB))
axes[0].set_title(f'SSD MobileNet V2\n{len(ssd_detections["boxes"])} detections', 
                  fontsize=14, fontweight='bold')
axes[0].axis('off')

axes[1].imshow(cv2.cvtColor(yolo_vis, cv2.COLOR_BGR2RGB))
axes[1].set_title(f'YOLOv8n\n{len(yolo_results[0].boxes)} detections', 
                  fontsize=14, fontweight='bold')
axes[1].axis('off')

plt.tight_layout()
plt.show()

print("\nüìä Comparison:")
print(f"  SSD detections: {len(ssd_detections['boxes'])}")
print(f"  YOLO detections: {len(yolo_results[0].boxes)}")
print("\n  Both models detect similar objects, but may differ in:")
print("    - Confidence scores")
print("    - Bounding box precision")
print("    - Small object detection")

## Summary

### What We Learned:
1. ‚úÖ **SSD Architecture**: Multi-scale detection with 6 feature maps
2. ‚úÖ **Default Boxes**: 8,732 boxes with multiple aspect ratios
3. ‚úÖ **SSD vs YOLO**: Conceptual and architectural differences
4. ‚úÖ **Multi-Scale Detection**: How SSD handles objects of varying sizes
5. ‚úÖ **Performance**: Real-time speed (59 FPS) with good accuracy (74.3% mAP)

### Key Takeaways:
- **SSD = Multi-scale + Single-shot**: Combines speed and accuracy
- **6 feature maps**: 38√ó38 (small objects) ‚Üí 1√ó1 (large objects)
- **8,732 default boxes**: Ensure comprehensive coverage
- **Trade-off**: SSD512 more accurate, SSD300 faster
- **Production-ready**: TensorFlow/PyTorch support, mobile deployment

### SSD vs YOLO Decision:
- **Choose SSD**: TensorFlow ecosystem, multi-scale focus, mobile deployment
- **Choose YOLO**: Latest accuracy, easier training, active development

### Next Steps:
- **Notebook 06**: Comprehensive YOLO vs SSD benchmark comparison
- **Practice**: Train SSD on custom dataset (similar to YOLOv8 training)

---

**Great job! You now understand SSD architecture and multi-scale detection! üéâ**