# Notebook 04: Evaluation Metrics (mAP)

**Course:** Deep Neural Network Architectures (21CSE558T)  
**Module 5:** Object Detection and Localization  
**Week 13:** Object Localization Fundamentals  
**Duration:** ~15 minutes

## Learning Objectives
By the end of this notebook, you will be able to:
- Calculate Precision and Recall for object detection
- Understand and compute Average Precision (AP)
- Calculate mean Average Precision (mAP)
- Interpret mAP values for model comparison
- Understand IoU-based detection evaluation

## Introduction
Unlike classification, object detection requires evaluating both:
1. **Localization accuracy**: How well the box aligns with ground truth (IoU)
2. **Classification accuracy**: Correct class prediction

**mAP (mean Average Precision)** is the standard metric for object detection, used by:
- PASCAL VOC Challenge
- MS COCO Competition
- All modern detection papers (YOLO, Faster R-CNN, SSD, etc.)

## Setup and Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from typing import List, Tuple, Dict

# Set random seed for reproducibility
np.random.seed(42)

# Configure matplotlib
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

## What Makes a Good Detection?

### IoU (Intersection over Union) Threshold

A detection is considered **correct** (True Positive) if:
1. **Class matches** ground truth
2. **IoU ≥ threshold** (typically 0.5 for PASCAL VOC, 0.5-0.95 for COCO)

```
IoU = Area of Overlap / Area of Union
```

### Classification of Detections:

| Term | Definition | Example |
|------|------------|--------|
| **True Positive (TP)** | Correct detection with IoU ≥ threshold | Detected car with IoU=0.75 |
| **False Positive (FP)** | Wrong detection (wrong class or IoU < threshold) | Background detected as car, or IoU=0.3 |
| **False Negative (FN)** | Missed ground truth object | Car in image but not detected |

**Note:** True Negative (TN) is not defined in object detection (infinite background regions)

### Confidence Threshold
- Detectors output confidence scores (0-1)
- By varying threshold, we get different precision-recall trade-offs
- Higher threshold → fewer detections → higher precision, lower recall

## Sample Detections

Let's create synthetic predictions and ground truth for demonstration:

In [None]:
# Ground truth objects (what's actually in the image)
ground_truth = [
    {'class': 'car', 'bbox': [100, 50, 250, 200]},
    {'class': 'car', 'bbox': [300, 100, 450, 250]},
    {'class': 'person', 'bbox': [50, 150, 120, 350]},
    {'class': 'dog', 'bbox': [400, 300, 500, 400]},
    {'class': 'person', 'bbox': [200, 250, 280, 400]}
]

# Model predictions (class, confidence, bbox)
predictions = [
    {'class': 'car', 'confidence': 0.95, 'bbox': [105, 55, 245, 195]},      # Good car detection
    {'class': 'car', 'confidence': 0.88, 'bbox': [295, 105, 455, 245]},     # Good car detection
    {'class': 'person', 'confidence': 0.92, 'bbox': [48, 155, 118, 345]},   # Good person detection
    {'class': 'dog', 'confidence': 0.78, 'bbox': [405, 295, 495, 405]},     # Good dog detection
    {'class': 'person', 'confidence': 0.85, 'bbox': [198, 255, 282, 395]},  # Good person detection
    {'class': 'car', 'confidence': 0.65, 'bbox': [500, 50, 600, 150]},      # False positive (no GT)
    {'class': 'person', 'confidence': 0.45, 'bbox': [150, 100, 200, 180]},  # False positive (low IoU)
    {'class': 'dog', 'confidence': 0.35, 'bbox': [250, 350, 320, 420]},     # False positive (wrong class/loc)
]

print("Ground Truth Objects:")
print("=" * 60)
for i, gt in enumerate(ground_truth, 1):
    print(f"{i}. {gt['class']:<8} at {gt['bbox']}")

print("\nModel Predictions:")
print("=" * 60)
for i, pred in enumerate(predictions, 1):
    print(f"{i}. {pred['class']:<8} Confidence: {pred['confidence']:.2f} at {pred['bbox']}")

print(f"\nTotal GT objects: {len(ground_truth)}")
print(f"Total predictions: {len(predictions)}")

## Classification with IoU

First, we need the IoU function from Notebook 02:

In [None]:
def calculate_iou(box1, box2):
    """
    Calculate Intersection over Union (IoU) for two boxes.
    
    Args:
        box1, box2: [x_min, y_min, x_max, y_max]
    
    Returns:
        IoU value (0 to 1)
    """
    # Calculate intersection
    x_min_inter = max(box1[0], box2[0])
    y_min_inter = max(box1[1], box2[1])
    x_max_inter = min(box1[2], box2[2])
    y_max_inter = min(box1[3], box2[3])
    
    # Check if boxes intersect
    if x_max_inter < x_min_inter or y_max_inter < y_min_inter:
        return 0.0
    
    intersection = (x_max_inter - x_min_inter) * (y_max_inter - y_min_inter)
    
    # Calculate union
    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union = area1 + area2 - intersection
    
    return intersection / union if union > 0 else 0.0

def match_predictions_to_gt(predictions, ground_truth, iou_threshold=0.5):
    """
    Match predictions to ground truth and classify as TP or FP.
    
    Args:
        predictions: List of prediction dicts
        ground_truth: List of ground truth dicts
        iou_threshold: IoU threshold for TP
    
    Returns:
        List of classification results
    """
    results = []
    matched_gt = set()  # Track which GT objects have been matched
    
    # Sort predictions by confidence (highest first)
    sorted_preds = sorted(predictions, key=lambda x: x['confidence'], reverse=True)
    
    for pred in sorted_preds:
        best_iou = 0
        best_gt_idx = -1
        
        # Find best matching ground truth
        for gt_idx, gt in enumerate(ground_truth):
            # Skip if already matched or different class
            if gt_idx in matched_gt or gt['class'] != pred['class']:
                continue
            
            iou = calculate_iou(pred['bbox'], gt['bbox'])
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = gt_idx
        
        # Classify as TP or FP
        if best_iou >= iou_threshold:
            classification = 'TP'
            matched_gt.add(best_gt_idx)
        else:
            classification = 'FP'
        
        results.append({
            'class': pred['class'],
            'confidence': pred['confidence'],
            'classification': classification,
            'iou': best_iou
        })
    
    # Count false negatives (unmatched GT objects)
    fn_count = len(ground_truth) - len(matched_gt)
    
    return results, fn_count

# Match predictions to ground truth
results, fn_count = match_predictions_to_gt(predictions, ground_truth, iou_threshold=0.5)

# Display results
print("Detection Classification (IoU threshold = 0.5):")
print("=" * 80)
print(f"{'Class':<10} {'Confidence':<12} {'Classification':<15} {'IoU':<10}")
print("=" * 80)

for result in results:
    symbol = "✓" if result['classification'] == 'TP' else "✗"
    print(f"{symbol} {result['class']:<8} {result['confidence']:<12.2f} "
          f"{result['classification']:<15} {result['iou']:<10.3f}")

tp_count = sum(1 for r in results if r['classification'] == 'TP')
fp_count = sum(1 for r in results if r['classification'] == 'FP')

print("\n" + "=" * 80)
print(f"True Positives (TP): {tp_count}")
print(f"False Positives (FP): {fp_count}")
print(f"False Negatives (FN): {fn_count}")
print(f"Total GT objects: {len(ground_truth)}")

## Precision and Recall

### Definitions:

**Precision**: Of all detections made, how many were correct?
```
Precision = TP / (TP + FP)
```
- High precision → Few false alarms
- Example: 0.80 means 80% of detections are correct

**Recall**: Of all ground truth objects, how many were detected?
```
Recall = TP / (TP + FN)
```
- High recall → Few missed objects
- Example: 0.90 means 90% of objects were found

### Trade-off:
- Lower confidence threshold → More detections → Higher recall, lower precision
- Higher confidence threshold → Fewer detections → Lower recall, higher precision

In [None]:
def calculate_precision_recall(results, fn_count):
    """
    Calculate precision and recall from classification results.
    
    Args:
        results: List of classification results
        fn_count: Number of false negatives
    
    Returns:
        precision, recall
    """
    tp = sum(1 for r in results if r['classification'] == 'TP')
    fp = sum(1 for r in results if r['classification'] == 'FP')
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn_count) if (tp + fn_count) > 0 else 0
    
    return precision, recall

precision, recall = calculate_precision_recall(results, fn_count)

print("Overall Metrics (All Detections):")
print("=" * 60)
print(f"Precision: {precision:.3f} ({precision*100:.1f}%)")
print(f"Recall:    {recall:.3f} ({recall*100:.1f}%)")

print("\nInterpretation:")
print(f"  - {precision*100:.1f}% of our detections are correct")
print(f"  - We found {recall*100:.1f}% of all objects in the image")

# F1 Score (harmonic mean of precision and recall)
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
print(f"\nF1 Score: {f1:.3f} (balanced metric)")

## Varying Confidence Threshold

Let's see how precision and recall change with different confidence thresholds:

In [None]:
def calculate_pr_at_threshold(predictions, ground_truth, conf_threshold, iou_threshold=0.5):
    """
    Calculate precision and recall at a specific confidence threshold.
    
    Args:
        predictions: List of predictions
        ground_truth: List of ground truth
        conf_threshold: Confidence threshold
        iou_threshold: IoU threshold for TP
    
    Returns:
        precision, recall, tp, fp, fn
    """
    # Filter predictions by confidence threshold
    filtered_preds = [p for p in predictions if p['confidence'] >= conf_threshold]
    
    if len(filtered_preds) == 0:
        return 0.0, 0.0, 0, 0, len(ground_truth)
    
    # Match to ground truth
    results, fn = match_predictions_to_gt(filtered_preds, ground_truth, iou_threshold)
    
    # Calculate metrics
    tp = sum(1 for r in results if r['classification'] == 'TP')
    fp = sum(1 for r in results if r['classification'] == 'FP')
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    
    return precision, recall, tp, fp, fn

# Test different thresholds
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
pr_results = []

print("Precision-Recall at Different Confidence Thresholds:")
print("=" * 80)
print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<12} {'TP':<6} {'FP':<6} {'FN':<6}")
print("=" * 80)

for threshold in thresholds:
    precision, recall, tp, fp, fn = calculate_pr_at_threshold(
        predictions, ground_truth, threshold
    )
    pr_results.append((threshold, precision, recall))
    print(f"{threshold:<12.2f} {precision:<12.3f} {recall:<12.3f} {tp:<6} {fp:<6} {fn:<6}")

print("\nObservations:")
print("  - Lower threshold → More detections → Higher recall, lower precision")
print("  - Higher threshold → Fewer detections → Lower recall, higher precision")

## Precision-Recall Curve

Plotting precision vs recall shows the trade-off visually:

In [None]:
# Generate PR curve with finer granularity
conf_thresholds = np.linspace(0.0, 1.0, 50)
precisions = []
recalls = []

for threshold in conf_thresholds:
    precision, recall, _, _, _ = calculate_pr_at_threshold(
        predictions, ground_truth, threshold
    )
    precisions.append(precision)
    recalls.append(recall)

# Plot PR curve
plt.figure(figsize=(10, 6))
plt.plot(recalls, precisions, 'b-', linewidth=2, label='PR Curve')
plt.scatter(recalls, precisions, c=conf_thresholds, cmap='viridis', 
           s=30, alpha=0.6, edgecolors='black', linewidth=0.5)

# Add colorbar for confidence thresholds
cbar = plt.colorbar(label='Confidence Threshold')

plt.xlabel('Recall', fontsize=12, fontweight='bold')
plt.ylabel('Precision', fontsize=12, fontweight='bold')
plt.title('Precision-Recall Curve', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xlim([0, 1.05])
plt.ylim([0, 1.05])

# Add annotations
plt.text(0.5, 0.95, 'High Precision\nLow Recall', 
         ha='center', va='top', fontsize=10, 
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.text(0.95, 0.5, 'High Recall\nLow Precision', 
         ha='right', va='center', fontsize=10,
         bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.5))

plt.tight_layout()
plt.show()

print("\nPR Curve Interpretation:")
print("  - Ideal curve: Top-right corner (high precision AND high recall)")
print("  - Points move right-to-left as confidence threshold increases")
print("  - Area under curve = Average Precision (AP)")

## Average Precision (AP) Calculation

**Average Precision (AP)** = Area under the Precision-Recall curve

### Two Methods:

1. **11-point interpolation** (PASCAL VOC 2007):
   - Sample at 11 recall levels: 0.0, 0.1, 0.2, ..., 1.0
   - Take max precision at each level

2. **All-point interpolation** (PASCAL VOC 2010+, COCO):
   - Use all unique recall values
   - More accurate

In [None]:
def calculate_ap_11point(precisions, recalls):
    """
    Calculate AP using 11-point interpolation (PASCAL VOC 2007 style).
    
    Args:
        precisions: List of precision values
        recalls: List of recall values
    
    Returns:
        Average Precision
    """
    ap = 0.0
    recall_levels = np.linspace(0, 1, 11)  # 0.0, 0.1, 0.2, ..., 1.0
    
    for r_level in recall_levels:
        # Find precisions where recall >= r_level
        precisions_above = [p for p, r in zip(precisions, recalls) if r >= r_level]
        
        if len(precisions_above) > 0:
            ap += max(precisions_above)
    
    return ap / 11.0

def calculate_ap_allpoint(precisions, recalls):
    """
    Calculate AP using all-point interpolation (PASCAL VOC 2010+ style).
    
    Args:
        precisions: List of precision values (sorted by decreasing recall)
        recalls: List of recall values (sorted)
    
    Returns:
        Average Precision
    """
    # Sort by recall
    sorted_indices = np.argsort(recalls)
    recalls = np.array(recalls)[sorted_indices]
    precisions = np.array(precisions)[sorted_indices]
    
    # Add sentinel values
    recalls = np.concatenate(([0.], recalls, [1.]))
    precisions = np.concatenate(([0.], precisions, [0.]))
    
    # Compute the precision envelope (monotonic decreasing)
    for i in range(len(precisions) - 2, -1, -1):
        precisions[i] = max(precisions[i], precisions[i + 1])
    
    # Calculate area under curve
    indices = np.where(recalls[1:] != recalls[:-1])[0] + 1
    ap = np.sum((recalls[indices] - recalls[indices - 1]) * precisions[indices])
    
    return ap

# Calculate both AP methods
ap_11pt = calculate_ap_11point(precisions, recalls)
ap_all = calculate_ap_allpoint(precisions, recalls)

print("Average Precision (AP):")
print("=" * 60)
print(f"11-point interpolation:  {ap_11pt:.4f} ({ap_11pt*100:.2f}%)")
print(f"All-point interpolation: {ap_all:.4f} ({ap_all*100:.2f}%)")
print("\nNote: All-point method is more accurate and widely used today.")

# Visualize interpolation
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# 11-point interpolation
ax = axes[0]
ax.plot(recalls, precisions, 'b-', alpha=0.3, label='Original')
recall_levels = np.linspace(0, 1, 11)
interp_precisions = []
for r_level in recall_levels:
    precisions_above = [p for p, r in zip(precisions, recalls) if r >= r_level]
    interp_precisions.append(max(precisions_above) if len(precisions_above) > 0 else 0)
ax.step(recall_levels, interp_precisions, 'r-', where='post', linewidth=2, label='11-point')
ax.scatter(recall_levels, interp_precisions, c='red', s=50, zorder=5)
ax.set_xlabel('Recall', fontweight='bold')
ax.set_ylabel('Precision', fontweight='bold')
ax.set_title(f'11-Point Interpolation\nAP = {ap_11pt:.4f}', fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])

# All-point interpolation
ax = axes[1]
ax.plot(recalls, precisions, 'b-', alpha=0.3, label='Original')
# Show monotonic envelope
sorted_indices = np.argsort(recalls)
sorted_recalls = np.array(recalls)[sorted_indices]
sorted_precisions = np.array(precisions)[sorted_indices]
envelope = sorted_precisions.copy()
for i in range(len(envelope) - 2, -1, -1):
    envelope[i] = max(envelope[i], envelope[i + 1])
ax.step(sorted_recalls, envelope, 'g-', where='post', linewidth=2, label='All-point')
ax.set_xlabel('Recall', fontweight='bold')
ax.set_ylabel('Precision', fontweight='bold')
ax.set_title(f'All-Point Interpolation\nAP = {ap_all:.4f}', fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend()
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])

plt.tight_layout()
plt.show()

## Multiple Classes

Real object detection involves multiple classes. Calculate AP for each class separately:

In [None]:
def calculate_ap_per_class(predictions, ground_truth, class_name, iou_threshold=0.5):
    """
    Calculate AP for a specific class.
    
    Args:
        predictions: All predictions
        ground_truth: All ground truth
        class_name: Class to evaluate
        iou_threshold: IoU threshold
    
    Returns:
        Average Precision for the class
    """
    # Filter by class
    class_preds = [p for p in predictions if p['class'] == class_name]
    class_gt = [g for g in ground_truth if g['class'] == class_name]
    
    if len(class_gt) == 0:
        return 0.0  # No ground truth for this class
    
    # Calculate PR curve
    conf_thresholds = sorted(set([p['confidence'] for p in class_preds] + [0.0]), reverse=True)
    precisions = []
    recalls = []
    
    for threshold in conf_thresholds:
        precision, recall, _, _, _ = calculate_pr_at_threshold(
            class_preds, class_gt, threshold, iou_threshold
        )
        precisions.append(precision)
        recalls.append(recall)
    
    # Calculate AP
    if len(precisions) > 0:
        return calculate_ap_allpoint(precisions, recalls)
    return 0.0

# Calculate AP for each class
classes = ['car', 'person', 'dog']
class_aps = {}

print("Average Precision per Class:")
print("=" * 60)

for class_name in classes:
    ap = calculate_ap_per_class(predictions, ground_truth, class_name)
    class_aps[class_name] = ap
    
    # Count objects
    gt_count = sum(1 for g in ground_truth if g['class'] == class_name)
    pred_count = sum(1 for p in predictions if p['class'] == class_name)
    
    print(f"{class_name:<10} AP: {ap:.4f} ({ap*100:.2f}%)  "
          f"[GT: {gt_count}, Predictions: {pred_count}]")

# Visualize per-class AP
plt.figure(figsize=(10, 6))
bars = plt.bar(class_aps.keys(), class_aps.values(), 
               color=['steelblue', 'coral', 'mediumseagreen'],
               edgecolor='black', linewidth=1.5)

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
            f'{height:.3f}',
            ha='center', va='bottom', fontweight='bold')

plt.xlabel('Class', fontsize=12, fontweight='bold')
plt.ylabel('Average Precision (AP)', fontsize=12, fontweight='bold')
plt.title('Average Precision per Class (IoU = 0.5)', fontsize=14, fontweight='bold')
plt.ylim([0, 1.1])
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

## Mean Average Precision (mAP)

**mAP** = Average of AP across all classes

```
mAP = (AP_class1 + AP_class2 + ... + AP_classN) / N
```

This is the **single number metric** used to compare object detection models.

In [None]:
def calculate_map(predictions, ground_truth, classes, iou_threshold=0.5):
    """
    Calculate mean Average Precision (mAP).
    
    Args:
        predictions: All predictions
        ground_truth: All ground truth
        classes: List of class names
        iou_threshold: IoU threshold
    
    Returns:
        mAP value and per-class APs
    """
    aps = {}
    
    for class_name in classes:
        ap = calculate_ap_per_class(predictions, ground_truth, class_name, iou_threshold)
        aps[class_name] = ap
    
    map_value = np.mean(list(aps.values()))
    return map_value, aps

# Calculate mAP
map_50, aps = calculate_map(predictions, ground_truth, classes, iou_threshold=0.5)

print("Mean Average Precision (mAP):")
print("=" * 60)
print(f"\nmAP@0.5 = {map_50:.4f} ({map_50*100:.2f}%)")
print("\nBreakdown by class:")
for class_name, ap in aps.items():
    print(f"  {class_name:<10} AP: {ap:.4f}")

print("\n" + "=" * 60)
print("\nNotation:")
print("  mAP@0.5  = mAP at IoU threshold 0.5 (PASCAL VOC)")
print("  mAP@0.75 = mAP at IoU threshold 0.75 (stricter)")
print("  mAP@0.5:0.95 = Average mAP from IoU 0.5 to 0.95 (COCO standard)")

## mAP@0.5 vs mAP@0.5:0.95 (COCO Standard)

The **COCO dataset** uses a more rigorous metric:

- **mAP@0.5:0.95**: Average mAP calculated at IoU thresholds [0.5, 0.55, 0.6, ..., 0.95]
- Total: 10 different thresholds
- Rewards models with better localization accuracy

In [None]:
# Calculate mAP at different IoU thresholds
iou_thresholds = np.arange(0.5, 1.0, 0.05)  # 0.5, 0.55, 0.6, ..., 0.95
map_values = []

print("mAP at Different IoU Thresholds:")
print("=" * 60)
print(f"{'IoU Threshold':<15} {'mAP':<10}")
print("=" * 60)

for iou_threshold in iou_thresholds:
    map_value, _ = calculate_map(predictions, ground_truth, classes, iou_threshold)
    map_values.append(map_value)
    print(f"{iou_threshold:<15.2f} {map_value:<10.4f}")

# Calculate COCO-style mAP
coco_map = np.mean(map_values)
print("\n" + "=" * 60)
print(f"\nmAP@0.5:0.95 (COCO): {coco_map:.4f} ({coco_map*100:.2f}%)")
print(f"mAP@0.5 (PASCAL):    {map_50:.4f} ({map_50*100:.2f}%)")

print("\nObservation:")
print("  COCO mAP is typically lower because it requires better localization")

# Visualize mAP vs IoU threshold
plt.figure(figsize=(10, 6))
plt.plot(iou_thresholds, map_values, 'o-', linewidth=2, markersize=8, color='steelblue')
plt.axhline(y=coco_map, color='red', linestyle='--', linewidth=2, 
           label=f'COCO mAP = {coco_map:.4f}')
plt.xlabel('IoU Threshold', fontsize=12, fontweight='bold')
plt.ylabel('mAP', fontsize=12, fontweight='bold')
plt.title('mAP vs IoU Threshold', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.legend(fontsize=11)
plt.xlim([0.45, 1.0])
plt.ylim([0, max(map_values) * 1.1])
plt.tight_layout()
plt.show()

print("\nAs IoU threshold increases:")
print("  - Fewer detections count as TP")
print("  - mAP decreases")
print("  - Model must localize objects more precisely")

## Interpretation

### What does mAP mean?

| mAP Value | Interpretation | Example Use Case |
|-----------|----------------|------------------|
| 0.90+ | Excellent | Research benchmarks |
| 0.70-0.90 | Good | Production systems |
| 0.50-0.70 | Moderate | Proof of concept |
| 0.30-0.50 | Poor | Needs improvement |
| <0.30 | Very Poor | Baseline/broken |

### Comparing Models

**Example:**
- **YOLO v8 mAP@0.5:0.95 = 0.53** on COCO
- **Faster R-CNN mAP@0.5:0.95 = 0.42** on COCO

→ YOLO v8 is better at detecting and localizing objects

### Important Notes:

1. **Dataset matters**: 
   - mAP on COCO ≠ mAP on custom dataset
   - Compare models on **same** dataset

2. **Speed-accuracy trade-off**:
   - Faster R-CNN: Higher mAP, slower (5-10 FPS)
   - YOLO: Lower mAP, faster (30-60 FPS)

3. **Class imbalance**:
   - mAP averages across classes
   - Check per-class AP for important classes

4. **IoU threshold**:
   - Always specify: mAP@0.5 or mAP@0.5:0.95
   - COCO standard: mAP@0.5:0.95

## Exercise: Calculate mAP for Given Dataset

In [None]:
print("EXERCISE: Calculate mAP")
print("=" * 60)

# New dataset
exercise_gt = [
    {'class': 'cat', 'bbox': [50, 50, 150, 150]},
    {'class': 'cat', 'bbox': [200, 100, 300, 200]},
    {'class': 'dog', 'bbox': [100, 200, 200, 300]},
    {'class': 'bird', 'bbox': [300, 50, 350, 100]},
]

exercise_preds = [
    {'class': 'cat', 'confidence': 0.95, 'bbox': [52, 48, 148, 152]},
    {'class': 'cat', 'confidence': 0.88, 'bbox': [205, 95, 295, 205]},
    {'class': 'dog', 'confidence': 0.92, 'bbox': [98, 205, 202, 295]},
    {'class': 'bird', 'confidence': 0.65, 'bbox': [305, 55, 345, 95]},
    {'class': 'cat', 'confidence': 0.45, 'bbox': [150, 150, 200, 200]},  # FP
]

print("\nTasks:")
print("1. Calculate AP for each class (cat, dog, bird)")
print("2. Calculate mAP@0.5")
print("3. Which class has the highest AP?")
print("4. How many false positives are there?")

print("\n" + "=" * 60)
print("Uncomment the code below to see the solution:\n")

# Solution (uncomment to run)
# exercise_classes = ['cat', 'dog', 'bird']
# exercise_map, exercise_aps = calculate_map(exercise_preds, exercise_gt, exercise_classes)
# 
# print("\nSolution:")
# print("=" * 60)
# for class_name, ap in exercise_aps.items():
#     print(f"{class_name:<10} AP: {ap:.4f}")
# print(f"\nmAP@0.5: {exercise_map:.4f}")
# 
# best_class = max(exercise_aps, key=exercise_aps.get)
# print(f"\nBest class: {best_class} (AP = {exercise_aps[best_class]:.4f})")
# 
# # Count FPs
# results, _ = match_predictions_to_gt(exercise_preds, exercise_gt)
# fp_count = sum(1 for r in results if r['classification'] == 'FP')
# print(f"False positives: {fp_count}")

## Summary

### Key Concepts

1. **Detection Classification:**
   - **TP (True Positive)**: Correct detection with IoU ≥ threshold
   - **FP (False Positive)**: Wrong detection or IoU < threshold
   - **FN (False Negative)**: Missed ground truth object

2. **Metrics:**
   - **Precision** = TP / (TP + FP) — Accuracy of detections
   - **Recall** = TP / (TP + FN) — Completeness of detections
   - **AP (Average Precision)** = Area under PR curve
   - **mAP (mean Average Precision)** = Average of AP across all classes

3. **Evaluation Standards:**
   - **PASCAL VOC**: mAP@0.5 (IoU threshold = 0.5)
   - **MS COCO**: mAP@0.5:0.95 (average across 10 IoU thresholds)
   - COCO is more strict and widely used in research

4. **Interpretation:**
   - Higher mAP = Better detector
   - Compare models on same dataset
   - Always specify IoU threshold
   - Check per-class AP for imbalanced datasets

5. **Trade-offs:**
   - Confidence threshold: Precision vs Recall
   - Model choice: Speed vs Accuracy
   - IoU threshold: Localization strictness

### Typical mAP Values (COCO Dataset)

| Model | mAP@0.5:0.95 | Speed (FPS) |
|-------|--------------|-------------|
| YOLOv8-nano | 0.37 | 80+ |
| YOLOv8-small | 0.45 | 60 |
| YOLOv8-medium | 0.50 | 40 |
| YOLOv8-large | 0.53 | 30 |
| Faster R-CNN | 0.42 | 5-10 |
| RetinaNet | 0.40 | 15 |

### Next Steps
- **Notebook 05**: Classical sliding window detection (pre-deep learning)
- **Week 14**: Modern architectures (YOLO, R-CNN family)
- **Week 15**: Advanced topics (NMS, anchor boxes)

### Further Reading
- PASCAL VOC Challenge: http://host.robots.ox.ac.uk/pascal/VOC/
- COCO Evaluation Metrics: https://cocodataset.org/#detection-eval
- Papers: "The PASCAL Visual Object Classes Challenge" (2010)

---

**Completion Time:** ~15 minutes  
**Mastery:** Practice calculating mAP on real datasets (COCO, VOC)