# 08 — mAP@IoU=0.5 Evaluation (v7.3)

This notebook evaluates the **detection quality** of our model predictions against ground truth boxes on **scene_8** (validation set).

**mAP@IoU=0.5** is Airbus evaluation criterion #1 — the single most important metric.

- **GT boxes**: `outputs/gt_boxes_all.csv` (generated by `build_gt_boxes.py`, internal class_id 1-4)
- **Predicted boxes**: `outputs/pred_v7/scene_8.csv` (Airbus format, class_ID 0-3)
- **IoU**: Axis-aligned 3D IoU (ignoring yaw rotation — standard simplification)
- **AP**: PASCAL VOC all-points interpolation per class, then mean across 4 classes

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

import numpy as np
import pandas as pd

print("Imports OK")

In [None]:
# ── Paths & config ──────────────────────────────────────────────────────────
DRIVE_BASE = "/content/drive/MyDrive/airbus_hackathon"

GT_CSV   = f"{DRIVE_BASE}/outputs/gt_boxes_all.csv"
PRED_CSV = f"{DRIVE_BASE}/outputs/pred_v7/scene_8.csv"

IOU_THRESHOLD = 0.5

# Internal class IDs (1-based, matching GT CSV)
CLASS_NAMES = {
    1: "Antenna",
    2: "Cable",
    3: "Electric Pole",
    4: "Wind Turbine",
}

VALIDATION_SCENE = "scene_8"

print(f"GT CSV:   {GT_CSV}")
print(f"Pred CSV: {PRED_CSV}")
print(f"IoU threshold: {IOU_THRESHOLD}")
print(f"Validation scene: {VALIDATION_SCENE}")

# Load and inspect data

In [None]:
# ── Load GT boxes ────────────────────────────────────────────────────────────
# GT CSV columns: scene,frame_idx,ego_x,ego_y,ego_z,ego_yaw,class_id,class_label,
#                  num_points,center_x,center_y,center_z,width,length,height,yaw
# class_id: 1=Antenna, 2=Cable, 3=Electric Pole, 4=Wind Turbine

df_gt = pd.read_csv(GT_CSV)
print(f"GT boxes total: {len(df_gt)} rows, columns: {list(df_gt.columns)}")
print(f"\nScenes in GT: {sorted(df_gt['scene'].unique())}")
print(f"\nGT class distribution (all scenes):")
print(df_gt['class_label'].value_counts())

# Filter to validation scene only
df_gt_scene8 = df_gt[df_gt['scene'] == VALIDATION_SCENE].copy()
print(f"\n{'='*50}")
print(f"GT boxes in {VALIDATION_SCENE}: {len(df_gt_scene8)}")
print(f"Frames: {df_gt_scene8['frame_idx'].nunique()}")
print(f"\nGT class distribution ({VALIDATION_SCENE}):")
print(df_gt_scene8['class_label'].value_counts())
print(f"\nGT head:")
df_gt_scene8.head()

In [None]:
# ── Load prediction boxes ────────────────────────────────────────────────────
# Pred CSV columns (Airbus format): ego_x,ego_y,ego_z,ego_yaw,
#   bbox_center_x,bbox_center_y,bbox_center_z,bbox_width,bbox_length,bbox_height,
#   bbox_yaw,class_ID,class_label
# class_ID: 0=Antenna, 1=Cable, 2=Electric Pole, 3=Wind Turbine

df_pred = pd.read_csv(PRED_CSV)
print(f"Pred boxes: {len(df_pred)} rows, columns: {list(df_pred.columns)}")

# Convert Airbus class_ID (0-3) to internal class_id (1-4)
df_pred['class_id_internal'] = df_pred['class_ID'] + 1

print(f"\nPred class distribution (Airbus ID → internal):")
for cid_airbus in sorted(df_pred['class_ID'].unique()):
    cid_internal = cid_airbus + 1
    count = (df_pred['class_ID'] == cid_airbus).sum()
    label = CLASS_NAMES.get(cid_internal, '???')
    print(f"  Airbus {cid_airbus} → internal {cid_internal} ({label}): {count}")

print(f"\nPred head:")
df_pred.head()

# IoU 3D computation

In [None]:
def iou_3d_axis_aligned(box_a, box_b):
    """Compute 3D IoU between two boxes (axis-aligned approximation).
    
    Each box is a dict with:
      - 'center': np.array([x, y, z])  in meters
      - 'dims':   np.array([w, l, h])  in meters
    
    Ignores yaw rotation — this is a standard simplification that gives
    a lower-bound on the true oriented IoU.
    """
    ca, da = box_a['center'], box_a['dims']
    cb, db = box_b['center'], box_b['dims']
    
    # Half-extents
    ha = da / 2.0
    hb = db / 2.0
    
    # Overlap per axis: min of upper bounds - max of lower bounds
    overlap = np.maximum(0.0, np.minimum(ca + ha, cb + hb) - np.maximum(ca - ha, cb - hb))
    intersection = overlap[0] * overlap[1] * overlap[2]
    
    vol_a = da[0] * da[1] * da[2]
    vol_b = db[0] * db[1] * db[2]
    union = vol_a + vol_b - intersection
    
    if union <= 0:
        return 0.0
    return intersection / union


# ── Quick sanity check ────────────────────────────────────────────────────────
# Identical boxes → IoU = 1.0
b = {'center': np.array([1, 2, 3]), 'dims': np.array([2, 2, 2])}
assert abs(iou_3d_axis_aligned(b, b) - 1.0) < 1e-9, "Sanity check failed"

# Non-overlapping boxes → IoU = 0.0
b1 = {'center': np.array([0, 0, 0]), 'dims': np.array([1, 1, 1])}
b2 = {'center': np.array([10, 10, 10]), 'dims': np.array([1, 1, 1])}
assert iou_3d_axis_aligned(b1, b2) == 0.0, "Sanity check failed"

# Partial overlap → known IoU
b1 = {'center': np.array([0, 0, 0]), 'dims': np.array([2, 2, 2])}
b2 = {'center': np.array([1, 0, 0]), 'dims': np.array([2, 2, 2])}
# Overlap: 1x2x2=4, union: 8+8-4=12, IoU=4/12=1/3
assert abs(iou_3d_axis_aligned(b1, b2) - 1/3) < 1e-9, "Sanity check failed"

print("IoU sanity checks passed.")

# mAP computation

In [None]:
def compute_ap(gt_boxes, pred_boxes, iou_threshold=0.5):
    """Compute Average Precision for a single class (across all frames).
    
    Uses PASCAL VOC all-points interpolation.
    Matching is constrained to boxes within the same frame (same frame_key).
    
    Args:
        gt_boxes:   list of dicts with 'center', 'dims', 'frame_key'
        pred_boxes: list of dicts with 'center', 'dims', 'confidence', 'frame_key'
        iou_threshold: minimum IoU for a true positive (default 0.5)
    
    Returns:
        ap:        Average Precision (float)
        precision: array of precision values at each prediction
        recall:    array of recall values at each prediction
        tp_count:  total true positives
        fp_count:  total false positives
        fn_count:  total false negatives (unmatched GT boxes)
    """
    # Edge cases
    if len(gt_boxes) == 0 and len(pred_boxes) == 0:
        return 1.0, np.array([]), np.array([]), 0, 0, 0
    if len(gt_boxes) == 0:
        return 0.0, np.zeros(len(pred_boxes)), np.zeros(len(pred_boxes)), 0, len(pred_boxes), 0
    if len(pred_boxes) == 0:
        return 0.0, np.array([]), np.array([]), 0, 0, len(gt_boxes)
    
    # Sort predictions by confidence (descending)
    pred_sorted = sorted(pred_boxes, key=lambda x: x.get('confidence', 0.0), reverse=True)
    
    # Index GT boxes by frame for O(1) lookup
    from collections import defaultdict
    gt_by_frame = defaultdict(list)
    for j, gt in enumerate(gt_boxes):
        gt_by_frame[gt['frame_key']].append((j, gt))
    
    # Track which GT boxes have been matched (by global index)
    gt_matched = set()
    
    tp = np.zeros(len(pred_sorted))
    fp = np.zeros(len(pred_sorted))
    
    for i, pred in enumerate(pred_sorted):
        best_iou = 0.0
        best_gt_idx = -1
        
        # Only check GT boxes in the same frame
        for j, gt in gt_by_frame.get(pred['frame_key'], []):
            if j in gt_matched:
                continue
            iou = iou_3d_axis_aligned(pred, gt)
            if iou > best_iou:
                best_iou = iou
                best_gt_idx = j
        
        if best_iou >= iou_threshold and best_gt_idx >= 0:
            tp[i] = 1
            gt_matched.add(best_gt_idx)
        else:
            fp[i] = 1
    
    # Cumulative sums
    tp_cumsum = np.cumsum(tp)
    fp_cumsum = np.cumsum(fp)
    
    precision = tp_cumsum / (tp_cumsum + fp_cumsum)
    recall = tp_cumsum / len(gt_boxes)
    
    # AP using all-points interpolation (PASCAL VOC style)
    # Add sentinel values at beginning and end
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([1.0], precision, [0.0]))
    
    # Make precision monotonically decreasing (right to left)
    for k in range(len(mpre) - 2, -1, -1):
        mpre[k] = max(mpre[k], mpre[k + 1])
    
    # Find points where recall changes
    change_points = np.where(mrec[1:] != mrec[:-1])[0] + 1
    
    # Sum rectangular areas under the PR curve
    ap = np.sum((mrec[change_points] - mrec[change_points - 1]) * mpre[change_points])
    
    tp_count = int(tp.sum())
    fp_count = int(fp.sum())
    fn_count = len(gt_boxes) - tp_count
    
    return float(ap), precision, recall, tp_count, fp_count, fn_count


print("compute_ap() defined.")

# Run evaluation

In [None]:
# ── Parse GT boxes into list of dicts ─────────────────────────────────────────
def parse_gt_boxes(df):
    """Parse GT CSV rows into box dicts.
    GT columns: ego_x, ego_y, ego_z, ego_yaw, class_id (1-4),
                center_x, center_y, center_z, width, length, height, yaw
    """
    boxes = []
    for _, row in df.iterrows():
        frame_key = (int(row['ego_x']), int(row['ego_y']),
                     int(row['ego_z']), int(row['ego_yaw']))
        boxes.append({
            'center': np.array([row['center_x'], row['center_y'], row['center_z']], dtype=np.float64),
            'dims':   np.array([row['width'], row['length'], row['height']], dtype=np.float64),
            'class_id': int(row['class_id']),
            'frame_key': frame_key,
        })
    return boxes


# ── Parse predicted boxes into list of dicts ──────────────────────────────────
def parse_pred_boxes(df):
    """Parse prediction CSV rows into box dicts.
    Pred columns (Airbus format): ego_x, ego_y, ego_z, ego_yaw,
        bbox_center_x, bbox_center_y, bbox_center_z,
        bbox_width, bbox_length, bbox_height, bbox_yaw,
        class_ID (0-3), class_label
    
    Converts Airbus class_ID (0-3) to internal class_id (1-4) by adding 1.
    """
    boxes = []
    for _, row in df.iterrows():
        frame_key = (int(row['ego_x']), int(row['ego_y']),
                     int(row['ego_z']), int(row['ego_yaw']))
        boxes.append({
            'center': np.array([row['bbox_center_x'], row['bbox_center_y'], row['bbox_center_z']], dtype=np.float64),
            'dims':   np.array([row['bbox_width'], row['bbox_length'], row['bbox_height']], dtype=np.float64),
            'class_id': int(row['class_ID']) + 1,   # Airbus 0-3 → internal 1-4
            'frame_key': frame_key,
            'confidence': 1.0,  # No confidence score in CSV → uniform
        })
    return boxes


# ── Parse boxes ───────────────────────────────────────────────────────────────
gt_boxes_all  = parse_gt_boxes(df_gt_scene8)
pred_boxes_all = parse_pred_boxes(df_pred)

print(f"GT boxes  (scene_8): {len(gt_boxes_all)}")
print(f"Pred boxes (scene_8): {len(pred_boxes_all)}")

# ── Verify frame key overlap ──────────────────────────────────────────────────
gt_frames  = set(b['frame_key'] for b in gt_boxes_all)
pred_frames = set(b['frame_key'] for b in pred_boxes_all)
common_frames = gt_frames & pred_frames
gt_only = gt_frames - pred_frames
pred_only = pred_frames - gt_frames

print(f"\nFrame overlap:")
print(f"  GT frames:     {len(gt_frames)}")
print(f"  Pred frames:   {len(pred_frames)}")
print(f"  Common frames: {len(common_frames)}")
print(f"  GT-only frames (missed entirely): {len(gt_only)}")
print(f"  Pred-only frames (hallucinated):  {len(pred_only)}")

In [None]:
# ── Compute AP per class and mAP ─────────────────────────────────────────────
results = {}

for cid in [1, 2, 3, 4]:
    gt_cls   = [b for b in gt_boxes_all if b['class_id'] == cid]
    pred_cls = [b for b in pred_boxes_all if b['class_id'] == cid]
    
    ap, prec, rec, tp, fp, fn = compute_ap(gt_cls, pred_cls, IOU_THRESHOLD)
    
    results[cid] = {
        'ap': ap,
        'tp': tp,
        'fp': fp,
        'fn': fn,
        'precision': tp / (tp + fp) if (tp + fp) > 0 else 0.0,
        'recall':    tp / (tp + fn) if (tp + fn) > 0 else 0.0,
        'n_gt':   len(gt_cls),
        'n_pred': len(pred_cls),
        'prec_curve': prec,
        'rec_curve':  rec,
    }
    
    print(f"\n{CLASS_NAMES[cid]}:")
    print(f"  GT={len(gt_cls)}, Pred={len(pred_cls)}")
    print(f"  TP={tp}, FP={fp}, FN={fn}")
    print(f"  Precision={results[cid]['precision']:.3f}, Recall={results[cid]['recall']:.3f}")
    print(f"  AP@0.5 = {ap:.4f}")

# ── mAP ───────────────────────────────────────────────────────────────────────
map_score = np.mean([results[c]['ap'] for c in [1, 2, 3, 4]])

print(f"\n{'='*60}")
print(f"  mAP@IoU=0.5 = {map_score:.4f}")
print(f"{'='*60}")

# Summary

In [None]:
# ── Pretty summary table ──────────────────────────────────────────────────────
rows = []
for cid in [1, 2, 3, 4]:
    r = results[cid]
    rows.append({
        'Class': CLASS_NAMES[cid],
        'GT': r['n_gt'],
        'Pred': r['n_pred'],
        'TP': r['tp'],
        'FP': r['fp'],
        'FN': r['fn'],
        'Precision': f"{r['precision']:.3f}",
        'Recall': f"{r['recall']:.3f}",
        'AP@0.5': f"{r['ap']:.4f}",
    })

# Add total row
total_gt   = sum(results[c]['n_gt'] for c in [1,2,3,4])
total_pred = sum(results[c]['n_pred'] for c in [1,2,3,4])
total_tp   = sum(results[c]['tp'] for c in [1,2,3,4])
total_fp   = sum(results[c]['fp'] for c in [1,2,3,4])
total_fn   = sum(results[c]['fn'] for c in [1,2,3,4])
rows.append({
    'Class': '--- TOTAL ---',
    'GT': total_gt,
    'Pred': total_pred,
    'TP': total_tp,
    'FP': total_fp,
    'FN': total_fn,
    'Precision': f"{total_tp/(total_tp+total_fp):.3f}" if (total_tp+total_fp) > 0 else "N/A",
    'Recall': f"{total_tp/(total_tp+total_fn):.3f}" if (total_tp+total_fn) > 0 else "N/A",
    'AP@0.5': f"{map_score:.4f} (mAP)",
})

df_results = pd.DataFrame(rows)
print(df_results.to_string(index=False))

print(f"\n{'='*60}")
print(f"  mAP@IoU=0.5 = {map_score:.4f}")
print(f"{'='*60}")

print(f"\nNote: IoU is axis-aligned (ignoring yaw rotation).")
print(f"This is a conservative lower bound — oriented IoU would likely give slightly higher scores.")
print(f"Predictions have uniform confidence=1.0 (no ranking), so AP = precision at max recall.")

# Analysis

## Interpreting the results

- **AP@0.5 per class**: how well the model detects each obstacle type. Values range from 0 (no correct detections) to 1 (perfect detection at all recall levels).
- **mAP@0.5**: the mean across all 4 classes — this is the **primary Airbus metric**.
- **Precision**: fraction of predictions that are correct (low = too many false positives / hallucinations).
- **Recall**: fraction of GT boxes that are detected (low = too many misses).
- **FN (false negatives)**: obstacles we completely missed — dangerous in a collision avoidance system.

## Key factors affecting scores

1. **No confidence scores**: predictions all have confidence=1.0, so AP cannot benefit from ranking good predictions higher. If the model outputs confidence, AP could improve significantly.
2. **Axis-aligned IoU**: ignoring yaw means elongated objects (cables, poles) may have artificially lower IoU when rotated. Oriented IoU would likely improve scores.
3. **Cable difficulty**: cables are thin (median height 0.16m) with few LiDAR points — hardest class to detect.
4. **Frame key matching**: ego pose values are cast to int for matching. If there is floating-point drift between GT and pred CSVs, some frames may not match (check `GT-only` and `Pred-only` counts above).

## What to improve

- Add **confidence scores** to predictions (softmax probability, cluster density, etc.)
- Implement **oriented 3D IoU** for more accurate evaluation
- Focus training on **recall for cables** (most safety-critical misses)
- Increase **NMS threshold tuning** to reduce duplicate detections (if FP is high)