# Pipeline Evaluation Notebook

This notebook evaluates the **full two-stage detection + classification pipeline** using ONNX-exported models for MegaDetector v6 and a ConvNeXt animal species classifier.  
It tests performance across **cis** (in-domain) and **trans** (out-of-domain) validation and test sets.

## Main Steps

1. **Load ONNX Models**
   - MegaDetector v6 (`megadetectorv6.onnx`) for detecting animals/vehicles.
   - ConvNeXt classifier (`convnext_classifier.onnx`) for identifying 13 animal species.

2. **Run MegaDetector on All Splits**
   - Processes all images in `cis_val`, `cis_test`, `trans_val`, and `trans_test`.
   - Saves raw detection predictions (`*_md_preds.csv`) with bounding boxes, confidence scores, and coarse class labels (animal/vehicle).

3. **Classify Animal Detections**
   - For each detection with class `"animal"`, crops the bounding box and runs the ConvNeXt classifier.
   - Assigns the most probable species along with confidence score.
   - Vehicles are kept as `"car"` without classification.
   - Applies a **confidence threshold (0.55)** to filter low-confidence classifications.

4. **Save Final Predictions**
   - Outputs per-split final results as COCO-style JSON files (`*_final_predictions.json`).
   - Logs confidence score statistics before/after thresholding for debugging.

5. **Match Predictions to Ground Truth**
   - Uses IoU (≥ 0.3) or containment check (≥ 60%) to match detections to ground-truth annotations.
   - Records:
     - Matched predictions
     - Unmatched predictions
     - Per-class and overall accuracies

6. **Generate and Save Evaluation Reports**
   - **Classification reports** (precision, recall, F1-score) per split.
   - **Comprehensive metrics CSV** (overall accuracy, weighted F1, per-class accuracies).
   - **Confusion matrices** (CSV + normalized PNG heatmaps).
   - CSVs of unmatched predictions for error analysis.

## Purpose
This notebook provides an **end-to-end evaluation** of the combined detection + classification pipeline, showing both:
- Detection quality from MegaDetector
- Species classification performance from ConvNeXt  
across in-domain and out-of-domain conditions.


In [None]:
import os
import onnxruntime as ort
from pathlib import Path
import json
import onnxruntime as ort
import numpy as np
import cv2
import pandas as pd
from tqdm import tqdm
import torch

# --- Paths ---
ROOT = Path("../")
SPLITS = ["cis_val", "cis_test", "trans_val", "trans_test"]
IMG_DIRS = {split: ROOT/ "data" / "megadetector_images" / split / "images" for split in SPLITS }
JSON_PATHS = {split: ROOT / "annotations" / f"{split}.json" for split in SPLITS}

# ONNX models
MD_ONNX_PATH = ROOT / "models" / "megadetectorv6.onnx"
CLS_ONNX_PATH = ROOT / "models" / "convnext_classifier.onnx"

# Output directory for detections and metrics
OUTPUT_DIR = ROOT / "eval" / "pipeline_results"
OUTPUT_DIR.mkdir(exist_ok=True)

# Load ONNX models
md_sess = ort.InferenceSession(str(MD_ONNX_PATH), providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
cls_sess = ort.InferenceSession(str(CLS_ONNX_PATH), providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

print("Models loaded successfully.")

# Category mapping
CATEGORY_NAME_TO_ID = {
    "bobcat": 6,
    "opossum": 1,
    "coyote": 9,
    "raccoon": 3,
    "bird": 11,
    "dog": 8,
    "cat": 16,
    "squirrel": 5,
    "rabbit": 10,
    "skunk": 7,
    "rodent": 99,
    "badger": 21,
    "deer": 34,
    "car": 33  # for detector only
}
CATEGORY_ID_TO_NAME = {v: k for k, v in CATEGORY_NAME_TO_ID.items()}

# Classifier class list (excluding car)
CLASSIFIER_CLASSES = [
    "badger", "bird", "bobcat", "cat", "coyote", "deer", "dog", "opossum",
    "rabbit", "raccoon", "rodent", "skunk", "squirrel"
]


Models loaded successfully.


In [None]:


def load_megadetector_session(model_path):
    return ort.InferenceSession(model_path)

def preprocess_image(img, input_size=(640, 640)):
    img_resized = cv2.resize(img, input_size)
    img_rgb = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)
    img_norm = img_rgb.astype(np.float32) / 255.0
    img_transposed = np.transpose(img_norm, (2, 0, 1))  # HWC - CHW
    return img_transposed[np.newaxis, :, :, :]  

def run_megadetector_on_split(split, session, conf_th=0.35, out_dir=Path("pipeline_results")):
    img_dir = IMG_DIRS[split]
    out_dir.mkdir(parents=True, exist_ok=True)
    csv_path = out_dir / f"{split}_md_preds.csv"
    records = []

    for img_path in tqdm(list(img_dir.glob("*.jpg")), desc=f"Running MD on {split}"):
        img = cv2.imread(str(img_path))
        if img is None:
            print(f" Failed to read {img_path}")
            continue

        orig_h, orig_w = img.shape[:2]
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img_resized = cv2.resize(img_rgb, (640, 640))
        img_input = img_resized.astype(np.float32) / 255.0
        img_input = np.transpose(img_input, (2, 0, 1))[np.newaxis, :, :, :]

        outputs = session.run(None, {"images": img_input})
        raw = outputs[0][0]  # shape: (N, 6)

        scale_x = orig_w / 640
        scale_y = orig_h / 640

        for det in raw:
            x1, y1, x2, y2, conf, cls_id = det
            if conf < conf_th:
                continue

            cls_id = int(cls_id)
            if cls_id == 0:
                cls_name = "animal"
            elif cls_id == 1 or cls_id == 2:
                cls_name = "vehicle"
            else:
                continue  # skip unknowns

            # Rescale to original size
            x1 *= scale_x
            y1 *= scale_y
            x2 *= scale_x
            y2 *= scale_y

            records.append({
                "filename": img_path.name,
                "x1": float(x1), "y1": float(y1), "x2": float(x2), "y2": float(y2),
                "conf": float(conf),
                "class": cls_name,
                "class_id": cls_id
            })

    pd.DataFrame(records).to_csv(csv_path, index=False)
    print(f" Saved predictions for {split} → {csv_path}")



In [None]:
md_model_path = "../models/megadetectorv6.onnx" 
session = load_megadetector_session(md_model_path)

for split in SPLITS:
    run_megadetector_on_split(split, session)

Running MD on cis_val: 100%|██████████| 1764/1764 [07:41<00:00,  3.82it/s]


✅ Saved predictions for cis_val → pipeline_results\cis_val_md_preds.csv


Running MD on cis_test: 100%|██████████| 12141/12141 [55:41<00:00,  3.63it/s]


✅ Saved predictions for cis_test → pipeline_results\cis_test_md_preds.csv


Running MD on trans_val: 100%|██████████| 1972/1972 [09:08<00:00,  3.60it/s]


✅ Saved predictions for trans_val → pipeline_results\trans_val_md_preds.csv


Running MD on trans_test: 100%|██████████| 18553/18553 [1:32:00<00:00,  3.36it/s]

✅ Saved predictions for trans_test → pipeline_results\trans_test_md_preds.csv





In [1]:
def clip_box(x1, y1, x2, y2, img_w, img_h):
    x1 = max(0, min(int(x1), img_w - 1))
    y1 = max(0, min(int(y1), img_h - 1))
    x2 = max(0, min(int(x2), img_w - 1))
    y2 = max(0, min(int(y2), img_h - 1))
    return x1, y1, x2, y2

In [None]:


# Config
CONF_THRESH = 0.55
IMG_SIZE = 224
input_name = cls_sess.get_inputs()[0].name

def preprocess_crop(img):
   
    # Convert BGR to RGB
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    # Direct resize to 224x224
    resized = cv2.resize(img_rgb, (224, 224))
    
    # Normalize
    norm = resized.astype(np.float32) / 255.0
    mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
    std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
    norm = (norm - mean) / std
    
    # Convert to CHW format and add batch dimension
    transposed = np.transpose(norm, (2, 0, 1))
    return transposed[np.newaxis, :, :, :]

def debug_confidence_scores(split, results_before_threshold, results_after_threshold):
    """
    Debug confidence score distributions
    """
    # Extract confidence scores before and after thresholding
    confs_before = [r["conf"] for r in results_before_threshold if r.get("conf")]
    confs_after = [r["conf"] for r in results_after_threshold if r.get("conf")]
    
    print(f"\n {split} - Confidence Analysis:")
    print(f"Total detections before threshold: {len(confs_before)}")
    print(f"Total detections after threshold (≥{CONF_THRESH}): {len(confs_after)}")
    print(f"Rejected: {len(confs_before) - len(confs_after)}")
    
    if confs_before:
        print(f"Confidence stats (before threshold):")
        print(f"  Mean: {np.mean(confs_before):.3f}")
        print(f"  Median: {np.median(confs_before):.3f}")
        print(f"  Min: {np.min(confs_before):.3f}")
        print(f"  Max: {np.max(confs_before):.3f}")
        print(f"  <0.3: {sum(c < 0.3 for c in confs_before)}")
        print(f"  0.3-0.5: {sum(0.3 <= c < 0.5 for c in confs_before)}")
        print(f"  0.5-0.55: {sum(0.5 <= c < 0.55 for c in confs_before)}")
        print(f"  ≥0.55: {sum(c >= 0.55 for c in confs_before)}")


def softmax(x):
    e_x = np.exp(x - np.max(x))  # numerical stability
    return e_x / e_x.sum()

# Process each split
for split in SPLITS:
    print(f"\nProcessing split: {split}")
    md_csv = OUTPUT_DIR / f"{split}_md_preds.csv"
    df = pd.read_csv(md_csv)

    results_all = []  # Before thresholding
    results = []      # After thresholding
    img_dir = IMG_DIRS[split]
    grouped = df.groupby("filename")

    for filename, rows in tqdm(grouped, desc=f"Processing {split}"):
        img_path = img_dir / filename
        img = cv2.imread(str(img_path))
        if img is None:
            continue
        H, W = img.shape[:2]

        for _, row in rows.iterrows():
            x1, y1, x2, y2 = clip_box(row.x1, row.y1, row.x2, row.y2, W, H)

            if x2 <= x1 or y2 <= y1:
                continue

            if row["class"] == "vehicle":
                pred = {
                    "filename": filename,
                    "bbox": [int(x1), int(y1), int(x2), int(y2)],
                    "category": "car",
                    "category_id": 33,
                    "conf": float(row.conf)
                }
                results_all.append(pred)
                results.append(pred)  # No threshold for vehicles
                
            elif row["class"] == "animal":
                crop = img[y1:y2, x1:x2]
                if crop.size == 0:
                    continue

                inp = preprocess_crop(crop)  # Use the fixed preprocessing
                logits = cls_sess.run(None, {input_name: inp})[0][0]
                
                probs = softmax(logits.astype(np.float32))
                cls_idx = np.argmax(probs)
                conf = probs[cls_idx]
                
                name = CLASSIFIER_CLASSES[cls_idx]
                coco_id = CATEGORY_NAME_TO_ID[name]

                pred = {
                    "filename": filename,
                    "bbox": [int(x1), int(y1), int(x2), int(y2)],
                    "category": str(name),
                    "category_id": int(coco_id),
                    "conf": float(conf)
                }
                results_all.append(pred)
                
                # Only add to final results if above threshold
                if conf >= CONF_THRESH:
                    results.append(pred)
                
    debug_confidence_scores(split, results_all, results)

    # Save final predictions
    out_json = OUTPUT_DIR / f"{split}_final_predictions.json"
    with open(out_json, "w") as f:
        json.dump(results, f, indent=2)
    print(f"Saved {len(results)} predictions to {out_json}")


Processing split: cis_val


Processing cis_val: 100%|██████████| 1645/1645 [00:19<00:00, 85.54it/s]



 cis_val - Confidence Analysis:
Total detections before threshold: 1779
Total detections after threshold (≥0.55): 1748
Rejected: 31
Confidence stats (before threshold):
  Mean: 0.888
  Median: 0.892
  Min: 0.238
  Max: 0.996
  <0.3: 2
  0.3-0.5: 18
  0.5-0.55: 11
  ≥0.55: 1748
Saved 1748 predictions to ..\eval\pipeline_results\cis_val_final_predictions.json

Processing split: cis_test


Processing cis_test: 100%|██████████| 11685/11685 [02:28<00:00, 78.87it/s]



 cis_test - Confidence Analysis:
Total detections before threshold: 12558
Total detections after threshold (≥0.55): 12286
Rejected: 272
Confidence stats (before threshold):
  Mean: 0.889
  Median: 0.897
  Min: 0.121
  Max: 0.996
  <0.3: 24
  0.3-0.5: 156
  0.5-0.55: 95
  ≥0.55: 12283
Saved 12286 predictions to ..\eval\pipeline_results\cis_test_final_predictions.json

Processing split: trans_val


Processing trans_val: 100%|██████████| 1780/1780 [00:19<00:00, 92.91it/s] 



 trans_val - Confidence Analysis:
Total detections before threshold: 1935
Total detections after threshold (≥0.55): 1863
Rejected: 72
Confidence stats (before threshold):
  Mean: 0.866
  Median: 0.881
  Min: 0.183
  Max: 0.995
  <0.3: 13
  0.3-0.5: 38
  0.5-0.55: 21
  ≥0.55: 1863
Saved 1863 predictions to ..\eval\pipeline_results\trans_val_final_predictions.json

Processing split: trans_test


Processing trans_test: 100%|██████████| 16145/16145 [03:13<00:00, 83.38it/s] 



 trans_test - Confidence Analysis:
Total detections before threshold: 17542
Total detections after threshold (≥0.55): 16765
Rejected: 777
Confidence stats (before threshold):
  Mean: 0.859
  Median: 0.879
  Min: 0.119
  Max: 0.996
  <0.3: 144
  0.3-0.5: 509
  0.5-0.55: 141
  ≥0.55: 16748
Saved 16765 predictions to ..\eval\pipeline_results\trans_test_final_predictions.json


In [None]:
import json
import numpy as np
import pandas as pd
from pathlib import Path
from collections import defaultdict
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns


GT_FILES = {
    "cis_val": "cis_val_fixed.json",
    "trans_val": "trans_val_fixed.json",
    "cis_test": "cis_test_noverify.json",
    "trans_test": "trans_test_noverify.json"
}

GT_DIR = Path("../data/preprocessed/annotations/cleaned")
IOU_THRESHOLD = 0.3
OUTPUT_CSV_DIR = Path("pipeline_results/eval_reports")
OUTPUT_CSV_DIR.mkdir(parents=True, exist_ok=True)


def gt_in_pred(gt_box, pred_box, containment_thresh=0.9):
    """
    Returns True if the ground-truth box is at least 90% contained within the predicted box.
    """
    gx1, gy1, gx2, gy2 = gt_box
    px1, py1, px2, py2 = pred_box

    # Compute intersection rectangle
    ix1 = max(gx1, px1)
    iy1 = max(gy1, py1)
    ix2 = min(gx2, px2)
    iy2 = min(gy2, py2)

    inter_w = max(0, ix2 - ix1)
    inter_h = max(0, iy2 - iy1)
    inter_area = inter_w * inter_h

    gt_area = max(1, (gx2 - gx1) * (gy2 - gy1))

    return (inter_area / gt_area) >= containment_thresh

def iou(boxA, boxB):
    """
    Compute Intersection over Union (IoU) between two bounding boxes.
    Boxes are in [x1, y1, x2, y2] format.
    """
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])

    interW = max(0, xB - xA)
    interH = max(0, yB - yA)
    interArea = interW * interH

    boxAArea = max(1, (boxA[2] - boxA[0]) * (boxA[3] - boxA[1]))
    boxBArea = max(1, (boxB[2] - boxB[0]) * (boxB[3] - boxB[1]))

    iou = interArea / float(boxAArea + boxBArea - interArea)
    return iou



SPLITS = ["cis_val", "trans_val", "cis_test", "trans_test"]

for split in SPLITS:
    print(f"\nEvaluating split: {split}")
    
    pred_file = f"pipeline_results/{split}_final_predictions.json"
    gt_file = GT_DIR / GT_FILES[split]

    preds = json.load(open(pred_file))
    gt_data = json.load(open(gt_file))

    image_id_to_name = {img["id"]: img["file_name"] for img in gt_data["images"]}
    gt_by_filename = defaultdict(list)
    for ann in gt_data["annotations"]:
        fn = image_id_to_name[ann["image_id"]]
        x, y, w, h = ann["bbox"]
        gt_box = [x, y, x + w, y + h]
        gt_by_filename[fn].append({"bbox": gt_box, "category_id": ann["category_id"]})
    id_to_name = {cat["id"]: cat["name"] for cat in gt_data["categories"]}

    y_true, y_pred, unmatched_preds = [], [], []
    
    for pred in preds:
        fn = pred["filename"]
        pred_box = pred["bbox"]
        pred_cat = pred["category_id"]

        best_gt = None
        for gt in gt_by_filename.get(fn, []):
            gt_box = gt["bbox"]
            iou_score = iou(pred_box, gt_box)
            containment = gt_in_pred(gt_box, pred_box, containment_thresh=0.6) 
            
            # Simple OR condition - accept if EITHER condition is met
            if iou_score >= IOU_THRESHOLD or containment:
                best_gt = gt
                break  # Take the first match that satisfies either condition

        if best_gt:
            y_true.append(best_gt["category_id"])
            y_pred.append(pred_cat)
        else:
            unmatched_preds.append(pred)

    print(f"  Total predictions: {len(preds)}")
    print(f"  Matched (IoU ≥ {IOU_THRESHOLD}): {len(y_true)}")
    print(f"  Unmatched: {len(unmatched_preds)}")

    # Calculate accuracy metrics
    y_true_names = [id_to_name.get(i, str(i)) for i in y_true]
    y_pred_names = [id_to_name.get(i, str(i)) for i in y_pred]
    
    # Overall accuracy
    overall_accuracy = accuracy_score(y_true_names, y_pred_names)
    print(f"  Overall Accuracy: {overall_accuracy:.3f}")

    # Classification report for detailed metrics
    report_dict = classification_report(
        y_true_names,
        y_pred_names,
        output_dict=True,
        zero_division=0
    )
    
    # Weighted accuracy
    weighted_accuracy = report_dict['weighted avg']['f1-score'] 
    print(f"  Weighted F1-Score: {weighted_accuracy:.3f}")

    # Per-class accuracy
    print("  Per-class Accuracy:")
    class_accuracies = {}
    for class_name in sorted(set(y_true_names)):
        # Get indices for this class in true labels
        class_indices = [i for i, true_label in enumerate(y_true_names) if true_label == class_name]
        if class_indices:
            class_true = [y_true_names[i] for i in class_indices]
            class_pred = [y_pred_names[i] for i in class_indices]
            class_acc = accuracy_score(class_true, class_pred)
            class_accuracies[class_name] = class_acc
            print(f"    {class_name}: {class_acc:.3f}")

    # Create comprehensive metrics summary
    metrics_summary = {
        "split": split,
        "total_predictions": len(preds),
        "matched_predictions": len(y_true),
        "unmatched_predictions": len(unmatched_preds),
        "matching_rate": len(y_true) / len(preds),
        "overall_accuracy": overall_accuracy,
        "weighted_f1": weighted_accuracy,
        **{f"{class_name}_accuracy": acc for class_name, acc in class_accuracies.items()}
    }
    
    # Save comprehensive metrics
    metrics_df = pd.DataFrame([metrics_summary])
    metrics_df.to_csv(OUTPUT_CSV_DIR / f"{split}_comprehensive_metrics.csv", index=False)

    # Original outputs
    report_df = pd.DataFrame(report_dict).transpose()
    report_df.to_csv(OUTPUT_CSV_DIR / f"{split}_classification_report.csv")

    cm = confusion_matrix(y_true_names, y_pred_names, labels=sorted(set(id_to_name.values())))
    cm_df = pd.DataFrame(cm, index=sorted(set(id_to_name.values())), columns=sorted(set(id_to_name.values())))
    cm_df.to_csv(OUTPUT_CSV_DIR / f"{split}_confusion_matrix.csv")

    labels_sorted = sorted(set(id_to_name.values()))

    # Normalize by true label (row-wise) - fix division by zero
    cm_normalized = cm.astype(np.float64)
    row_sums = cm_normalized.sum(axis=1, keepdims=True)
    row_sums[row_sums == 0] = 1  # Avoid division by zero
    cm_normalized = cm_normalized / row_sums
    cm_normalized_df = pd.DataFrame(cm_normalized, index=labels_sorted, columns=labels_sorted)

    # Save normalized confusion matrix as PNG
    plt.figure(figsize=(12, 10))
    sns.heatmap(cm_normalized_df, annot=True, fmt=".2f", cmap="Blues", vmin=0, vmax=1,
                xticklabels=True, yticklabels=True)
    plt.title(f"{split} Normalized Confusion Matrix (Row-wise)")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.xticks(rotation=45, ha='right')
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.savefig(OUTPUT_CSV_DIR / f"{split}_confusion_matrix.png", dpi=300)
    plt.close()

    pd.DataFrame(unmatched_preds).to_csv(OUTPUT_CSV_DIR / f"{split}_unmatched_preds.csv", index=False)

print("\nEvaluation complete!")


Evaluating split: cis_val
  Total predictions: 1748
  Matched (IoU ≥ 0.3): 1702
  Unmatched: 46
  Overall Accuracy: 0.981
  Weighted F1-Score: 0.981
  Per-class Accuracy:
    bird: 0.957
    bobcat: 0.952
    car: 1.000
    cat: 0.975
    coyote: 0.978
    deer: 1.000
    dog: 0.980
    opossum: 0.989
    rabbit: 0.992
    raccoon: 0.976
    rodent: 1.000
    skunk: 1.000
    squirrel: 0.966

Evaluating split: trans_val
  Total predictions: 1863
  Matched (IoU ≥ 0.3): 1783
  Unmatched: 80
  Overall Accuracy: 0.942
  Weighted F1-Score: 0.946
  Per-class Accuracy:
    bird: 0.842
    bobcat: 0.934
    car: 1.000
    cat: 0.787
    coyote: 0.967
    dog: 0.938
    opossum: 0.968
    rabbit: 1.000
    raccoon: 0.907
    rodent: 0.400
    skunk: 0.950
    squirrel: 0.935

Evaluating split: cis_test
  Total predictions: 12286
  Matched (IoU ≥ 0.3): 11948
  Unmatched: 338
  Overall Accuracy: 0.977
  Weighted F1-Score: 0.977
  Per-class Accuracy:
    badger: 0.000
    bird: 0.987
    bobcat: 