# Object Detection with Depth Estimation

This notebook implements a combined object detection and depth estimation pipeline for peatland navigation. It integrates:

1. Object Detection:
   - YOLO-based bench detection
   - Confidence thresholding
   - Real-time processing

2. Depth Estimation:
   - MiDaS depth model
   - Relative distance calculation
   - Object-specific depth analysis

3. Visualization:
   - Annotated video output
   - Distance measurements
   - Visual markers

The pipeline provides enriched scene understanding by combining spatial and semantic information.

## 1. Required Libraries

In [6]:
# Deep learning and object detection
import torch
from ultralytics import YOLO

# File and path handling
from pathlib import Path

# Image and video processing
import cv2
from PIL import Image
import numpy as np

# Progress tracking
from tqdm import tqdm

## 2. Configuration

Setup the inference pipeline:

1. Model Selection:
   - Detection model path
   - Confidence thresholds
   - Device configuration

2. Input/Output:
   - Source video selection
   - Output directory structure
   - File path resolution

3. Processing Parameters:
   - Detection confidence
   - Hardware acceleration
   - Directory management

In [7]:
# Model configuration
RUN_NAME = "finetuned"                 # Name of the fine-tuned model run
PROJECT_DIR = "./training/metrics/detection"       # Base directory for model artifacts

# Input video selection
VIDEO_FILENAME = "Clip_3_35s.mp4"      # Source video file name

# Detection parameters
CONFIDENCE_THRESHOLD = 0.4              # Minimum confidence for valid detections

# Path configuration
MODEL_PATH = Path(PROJECT_DIR) / RUN_NAME / "weights/best.pt"     # Model weights
VIDEO_PATH = Path("./data/video/splits") / VIDEO_FILENAME        # Source video
OUTPUT_DIR = Path(PROJECT_DIR) / RUN_NAME / "depth_predictions"   # Output directory
OUTPUT_DIR.mkdir(exist_ok=True)                                   # Ensure output exists
OUTPUT_VIDEO_PATH = OUTPUT_DIR / f"depth_{VIDEO_FILENAME}"        # Output video path

# Device configuration
if torch.cuda.is_available():
    DEVICE = "cuda"      # Use GPU if available
elif torch.backends.mps.is_available():
    DEVICE = "mps"       # Use Apple Silicon if available
else:
    DEVICE = "cpu"       # Fall back to CPU

# Display configuration
print(f"Loading model from run: {RUN_NAME}")
print(f"Processing video: {VIDEO_PATH}")
print(f"Using device: {DEVICE}")

Loading model from run: finetuned
Processing video: data/video/splits/Clip_3_35s.mp4
Using device: mps


## 3. Model Initialization

Load and configure inference models:

1. YOLO Detection Model:
   - Load fine-tuned weights
   - Configure class mapping
   - Set inference parameters

2. MiDaS Depth Model:
   - Small model variant
   - Optimized transforms
   - GPU acceleration

3. Model Configuration:
   - Device placement
   - Evaluation mode
   - Transform pipeline

In [8]:
# Initialize YOLO detection model
detection_model = YOLO(MODEL_PATH)

# Get bench class ID from model configuration
BENCH_CLASS_ID = [k for k, v in detection_model.names.items() if v == 'bench'][0]
print(f"YOLO model loaded. 'bench' class ID is: {BENCH_CLASS_ID}")

# Initialize MiDaS depth estimation model
midas_model_type = "MiDaS_small"  # Balanced speed/accuracy model variant
midas = torch.hub.load("intel-isl/MiDaS", midas_model_type)

# Configure MiDaS model
midas.to(DEVICE)        # Move to appropriate device
midas.eval()           # Set to evaluation mode

# Load appropriate MiDaS transforms
midas_transforms = torch.hub.load("intel-isl/MiDaS", "transforms")
transform = (midas_transforms.small_transform 
            if midas_model_type == "MiDaS_small" 
            else midas_transforms.dpt_transform)

print("MiDaS depth model loaded successfully.")

YOLO model loaded. 'bench' class ID is: 0


Using cache found in /Users/stahlma/.cache/torch/hub/intel-isl_MiDaS_master


Loading weights:  None


Using cache found in /Users/stahlma/.cache/torch/hub/rwightman_gen-efficientnet-pytorch_master


MiDaS depth model loaded successfully.


Using cache found in /Users/stahlma/.cache/torch/hub/intel-isl_MiDaS_master


## 4. Combined Inference Pipeline

Execute multi-model inference:

1. Video Processing:
   - Frame extraction
   - Resolution handling
   - Output video setup

2. Detection Pipeline:
   - YOLO inference
   - Confidence filtering
   - Bounding box extraction

3. Depth Analysis:
   - MiDaS inference
   - Depth map generation
   - Distance calculation

4. Visualization:
   - Bounding box drawing
   - Distance annotation
   - Video composition

In [None]:
# Initialize video capture
cap = cv2.VideoCapture(str(VIDEO_PATH))
if not cap.isOpened():
    print(f"Error: Could not open video file {VIDEO_PATH}")
else:
    # Configure video parameters
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))     # Video width
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))   # Video height
    fps = int(cap.get(cv2.CAP_PROP_FPS))                     # Frame rate
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))    # Total frames
    
    # Initialize video writer
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')                 # Video codec
    out = cv2.VideoWriter(str(OUTPUT_VIDEO_PATH), fourcc, fps, 
                         (frame_width, frame_height))

    print(f"Processing {total_frames} frames...")

    dist_series = []
    area_series = []
    frame_idx = []
    
    # Process each frame
    for _ in tqdm(range(total_frames)):
        # Read the next frame
        ret, frame = cap.read()
        if not ret:
            break
            
        # 1. Object Detection
        results = detection_model(frame, verbose=False)
        annotated_frame = frame.copy()  # Create fresh copy for annotations
        
        # 2. Depth Estimation
        img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)  # Convert to RGB
        input_batch = transform(img_rgb).to(DEVICE)       # Prepare for MiDaS
        
        # Generate depth map
        with torch.no_grad():
            prediction = midas(input_batch)
            # Resize depth map to match frame
            depth_map = torch.nn.functional.interpolate(
                prediction.unsqueeze(1),
                size=img_rgb.shape[:2],
                mode="bicubic",
                align_corners=False,
            ).squeeze().cpu().numpy()

        # 3. Combine Detection and Depth
        for r in results:
            for box in r.boxes:
                # Filter detections by confidence and class
                if box.cls == BENCH_CLASS_ID and box.conf >= CONFIDENCE_THRESHOLD:
                    # Extract bounding box coordinates
                    x1, y1, x2, y2 = map(int, box.xyxy[0])

                    w = max(1, x2 - x1)
                    h = max(1, y2 - y1)
                    area = float(w * h)

                    dist_series.append(float(distance))
                    area_series.append(area)
                    frame_idx.append(len(frame_idx))
                    
                    # Calculate relative distance using depth
                    box_depth = depth_map[y1:y2, x1:x2]           # Get depth in box
                    median_depth_value = np.median(box_depth)      # Robust depth estimate
                    distance = 1 / median_depth_value * 100        # Scale for readability
                    
                    # Draw annotations
                    cv2.rectangle(annotated_frame, (x1, y1), (x2, y2), 
                                (0, 255, 0), 2)                    # Box
                    label = f"Bench | Dist: {distance*2:.1f}"        # Distance label
                    cv2.putText(annotated_frame, label, (x1, y1 - 10), 
                              cv2.FONT_HERSHEY_SIMPLEX, 0.9, (0, 255, 0), 2)

        # Write annotated frame
        out.write(annotated_frame)

    import numpy as np

    d = np.array(dist_series, dtype=np.float64)
    a = np.array(area_series, dtype=np.float64)
    t = np.arange(len(d), dtype=np.float64)

    def spearman_rho(x, y):
        rx = np.argsort(np.argsort(x))
        ry = np.argsort(np.argsort(y))
        rx = (rx - rx.mean()) / (rx.std() + 1e-8)
        ry = (ry - ry.mean()) / (ry.std() + 1e-8)
        return float((rx * ry).mean())

    # A) Trend
    dd = np.diff(d)
    pct_decreasing = 100.0 * (dd < 0).mean() if dd.size else np.nan
    rho_time = spearman_rho(t, -d)  # expect positive if d decreases over time

    # B) Pinhole scale consistency
    inv_sqrt_area = 1.0 / np.sqrt(np.clip(a, 1.0, None))
    rho_geom = spearman_rho(d, inv_sqrt_area)  # expect high positive
    K = d * np.sqrt(np.clip(a, 1.0, None))
    K_cv = float(np.std(K) / (np.mean(K) + 1e-8)) if K.size else np.nan  # lower is better

    # C) Smoothness (scale-invariant)
    smooth_idx = float(np.median(np.abs(dd)) / (np.median(d) + 1e-8)) if dd.size else np.nan  # lower is better

    # D) Robustness (dispersion)
    iqr_d = float(np.percentile(d, 75) - np.percentile(d, 25)) if d.size else np.nan
    iqr_K = float(np.percentile(K, 75) - np.percentile(K, 25)) if K.size else np.nan

    print("\n--- Distance metric (no-GT) sanity checks ---")
    print(f"Trend: Spearman(time, -distance) = {rho_time:.3f}  (↑ better)")
    print(f"Trend: % steps decreasing        = {pct_decreasing:.1f}%")
    print(f"Geom consistency: Spearman(d, 1/sqrt(area)) = {rho_geom:.3f}  (↑ better)")
    print(f"Geom consistency: CV of K=d*sqrt(area)      = {K_cv:.3f}      (↓ better)")
    print(f"Smoothness: median|Δd| / median(d)          = {smooth_idx:.4f} (↓ better)")
    print(f"Robustness: IQR(d) = {iqr_d:.3f},  IQR(K) = {iqr_K:.3f}        (↓ better)")   
    
    # Clean up resources
    cap.release()
    out.release()
    print("\nProcessing complete.")
    print(f"Output video with depth estimation saved to: {OUTPUT_VIDEO_PATH}")

Processing 1065 frames...


100%|██████████| 1065/1065 [02:47<00:00,  6.35it/s]


--- Distance metric (no-GT) sanity checks ---
Trend: Spearman(time, -distance) = 0.779  (↑ better)
Trend: % steps decreasing        = 44.4%
Geom consistency: Spearman(d, 1/sqrt(area)) = 0.089  (↑ better)
Geom consistency: CV of K=d*sqrt(area)      = 0.265      (↓ better)
Smoothness: median|Δd| / median(d)          = 0.0599 (↓ better)
Robustness: IQR(d) = 0.124,  IQR(K) = 39.145        (↓ better)

Processing complete.
Output video with depth estimation saved to: training/metrics/detection/finetuned/depth_predictions/depth_Clip_3_35s.mp4



