<a href="https://colab.research.google.com/github/A00785001/TC5035/blob/main/004_Loop_Closure_Dataset_Generation_v7_0_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loop Closure Dataset Generation: Time Alignment, Pairing & Labeling
## Phase 1.5: Supervised Learning Dataset Preparation for Jetson Nano Training Pipeline

**Version:** 6.10  
**Pipeline Phase:** Feature Extraction ‚Üí **[THIS NOTEBOOK]** ‚Üí Fusion MLP Training ‚Üí Deployment  
**Target Hardware:** Waveshare Jetbot AI Pro Kit (Jetson Nano)  
**SLAM System:** Google Cartographer (2D)  
**Training Platform:** Vertex AI

---

## NOTEBOOK DOCUMENTATION

### Purpose

This notebook transforms independently extracted multi-modal features (camera + LiDAR) into a supervised learning dataset for training a loop closure detection classifier. It bridges the gap between feature extraction and neural network training by performing temporal alignment, intelligent pairing, and ground truth labeling.

---

### Workflow Overview

**Phase 1A:** ROS Bag Validation (optional validation)
**Phase 1B:** Feature File Validation (HDF5 features)
**Phase 1C:** Pbstream Extraction
- Extract trajectory nodes with timestamps and poses
- Extract INTER_SUBMAP constraints (loop closures)
- Generate pairwise distance/time table for all nodes
- Temporal alignment: Match features to trajectory nodes

**Phase 2:** Dataset Generation
- Data profiling and automatic threshold suggestion
- Generate positive pairs (loop closures from distance + time filters)
- Generate easy negative pairs (spatially distant)
- Generate hard negative pairs (perceptually similar but distant)

**Phase 3:** Dataset Splitting & Validation
**Phase 4:** Save Dataset & Generate Report

---

### Folder Structure

**Expected Directory Layout:**

```
session_YYYYMMDD_HHMMSS/          (base working folder - your session)
‚îú‚îÄ‚îÄ map.pbstream                  ‚Üê INPUT: Cartographer pbstream file (REQUIRED)
‚îú‚îÄ‚îÄ session_data.bag              ‚Üê INPUT: ROS bag file (optional validation)
‚îú‚îÄ‚îÄ features/                     ‚Üê INPUT: Directory from feature extraction
‚îÇ   ‚îî‚îÄ‚îÄ features.h5               ‚Üê INPUT: Extracted features (HDF5)
‚îî‚îÄ‚îÄ dataset/                      ‚Üê OUTPUT: Created by this notebook
    ‚îú‚îÄ‚îÄ loop_closure_dataset.pkl  ‚Üê OUTPUT: Training dataset
    ‚îú‚îÄ‚îÄ dataset_diagnostics.png   ‚Üê OUTPUT: Visualization plots
    ‚îî‚îÄ‚îÄ dataset_generation_report.txt ‚Üê OUTPUT: Summary report
```

**Note:** The `.pbstream` file is REQUIRED. Replace `session_YYYYMMDD_HHMMSS` with your actual session folder name.

---

### Required Inputs

#### 1. Pbstream File (REQUIRED)

**File:** `map.pbstream` (Cartographer SLAM state)

**Contents:**
- Trajectory nodes with full 6-DOF poses and timestamps
- INTER_SUBMAP constraints (loop closures detected by SLAM)
- Complete SLAM graph state

#### 2. Extracted Features (HDF5 Format)

**File:** `features/features.h5` (generated from feature extraction pipeline)

**Structure:**
```
features.h5
‚îú‚îÄ‚îÄ camera/
‚îÇ   ‚îú‚îÄ‚îÄ features [N_cam, 1280]       # MobileNetV2 embeddings (L2 normalized, float32)
‚îÇ   ‚îú‚îÄ‚îÄ timestamps_sec [N_cam]       # ROS timestamp seconds (int64)
‚îÇ   ‚îú‚îÄ‚îÄ timestamps_nsec [N_cam]      # ROS timestamp nanoseconds (int32)
‚îÇ   ‚îî‚îÄ‚îÄ filenames [N_cam]            # Source image filenames (strings)
‚îî‚îÄ‚îÄ lidar/
    ‚îú‚îÄ‚îÄ features [N_lid, 256]        # 1D CNN descriptors (L2 normalized, float32)
    ‚îú‚îÄ‚îÄ timestamps_sec [N_lid]       # ROS timestamp seconds (int64)
    ‚îú‚îÄ‚îÄ timestamps_nsec [N_lid]      # ROS timestamp nanoseconds (int32)
    ‚îî‚îÄ‚îÄ filenames [N_lid]            # Source scan filenames (strings)
```

---


---

## üß† HOW THIS NOTEBOOK WORKS: COMPREHENSIVE EXPLANATION

### The Problem We're Solving

**Loop closure detection** is critical for mobile robot SLAM. When a robot revisits a location, the system must recognize this "loop closure" to correct accumulated drift. This notebook creates a **supervised learning dataset** to train a neural network for loop closure detection using multi-modal sensor data.

**The Challenge:** Three independent data streams must be combined:
- Camera features (1280D MobileNetV2) at ~0.22 Hz
- LiDAR features (256D 1D-CNN) at ~0.7 Hz  
- SLAM ground truth (Cartographer trajectory + constraints)

These have **different rates**, **independent timestamps**, and **no pre-sync**.

---

### Why This Approach?

#### **1. Why INTER_SUBMAP Constraints?**

Cartographer's pose graph has two constraint types:
- **INTRA_SUBMAP**: Sequential (node j inserted into submap i)
- **INTER_SUBMAP**: Loop closures (node j NOT in submap i = robot returned)

**INTER_SUBMAP = gold standard** because:
- Cartographer solved loop closure via scan matching
- Passed rigorous geometric validation
- Verified spatial alignment (<2m, similar orientation)
- Residual provides confidence metric

#### **2. Why KD-Trees for Temporal Alignment?**

**Sensors operate independently:**
- Camera: ~4.5s intervals
- LiDAR: ~1.4s intervals
- Trajectory nodes: ~0.9 Hz

**KD-Tree solution:** O(log N) nearest-neighbor in time domain
```
For node at time T:
  ‚Üí Find camera where |t_cam - T| < 0.5s
  ‚Üí Find LiDAR where |t_lid - T| < 0.5s
  ‚Üí Combine into multi-modal feature
```

**¬±500ms threshold** balances:
- Robot motion (~0.5m in 500ms)
- Sensor rate limits
- Pose uncertainty

#### **3. Why Three Pair Types?**

**Positive (30%)** - True loop closures from INTER_SUBMAP
- Teaches "same place" recognition

**Easy Negative (35%)** - Distance >5m + temporal gap >5s
- Teaches basic discrimination
- Temporal gap prevents consecutive-but-far frames

**Hard Negative (35%)** - High similarity (>0.7) BUT distant (>3m)
- Handles perceptual aliasing (corridors, symmetry)
- Most challenging cases

**Balanced 30/35/35** prevents:
- Trivial solutions ("always no")
- Easy case overfitting
- Perceptual aliasing failure

---

### Step-by-Step Workflow

#### **PHASE 0: Data Validation**

**HDF5 Analysis (1.1.5):** Feature stats, temporal coherence, quality checks
**Bag Analysis (1.2.5):** Trajectory distribution, INTER constraints, overlap

#### **PHASE 1: Feature-Trajectory Alignment**

**Step 1.1: Load Features**
```
camera_features: [N, 1280] L2-normalized
lidar_features: [N, 256] L2-normalized
timestamps: sec (int64) + nsec (int32)
```

**Step 1.2: Parse Trajectory Nodes**
```
From /trajectory_node_list (MarkerArray):
  node_id, timestamp, pose (x,y,z, qx,qy,qz,qw)
```

**Step 1.2.6: Parse INTER_SUBMAP Constraints**
```
From /constraint_list (MarkerArray):
  Filter: "Inter constraints" namespace only
  Match: constraint endpoints ‚Üí trajectory nodes
  Validate: distance <2m, angle <90¬∞, residual <0.5m
```

**Step 1.3: Temporal Alignment**
```
For each trajectory node at time T:
  cam_idx = nearest(camera_times, T) if |diff| < 0.5s
  lid_idx = nearest(lidar_times, T) if |diff| < 0.5s
  
  If both found:
    combined_feature = [cam_feat, lid_feat]  # [1536D]
    valid_nodes.append(node)
```

#### **PHASE 2: Intelligent Pairing**

**Step 2.1: Positive Pairs**
```
For (node_i, node_j) in INTER_SUBMAP_constraints:
  If both have combined_feature:
    pair = {features: [feat_i, feat_j], label: 1}
```

**Step 2.2: Easy Negatives**
```
Random sample nodes:
  Accept if: distance >5m AND time_gap >5s
```

**Step 2.3: Hard Negatives**
```
KD-Tree in feature space:
  Find similar features (cosine >0.7)
  Accept if: spatially distant (>3m)
```

#### **PHASE 3: Finalization**

- Combine + shuffle all pairs
- Stratified train/val/test split (60/20/20)
- Validation checks + statistics
- Save PKL + diagnostics + report

---

### üõ°Ô∏è Key Design Decisions

**¬±500ms threshold:**
- Too tight (<100ms): 60-80% data loss
- Too loose (>1s): Robot moves >1m, high uncertainty
- Sweet spot (500ms): <0.5m motion, 70-80% alignment

**Float64 for timestamps:**
- Precision loss: ~250ns
- Threshold: 500,000,000ns
- Ratio: 5e-7 (negligible!)
- Store both: split (exact) + float (algorithms)

**Residual <0.5m filter:**
- High residual = weak constraint
- Filters false positives
- Keeps only high-quality ground truth

**35/35 easy/hard split:**
- Only easy: fails on aliasing
- Only hard: won't converge
- Balanced: stable training + generalization

**L2 normalization required:**
- Enables cosine similarity = dot product
- Network learns directions not magnitudes
- KD-tree works in normalized space

---

### ‚ö†Ô∏è Common Issues

**"No INTER_SUBMAP constraints"**
- Robot never looped back
- Solution: Re-record with intentional loops

**"Low alignment <50%"**
- Clock skew or wrong timestamps
- Solution: Verify bag recording, check timestamps

**"Features not normalized"**
- Extraction pipeline error
- Solution: Re-extract with normalization

**"Insufficient positives"**
- Too few validated constraints
- Solution: Relax thresholds or re-record

---

### Next Steps

**Output:** `loop_closure_dataset.pkl`
```
{'train': {features: [N,2,1536], labels: [N]},
 'val': {...},
 'test': {...},
 'config': {...},
 'statistics': {...}}
```

**1. Train Fusion MLP (Vertex AI)**
- Input: [2, 1536] paired features
- Architecture: MLP with dropout
- Output: Binary classifier

**2. Evaluate**
- Precision/Recall/F1
- ROC curves
- Per-type accuracy

**3. Deploy to Jetson Nano**
- TensorRT optimization
- Real-time inference (<50ms)
- Integrate with Cartographer

---

In [None]:
# Install required packages
!pip install -q rosbags h5py scikit-learn matplotlib

print("‚úÖ All packages installed successfully")

[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/137.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m137.9/137.9 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/1.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[91m‚ï∏[0m [32m1.4/1.4 MB[0m [31m40.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.4/1.4 MB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K   

---

## SECTION 1: SETUP

Imports, configuration, thresholds.

---



## 1.1 IMPORTS AND DEPENDENCIES

In [None]:
# System and file I/O
import os
import sys
import pickle
from pathlib import Path
# Numerical computing
import numpy as np
from scipy.spatial import KDTree
from scipy.spatial.transform import Rotation
# Data handling
import h5py
# ROS bag processing
from rosbags.rosbag1 import Reader
from rosbags.typesys import get_typestore, Stores
# Visualization
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
import seaborn as sns
# Progress tracking
from tqdm.auto import tqdm
# Random state for reproducibility
import random
print("‚úÖ All imports successful")
print(f"   NumPy version: {np.__version__}")
print(f"   Python version: {sys.version.split()[0]}")import struct
import zlib
from google.protobuf.internal.decoder import _DecodeVarint32, _DecodeVarint
from google.colab import drive


‚úÖ All imports successful
   NumPy version: 2.0.2
   Python version: 3.12.12


## 1.2 WORKING DIRECTORY CONFIGURATION

In [None]:
# Configure session folder
SESSION_ID = 'session_20251022_155137'  # ‚Üê CHANGE THIS to your session folder name

if IN_COLAB:
    # Google Colab: Assume data is in Drive
    BASE_PATH = f'/content/drive/MyDrive/DATA/Artificial_Intelligence/MNA-V/Subjects/TC5035-Proyecto_Integrador/TC5035.data/jetbot/{SESSION_ID}'
else:
    # Local: Use current directory or specify path
    BASE_PATH = f'./{SESSION_ID}'  # Or specify full path: '/path/to/data/{SESSION_ID}'

# Change to working directory
os.chdir(BASE_PATH)
print(f"\n Working directory: {os.getcwd()}")
print(f"   Session ID: {SESSION_ID}")


üìÇ Working directory: /content/drive/MyDrive/DATA/Artificial_Intelligence/MNA-V/Subjects/TC5035-Proyecto_Integrador/TC5035.data/jetbot/session_20251022_155137
   Session ID: session_20251022_155137


In [None]:
# File Paths Configuration
print("=" * 70)
print("FILE PATHS CONFIGURATION")
print("=" * 70)

# Input directories and files (relative to working folder)
FEATURES_DIR = 'features'
FEATURES_FILE = os.path.join(FEATURES_DIR, 'features.h5')
BAG_FILE = 'session_data.bag'  # In base working folder
PBSTREAM_FILE = 'session_data.pbstream'  # Optional, in base working folder

# Output directory for dataset
OUTPUT_DIR = 'dataset'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Output files
DATASET_FILE = os.path.join(OUTPUT_DIR, 'loop_closure_dataset.pkl')
DIAGNOSTICS_FILE = os.path.join(OUTPUT_DIR, 'dataset_diagnostics.png')
REPORT_FILE = os.path.join(OUTPUT_DIR, 'dataset_generation_report.txt')

print("\n Input Configuration:")
print(f"  Features directory: {FEATURES_DIR}")
print(f"  Features file: {FEATURES_FILE}")
print(f"  ROS bag file: {BAG_FILE}")
print(f"  Pbstream file (optional): {PBSTREAM_FILE}")

print("\n Output Configuration:")
print(f"  Output directory: {OUTPUT_DIR}")
print(f"  Dataset file: {DATASET_FILE}")
print(f"  Diagnostics plot: {DIAGNOSTICS_FILE}")
print(f"  Report file: {REPORT_FILE}")

# Verify input files exist
print("\n Verifying input files...")
if os.path.exists(FEATURES_FILE):
    print(f"  ‚úì Features file found: {FEATURES_FILE}")
else:
    print(f"  ‚ùå Features file NOT found: {FEATURES_FILE}")
    raise FileNotFoundError(f"Required file not found: {FEATURES_FILE}")

if os.path.exists(BAG_FILE):
    print(f"  ‚úì ROS bag file found: {BAG_FILE}")
else:
    print(f"  ‚ùå ROS bag file NOT found: {BAG_FILE}")
    raise FileNotFoundError(f"Required file not found: {BAG_FILE}")

if os.path.exists(PBSTREAM_FILE):
    print(f"  ‚úì Pbstream file found: {PBSTREAM_FILE}")
else:
    print(f"  ‚Ñπ Pbstream file not found (optional): {PBSTREAM_FILE}")

print("=" * 70)

# Time alignment parameters
MAX_TIME_OFFSET = 0.5  # seconds - maximum time difference for feature-node matching
MIN_TEMPORAL_GAP = 5.0  # seconds - minimum time between frames for negative pairs

# Spatial thresholds
POSITIVE_DISTANCE_THRESHOLD = 0.3  # meters - UPDATED for 3m x 2m environment
EASY_NEGATIVE_MIN_DISTANCE = 1.0  # meters - UPDATED for small indoor space
HARD_NEGATIVE_MIN_DISTANCE = 3.0  # meters - minimum distance for hard negatives

POSITIVE_TIME_GAP = 10.0  # seconds - minimum time between loop closure pairs
# Cartographer validation thresholds
MAX_CONSTRAINT_RESIDUAL = 0.5  # meters - maximum constraint error for validation
MAX_ANGULAR_DISTANCE = np.pi / 2  # radians - maximum angular difference for loop closure

# Pairing strategy targets
POSITIVE_RATIO = 0.30  # 30% positive pairs
EASY_NEGATIVE_RATIO = 0.35  # 35% easy negatives
HARD_NEGATIVE_RATIO = 0.35  # 35% hard negatives

# Hard negative mining
HARD_NEGATIVE_SIMILARITY_THRESHOLD = 0.7  # cosine similarity threshold for perceptual aliasing

# Split ratios (stratified random)
TRAIN_RATIO = 0.60
VAL_RATIO = 0.20
TEST_RATIO = 0.20

print("\n‚öôÔ∏è Algorithm Parameters:")
print(f"  Time alignment: ¬±{MAX_TIME_OFFSET}s")
print(f"  Positive threshold: <{POSITIVE_DISTANCE_THRESHOLD}m")
print(f"  Easy negative: >{EASY_NEGATIVE_MIN_DISTANCE}m")
print(f"  Hard negative: >{HARD_NEGATIVE_MIN_DISTANCE}m, similarity >{HARD_NEGATIVE_SIMILARITY_THRESHOLD}")
print(f"  Constraint validation: residual <{MAX_CONSTRAINT_RESIDUAL}m, angle <{MAX_ANGULAR_DISTANCE:.2f} rad")
print(f"  Dataset ratios: {POSITIVE_RATIO:.0%} pos / {EASY_NEGATIVE_RATIO:.0%} easy neg / {HARD_NEGATIVE_RATIO:.0%} hard neg")
print(f"  Split ratios: {TRAIN_RATIO:.0%} train / {VAL_RATIO:.0%} val / {TEST_RATIO:.0%} test")

FILE PATHS CONFIGURATION

üìÅ Input Configuration:
  Features directory: features
  Features file: features/features.h5
  ROS bag file: session_data.bag
  Pbstream file (optional): session_data.pbstream

üìÅ Output Configuration:
  Output directory: dataset
  Dataset file: dataset/loop_closure_dataset.pkl
  Diagnostics plot: dataset/dataset_diagnostics.png
  Report file: dataset/dataset_generation_report.txt

üîç Verifying input files...
  ‚úì Features file found: features/features.h5
  ‚úì ROS bag file found: session_data.bag
  ‚Ñπ Pbstream file not found (optional): session_data.pbstream

‚öôÔ∏è Algorithm Parameters:
  Time alignment: ¬±0.5s
  Positive threshold: <2.0m
  Easy negative: >5.0m
  Hard negative: >3.0m, similarity >0.7
  Constraint validation: residual <0.5m, angle <1.57 rad
  Dataset ratios: 30% pos / 35% easy neg / 35% hard neg
  Split ratios: 60% train / 20% val / 20% test


---

## SECTION 2: STORAGE MOUNTING

Google Drive mount.

---



In [None]:
# Mount Google Drive for Colab environment
try:
    from google.colab import drive
    drive.mount('/content/drive')
    print("‚úì Google Drive mounted")
except:
    print("‚ÑπÔ∏è  Not in Colab environment, skipping Drive mount")


In [None]:
# Detect environment
try:
    import google.colab
    IN_COLAB = True
    print(" Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print(" Running in local environment")

# Mount Google Drive if in Colab
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    print("‚úÖ Google Drive mounted")

# Set random seeds for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
print(f"\n Random seed set to: {RANDOM_SEED}")

üîµ Running in Google Colab
Mounted at /content/drive
‚úÖ Google Drive mounted

üé≤ Random seed set to: 42


---

## SECTION 3: PBSTREAM LOADING & EXTRACTION

Trajectory, constraints, pairwise data.

---



## 3.1 SECTION DOCUMENTATION

### 1C.1 Pbstream Extraction Functions

These functions parse the Cartographer `.pbstream` file (protobuf format) to extract:
- Trajectory nodes with timestamps and poses
- INTER_SUBMAP constraints (loop closures)
- Pairwise node distances and time differences


---

# üìö Cartographer PBSTREAM Format Documentation

## üèóÔ∏è File Structure

```
PBSTREAM FILE
‚îú‚îÄ [8-byte header]
‚îî‚îÄ [Messages]* (repeated)
   ‚îú‚îÄ 8 bytes: message length (uint64, little-endian)
   ‚îî‚îÄ N bytes: gzip-compressed protobuf message
```

---

## üìã Message Types

### 1. POSE_GRAPH (Field 1)
Complete optimized graph structure containing all trajectories, nodes, submaps, and loop closure constraints after SLAM optimization.

**Typical Count:** 1-2 messages (usually at end of file)

### 2. ALL_TRAJECTORY_BUILDER_OPTIONS (Field 2)
SLAM algorithm configuration parameters and tuning settings.

**Typical Count:** 1 message (at start of file)

### 3. SUBMAP (Field 3)
2D/3D probability grids representing mapped areas. Each submap is a tile of the complete map.

### 4. NODE (Field 4) ‚≠ê PRIMARY DATA
Trajectory nodes containing robot pose, timestamp, and sensor data at each mapping step.

### 5. TRAJECTORY_DATA (Field 5)
Metadata about trajectories (IDs, relationships between multiple trajectories).

### 6. IMU_DATA (Field 6)
IMU sensor readings (accelerometer, gyroscope) if IMU was used during mapping.

---

## üéØ NODE Structure (Field 4) - DETAILED

This is the **primary data structure** containing pose and sensor information.

```
NODE (Field 4)
‚îú‚îÄ Field 1: Node metadata (usually empty, 0-2 bytes)
‚îî‚îÄ Field 5: POSE DATA
   ‚îú‚îÄ Field 1: TIMESTAMP (int64 varint)
   ‚îÇ   ‚îî‚îÄ Format: 100-nanosecond ticks since Windows epoch (1601-01-01)
   ‚îÇ   ‚îî‚îÄ Conversion: (timestamp - 621355968000000000) / 10000000 = Unix seconds
   ‚îÇ
   ‚îú‚îÄ Field 2: Metadata (9-18 bytes, varies by node)
   ‚îú‚îÄ Field 3: POINT CLOUD DATA (600-950 bytes, raw LIDAR scan)
   ‚îú‚îÄ Field 4: (empty)
   ‚îú‚îÄ Field 5: (empty)
   ‚îÇ
   ‚îî‚îÄ Field 7: POSE (TRANSFORM)
      ‚îú‚îÄ Field 1: TRANSLATION
      ‚îÇ  ‚îú‚îÄ Field 1: x (double) - meters
      ‚îÇ  ‚îú‚îÄ Field 2: y (double) - meters
      ‚îÇ  ‚îî‚îÄ Field 3: z (double) - meters
      ‚îÇ
      ‚îî‚îÄ Field 2: ROTATION (quaternion)
         ‚îú‚îÄ Field 1: x (double)
         ‚îú‚îÄ Field 2: y (double)
         ‚îú‚îÄ Field 3: z (double)
         ‚îî‚îÄ Field 4: w (double) - normalized
```

---

## üîß Protobuf Wire Types

| Wire Type | Name | Description |
|-----------|------|-------------|
| 0 | VARINT | Variable-length integer (int32, int64, uint32, uint64, bool, enum) |
| 1 | 64BIT | Fixed 8 bytes (double, fixed64, sfixed64) |
| 2 | LENGTH_DELIM | Length-prefixed (string, bytes, nested messages, packed repeated) |
| 5 | 32BIT | Fixed 4 bytes (float, fixed32, sfixed32) |

---

## üìö Important Notes

- **Timestamp Format:** Cartographer uses "Universal Time" (100-nanosecond ticks since Windows epoch)
- **Coordinate System:** Right-handed with Z-up (typical for 2D: z=0)
- **Quaternions:** Stored as (x, y, z, w) and are normalized (magnitude = 1.0)
- **Compression:** All messages are gzip-compressed with window bits = 16 + MAX_WBITS
- **Endianness:** Little-endian for all multi-byte values
- **Field Numbers:** In protobuf, field numbers 1-15 use 1 byte, 16+ use 2+ bytes

---

## 3.2 Helper Functions (ORIGINAL - DO NOT MODIFY)

In [None]:
def extract_xy(field5_data):
    pos = 0
    x = y = None
    while pos < len(field5_data):
        try:
            tag, pos = _DecodeVarint32(field5_data, pos)
            fn = tag >> 3
            wt = tag & 0x7
            if fn == 7 and wt == 2:
                length, pos = _DecodeVarint32(field5_data, pos)
                f7_data = field5_data[pos:pos + length]
                f7_pos = 0
                while f7_pos < len(f7_data):
                    try:
                        f7_tag, f7_pos = _DecodeVarint32(f7_data, f7_pos)
                        f7_fn = f7_tag >> 3
                        f7_wt = f7_tag & 0x7
                        if f7_fn == 1 and f7_wt == 2:
                            tl, f7_pos = _DecodeVarint32(f7_data, f7_pos)
                            td = f7_data[f7_pos:f7_pos + tl]
                            tp = 0
                            while tp < len(td):
                                try:
                                    tt, tp = _DecodeVarint32(td, tp)
                                    tf = tt >> 3
                                    tw = tt & 0x7
                                    if tw == 1 and tp + 8 <= len(td):
                                        val = struct.unpack('<d', td[tp:tp+8])[0]
                                        if tf == 1:
                                            x = val
                                        elif tf == 2:
                                            y = val
                                        tp += 8
                                    else:
                                        break
                                except:
                                    break
                            f7_pos += tl
                            break
                        elif f7_wt == 2:
                            l2, f7_pos = _DecodeVarint32(f7_data, f7_pos)
                            f7_pos += l2
                        else:
                            break
                    except:
                        break
                pos += length
                break
            elif wt == 0:
                _, pos = _DecodeVarint32(field5_data, pos)
            elif wt == 1:
                pos += 8
            elif wt == 2:
                l, pos = _DecodeVarint32(field5_data, pos)
                pos += l
            elif wt == 5:
                pos += 4
            else:
                break
        except:
            break
    return x, y


def extract_timestamp(node_data):
    """Extract timestamp from Field 1 (node ID message)"""
    pos = 0
    while pos < len(node_data):
        try:
            tag, pos = _DecodeVarint32(node_data, pos)
            fn = tag >> 3
            wt = tag & 0x7

            # Timestamp is inside Field 5 (pose), not Field 1!
            if fn == 5 and wt == 2:
                length, pos = _DecodeVarint32(node_data, pos)
                field1_data = node_data[pos:pos + length]
                f1_pos = 0
                while f1_pos < len(field1_data):
                    try:
                        f1_tag, f1_pos = _DecodeVarint32(field1_data, f1_pos)
                        f1_fn = f1_tag >> 3
                        f1_wt = f1_tag & 0x7
                        if f1_fn == 1 and f1_wt == 0:  # Timestamp
                            timestamp, f1_pos = _DecodeVarint(field1_data, f1_pos)
                            return timestamp
                        elif f1_wt == 0:
                            _, f1_pos = _DecodeVarint32(field1_data, f1_pos)
                        elif f1_wt == 1:
                            f1_pos += 8
                        elif f1_wt == 2:
                            l, f1_pos = _DecodeVarint32(field1_data, f1_pos)
                            f1_pos += l
                        elif f1_wt == 5:
                            f1_pos += 4
                        else:
                            break
                    except:
                        break
                return None
            elif wt == 0:
                _, pos = _DecodeVarint32(node_data, pos)
            elif wt == 1:
                pos += 8
            elif wt == 2:
                l, pos = _DecodeVarint32(node_data, pos)
                pos += l
            elif wt == 5:
                pos += 4
            else:
                break
        except:
            break
    return None



# 3.3 FILE VALIDATION
---

In [None]:
PBSTREAM_FILE = os.path.join(BASE_PATH, "map.pbstream")
print(f"Pbstream file: {PBSTREAM_FILE}")


üîç VALIDATING PBSTREAM FILE...

File size: 2,800,733 bytes (2.67 MB)
Header: db 01 f5 5b 7b 1f 1d 7b

‚úÖ Message counts:
   POSE_GRAPH           (Field 1): 2
   BUILDER_OPTIONS      (Field 2): 1
   SUBMAP               (Field 3): 36
   NODE                 (Field 4): 1,256

‚úÖ Timestamp example (first node):
   Raw ticks (Year 1 epoch): 638967723732811315
   Unix timestamp: 1761175573 seconds
   Nanoseconds in second: 281131500 ns
   Date/time: 2025-10-22 23:26:13.281

‚úÖ File structure: Valid


In [None]:
def extract_full_pose(field5_data):
    """Extract full pose: x, y, z, quat_x, quat_y, quat_z, quat_w, timestamp"""
    pos = 0
    x = y = z = qx = qy = qz = qw = timestamp = None

    while pos < len(field5_data):
        try:
            tag, pos = _DecodeVarint32(field5_data, pos)
            fn = tag >> 3
            wt = tag & 0x7

            if fn == 1 and wt == 0:  # Timestamp
                timestamp, pos = _DecodeVarint(field5_data, pos)

            elif fn == 7 and wt == 2:  # Pose
                length, pos = _DecodeVarint32(field5_data, pos)
                f7_data = field5_data[pos:pos + length]
                f7_pos = 0

                while f7_pos < len(f7_data):
                    try:
                        f7_tag, f7_pos = _DecodeVarint32(f7_data, f7_pos)
                        f7_fn = f7_tag >> 3
                        f7_wt = f7_tag & 0x7

                        if f7_fn == 1 and f7_wt == 2:  # Translation
                            tl, f7_pos = _DecodeVarint32(f7_data, f7_pos)
                            td = f7_data[f7_pos:f7_pos + tl]
                            tp = 0
                            while tp < len(td):
                                try:
                                    tt, tp = _DecodeVarint32(td, tp)
                                    tf = tt >> 3
                                    tw = tt & 0x7
                                    if tw == 1 and tp + 8 <= len(td):
                                        val = struct.unpack('<d', td[tp:tp+8])[0]
                                        if tf == 1: x = val
                                        elif tf == 2: y = val
                                        elif tf == 3: z = val
                                        tp += 8
                                    else:
                                        break
                                except:
                                    break
                            f7_pos += tl

                        elif f7_fn == 2 and f7_wt == 2:  # Rotation
                            rl, f7_pos = _DecodeVarint32(f7_data, f7_pos)
                            rd = f7_data[f7_pos:f7_pos + rl]
                            rp = 0
                            while rp < len(rd):
                                try:
                                    rt, rp = _DecodeVarint32(rd, rp)
                                    rf = rt >> 3
                                    rw = rt & 0x7
                                    if rw == 1 and rp + 8 <= len(rd):
                                        val = struct.unpack('<d', rd[rp:rp+8])[0]
                                        if rf == 1: qx = val
                                        elif rf == 2: qy = val
                                        elif rf == 3: qz = val
                                        elif rf == 4: qw = val
                                        rp += 8
                                    else:
                                        break
                                except:
                                    break
                            f7_pos += rl

                        elif f7_wt == 2:
                            l, f7_pos = _DecodeVarint32(f7_data, f7_pos)
                            f7_pos += l
                        else:
                            break
                    except:
                        break
                pos += length

            elif wt == 0:
                _, pos = _DecodeVarint32(field5_data, pos)
            elif wt == 1:
                pos += 8
            elif wt == 2:
                l, pos = _DecodeVarint32(field5_data, pos)
                pos += l
            elif wt == 5:
                pos += 4
            else:
                break
        except:
            break

    return x, y, z, qx, qy, qz, qw, timestamp

---
# 3.4 TRAJECTORY EXTRACTION AND ANALYSIS
---

In [None]:
PBSTREAM_FILE = os.path.join(BASE_PATH, "map.pbstream")
print(f"Pbstream file: {PBSTREAM_FILE}")


üõ§Ô∏è EXTRACTING TRAJECTORY DATA...

‚úÖ Extracted: 1255 nodes

üìù Timestamp format:
   timestamp_secs: Unix timestamp (seconds since 1970-01-01)
   timestamp_nsecs: Nanoseconds within that second (0-999,999,999)

üìã First 5 rows:


Unnamed: 0,node_id,timestamp_secs,timestamp_nsecs,x_m,y_m,z_m,quat_x,quat_y,quat_z,quat_w
0,0,1761175578,341865800,-0.001466,0.004393,0.0,0.0,0.0,0.001625,0.999999
1,1,1761175578,476540500,-0.003856,0.002816,0.0,0.0,0.0,0.017387,0.999849
2,2,1761175578,611894500,0.001183,0.002219,0.0,0.0,0.0,0.030296,0.999541
3,3,1761175578,867448600,0.041088,0.004711,0.0,0.0,0.0,0.00601,0.999982
4,4,1761175579,2461300,0.076415,0.010777,0.0,0.0,0.0,-0.005306,0.999986



üìã Last 5 rows:


Unnamed: 0,node_id,timestamp_secs,timestamp_nsecs,x_m,y_m,z_m,quat_x,quat_y,quat_z,quat_w
1250,1250,1761175962,839079300,2.84403,6.390513,0.0,0.0,0.0,0.876729,-0.480985
1251,1251,1761175963,632571600,2.768131,6.164998,0.0,0.0,0.0,0.879331,-0.476211
1252,1252,1761175963,902584000,2.71285,6.092377,0.0,0.0,0.0,0.885523,-0.464595
1253,1253,1761175964,296024600,2.607575,6.05464,0.0,0.0,0.0,0.891796,-0.452439
1254,1254,1761175964,946802100,2.464423,6.084261,0.0,0.0,0.0,0.897377,-0.441265


In [None]:
# Verify timestamp format
print("üîç TIMESTAMP VERIFICATION\n")
sample = df_trajectory.iloc[0]
print(f"Sample node (first):")
print(f"   timestamp_secs: {sample['timestamp_secs']} (type: {type(sample['timestamp_secs']).__name__})")
print(f"   timestamp_nsecs: {sample['timestamp_nsecs']} (type: {type(sample['timestamp_nsecs']).__name__})")

if sample['timestamp_secs'] is not None:
    import datetime
    dt = datetime.datetime.fromtimestamp(int(sample['timestamp_secs']))
    nsecs = int(sample['timestamp_nsecs']) if sample['timestamp_nsecs'] is not None else 0
    print(f"   Readable: {dt.strftime('%Y-%m-%d %H:%M:%S')}.{nsecs//1000000:03d}")
    print(f"\n‚úÖ Timestamps correctly formatted as (seconds, nanoseconds)")


üîç TIMESTAMP VERIFICATION

Sample node (first):
   timestamp_secs: 1761175578.0 (type: float64)
   timestamp_nsecs: 341865800.0 (type: float64)
   Readable: 2025-10-22 23:26:18.341

‚úÖ Timestamps correctly formatted as (seconds, nanoseconds)


In [None]:
# Trajectory profiling - consecutive nodes
print("\nüìä TRAJECTORY PROFILING (CONSECUTIVE NODES)\n")


# Calculate distances and time differences between consecutive nodes
consecutive_distances = []
consecutive_times = []

for i in range(len(df_trajectory) - 1):
    node1 = df_trajectory.iloc[i]
    node2 = df_trajectory.iloc[i + 1]

    # Distance
    dist = np.sqrt((node2['x_m'] - node1['x_m'])**2 + (node2['y_m'] - node1['y_m'])**2)
    consecutive_distances.append(dist)

    # Time difference
    if node1['timestamp_secs'] is not None and node2['timestamp_secs'] is not None:
        time_diff_ns = (
            (node2['timestamp_secs'] - node1['timestamp_secs']) * 1_000_000_000 +
            (node2['timestamp_nsecs'] - node1['timestamp_nsecs'])
        )
        time_diff_secs = time_diff_ns / 1_000_000_000
        consecutive_times.append(time_diff_secs)

consecutive_distances = np.array(consecutive_distances)
consecutive_times = np.array(consecutive_times)


# Statistics table
print("\nüìà DESCRIPTIVE STATISTICS\n")
print("Distance Between Consecutive Nodes (meters):")
print(f"   Count:  {len(consecutive_distances)}")
print(f"   Min:    {consecutive_distances.min():.6f} m")
print(f"   Max:    {consecutive_distances.max():.6f} m")
print(f"   Mean:   {consecutive_distances.mean():.6f} m")
print(f"   Median: {np.median(consecutive_distances):.6f} m")
print(f"   Std:    {consecutive_distances.std():.6f} m")

print("\nTime Between Consecutive Nodes (seconds):")
print(f"   Count:  {len(consecutive_times)}")
print(f"   Min:    {consecutive_times.min():.6f} s")
print(f"   Max:    {consecutive_times.max():.6f} s")
print(f"   Mean:   {consecutive_times.mean():.6f} s")
print(f"   Median: {np.median(consecutive_times):.6f} s")
print(f"   Std:    {consecutive_times.std():.6f} s")

print("\nTrajectory Velocity Profile:")
velocities = consecutive_distances / consecutive_times
print(f"   Mean velocity:   {velocities.mean():.6f} m/s")
print(f"   Median velocity: {np.median(velocities):.6f} m/s")
print(f"   Max velocity:    {velocities.max():.6f} m/s")


üìä TRAJECTORY PROFILING (CONSECUTIVE NODES)


üìà DESCRIPTIVE STATISTICS

Distance Between Consecutive Nodes (meters):
   Count:  1254
   Min:    0.000845 m
   Max:    6.828017 m
   Mean:   0.058617 m
   Median: 0.034281 m
   Std:    0.199323 m

Time Between Consecutive Nodes (seconds):
   Count:  1254
   Min:    0.114630 s
   Max:    48.649355 s
   Mean:   0.308297 s
   Median: 0.135182 s
   Std:    1.420958 s

Trajectory Velocity Profile:
   Mean velocity:   0.214999 m/s
   Median velocity: 0.239200 m/s
   Max velocity:    0.460624 m/s


---
# 3.5 NODE PAIRING
---

In [None]:
print("\nüîó CREATING ALL NODE PAIRS...\n")
print("‚ö†Ô∏è  Creating ~786,885 pairs. Please wait...\n")

node_pairs_data = []
n_nodes = len(df_trajectory)

for i in range(n_nodes):
    if i % 100 == 0:
        print(f"   Processing node {i}/{n_nodes}...")

    node1 = df_trajectory.iloc[i]

    for j in range(i + 1, n_nodes):
        node2 = df_trajectory.iloc[j]

        distance = np.sqrt((node2['x_m'] - node1['x_m'])**2 + (node2['y_m'] - node1['y_m'])**2)

        time_diff_secs = 0.0
        if node1['timestamp_secs'] is not None and node2['timestamp_secs'] is not None:
            time_diff_ns = (
                (node2['timestamp_secs'] - node1['timestamp_secs']) * 1_000_000_000 +
                (node2['timestamp_nsecs'] - node1['timestamp_nsecs'])
            )
            time_diff_secs = round(time_diff_ns / 1_000_000_000, 5)

        node_pairs_data.append({
            'node1_id': i,
            'node2_id': j,
            'distance_between_nodes_m': distance,
            'time_diff_secs': time_diff_secs,
            'loop_closure': 0  # Initialize to 0, will be marked during loop closure extraction
        })

df_all_pairs = pd.DataFrame(node_pairs_data)

print(f"\n‚úÖ Created: {len(df_all_pairs):,} pairs\n")
print(f"üìä Statistics:")
print(f"   Distance - Min: {df_all_pairs['distance_between_nodes_m'].min():.6f} m")
print(f"   Distance - Max: {df_all_pairs['distance_between_nodes_m'].max():.6f} m")
print(f"   Distance - Mean: {df_all_pairs['distance_between_nodes_m'].mean():.6f} m")
print(f"   Time - Min: {df_all_pairs['time_diff_secs'].min():.5f} s")
print(f"   Time - Max: {df_all_pairs['time_diff_secs'].max():.5f} s")
print(f"   Loop closures: 0 (will be marked after extraction)")


üîó CREATING ALL NODE PAIRS...

‚ö†Ô∏è  Creating ~786,885 pairs. Please wait...

   Processing node 0/1255...
   Processing node 100/1255...
   Processing node 200/1255...
   Processing node 300/1255...
   Processing node 400/1255...
   Processing node 500/1255...
   Processing node 600/1255...
   Processing node 700/1255...
   Processing node 800/1255...
   Processing node 900/1255...
   Processing node 1000/1255...
   Processing node 1100/1255...
   Processing node 1200/1255...

‚úÖ Created: 786,885 pairs

üìä Statistics:
   Distance - Min: 0.000406 m
   Distance - Max: 8.700595 m
   Distance - Mean: 3.006543 m
   Time - Min: 0.11463 s
   Time - Max: 386.60494 s
   Loop closures: 0 (will be marked after extraction)


---
# 3.5 LOOP CLOSURE EXTRACTION AND ANALYSIS
---

In [None]:
PBSTREAM_FILE = os.path.join(BASE_PATH, "map.pbstream")
print(f"Pbstream file: {PBSTREAM_FILE}")


üîó EXTRACTING LOOP CLOSURES (ORIGINAL ALGORITHM)...

‚úÖ Loop closures extracted: 406 pairs
   Total INTER_SUBMAP constraints: 805
   Submaps tracked: 35
   Trajectory nodes: 1255


In [None]:
# Mark loop closures in df_all_pairs
print("\nüîó MARKING LOOP CLOSURES IN NODE PAIRS...\n")

marked_count = 0

for ref_idx, node_idx in loop_closure_pairs:
    if ref_idx < len(df_trajectory) and node_idx < len(df_trajectory):
        # Find the pair in df_all_pairs and mark it
        mask = (df_all_pairs['node1_id'] == ref_idx) & (df_all_pairs['node2_id'] == node_idx)
        if mask.any():
            df_all_pairs.loc[mask, 'loop_closure'] = 1
            marked_count += 1

print(f"‚úÖ Marked: {marked_count} loop closure pairs in df_all_pairs")
print(f"   Total pairs: {len(df_all_pairs):,}")
print(f"   Loop closures: {df_all_pairs['loop_closure'].sum()} ({100*df_all_pairs['loop_closure'].sum()/len(df_all_pairs):.3f}%)")

# Create filtered view for display
df_loop_closures = df_all_pairs[df_all_pairs['loop_closure'] == 1][['node1_id', 'node2_id', 'distance_between_nodes_m', 'time_diff_secs']].copy()

print(f"\nüìã First 50 loop closure pairs:")
display(df_loop_closures.head(50))
print(f"\nüìã Last 50 loop closure pairs:")
display(df_loop_closures.tail(50))


üîó MARKING LOOP CLOSURES IN NODE PAIRS...

‚úÖ Marked: 406 loop closure pairs in df_all_pairs
   Total pairs: 786,885
   Loop closures: 406 (0.052%)

üìã First 50 loop closure pairs:


Unnamed: 0,node1_id,node2_id,distance_between_nodes_m,time_diff_secs
45429,36,952,0.025163,237.9177
45434,36,957,0.013182,244.0949
50363,40,1024,0.078557,262.64145
51537,41,985,0.067171,248.17507
51547,41,995,0.063022,249.76031
51581,41,1029,0.029172,262.91324
60649,49,429,0.094426,98.35462
63671,51,1044,0.360064,263.16983
65657,53,627,0.078664,146.90694
67864,55,435,0.094407,98.50267



üìã Last 50 loop closure pairs:


Unnamed: 0,node1_id,node2_id,distance_between_nodes_m,time_diff_secs
498333,494,1123,0.648006,162.22162
500166,497,682,0.021054,48.61333
506952,506,700,0.079983,48.90823
515861,518,711,0.084424,48.42704
516297,518,1147,0.742233,158.84519
521716,526,714,0.008323,48.83842
526070,532,721,0.055232,48.95663
528233,535,724,0.0483,49.09719
528952,536,725,0.053394,49.10431
531103,539,728,0.098109,48.72318


---

## SECTION 4: FEATURES LOADING & ANALYSIS

HDF5 features validation.

---



### 4.1 FILE LOADING AND VALIDATION




In [None]:
print("Loading extracted features from HDF5...")

with h5py.File(FEATURES_FILE, 'r') as f:
    # Load camera features and split timestamps
    camera_features = f['camera/features'][:]
    camera_timestamps_sec = f['camera/timestamps_sec'][:].astype(np.int64)
    camera_timestamps_nsec = f['camera/timestamps_nsec'][:].astype(np.int32)
    camera_filenames = [fn.decode('utf-8') if isinstance(fn, bytes) else fn
                       for fn in f['camera/filenames'][:]]

    # Load LiDAR features and split timestamps
    lidar_features = f['lidar/features'][:]
    lidar_timestamps_sec = f['lidar/timestamps_sec'][:].astype(np.int64)
    lidar_timestamps_nsec = f['lidar/timestamps_nsec'][:].astype(np.int32)
    lidar_filenames = [fn.decode('utf-8') if isinstance(fn, bytes) else fn
                      for fn in f['lidar/filenames'][:]]

# Create float64 timestamps for convenience (temporal operations)
# Note: float64 loses ~250ns precision, but this is negligible vs ¬±0.5s alignment threshold
camera_timestamps = camera_timestamps_sec + camera_timestamps_nsec * 1e-9
lidar_timestamps = lidar_timestamps_sec + lidar_timestamps_nsec * 1e-9

print(f"\n‚úÖ Features loaded successfully:")
print(f"   Camera: {len(camera_features)} frames, {camera_features.shape[1]}D features")
print(f"   LiDAR: {len(lidar_features)} scans, {lidar_features.shape[1]}D features")
print(f"   Time range: {min(min(camera_timestamps), min(lidar_timestamps)):.2f}s to {max(max(camera_timestamps), max(lidar_timestamps)):.2f}s")

# Verify L2 normalization
camera_norms = np.linalg.norm(camera_features, axis=1)
lidar_norms = np.linalg.norm(lidar_features, axis=1)
print(f"\n Feature normalization check:")
print(f"   Camera L2 norms: mean={camera_norms.mean():.4f}, std={camera_norms.std():.4f}")
print(f"   LiDAR L2 norms: mean={lidar_norms.mean():.4f}, std={lidar_norms.std():.4f}")

if not (np.allclose(camera_norms, 1.0, atol=1e-5) and np.allclose(lidar_norms, 1.0, atol=1e-5)):
    print("   ‚ö†Ô∏è WARNING: Features are not properly L2 normalized!")
else:
    print("   ‚úÖ Features are properly L2 normalized")

### 4.2 H5 DATA ANALYSIS

**Purpose:** Deep analysis and validation of extracted features before proceeding with dataset generation.

In [None]:
print("\n" + "=" * 80)
print("COMPREHENSIVE HDF5 DATA ANALYSIS")
print("=" * 80)

# =======================================================================
#  Basic Statistics
# =======================================================================
print("\n BASIC FEATURE STATISTICS")
print("-" * 80)

print(f"\n Camera Features:")
print(f"   Count: {len(camera_features)} frames")
print(f"   Dimension: {camera_features.shape[1]}D")
print(f"   Data type: {camera_features.dtype}")
print(f"   Memory: {camera_features.nbytes / 1024 / 1024:.2f} MB")
print(f"   Value range: [{camera_features.min():.4f}, {camera_features.max():.4f}]")
print(f"   Mean: {camera_features.mean():.4f}, Std: {camera_features.std():.4f}")

print(f"\n LiDAR Features:")
print(f"   Count: {len(lidar_features)} scans")
print(f"   Dimension: {lidar_features.shape[1]}D")
print(f"   Data type: {lidar_features.dtype}")
print(f"   Memory: {lidar_features.nbytes / 1024 / 1024:.2f} MB")
print(f"   Value range: [{lidar_features.min():.4f}, {lidar_features.max():.4f}]")
print(f"   Mean: {lidar_features.mean():.4f}, Std: {lidar_features.std():.4f}")

# =======================================================================
#  Timestamp Analysis
# =======================================================================
print(f"\n\n‚è±Ô∏è  TIMESTAMP ANALYSIS")
print("-" * 80)

print(f"\n Camera Timestamps:")
cam_duration = camera_timestamps[-1] - camera_timestamps[0]
cam_intervals = np.diff(camera_timestamps)
print(f"   Time range: {camera_timestamps[0]:.3f}s to {camera_timestamps[-1]:.3f}s")
print(f"   Duration: {cam_duration:.2f}s ({cam_duration/60:.2f} minutes)")
print(f"   Intervals - Mean: {cam_intervals.mean():.3f}s, Std: {cam_intervals.std():.3f}s")
print(f"   Intervals - Min: {cam_intervals.min():.3f}s, Max: {cam_intervals.max():.3f}s")
print(f"   Effective rate: {len(camera_features) / cam_duration:.2f} Hz")

print(f"\n LiDAR Timestamps:")
lid_duration = lidar_timestamps[-1] - lidar_timestamps[0]
lid_intervals = np.diff(lidar_timestamps)
print(f"   Time range: {lidar_timestamps[0]:.3f}s to {lidar_timestamps[-1]:.3f}s")
print(f"   Duration: {lid_duration:.2f}s ({lid_duration/60:.2f} minutes)")
print(f"   Intervals - Mean: {lid_intervals.mean():.3f}s, Std: {lid_intervals.std():.3f}s")
print(f"   Intervals - Min: {lid_intervals.min():.3f}s, Max: {lid_intervals.max():.3f}s")
print(f"   Effective rate: {len(lidar_features) / lid_duration:.2f} Hz")

# Temporal overlap
cam_start, cam_end = camera_timestamps[0], camera_timestamps[-1]
lid_start, lid_end = lidar_timestamps[0], lidar_timestamps[-1]
overlap_start = max(cam_start, lid_start)
overlap_end = min(cam_end, lid_end)
overlap_duration = max(0, overlap_end - overlap_start)

print(f"\n Cross-Modal Temporal Analysis:")
print(f"   Camera starts: {cam_start:.3f}s, ends: {cam_end:.3f}s")
print(f"   LiDAR starts: {lid_start:.3f}s, ends: {lid_end:.3f}s")
print(f"   Temporal overlap: {overlap_duration:.2f}s ({overlap_duration/60:.2f} minutes)")
print(f"   Overlap ratio: {overlap_duration/max(cam_duration, lid_duration):.1%}")

if overlap_duration < 10:
    print(f"   ‚ö†Ô∏è  WARNING: Very short temporal overlap (<10s)!")
elif overlap_duration < 30:
    print(f"   ‚ö†Ô∏è  WARNING: Short temporal overlap (<30s), dataset may be limited")
else:
    print(f"   ‚úÖ Good temporal overlap for dataset generation")

# Time offset detection
time_offset = abs(cam_start - lid_start)
if time_offset > 5.0:
    print(f"   ‚ö†Ô∏è  WARNING: Large time offset between sensors: {time_offset:.2f}s")
    print(f"       This may indicate sensor desynchronization")

# Gap detection
print(f"\n Temporal Gap Detection:")
cam_gaps = np.where(cam_intervals > 10.0)[0]
lid_gaps = np.where(lid_intervals > 10.0)[0]

if len(cam_gaps) > 0:
    print(f"   Camera: Found {len(cam_gaps)} gaps >10s")
    for gap_idx in cam_gaps[:3]:  # Show first 3
        print(f"     ‚Ä¢ Gap at frame {gap_idx}: {cam_intervals[gap_idx]:.2f}s")
    if len(cam_gaps) > 3:
        print(f"     ‚Ä¢ ... and {len(cam_gaps)-3} more gaps")
else:
    print(f"   Camera: No significant gaps detected")

if len(lid_gaps) > 0:
    print(f"   LiDAR: Found {len(lid_gaps)} gaps >10s")
    for gap_idx in lid_gaps[:3]:
        print(f"     ‚Ä¢ Gap at scan {gap_idx}: {lid_intervals[gap_idx]:.2f}s")
    if len(lid_gaps) > 3:
        print(f"     ‚Ä¢ ... and {len(lid_gaps)-3} more gaps")
else:
    print(f"   LiDAR: No significant gaps detected")

# =======================================================================
#  Feature Quality Analysis
# =======================================================================
print(f"\n\n FEATURE QUALITY ANALYSIS")
print("-" * 80)

# Check for NaN/Inf
cam_nan = np.isnan(camera_features).sum()
cam_inf = np.isinf(camera_features).sum()
lid_nan = np.isnan(lidar_features).sum()
lid_inf = np.isinf(lidar_features).sum()

print(f"\n Camera Feature Quality:")
print(f"   NaN values: {cam_nan} ({cam_nan/camera_features.size:.2%})")
print(f"   Inf values: {cam_inf} ({cam_inf/camera_features.size:.2%})")
if cam_nan > 0 or cam_inf > 0:
    print(f"   ‚ùå CRITICAL: Invalid values detected in camera features!")
else:
    print(f"   ‚úÖ No invalid values detected")

# Check for zero features
cam_zero_frames = np.sum(np.all(camera_features == 0, axis=1))
print(f"   All-zero frames: {cam_zero_frames} ({cam_zero_frames/len(camera_features):.1%})")
if cam_zero_frames > 0:
    print(f"   ‚ö†Ô∏è  WARNING: {cam_zero_frames} frames have all-zero features")

print(f"\n LiDAR Feature Quality:")
print(f"   NaN values: {lid_nan} ({lid_nan/lidar_features.size:.2%})")
print(f"   Inf values: {lid_inf} ({lid_inf/lidar_features.size:.2%})")
if lid_nan > 0 or lid_inf > 0:
    print(f"   ‚ùå CRITICAL: Invalid values detected in LiDAR features!")
else:
    print(f"   ‚úÖ No invalid values detected")

lid_zero_scans = np.sum(np.all(lidar_features == 0, axis=1))
print(f"   All-zero scans: {lid_zero_scans} ({lid_zero_scans/len(lidar_features):.1%})")
if lid_zero_scans > 0:
    print(f"   ‚ö†Ô∏è  WARNING: {lid_zero_scans} scans have all-zero features")

# Normalization verification (already done, but detailed here)
print(f"\n‚úÖ L2 Normalization Verification:")
cam_norm_errors = np.abs(camera_norms - 1.0)
lid_norm_errors = np.abs(lidar_norms - 1.0)
print(f"   Camera - Max error: {cam_norm_errors.max():.2e}, Mean error: {cam_norm_errors.mean():.2e}")
print(f"   LiDAR - Max error: {lid_norm_errors.max():.2e}, Mean error: {lid_norm_errors.mean():.2e}")

if cam_norm_errors.max() > 1e-4 or lid_norm_errors.max() > 1e-4:
    print(f"   ‚ö†Ô∏è  WARNING: Some features deviate significantly from unit norm")
else:
    print(f"   ‚úÖ All features properly normalized (error < 1e-4)")

# =======================================================================
#  Feature Distribution Analysis
# =======================================================================
print(f"\n\n FEATURE DISTRIBUTION ANALYSIS")
print("-" * 80)

# Sparsity analysis
cam_sparsity = (camera_features == 0).sum() / camera_features.size
lid_sparsity = (lidar_features == 0).sum() / lidar_features.size

print(f"\n Camera Feature Distribution:")
print(f"   Sparsity: {cam_sparsity:.1%} (fraction of zero values)")
print(f"   Non-zero values per frame - Mean: {(camera_features != 0).sum(axis=1).mean():.0f}, Std: {(camera_features != 0).sum(axis=1).std():.0f}")

# Check for duplicate frames
cam_unique = len(np.unique(camera_features, axis=0))
cam_duplicates = len(camera_features) - cam_unique
print(f"   Unique frames: {cam_unique}/{len(camera_features)}")
if cam_duplicates > 0:
    print(f"   ‚ö†Ô∏è  WARNING: {cam_duplicates} duplicate frames detected ({cam_duplicates/len(camera_features):.1%})")
else:
    print(f"   ‚úÖ No duplicate frames detected")

print(f"\n LiDAR Feature Distribution:")
print(f"   Sparsity: {lid_sparsity:.1%} (fraction of zero values)")
print(f"   Non-zero values per scan - Mean: {(lidar_features != 0).sum(axis=1).mean():.0f}, Std: {(lidar_features != 0).sum(axis=1).std():.0f}")

lid_unique = len(np.unique(lidar_features, axis=0))
lid_duplicates = len(lidar_features) - lid_unique
print(f"   Unique scans: {lid_unique}/{len(lidar_features)}")
if lid_duplicates > 0:
    print(f"   ‚ö†Ô∏è  WARNING: {lid_duplicates} duplicate scans detected ({lid_duplicates/len(lidar_features):.1%})")
else:
    print(f"   ‚úÖ No duplicate scans detected")

# =======================================================================
#  Timestamp Precision Analysis
# =======================================================================
print(f"\n\n TIMESTAMP PRECISION ANALYSIS")
print("-" * 80)

print(f"\n Split Timestamp Format:")
print(f"   Storage format: timestamps_sec (int64) + timestamps_nsec (int32)")
print(f"   Precision: Full nanosecond precision maintained")
print(f"   Camera timestamps_sec range: [{camera_timestamps_sec.min()}, {camera_timestamps_sec.max()}]")
print(f"   Camera timestamps_nsec range: [{camera_timestamps_nsec.min()}, {camera_timestamps_nsec.max()}]")
print(f"   LiDAR timestamps_sec range: [{lidar_timestamps_sec.min()}, {lidar_timestamps_sec.max()}]")
print(f"   LiDAR timestamps_nsec range: [{lidar_timestamps_nsec.min()}, {lidar_timestamps_nsec.max()}]")

# Reconstruct and compare
cam_recon = camera_timestamps_sec + camera_timestamps_nsec * 1e-9
lid_recon = lidar_timestamps_sec + lidar_timestamps_nsec * 1e-9
cam_precision_loss = np.abs(cam_recon - camera_timestamps).max()
lid_precision_loss = np.abs(lid_recon - lidar_timestamps).max()

print(f"\n Float64 Conversion Analysis:")
print(f"   Camera max precision loss: {cam_precision_loss*1e9:.2f} ns")
print(f"   LiDAR max precision loss: {lid_precision_loss*1e9:.2f} ns")
print(f"   Temporal alignment threshold: {MAX_TIME_OFFSET*1e9:.0f} ns ({MAX_TIME_OFFSET}s)")
print(f"   Precision loss vs threshold: {cam_precision_loss/MAX_TIME_OFFSET:.2e}x")

if cam_precision_loss < 1e-6 and lid_precision_loss < 1e-6:
    print(f"   ‚úÖ Precision loss negligible (<1 microsecond)")
    print(f"   ‚úÖ Float64 safe for all temporal operations in this notebook")
else:
    print(f"   ‚ö†Ô∏è  WARNING: Precision loss > 1 microsecond detected")

# =======================================================================
#  Cross-Modal Synchronization Preview
# =======================================================================
print(f"\n\n CROSS-MODAL SYNCHRONIZATION PREVIEW")
print("-" * 80)

print(f"\nEstimating potential aligned pairs (using ¬±{MAX_TIME_OFFSET}s threshold):")

# Count how many camera frames can be aligned
aligned_count = 0
for cam_t in camera_timestamps:
    # Find closest LiDAR timestamp
    time_diffs = np.abs(lidar_timestamps - cam_t)
    min_diff = time_diffs.min()
    if min_diff < MAX_TIME_OFFSET:
        aligned_count += 1

alignment_rate = aligned_count / len(camera_timestamps)
print(f"   Camera frames that can align with LiDAR: {aligned_count}/{len(camera_timestamps)} ({alignment_rate:.1%})")

if alignment_rate < 0.5:
    print(f"   ‚ùå CRITICAL: Low alignment rate (<50%)!")
    print(f"      This suggests poor temporal synchronization between sensors.")
elif alignment_rate < 0.7:
    print(f"   ‚ö†Ô∏è  WARNING: Moderate alignment rate (<70%)")
    print(f"      Some features may not be usable in multi-modal pairs.")
else:
    print(f"   ‚úÖ Good alignment rate (>70%)")

# =======================================================================
#  Overall Quality Assessment
# =======================================================================
print(f"\n\n‚úÖ OVERALL DATA QUALITY ASSESSMENT")
print("=" * 80)

# Quality checks
checks = []

# 1. Sufficient data
min_frames = 50
checks.append((
    "Sufficient frames",
    len(camera_features) >= min_frames and len(lidar_features) >= min_frames,
    f"Camera: {len(camera_features)}, LiDAR: {len(lidar_features)} (need ‚â•{min_frames})"
))

# 2. No invalid values
checks.append((
    "No NaN/Inf values",
    cam_nan == 0 and cam_inf == 0 and lid_nan == 0 and lid_inf == 0,
    f"Camera NaN: {cam_nan}, Inf: {cam_inf}; LiDAR NaN: {lid_nan}, Inf: {lid_inf}"
))

# 3. Proper normalization
checks.append((
    "Proper L2 normalization",
    cam_norm_errors.max() < 1e-4 and lid_norm_errors.max() < 1e-4,
    f"Max error: Camera {cam_norm_errors.max():.2e}, LiDAR {lid_norm_errors.max():.2e}"
))

# 4. Temporal overlap
checks.append((
    "Sufficient temporal overlap",
    overlap_duration > 30,
    f"{overlap_duration:.1f}s (need >30s)"
))

# 5. Good alignment rate
checks.append((
    "Good cross-modal alignment",
    alignment_rate > 0.7,
    f"{alignment_rate:.1%} (need >70%)"
))

# 6. Not too many zeros
checks.append((
    "Low all-zero frame rate",
    cam_zero_frames < len(camera_features) * 0.05 and lid_zero_scans < len(lidar_features) * 0.05,
    f"Camera: {cam_zero_frames}, LiDAR: {lid_zero_scans} (<5% threshold)"
))

# 7. No excessive gaps
checks.append((
    "Temporal continuity",
    len(cam_gaps) < 5 and len(lid_gaps) < 5,
    f"Camera gaps: {len(cam_gaps)}, LiDAR gaps: {len(lid_gaps)} (<5 threshold)"
))

# 8. Reasonable durations
checks.append((
    "Sufficient recording duration",
    cam_duration > 60 or lid_duration > 60,
    f"Camera: {cam_duration:.1f}s, LiDAR: {lid_duration:.1f}s (need >60s)"
))

# Print results
print("\nQuality Checks:")
passed = 0
for check_name, check_passed, check_details in checks:
    status = "‚úÖ PASS" if check_passed else "‚ùå FAIL"
    print(f"  {status} - {check_name}")
    print(f"         {check_details}")
    if check_passed:
        passed += 1

# Overall assessment
print(f"\n" + "=" * 80)
print(f"OVERALL ASSESSMENT: {passed}/{len(checks)} checks passed")

if passed == len(checks):
    print(" EXCELLENT: All quality checks passed! Data ready for dataset generation.")
elif passed >= len(checks) * 0.75:
    print("‚úÖ GOOD: Most checks passed. Proceed with dataset generation.")
elif passed >= len(checks) * 0.5:
    print("‚ö†Ô∏è  FAIR: Several issues detected. Dataset generation may proceed with limitations.")
else:
    print("‚ùå POOR: Multiple critical issues detected. Review data quality before proceeding.")

print("=" * 80 + "\n")

---

## SECTION 6: TIME ALIGNMENT

Feature-to-node matching.

---



### Temporal Alignment: Features to Trajectory Nodes

**Strategy:** Iterate through trajectory nodes and find the closest camera and LiDAR features within tolerance.

**Process:**
1. For each trajectory node timestamp
2. Find nearest camera feature (within MAX_TIME_OFFSET)
3. Find nearest LiDAR feature (within MAX_TIME_OFFSET)
4. Create valid_nodes dataframe: [node_id, camera_feat_id, lidar_feat_id]
5. Skip nodes where either modality is missing


### 6.1 Align Features to Trajectory Nodes

In [None]:
print("Aligning features to trajectory nodes...")

# Build KD-trees for temporal matching
node_timestamps = np.array([node['timestamp'] for node in trajectory_nodes.values()])
node_ids = list(trajectory_nodes.keys())
node_kdtree = KDTree(node_timestamps.reshape(-1, 1))

# Align camera features
camera_aligned = 0
for i, cam_t in enumerate(camera_timestamps):
    dist, idx = node_kdtree.query([[cam_t]], k=1)
    if dist[0][0] < MAX_TIME_OFFSET:
        node_id = node_ids[idx[0][0]]
        trajectory_nodes[node_id]['camera_feature'] = camera_features[i]
        trajectory_nodes[node_id]['camera_idx'] = i
        camera_aligned += 1

# Align LiDAR features
lidar_aligned = 0
for i, lid_t in enumerate(lidar_timestamps):
    dist, idx = node_kdtree.query([[lid_t]], k=1)
    if dist[0][0] < MAX_TIME_OFFSET:
        node_id = node_ids[idx[0][0]]
        trajectory_nodes[node_id]['lidar_feature'] = lidar_features[i]
        trajectory_nodes[node_id]['lidar_idx'] = i
        lidar_aligned += 1

# Filter to nodes with both modalities
valid_nodes = {node_id: data for node_id, data in trajectory_nodes.items()
               if data['camera_feature'] is not None and data['lidar_feature'] is not None}

camera_alignment_rate = camera_aligned / len(camera_features)
lidar_alignment_rate = lidar_aligned / len(lidar_features)

print(f"‚úÖ Alignment complete:")
print(f"   Camera aligned: {camera_aligned}/{len(camera_features)} ({camera_alignment_rate:.1%})")
print(f"   LiDAR aligned: {lidar_aligned}/{len(lidar_features)} ({lidar_alignment_rate:.1%})")
print(f"   Valid nodes (both modalities): {len(valid_nodes)}")

# Concatenate features for each valid node
for node_id in valid_nodes:
    cam_feat = valid_nodes[node_id]['camera_feature']
    lid_feat = valid_nodes[node_id]['lidar_feature']
    valid_nodes[node_id]['combined_feature'] = np.concatenate([cam_feat, lid_feat])

print(f"   Combined feature dimension: {valid_nodes[list(valid_nodes.keys())[0]]['combined_feature'].shape[0]}D")

In [None]:
print("\n" + "="*70)
print("TEMPORAL ALIGNMENT: FEATURES TO TRAJECTORY NODES")
print("="*70)

# Convert trajectory timestamps to seconds (they're in nanoseconds from pbstream)
trajectory_times = df_trajectory['timestamp'].values / 1e9  # Convert ns to seconds

# Combine feature timestamps (from split sec/nsec format)
camera_times = camera_timestamps_sec + camera_timestamps_nsec * 1e-9
lidar_times = lidar_timestamps_sec + lidar_timestamps_nsec * 1e-9

print(f"\nInput data:")
print(f"  Trajectory nodes: {len(df_trajectory)}")
print(f"  Camera features: {len(camera_features)}")
print(f"  LiDAR features: {len(lidar_features)}")
print(f"  Tolerance: ¬±{MAX_TIME_OFFSET}s")

# Build valid nodes dataframe
valid_nodes_data = []
alignment_stats = {'camera_aligned': 0, 'lidar_aligned': 0, 'both_aligned': 0}

for idx, row in df_trajectory.iterrows():
    node_id = row['node_id']
    node_time = row['timestamp'] / 1e9  # Convert to seconds

    # Find closest camera feature
    cam_time_diffs = np.abs(camera_times - node_time)
    cam_min_diff = np.min(cam_time_diffs)
    cam_feat_id = None

    if cam_min_diff <= MAX_TIME_OFFSET:
        cam_feat_id = np.argmin(cam_time_diffs)
        alignment_stats['camera_aligned'] += 1

    # Find closest LiDAR feature
    lid_time_diffs = np.abs(lidar_times - node_time)
    lid_min_diff = np.min(lid_time_diffs)
    lid_feat_id = None

    if lid_min_diff <= MAX_TIME_OFFSET:
        lid_feat_id = np.argmin(lid_time_diffs)
        alignment_stats['lidar_aligned'] += 1

    # Only keep nodes with BOTH modalities
    if cam_feat_id is not None and lid_feat_id is not None:
        valid_nodes_data.append({
            'node_id': node_id,
            'camera_feat_id': cam_feat_id,
            'lidar_feat_id': lid_feat_id,
            'x': row['x'],
            'y': row['y'],
            'timestamp': row['timestamp']
        })
        alignment_stats['both_aligned'] += 1

# Create valid nodes dataframe
valid_nodes = pd.DataFrame(valid_nodes_data)

print(f"\n‚úì Alignment Results:")
print(f"  Camera aligned: {alignment_stats['camera_aligned']} / {len(df_trajectory)} ({100*alignment_stats['camera_aligned']/len(df_trajectory):.1f}%)")
print(f"  LiDAR aligned: {alignment_stats['lidar_aligned']} / {len(df_trajectory)} ({100*alignment_stats['lidar_aligned']/len(df_trajectory):.1f}%)")
print(f"  Valid nodes (both): {len(valid_nodes)} / {len(df_trajectory)} ({100*len(valid_nodes)/len(df_trajectory):.1f}%)")

if len(valid_nodes) == 0:
    raise ValueError("‚ùå No valid nodes with both modalities! Check timestamp alignment.")

print(f"\n‚úì Valid nodes dataframe created: {len(valid_nodes)} nodes")
print(valid_nodes.head())


---

## SECTION 7: PAIRING

Positive and negative pair generation.

---



## INTELLIGENT PAIRING STRATEGY

### 7.1 Data Profiling & Automatic Threshold Suggestion

Before pairing, analyze the pairwise distance and time distributions to suggest optimal thresholds.


In [None]:
print("\n" + "="*70)
print("DATA PROFILING & THRESHOLD SUGGESTION")
print("="*70)

# Filter df_all_pairs to only include valid nodes
valid_node_ids = set(valid_nodes['node_id'].values)
df_pairs_valid = df_all_pairs[
    df_all_pairs['node1_id'].isin(valid_node_ids) &
    df_all_pairs['node2_id'].isin(valid_node_ids)
].copy()

print(f"\nPairwise data:")
print(f"  Total pairs: {len(df_all_pairs):,}")
print(f"  Valid pairs (both nodes aligned): {len(df_pairs_valid):,}")

# Calculate statistics
distances = df_pairs_valid['distance_between_nodes_m'].values
time_diffs = df_pairs_valid['time_diff_secs'].values

# Distance percentiles
dist_percentiles = [5, 10, 25, 50, 75, 90, 95]
dist_values = np.percentile(distances, dist_percentiles)

print(f"\nüìä Distance Distribution (meters):")
for p, v in zip(dist_percentiles, dist_values):
    print(f"  {p:2d}th percentile: {v:.3f}m")

# Time percentiles
time_percentiles = [5, 10, 25, 50, 75, 90, 95]
time_values = np.percentile(time_diffs, time_percentiles)

print(f"\n‚è±Ô∏è  Time Difference Distribution (seconds):")
for p, v in zip(time_percentiles, time_values):
    print(f"  {p:2d}th percentile: {v:.1f}s")

# Automatic threshold suggestions
suggested_positive_dist = dist_values[1]  # 10th percentile
suggested_easy_neg_dist = dist_values[5]  # 75th percentile

# For time gap: analyze pairs with small distance
close_pairs = df_pairs_valid[df_pairs_valid['distance_between_nodes_m'] < suggested_positive_dist]
if len(close_pairs) > 0:
    suggested_time_gap = np.percentile(close_pairs['time_diff_secs'].values, 75)
else:
    suggested_time_gap = 10.0  # Default

print(f"\nüí° Suggested Thresholds:")
print(f"  Positive distance: {suggested_positive_dist:.3f}m (10th percentile)")
print(f"  Easy negative distance: {suggested_easy_neg_dist:.3f}m (75th percentile)")
print(f"  Positive time gap: {suggested_time_gap:.1f}s (75th %ile of close pairs)")

print(f"\nüìù Current Configuration:")
print(f"  Positive distance: {POSITIVE_DISTANCE_THRESHOLD:.3f}m")
print(f"  Easy negative distance: {EASY_NEGATIVE_MIN_DISTANCE:.3f}m")
print(f"  Positive time gap: {POSITIVE_TIME_GAP:.1f}s")

# Visualization
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Distance histogram
axes[0].hist(distances, bins=50, alpha=0.7, edgecolor='black')
axes[0].axvline(POSITIVE_DISTANCE_THRESHOLD, color='green', linestyle='--', linewidth=2, label='Positive threshold')
axes[0].axvline(EASY_NEGATIVE_MIN_DISTANCE, color='red', linestyle='--', linewidth=2, label='Easy neg threshold')
axes[0].set_xlabel('Distance (m)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Pairwise Distance Distribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Time vs Distance scatter
axes[1].scatter(time_diffs, distances, alpha=0.3, s=1)
axes[1].axhline(POSITIVE_DISTANCE_THRESHOLD, color='green', linestyle='--', linewidth=2, label='Positive dist')
axes[1].axvline(POSITIVE_TIME_GAP, color='blue', linestyle='--', linewidth=2, label='Positive time gap')
axes[1].set_xlabel('Time Difference (s)')
axes[1].set_ylabel('Distance (m)')
axes[1].set_title('Distance vs Time Difference')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n‚úì Profiling complete. Proceeding with configured thresholds.")


### 7.2 Strategy Overview

We create three types of training pairs:

1. **Positive Pairs (30%)** - Loop closures validated by Cartographer INTER_SUBMAP constraints
2. **Easy Negative Pairs (35%)** - Spatially distant locations (>5m) with temporal gap
3. **Hard Negative Pairs (35%)** - Perceptually similar but spatially distinct locations

### 7.3 Generate Positive Pairs from Loop Closures

In [None]:
print("\n" + "="*70)
print("GENERATING POSITIVE PAIRS (LOOP CLOSURES)")
print("="*70)

# Use df_pairs_valid filtered by distance and time thresholds
positive_candidates = df_pairs_valid[
    (df_pairs_valid['distance_between_nodes_m'] < POSITIVE_DISTANCE_THRESHOLD) &
    (df_pairs_valid['time_diff_secs'] > POSITIVE_TIME_GAP)
].copy()

print(f"\nCriteria:")
print(f"  Distance < {POSITIVE_DISTANCE_THRESHOLD}m")
print(f"  Time difference > {POSITIVE_TIME_GAP}s")
print(f"\nCandidates: {len(positive_candidates)}")

# Generate positive pairs
positive_pairs = []
for _, row in positive_candidates.iterrows():
    node1_id = row['node1_id']
    node2_id = row['node2_id']

    # Get valid node data
    node1_data = valid_nodes[valid_nodes['node_id'] == node1_id].iloc[0]
    node2_data = valid_nodes[valid_nodes['node_id'] == node2_id].iloc[0]

    # Get features
    cam1_idx = node1_data['camera_feat_id']
    lid1_idx = node1_data['lidar_feat_id']
    cam2_idx = node2_data['camera_feat_id']
    lid2_idx = node2_data['lidar_feat_id']

    # Concatenate features
    feat1 = np.concatenate([camera_features[cam1_idx], lidar_features[lid1_idx]])
    feat2 = np.concatenate([camera_features[cam2_idx], lidar_features[lid2_idx]])

    # Create pairwise feature vector
    pairwise_feat = np.concatenate([feat1, feat2])

    positive_pairs.append({
        'features': pairwise_feat,
        'label': 1,
        'node1_id': node1_id,
        'node2_id': node2_id,
        'distance': row['distance_between_nodes_m'],
        'time_diff': row['time_diff_secs']
    })

print(f"\n‚úì Generated {len(positive_pairs)} positive pairs")


### 7.4 Generate Easy Negative Pairs

In [None]:
print("\n" + "="*70)
print("GENERATING EASY NEGATIVE PAIRS")
print("="*70)

# Use df_pairs_valid filtered by distance threshold
easy_neg_candidates = df_pairs_valid[
    df_pairs_valid['distance_between_nodes_m'] > EASY_NEGATIVE_MIN_DISTANCE
].copy()

print(f"\nCriteria:")
print(f"  Distance > {EASY_NEGATIVE_MIN_DISTANCE}m")
print(f"\nCandidates: {len(easy_neg_candidates)}")

# Sample to match number of positive pairs
n_easy_neg = min(len(positive_pairs), len(easy_neg_candidates))
easy_neg_sample = easy_neg_candidates.sample(n=n_easy_neg, random_state=RANDOM_SEED)

# Generate easy negative pairs
easy_negative_pairs = []
for _, row in easy_neg_sample.iterrows():
    node1_id = row['node1_id']
    node2_id = row['node2_id']

    # Get valid node data
    node1_data = valid_nodes[valid_nodes['node_id'] == node1_id].iloc[0]
    node2_data = valid_nodes[valid_nodes['node_id'] == node2_id].iloc[0]

    # Get features
    cam1_idx = node1_data['camera_feat_id']
    lid1_idx = node1_data['lidar_feat_id']
    cam2_idx = node2_data['camera_feat_id']
    lid2_idx = node2_data['lidar_feat_id']

    # Concatenate features
    feat1 = np.concatenate([camera_features[cam1_idx], lidar_features[lid1_idx]])
    feat2 = np.concatenate([camera_features[cam2_idx], lidar_features[lid2_idx]])

    # Create pairwise feature vector
    pairwise_feat = np.concatenate([feat1, feat2])

    easy_negative_pairs.append({
        'features': pairwise_feat,
        'label': 0,
        'node1_id': node1_id,
        'node2_id': node2_id,
        'distance': row['distance_between_nodes_m'],
        'time_diff': row['time_diff_secs']
    })

print(f"\n‚úì Generated {len(easy_negative_pairs)} easy negative pairs")


### 7.5 Generate Hard Negative Pairs (Perceptual Aliasing)

In [None]:
print("\n" + "="*70)
print("GENERATING HARD NEGATIVE PAIRS (PERCEPTUAL ALIASING)")
print("="*70)

# Use df_pairs_valid filtered by distance (must be far apart)
hard_neg_candidates = df_pairs_valid[
    df_pairs_valid['distance_between_nodes_m'] > EASY_NEGATIVE_MIN_DISTANCE
].copy()

print(f"\nCriteria:")
print(f"  Distance > {EASY_NEGATIVE_MIN_DISTANCE}m")
print(f"  Cosine similarity > {HARD_NEGATIVE_SIMILARITY_THRESHOLD}")
print(f"\nCandidates: {len(hard_neg_candidates)}")

# Calculate cosine similarity for candidates
hard_negative_pairs = []
hard_negative_pairs_type_a = []

print("\nComputing feature similarities...")
for idx, row in hard_neg_candidates.iterrows():
    node1_id = row['node1_id']
    node2_id = row['node2_id']

    # Get valid node data
    node1_data = valid_nodes[valid_nodes['node_id'] == node1_id].iloc[0]
    node2_data = valid_nodes[valid_nodes['node_id'] == node2_id].iloc[0]

    # Get features
    cam1_idx = node1_data['camera_feat_id']
    lid1_idx = node1_data['lidar_feat_id']
    cam2_idx = node2_data['camera_feat_id']
    lid2_idx = node2_data['lidar_feat_id']

    # Concatenate features
    feat1 = np.concatenate([camera_features[cam1_idx], lidar_features[lid1_idx]])
    feat2 = np.concatenate([camera_features[cam2_idx], lidar_features[lid2_idx]])

    # Compute cosine similarity
    similarity = np.dot(feat1, feat2) / (np.linalg.norm(feat1) * np.linalg.norm(feat2))

    # Check if perceptually similar
    if similarity > HARD_NEGATIVE_SIMILARITY_THRESHOLD:
        # Create pairwise feature vector
        pairwise_feat = np.concatenate([feat1, feat2])

        pair_data = {
            'features': pairwise_feat,
            'label': 0,
            'node1_id': node1_id,
            'node2_id': node2_id,
            'distance': row['distance_between_nodes_m'],
            'time_diff': row['time_diff_secs'],
            'similarity': similarity
        }

        hard_negative_pairs.append(pair_data)
        hard_negative_pairs_type_a.append(pair_data)

print(f"\n‚úì Generated {len(hard_negative_pairs)} hard negative pairs")
print(f"  Type A (perceptual): {len(hard_negative_pairs_type_a)}")


### 7.6 Combine and Shuffle Dataset

In [None]:
print("Combining all pairs into dataset...")

# Combine all pairs
dataset = positive_pairs + easy_negative_pairs + hard_negative_pairs

# Shuffle dataset
random.shuffle(dataset)

print(f"\n‚úÖ Dataset created:")
print(f"   Total pairs: {len(dataset)}")
print(f"   Positive: {len(positive_pairs)} ({100*len(positive_pairs)/len(dataset):.1f}%)")
print(f"   Easy negative: {len(easy_negative_pairs)} ({100*len(easy_negative_pairs)/len(dataset):.1f}%)")
print(f"   Hard negative: {len(hard_negative_pairs)} ({100*len(hard_negative_pairs)/len(dataset):.1f}%)")

---

## SECTION 8: VALIDATIONS

Split, validate, visualize, report.

---



### 8.1 Stratified Train/Val/Test Split

In [None]:
print("Creating stratified train/val/test splits...")

# Separate by label
positive_samples = [d for d in dataset if d['label'] == 1]
negative_samples = [d for d in dataset if d['label'] == 0]

# Shuffle each class
random.shuffle(positive_samples)
random.shuffle(negative_samples)

# Split each class
def split_class(samples, train_ratio, val_ratio, test_ratio):
    n = len(samples)
    train_end = int(n * train_ratio)
    val_end = train_end + int(n * val_ratio)
    return samples[:train_end], samples[train_end:val_end], samples[val_end:]

pos_train, pos_val, pos_test = split_class(positive_samples, TRAIN_RATIO, VAL_RATIO, TEST_RATIO)
neg_train, neg_val, neg_test = split_class(negative_samples, TRAIN_RATIO, VAL_RATIO, TEST_RATIO)

# Combine and shuffle within each split
train_dataset = pos_train + neg_train
val_dataset = pos_val + neg_val
test_dataset = pos_test + neg_test

random.shuffle(train_dataset)
random.shuffle(val_dataset)
random.shuffle(test_dataset)

# Compute class balance
train_pos_ratio = sum(d['label'] for d in train_dataset) / len(train_dataset)
val_pos_ratio = sum(d['label'] for d in val_dataset) / len(val_dataset)
test_pos_ratio = sum(d['label'] for d in test_dataset) / len(test_dataset)

print(f"\n‚úÖ Stratified splits created:")
print(f"   Train: {len(train_dataset)} pairs (Pos: {train_pos_ratio:.1%})")
print(f"   Val: {len(val_dataset)} pairs (Pos: {val_pos_ratio:.1%})")
print(f"   Test: {len(test_dataset)} pairs (Pos: {test_pos_ratio:.1%})")

# Check stratification quality
target_ratio = len(positive_samples) / len(dataset)
max_deviation = max(abs(train_pos_ratio - target_ratio),
                   abs(val_pos_ratio - target_ratio),
                   abs(test_pos_ratio - target_ratio))

if max_deviation < 0.05:
    print(f"\n  ‚úÖ Stratification quality: excellent (max deviation: {max_deviation:.3f})")
else:
    print(f"\n  ‚ö†Ô∏è  Stratification quality: acceptable (max deviation: {max_deviation:.3f})")

### 8.2 Dataset Validation

In [None]:
print("Validating dataset quality...")

# Extract features and labels
X = np.array([d['features'] for d in dataset])
y = np.array([d['label'] for d in dataset])

validation_checks = []

# Check 1: Dataset size
min_size = 100
check_1 = len(dataset) >= min_size
validation_checks.append((f"Dataset size >= {min_size}", check_1))

# Check 2: Class balance
pos_ratio = np.sum(y) / len(y)
check_2 = 0.2 <= pos_ratio <= 0.4
validation_checks.append((f"Class balance (20-40% positive): {pos_ratio:.1%}", check_2))

# Check 3: Feature dimension
expected_dim = 1536  # 1280 (camera) + 256 (lidar) concatenated twice
check_3 = X.shape[1] == expected_dim
validation_checks.append((f"Feature dimension == {expected_dim}D", check_3))

# Check 4: No NaN values
check_4 = not np.any(np.isnan(X))
validation_checks.append(("No NaN values in features", check_4))

# Check 5: No infinite values
check_5 = not np.any(np.isinf(X))
validation_checks.append(("No infinite values in features", check_5))

# Check 6: Feature range reasonable
check_6 = np.abs(X).max() < 10.0
validation_checks.append(("Feature values in reasonable range", check_6))

# Check 7: Positive pairs from loop closures
check_7 = len(positive_pairs) > 0
validation_checks.append(("Positive pairs from Cartographer loop closures", check_7))

# Check 8: Sufficient hard negatives
check_8 = len(hard_negative_pairs) >= len(positive_pairs) * 0.5
validation_checks.append(("Sufficient hard negative pairs", check_8))

# Critical checks (must pass)
critical_checks = [check_18, check_3, check_4, check_5, check_7]
critical_passed = all(critical_checks)
all_passed = all(c[1] for c in validation_checks)

print(f"\n{'='*70}")
print("VALIDATION RESULTS")
print(f"{'='*70}\n")

for check_name, result in validation_checks:
    status = "‚úÖ" if result else "‚ùå"
    print(f"{status} {check_name}")

print(f"\n{'='*70}")
if all_passed:
    print("‚úÖ ALL CHECKS PASSED - Dataset is ready for training")
elif critical_passed:
    print("‚ö†Ô∏è  CRITICAL CHECKS PASSED - Dataset is usable but review warnings")
else:
    print("‚ùå CRITICAL CHECKS FAILED - Dataset quality issues detected")
print(f"{'='*70}")

### 8.3 Diagnostic Visualizations

In [None]:
print("Generating diagnostic visualizations...")

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle(f'Loop Closure Dataset Diagnostics - {SESSION_ID}', fontsize=16, fontweight='bold')

# Plot 1: Trajectory with loop closures
ax = axes[0, 0]
valid_positions = np.array([[valid_nodes[nid]['pose']['x'], valid_nodes[nid]['pose']['y']]
                           for nid in valid_node_ids])
ax.plot(valid_positions[:, 0], valid_positions[:, 1], 'b-', alpha=0.3, linewidth=1, label='Trajectory')
ax.scatter(valid_positions[:, 0], valid_positions[:, 1], c='blue', s=10, alpha=0.5, label='Nodes')

# Plot loop closures
for pair in positive_pairs[:50]:  # Plot first 50
    n1 = valid_nodes[pair['node1_id']]
    n2 = valid_nodes[pair['node2_id']]
    ax.plot([n1['pose']['x'], n2['pose']['x']],
           [n1['pose']['y'], n2['pose']['y']],
           'r-', alpha=0.2, linewidth=0.5)

ax.set_xlabel('X (meters)')
ax.set_ylabel('Y (meters)')
ax.set_title('Trajectory with Loop Closures')
ax.legend()
ax.grid(True, alpha=0.3)
ax.axis('equal')

# Plot 2: Class distribution
ax = axes[0, 1]
pair_types = ['Positive', 'Easy Neg', 'Hard Neg']
pair_counts = [len(positive_pairs), len(easy_negative_pairs), len(hard_negative_pairs)]
colors = ['#2ecc71', '#3498db', '#e74c3c']
bars = ax.bar(pair_types, pair_counts, color=colors, alpha=0.7)
ax.set_ylabel('Count')
ax.set_title('Pair Type Distribution')
ax.grid(True, alpha=0.3, axis='y')
for bar, count in zip(bars, pair_counts):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
           f'{count}\n({100*count/len(dataset):.1f}%)',
           ha='center', va='bottom')

# Plot 3: Spatial distance distribution
ax = axes[0, 2]
pos_dists = [p['spatial_distance'] for p in positive_pairs]
easy_neg_dists = [p['spatial_distance'] for p in easy_negative_pairs]
hard_neg_dists = [p['spatial_distance'] for p in hard_negative_pairs]

ax.hist(pos_dists, bins=20, alpha=0.6, label='Positive', color='#2ecc71')
ax.hist(easy_neg_dists, bins=20, alpha=0.6, label='Easy Neg', color='#3498db')
ax.hist(hard_neg_dists, bins=20, alpha=0.6, label='Hard Neg', color='#e74c3c')
ax.set_xlabel('Spatial Distance (m)')
ax.set_ylabel('Count')
ax.set_title('Spatial Distance Distribution')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Plot 4: Split distribution
ax = axes[1, 0]
splits = ['Train', 'Val', 'Test']
split_sizes = [len(train_dataset), len(val_dataset), len(test_dataset)]
split_colors = ['#2ecc71', '#f39c12', '#e74c3c']
bars = ax.bar(splits, split_sizes, color=split_colors, alpha=0.7)
ax.set_ylabel('Pairs')
ax.set_title('Train/Val/Test Splits')
ax.grid(True, alpha=0.3, axis='y')
for bar, size, ratio in zip(bars, split_sizes, [train_pos_ratio, val_pos_ratio, test_pos_ratio]):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
           f'{size}\nPos: {ratio:.1%}',
           ha='center', va='bottom')

# Plot 5: Feature statistics
ax = axes[1, 1]
feature_means = X.mean(axis=0)
ax.plot(feature_means, alpha=0.7, linewidth=0.5)
ax.axvline(x=1280, color='r', linestyle='--', alpha=0.5, label='Camera|LiDAR (1st pair)')
ax.axvline(x=1536, color='g', linestyle='--', alpha=0.5, label='1st pair|2nd pair')
ax.set_xlabel('Feature Index')
ax.set_ylabel('Mean Value')
ax.set_title('Feature Statistics (Mean across dataset)')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 6: Validation summary
ax = axes[1, 2]
ax.axis('off')
summary_text = f"""VALIDATION SUMMARY

Dataset Size: {len(dataset)} pairs
  ‚Ä¢ Positive: {len(positive_pairs)} ({100*len(positive_pairs)/len(dataset):.1f}%)
  ‚Ä¢ Negative: {len(easy_negative_pairs) + len(hard_negative_pairs)} ({100*(len(easy_negative_pairs)+len(hard_negative_pairs))/len(dataset):.1f}%)

Splits:
  ‚Ä¢ Train: {len(train_dataset)} ({100*len(train_dataset)/len(dataset):.1f}%)
  ‚Ä¢ Val: {len(val_dataset)} ({100*len(val_dataset)/len(dataset):.1f}%)
  ‚Ä¢ Test: {len(test_dataset)} ({100*len(test_dataset)/len(dataset):.1f}%)

Quality Checks:
  ‚Ä¢ Validation: {'‚úÖ PASSED' if all_passed else '‚ö†Ô∏è  WARNINGS' if critical_passed else '‚ùå FAILED'}
  ‚Ä¢ Feature Dim: {X.shape[1]}D
  ‚Ä¢ Loop Closures: {len(inter_submap_constraints)}
  ‚Ä¢ Class Balance: {pos_ratio:.1%}
"""
ax.text(0.1, 0.5, summary_text, transform=ax.transAxes,
       fontsize=11, verticalalignment='center', family='monospace',
       bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.3))

plt.tight_layout()
plt.savefig(DIAGNOSTICS_FILE, dpi=150, bbox_inches='tight')
print(f"\n‚úÖ Diagnostics saved: {DIAGNOSTICS_FILE}")
plt.show()

---

## SECTION 9: OUTPUT

---



### 9.1 Dataset saving

In [None]:
print("Saving dataset...")

# Package dataset
dataset_package = {
    'train': train_dataset,
    'val': val_dataset,
    'test': test_dataset,
    'metadata': {
        'session_id': SESSION_ID,
        'creation_date': str(np.datetime64('today')),
        'num_trajectory_nodes': len(trajectory_nodes),
        'num_valid_nodes': len(valid_nodes),
        'num_constraints': len(inter_submap_constraints),
        'feature_dim': X.shape[1],
        'random_seed': RANDOM_SEED,
        'config': {
            'max_time_offset': MAX_TIME_OFFSET,
            'positive_distance_threshold': POSITIVE_DISTANCE_THRESHOLD,
            'easy_negative_min_distance': EASY_NEGATIVE_MIN_DISTANCE,
            'hard_negative_min_distance': HARD_NEGATIVE_MIN_DISTANCE,
            'hard_negative_similarity_threshold': HARD_NEGATIVE_SIMILARITY_THRESHOLD,
            'max_constraint_residual': MAX_CONSTRAINT_RESIDUAL,
            'max_angular_distance': MAX_ANGULAR_DISTANCE
        }
    }
}

# Save to pickle
with open(DATASET_FILE, 'wb') as f:
    pickle.dump(dataset_package, f)

file_size_mb = os.path.getsize(DATASET_FILE) / (1024 * 1024)
print(f"\n‚úÖ Dataset saved: {DATASET_FILE}")
print(f"   File size: {file_size_mb:.2f} MB")

### 9.2 Text Report

In [None]:
print("Generating final report...")

output_filename = os.path.basename(DATASET_FILE)

report = f"""
{'='*70}
LOOP CLOSURE DATASET GENERATION REPORT
{'='*70}

SESSION INFORMATION:
  ‚Ä¢ Session ID: {SESSION_ID}
  ‚Ä¢ Generation date: {np.datetime64('today')}
  ‚Ä¢ Pipeline version: 6.2
  ‚Ä¢ Random seed: {RANDOM_SEED}

INPUT DATA:
  ‚Ä¢ Trajectory nodes: {len(trajectory_nodes)}
  ‚Ä¢ Valid nodes (both modalities): {len(valid_nodes)}
  ‚Ä¢ Camera features: {len(camera_features)} (aligned: {camera_aligned}, {100*camera_alignment_rate:.1f}%)
  ‚Ä¢ LiDAR features: {len(lidar_features)} (aligned: {lidar_aligned}, {100*lidar_alignment_rate:.1f}%)
  ‚Ä¢ INTER_SUBMAP constraints: {len(constraint_metadata)} (validated: {len(inter_submap_constraints)})

DATASET COMPOSITION:
  ‚Ä¢ Total pairs: {len(dataset)}
  ‚Ä¢ Positive pairs: {len(positive_pairs)} ({100*len(positive_pairs)/len(dataset):.1f}%)
  ‚Ä¢ Easy negative pairs: {len(easy_negative_pairs)} ({100*len(easy_negative_pairs)/len(dataset):.1f}%)
  ‚Ä¢ Hard negative pairs: {len(hard_negative_pairs)} ({100*len(hard_negative_pairs)/len(dataset):.1f}%)
      ‚Üí Type A (perceptual): {len(hard_negative_pairs_type_a)}

TRAIN/VAL/TEST SPLITS (STRATIFIED RANDOM):
  ‚Ä¢ Train: {len(train_dataset)} pairs ({100*len(train_dataset)/len(dataset):.1f}%)
      ‚Üí Positive: {sum(d['label'] for d in train_dataset)} ({100*train_pos_ratio:.1f}%)
  ‚Ä¢ Validation: {len(val_dataset)} pairs ({len(val_dataset)/len(dataset):.1%})
      ‚Üí Positive: {sum(d['label'] for d in val_dataset)} ({100*val_pos_ratio:.1f}%)
  ‚Ä¢ Test: {len(test_dataset)} pairs ({100*len(test_dataset)/len(dataset):.1f}%)
      ‚Üí Positive: {sum(d['label'] for d in test_dataset)} ({100*test_pos_ratio:.1f}%)
  ‚Ä¢ Stratification quality: {max_deviation:.3f} max deviation (< 0.05 is good)

FEATURE STATISTICS:
  ‚Ä¢ Pairwise feature dimension: {X.shape[1]}D
  ‚Ä¢ Mean: {np.mean(X):.4f}
  ‚Ä¢ Std: {np.std(X):.4f}
  ‚Ä¢ Range: [{np.min(X):.4f}, {np.max(X):.4f}]

VALIDATION STATUS:
  {'‚úÖ' if all_passed else '‚ö†Ô∏è ' if critical_passed else '‚ùå'} Overall: {'PASSED' if all_passed else 'PASSED WITH WARNINGS' if critical_passed else 'FAILED'}
"""

for check_name, check_result in validation_checks:
    report += f"  {'‚úÖ' if check_result else '‚ùå'} {check_name}\n"

report += f"""
OUTPUT FILES:
  ‚Ä¢ Dataset: {output_filename} ({file_size_mb:.2f} MB)
  ‚Ä¢ Diagnostics: dataset_diagnostics.png

NEXT STEPS:
  1. Load dataset with: pickle.load(open('{output_filename}', 'rb'))
  2. Train Fusion MLP (Phase 2): 1536‚Üí512‚Üí128‚Üí1 architecture
  3. Use BCE loss + hard negative mining
  4. Monitor validation performance
  5. Export to ONNX/TensorRT for Jetson Nano deployment


{'='*70}
"""

print(report)

# Save report
with open(REPORT_FILE, 'w') as f:
    f.write(report)

print("\n‚úÖ Final report saved to: dataset_generation_report.txt")

print("\n" + "=" * 70)
if all_passed:
    print(" DATASET GENERATION COMPLETE - READY FOR TRAINING!")
elif critical_passed:
    print("‚úÖ DATASET GENERATION COMPLETE - USABLE WITH WARNINGS")
else:
    print("‚ö†Ô∏è  DATASET GENERATION COMPLETE - REVIEW VALIDATION ISSUES")
print("=" * 70)

---

## VERSION CONTROL

**Version:** 7.0 (Major Refactor)
**Date:** 2025-11-12
**Changes:**

### MAJOR ARCHITECTURAL CHANGES

**1. Pbstream-First Approach:**
- Made `.pbstream` file REQUIRED (no longer optional)
- Removed bag-based trajectory extraction fallback
- Integrated full pbstream extraction pipeline from standalone notebook
- Direct access to SLAM graph with full fidelity

**2. New Temporal Alignment Strategy:**
- Changed from feature-first to node-first iteration
- For each trajectory node ‚Üí find closest camera + LiDAR features
- Creates explicit mapping: [node_id, camera_feat_id, lidar_feat_id]
- Clearer data flow and validation

**3. Pairwise Table-Based Pairing:**
- Generate complete pairwise distance/time table from pbstream
- Use table filtering for all pair types (positive, easy neg, hard neg)
- Eliminated manual constraint parsing from bag
- More efficient and consistent approach

**4. Data-Driven Threshold Selection:**
- Added automatic profiling section
- Analyzes distance and time distributions
- Suggests optimal thresholds based on data percentiles
- Visualization of distributions with threshold overlays

**5. Environment-Specific Calibration:**
- Updated thresholds for small indoor environment (3m √ó 2m)
- Positive distance: 2.0m ‚Üí 0.3m
- Easy negative distance: 5.0m ‚Üí 1.0m
- Added positive time gap: 10.0s (prevents sequential pairing)

**6. Code Integration:**
- Merged pbstream extraction functions (extract_xy, extract_timestamp, etc.)
- Added Google Drive mount for Colab compatibility
- Unified path structure using SESSION_ID
- Preserved all validation and diagnostic sections

---

**Previous changes (v6.10):**
- Fixed critical parsing bug in trajectory extraction
- Implemented two-tier pbstream/bag approach
- Reorganized notebook structure

**Previous changes (v6.6):**
- Added trajectory topic structure verification
- Identified correct data source

---
