Skip to content

# Fix: deterministic multithreaded lidar odometry#379

Merged
JanuszBedkowski merged 2 commits intoMapsHD:mainfrom
bloom256:fix/deterministic-lidar-odometry-3
Feb 28, 2026
Merged

# Fix: deterministic multithreaded lidar odometry#379
JanuszBedkowski merged 2 commits intoMapsHD:mainfrom
bloom256:fix/deterministic-lidar-odometry-3

Conversation

@bloom256
Copy link
Copy Markdown
Collaborator

@bloom256 bloom256 commented Feb 28, 2026

Problem

optimize_lidar_odometry gives different results between multithreaded runs.
The multithreaded path used tbb::combinable<MatrixPair> + combine_each to
sum per-thread Hessian copies. combine_each iterates thread-local storage in
unspecified order, and since floating-point addition is not associative
((a+b)+c != a+(b+c)), different summation orders produce different Hessian
matrices. Over many optimiser iterations these small differences accumulate and
lead to different convergence paths.

Why not per-point storage?

The straightforward fix is to store each point's 6×6 + 6×1 contribution
separately and sum them in fixed order. This was tried but makes the whole
optimization ~60% slower due to two effects:

  • Sequential sum adds ~36%: reducing millions of per-point matrices cannot
    be parallelized (order must be fixed for determinism)
  • Parallel compute adds ~24%: writing to scattered per-point matrices
    instead of local accumulators hurts cache during the compute phase

Fix: fixed-chunk per-pose accumulators

Split points into 128 fixed-size chunks. Each chunk has its own per-pose 6×6 and
6×1 accumulator matrices (chunk_AtPA[chunk][pose], chunk_AtPB[chunk][pose]).

  1. Compute phase (tbb::parallel_for over chunks, or sequential for):
    each chunk zeros its own accumulators, then iterates its point range and
    accumulates into chunk_AtPA[chunk][pose] / chunk_AtPB[chunk][pose]
  2. Reduce phase (sequential): sum chunk accumulators into the global Hessian
    in fixed chunk×pose order

This is deterministic because:

  • Chunk boundaries are fixed (determined by point count, not thread scheduling)
  • Within each chunk, points are processed in index order
  • The reduce sums chunks in fixed order (0..127, poses 0..N-1)
  • ST and MT use the same process_chunk lambda — only the loop differs

Both ST and MT paths iterate over the same 128 chunks using the same
process_chunk lambda — only the loop type differs (tbb::parallel_for vs
plain for), guaranteeing bit-identical results.

Performance: ~same as the original non-deterministic code.

NUM_CHUNKS = 128 must be >= max number of CPU cores to ensure all cores get
work. 128 is sufficient for current hardware while keeping the chunk overhead
negligible.

Changes in lidar_odometry_utils_optimizers.cpp

  • Removed #include <tbb/blocked_range.h> (no longer used)
  • add_indoor_hessian_contribution / add_outdoor_hessian_contribution: write
    to fixed-size 6×6 and 6×1 output refs instead of indexing into global Hessian
    via block<6,6>(offset, offset). Sign convention changed: helpers accumulate
    positive AtPB, reduce subtracts once (AtPBndt -= chunk_AtPB)
  • compute_hessian: takes Mat6x6& out_AtPA and Vec6x1& out_AtPB instead
    of Eigen::MatrixXd& + matrix_offset
  • Replaced tbb::combinable<MatrixPair> with 128 fixed-chunk approach (static
    std::vector<std::vector<Mat6x6/Vec6x1>> to avoid reallocation across ~1354
    calls per dataset)
  • tbb::combinable<LookupStats> kept for integer lookup counters (accumulation
    order doesn't matter for integers)
  • Fixed orphaned UTL_PROFILER_END(before_iter) — was inside
    process_worker_step_lidar_odometry_core but its BEGIN was in the caller;
    moved back to caller
  • Added UTL_PROFILER_SCOPE to process_worker_step_1, process_worker_step_2,
    and process_worker_step_lidar_odometry_core

Note: This was not caught in the previous nondeterminism PR because the
floating-point differences are very small and only showed up on certain datasets
during extended testing.

Testing

Tested on 7 datasets (4 MT + 4 ST runs each), lengths from 500m to 5000m.
All runs produce identical results between ST and MT and between runs.


Probably related to #338 — the non-deterministic results reported there could
be caused by this tbb::combinable summation order issue.

…er-pose accumulators, guaranteeing ST=MT bit-identical results regardless of thread count
@JanuszBedkowski JanuszBedkowski merged commit 92b4c71 into MapsHD:main Feb 28, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants