# Fix: deterministic multithreaded lidar odometry by bloom256 · Pull Request #379 · MapsHD/HDMapping

bloom256 · 2026-02-28T07:38:08Z

Problem

optimize_lidar_odometry gives different results between multithreaded runs.
The multithreaded path used tbb::combinable<MatrixPair> + combine_each to
sum per-thread Hessian copies. combine_each iterates thread-local storage in
unspecified order, and since floating-point addition is not associative
((a+b)+c != a+(b+c)), different summation orders produce different Hessian
matrices. Over many optimiser iterations these small differences accumulate and
lead to different convergence paths.

Why not per-point storage?

The straightforward fix is to store each point's 6×6 + 6×1 contribution
separately and sum them in fixed order. This was tried but makes the whole
optimization ~60% slower due to two effects:

Sequential sum adds ~36%: reducing millions of per-point matrices cannot
be parallelized (order must be fixed for determinism)
Parallel compute adds ~24%: writing to scattered per-point matrices
instead of local accumulators hurts cache during the compute phase

Fix: fixed-chunk per-pose accumulators

Split points into 128 fixed-size chunks. Each chunk has its own per-pose 6×6 and
6×1 accumulator matrices (chunk_AtPA[chunk][pose], chunk_AtPB[chunk][pose]).

Compute phase (tbb::parallel_for over chunks, or sequential for):
each chunk zeros its own accumulators, then iterates its point range and
accumulates into chunk_AtPA[chunk][pose] / chunk_AtPB[chunk][pose]
Reduce phase (sequential): sum chunk accumulators into the global Hessian
in fixed chunk×pose order

This is deterministic because:

Chunk boundaries are fixed (determined by point count, not thread scheduling)
Within each chunk, points are processed in index order
The reduce sums chunks in fixed order (0..127, poses 0..N-1)
ST and MT use the same process_chunk lambda — only the loop differs

Both ST and MT paths iterate over the same 128 chunks using the same
process_chunk lambda — only the loop type differs (tbb::parallel_for vs
plain for), guaranteeing bit-identical results.

Performance: ~same as the original non-deterministic code.

NUM_CHUNKS = 128 must be >= max number of CPU cores to ensure all cores get
work. 128 is sufficient for current hardware while keeping the chunk overhead
negligible.

Changes in `lidar_odometry_utils_optimizers.cpp`

Removed #include <tbb/blocked_range.h> (no longer used)
add_indoor_hessian_contribution / add_outdoor_hessian_contribution: write
to fixed-size 6×6 and 6×1 output refs instead of indexing into global Hessian
via block<6,6>(offset, offset). Sign convention changed: helpers accumulate
positive AtPB, reduce subtracts once (AtPBndt -= chunk_AtPB)
compute_hessian: takes Mat6x6& out_AtPA and Vec6x1& out_AtPB instead
of Eigen::MatrixXd& + matrix_offset
Replaced tbb::combinable<MatrixPair> with 128 fixed-chunk approach (static
std::vector<std::vector<Mat6x6/Vec6x1>> to avoid reallocation across ~1354
calls per dataset)
tbb::combinable<LookupStats> kept for integer lookup counters (accumulation
order doesn't matter for integers)
Fixed orphaned UTL_PROFILER_END(before_iter) — was inside
process_worker_step_lidar_odometry_core but its BEGIN was in the caller;
moved back to caller
Added UTL_PROFILER_SCOPE to process_worker_step_1, process_worker_step_2,
and process_worker_step_lidar_odometry_core

Note: This was not caught in the previous nondeterminism PR because the
floating-point differences are very small and only showed up on certain datasets
during extended testing.

Testing

Tested on 7 datasets (4 MT + 4 ST runs each), lengths from 500m to 5000m.
All runs produce identical results between ST and MT and between runs.

Probably related to #338 — the non-deterministic results reported there could
be caused by this tbb::combinable summation order issue.

…er-pose accumulators, guaranteeing ST=MT bit-identical results regardless of thread count

bloom256 added 2 commits February 27, 2026 22:40

Deterministic Hessian: replace tbb::combinable with 128 fixed-chunk p…

cbccdea

…er-pose accumulators, guaranteeing ST=MT bit-identical results regardless of thread count

clang formatting

6a0f9b7

JanuszBedkowski merged commit 92b4c71 into MapsHD:main Feb 28, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# Fix: deterministic multithreaded lidar odometry#379

# Fix: deterministic multithreaded lidar odometry#379
JanuszBedkowski merged 2 commits intoMapsHD:mainfrom
bloom256:fix/deterministic-lidar-odometry-3

bloom256 commented Feb 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bloom256 commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Why not per-point storage?

Fix: fixed-chunk per-pose accumulators

Changes in lidar_odometry_utils_optimizers.cpp

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bloom256 commented Feb 28, 2026 •

edited

Loading

Changes in `lidar_odometry_utils_optimizers.cpp`