Skip to content

Use relative tolerance for cycle-finder cost mismatch assertion#1240

Open
mlubin wants to merge 1 commit into
NVIDIA:mainfrom
mlubin:fix/cycle-cost-mismatch-tolerance
Open

Use relative tolerance for cycle-finder cost mismatch assertion#1240
mlubin wants to merge 1 commit into
NVIDIA:mainfrom
mlubin:fix/cycle-cost-mismatch-tolerance

Conversation

@mlubin
Copy link
Copy Markdown
Contributor

@mlubin mlubin commented May 19, 2026

(Fix to the flaky test discovered by Claude. Please validate correctness.)

The line "Cost mismatch after a move" assertion compares the realized cost change after perform_moves against the cycle finder's predicted total_cycle_cost. That predicted value is the sum of per-edge cost deltas read from a pre-computed candidate matrix; each delta is an O(1) approximation evaluated against the pre-move state, so chaining them around a cycle does not exactly equal the post-move recomputed cost. Empirically the drift is bounded by a small fraction of |cost_before|.

The previous fixed < 1.0 tolerance was fine at small problem cost magnitudes but under-budgeted that natural approximation slop once cost_before rose above ~100. This caused intermittent CI aborts in
level0_retail/retail_float_test_t.CVRPTW_Retail/18.

Switch to a relative tolerance with a 1.0 absolute floor, matching the pattern used elsewhere for similar cost-mismatch checks (perform_cross, perform_two_opt).

Verified locally by running CVRPTW_Retail/18 in a loop 300 times under ASSERT_MODE. Before: ~1.7% per-iter abort rate. After: zero aborts.

Closes #845

The line "Cost mismatch after a move" assertion compares the realized cost
change after `perform_moves` against the cycle finder's predicted
`total_cycle_cost`. That predicted value is the sum of per-edge cost deltas
read from a pre-computed candidate matrix; each delta is an O(1) approximation
evaluated against the pre-move state, so chaining them around a cycle does
not exactly equal the post-move recomputed cost. Empirically the drift is
bounded by a small fraction of |cost_before|.

The previous fixed `< 1.0` tolerance was fine at small problem cost magnitudes
but under-budgeted that natural approximation slop once `cost_before` rose
above ~100. This caused intermittent CI aborts in
`level0_retail/retail_float_test_t.CVRPTW_Retail/18`.

Switch to a relative tolerance with a 1.0 absolute floor, matching the
pattern used elsewhere for similar cost-mismatch checks (`perform_cross`,
`perform_two_opt`).

Verified locally by running CVRPTW_Retail/18 in a loop 300 times under
ASSERT_MODE. Before: ~1.7% per-iter abort rate. After: zero aborts.

Closes NVIDIA#845

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Miles Lubin <mlubin@nvidia.com>
@mlubin mlubin requested review from akifcorduk and rg20 May 19, 2026 11:12
@mlubin mlubin requested a review from a team as a code owner May 19, 2026 11:12
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mlubin mlubin added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels May 19, 2026
@mlubin
Copy link
Copy Markdown
Contributor Author

mlubin commented May 19, 2026

/ok to test 34fb29b

@mlubin mlubin mentioned this pull request May 19, 2026
8 tasks
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

In run_best_local_search, the post-move cost verification assertion is updated to use a dynamic tolerance based on the cost magnitude rather than a fixed threshold, improving numerical stability for varying cost scales and resolving intermittent test failures.

Changes

Post-move cost verification

Layer / File(s) Summary
Post-move cost assertion with dynamic tolerance
cpp/src/routing/local_search/local_search.cu
After an improving cycle move, the cost difference assertion changes from a fixed absolute tolerance to a relative tolerance (as a percentage of cost_before with a minimum absolute floor), with comments explaining the approximation basis for total_cycle_cost.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: switching from fixed tolerance to relative tolerance for the cycle-finder cost mismatch assertion.
Description check ✅ Passed The description provides clear context for the fix, explains the root cause, details the solution, and includes verification results demonstrating the fix resolves the flaky test.
Linked Issues check ✅ Passed The PR directly addresses issue #845 by replacing the fixed tolerance assertion with a relative tolerance check, eliminating the intermittent test failure reported in the linked issue.
Out of Scope Changes check ✅ Passed All changes are focused on fixing the flaky assertion in local_search.cu; no unrelated modifications were introduced outside the scope of addressing issue #845.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/routing/local_search/local_search.cu`:
- Around line 320-321: The current cuopt_assert uses a one-sided check:
cuopt_assert((cost_after - cost_before) -
move_candidates.cycles.total_cycle_cost < ...), which allows large negative
residuals; change it to a symmetric magnitude check by taking the absolute value
of the residual: use abs((cost_after - cost_before) -
move_candidates.cycles.total_cycle_cost) < std::max(1., 0.01 *
std::abs(cost_before)) so the drift bound is enforced both directions for the
assertion in the cuopt_assert call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8e6a84e9-f23a-4fe8-a0a3-5cad88776a58

📥 Commits

Reviewing files that changed from the base of the PR and between ef4aae7 and 34fb29b.

📒 Files selected for processing (1)
  • cpp/src/routing/local_search/local_search.cu

Comment on lines +320 to +321
cuopt_assert((cost_after - cost_before) - move_candidates.cycles.total_cycle_cost <
std::max(1., 0.01 * std::abs(cost_before)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a symmetric mismatch check for cycle-cost drift.

At Line 320, the check is one-sided; large negative residuals will always pass. If this is a drift bound, it should be magnitude-based.

Suggested fix
-      cuopt_assert((cost_after - cost_before) - move_candidates.cycles.total_cycle_cost <
-                     std::max(1., 0.01 * std::abs(cost_before)),
+      const double cycle_cost_residual =
+        (cost_after - cost_before) - move_candidates.cycles.total_cycle_cost;
+      const double cycle_cost_tolerance = std::max(1.0, 0.01 * std::abs(cost_before));
+      cuopt_assert(std::abs(cycle_cost_residual) < cycle_cost_tolerance,
                    "Cost mismatch after a move");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/routing/local_search/local_search.cu` around lines 320 - 321, The
current cuopt_assert uses a one-sided check: cuopt_assert((cost_after -
cost_before) - move_candidates.cycles.total_cycle_cost < ...), which allows
large negative residuals; change it to a symmetric magnitude check by taking the
absolute value of the residual: use abs((cost_after - cost_before) -
move_candidates.cycles.total_cycle_cost) < std::max(1., 0.01 *
std::abs(cost_before)) so the drift bound is enforced both directions for the
assertion in the cuopt_assert call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Routing test failure

1 participant