Skip to content

Fix concurrent LP exception cleanup#1206

Merged
rapids-bot[bot] merged 1 commit into
NVIDIA:mainfrom
mlubin:concurrent-exception
May 13, 2026
Merged

Fix concurrent LP exception cleanup#1206
rapids-bot[bot] merged 1 commit into
NVIDIA:mainfrom
mlubin:concurrent-exception

Conversation

@mlubin
Copy link
Copy Markdown
Contributor

@mlubin mlubin commented May 12, 2026

Join concurrent solver worker threads before rethrowing exceptions so std::thread destructors do not terminate the process. Add a deterministic regression test that exercises a PDLP validation error after concurrent workers start.

Prevents some core dumps and gives a more useful error message for the test_incumbent_callbacks flaky test failure.

Old failure:

 | Explored | Unexplored |    Objective    |     Bound     | IntInf | Depth | Iter/Node |   Gap    |  Time  |
terminate called without an active exception
Fatal Python error: Aborted

Current thread 0x0000e727cfdde020 (most recent call first):
  File "/pyenv/versions/3.11.15/lib/python3.11/site-packages/cuopt/linear_programming/solver/solver.py", line 98 in Solve
  File "/pyenv/versions/3.11.15/lib/python3.11/site-packages/cuopt/utilities/exception_handler.py", line 24 in func
  File "/__w/cuopt/cuopt/python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py", line 87 in _run_incumbent_solver_callback
  File "/__w/cuopt/cuopt/python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py", line 112 in test_incumbent_get_callback

New failure:

=================================== FAILURES ===================================
_________________ test_incumbent_get_callback[/mip/swath1.mps] _________________

file_name = '/mip/swath1.mps'

    @pytest.mark.parametrize(
        "file_name",
        [
            ("/mip/swath1.mps"),
            ("/mip/neos5-free-bound.mps"),
        ],
    )
    def test_incumbent_get_callback(file_name):
>       _run_incumbent_solver_callback(file_name, include_set_callback=False)

tests/linear_programming/test_incumbent_callbacks.py:112: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/linear_programming/test_incumbent_callbacks.py:87: in _run_incumbent_solver_callback
    solution = solver.Solve(data_model_obj, settings)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/opt/conda/envs/test/lib/python3.12/site-packages/cuopt/utilities/exception_handler.py:48: in func
    raise e
/opt/conda/envs/test/lib/python3.12/site-packages/cuopt/utilities/exception_handler.py:24: in func
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
/opt/conda/envs/test/lib/python3.12/site-packages/cuopt/linear_programming/solver/solver.py:98: in Solve
    s = solver_wrapper.Solve(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   RuntimeError: CUDA error encountered at: file=/tmp/conda-bld-output/bld/rattler-build_libmps-parser/work/cpp/src/pdlp/utilities/ping_pong_graph.cu line=57: call='cudaStreamEndCapture(stream_view_.value(), &even_graph)', Reason=cudaErrorStreamCaptureInvalidated:operation failed due to a previous error during capture

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mlubin
Copy link
Copy Markdown
Contributor Author

mlubin commented May 12, 2026

/ok to test 2af4f87

@mlubin mlubin added bug Something isn't working non-breaking Introduces a non-breaking change labels May 12, 2026
@mlubin mlubin force-pushed the concurrent-exception branch from 2af4f87 to 4969003 Compare May 12, 2026 19:49
Join concurrent solver worker threads before rethrowing exceptions so std::thread destructors do not terminate the process. Add a deterministic regression test that exercises a PDLP validation error after concurrent workers start.

Co-authored-by: Codex <codex@openai.com>
@mlubin mlubin force-pushed the concurrent-exception branch from 4969003 to b806ca3 Compare May 12, 2026 20:00
@mlubin
Copy link
Copy Markdown
Contributor Author

mlubin commented May 12, 2026

/ok to test b806ca3

@mlubin mlubin marked this pull request as ready for review May 12, 2026 21:29
@mlubin mlubin requested a review from a team as a code owner May 12, 2026 21:29
@mlubin mlubin requested review from hlinsen and nguidotti May 12, 2026 21:29
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a5df6753-ddbd-412b-8b8d-1eee9b09be43

📥 Commits

Reviewing files that changed from the base of the PR and between 3df9b05 and b806ca3.

📒 Files selected for processing (2)
  • cpp/src/pdlp/solve.cu
  • cpp/tests/linear_programming/pdlp_test.cu

📝 Walkthrough

Walkthrough

In run_concurrent, dual-simplex and barrier threads now capture exceptions and request concurrent halt via a shared flag. The main thread joins both threads after run_pdlp completes, cleans up the barrier handle, and rethrows any captured exceptions. A new test validates this exception-handling path in concurrent mode.

Changes

Concurrent Exception Handling

Layer / File(s) Summary
Worker thread exception capture and halt signaling
cpp/src/pdlp/solve.cu
Dual-simplex and barrier threads are spawned with lambdas that capture exceptions to std::exception_ptr. A new request_concurrent_halt helper sets the shared concurrent_halt flag when any worker thread fails, signaling the other solver to stop.
Main thread exception handling and cleanup
cpp/src/pdlp/solve.cu
Main thread captures exceptions from run_pdlp, joins both worker threads when joinable, destroys the barrier handle, and rethrows any captured exceptions from PDLP, dual-simplex, or barrier threads.
Test validation of concurrent exception handling
cpp/tests/linear_programming/pdlp_test.cu
New test concurrent_pdlp_exception_joins_worker_threads exercises the exception path in Concurrent mode with an invalid setting, verifying that ValidationError is returned with the expected message and worker threads are properly joined.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix concurrent LP exception cleanup' directly summarizes the main change: improving exception handling in concurrent LP solvers by properly joining worker threads before rethrowing.
Description check ✅ Passed The description clearly explains the fix (joining concurrent worker threads before rethrowing exceptions), adds context about regression testing, and demonstrates the problem with before/after failure examples.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@mlubin
Copy link
Copy Markdown
Contributor Author

mlubin commented May 13, 2026

/merge

@rapids-bot rapids-bot Bot merged commit c8c1a4d into NVIDIA:main May 13, 2026
206 of 209 checks passed
@mlubin mlubin deleted the concurrent-exception branch May 13, 2026 00:41
rgsl888prabhu added a commit to mlubin/cuopt that referenced this pull request May 14, 2026
PR NVIDIA#1206 (Fix concurrent LP exception cleanup) landed on main while
this branch already renamed cuopt::mps_parser -> cuopt::linear_programming::io.
The merge auto-resolved the file but the two parser references added by
NVIDIA#1206 in the new test (concurrent_pdlp_exception_joins_worker_threads)
kept the old namespace. Rename them to match the rest of the file.

Signed-off-by: Ramakrishna Prabhu <ramakrishnap@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants