Skip to content

Stabilize logger config in LP BatchSolve and version logging#1286

Merged
rapids-bot[bot] merged 1 commit into
NVIDIA:release/26.06from
mlubin:fix-logger-race-batch-solve
May 26, 2026
Merged

Stabilize logger config in LP BatchSolve and version logging#1286
rapids-bot[bot] merged 1 commit into
NVIDIA:release/26.06from
mlubin:fix-logger-race-batch-solve

Conversation

@mlubin
Copy link
Copy Markdown
Contributor

@mlubin mlubin commented May 22, 2026

(Claude-driven bug fix, potentially fixing the crash in https://github.com/NVIDIA/cuopt/actions/runs/26249434727/job/77262219725?pr=1103. The issue seems rare enough that I wasn't able to reliably reproduce locally to confirm the fix.)

init_logger_t reconfigures the process-global cuopt::default_logger() sinks under g_guard_mutex, but CUOPT_LOG_* calls go through that same global logger without holding the mutex. Three risky entry points existed:

  • call_batch_solve() ran an OpenMP loop where every worker constructed its own init_logger_t inside solve_lp(). When workers finished at different times the last guard's destructor reset sinks while siblings were still logging — a likely source of rare pure virtual method called -> std::terminate -> SIGABRT aborts.
  • solve_lp() and solve_qp() called print_version_info() (which uses CUOPT_LOG_*) before constructing their init_logger_t, so version logging ran through whatever global sink configuration happened to be active at that moment.

Fix:

  • Construct an outer init_logger_t at the top of call_batch_solve() before any CUOPT_LOG_* and before the OpenMP region. Worker-local init_logger_t instances now ref-count onto it and never tear down global sinks.
  • Reorder solve_lp() and solve_qp() so init_logger_t is constructed before print_version_info().

Verified by running test_parser_and_batch_solver 300x in one process and 2x300 in concurrent processes without any abort signature.

The changes to solve_lp and solve_qp are incidental and probably not related to the original crash. They're preventing a potential similar issue if solve_lp or solve_qp are called from multiple threads.

`init_logger_t` reconfigures the process-global `cuopt::default_logger()`
sinks under `g_guard_mutex`, but `CUOPT_LOG_*` calls go through that same
global logger without holding the mutex. Three risky entry points existed:

- `call_batch_solve()` ran an OpenMP loop where every worker constructed
  its own `init_logger_t` inside `solve_lp()`. When workers finished at
  different times the last guard's destructor reset sinks while siblings
  were still logging — a likely source of rare `pure virtual method called
  -> std::terminate -> SIGABRT` aborts.
- `solve_lp()` and `solve_qp()` called `print_version_info()` (which uses
  `CUOPT_LOG_*`) before constructing their `init_logger_t`, so version
  logging ran through whatever global sink configuration happened to be
  active at that moment.

Fix:

- Construct an outer `init_logger_t` at the top of `call_batch_solve()`
  before any `CUOPT_LOG_*` and before the OpenMP region. Worker-local
  `init_logger_t` instances now ref-count onto it and never tear down
  global sinks.
- Reorder `solve_lp()` and `solve_qp()` so `init_logger_t` is constructed
  before `print_version_info()`.

Verified by running `test_parser_and_batch_solver` 300x in one process and
2x300 in concurrent processes without any abort signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Miles Lubin <mlubin@nvidia.com>
@mlubin mlubin requested a review from a team as a code owner May 22, 2026 16:08
@mlubin mlubin requested review from aliceb-nv and rg20 May 22, 2026 16:08
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mlubin mlubin added bug Something isn't working non-breaking Introduces a non-breaking change labels May 22, 2026
@mlubin mlubin requested a review from Kh4ster May 22, 2026 16:08
@mlubin
Copy link
Copy Markdown
Contributor Author

mlubin commented May 22, 2026

/ok to test a4efec7

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR reorders logger initialization in PDLP solver functions to occur before version printing, and adds batch-scoped logger initialization in parallel batch execution to preserve logger configuration across parallel worker threads during batch solve operations.

Changes

Logger lifecycle and initialization

Layer / File(s) Summary
Logger initialization ordering in solver functions
cpp/src/pdlp/solve.cu
In solve_qp and solve_lp, the logger stream is now initialized before the conditional print_version_info() call, reversing the initialization order so logging is ready before version information is printed.
Batch-scoped logger initialization
cpp/src/pdlp/utilities/cython_solve.cu
Adds utilities/logger.hpp include and creates a batch-lifetime logger in call_batch_solve from solver settings, ensuring logger configuration is preserved across the entire parallel batch operation for worker-local logging instances to reuse.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested reviewers

  • tmckayus
  • hlinsen
  • Bubullzz
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main changes: stabilizing logger configuration in batch solve operations and reordering version logging initialization.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains a race condition bug in logger initialization, identifies the problematic entry points, describes the fixes implemented, and provides verification details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/pdlp/utilities/cython_solve.cu`:
- Around line 262-265: call_batch_solve dereferences solver_settings when
constructing batch_log without a null check; add validation to return a typed
validation error if solver_settings is null. In call_batch_solve, before
creating init_logger_t batch_log, check whether solver_settings is nullptr and
handle the error path (returning or throwing the existing validation/error code
used by this module) instead of proceeding; use the same error reporting
mechanism as other boundary checks and reference
solver_settings->get_pdlp_settings() only after the null check so
init_logger_t(batch_log) is only constructed with a valid solver_settings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1e40d6af-6dd6-4442-b482-d5d2427af244

📥 Commits

Reviewing files that changed from the base of the PR and between 28eacd9 and a4efec7.

📒 Files selected for processing (2)
  • cpp/src/pdlp/solve.cu
  • cpp/src/pdlp/utilities/cython_solve.cu

Comment on lines +262 to +265
// Hold the logger configuration for the whole batch so that worker-local
// init_logger_t instances inside solve_lp() reuse it.
init_logger_t batch_log(solver_settings->get_pdlp_settings().log_file,
solver_settings->get_pdlp_settings().log_to_console);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add null-check for solver_settings before logger initialization.

call_batch_solve now dereferences solver_settings to build batch_log without validating the pointer first; at this boundary, null input would crash instead of returning a typed validation error.

Suggested fix
 std::pair<std::vector<std::unique_ptr<solver_ret_t>>, double> call_batch_solve(
   std::vector<cuopt::linear_programming::io::data_model_view_t<int, double>*> data_models,
   cuopt::linear_programming::solver_settings_t<int, double>* solver_settings)
 {
   raft::common::nvtx::range fun_scope("Call batch solve");
+  cuopt_expects(
+    solver_settings != nullptr, error_type_t::ValidationError, "solver_settings cannot be null");

   if (cuopt::linear_programming::is_remote_execution_enabled()) {
     return solve_batch_remote(data_models, solver_settings);
   }

As per coding guidelines, "Flag missing input validation at library and server boundaries".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Hold the logger configuration for the whole batch so that worker-local
// init_logger_t instances inside solve_lp() reuse it.
init_logger_t batch_log(solver_settings->get_pdlp_settings().log_file,
solver_settings->get_pdlp_settings().log_to_console);
std::pair<std::vector<std::unique_ptr<solver_ret_t>>, double> call_batch_solve(
std::vector<cuopt::linear_programming::io::data_model_view_t<int, double>*> data_models,
cuopt::linear_programming::solver_settings_t<int, double>* solver_settings)
{
raft::common::nvtx::range fun_scope("Call batch solve");
cuopt_expects(
solver_settings != nullptr, error_type_t::ValidationError, "solver_settings cannot be null");
if (cuopt::linear_programming::is_remote_execution_enabled()) {
return solve_batch_remote(data_models, solver_settings);
}
// Hold the logger configuration for the whole batch so that worker-local
// init_logger_t instances inside solve_lp() reuse it.
init_logger_t batch_log(solver_settings->get_pdlp_settings().log_file,
solver_settings->get_pdlp_settings().log_to_console);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/utilities/cython_solve.cu` around lines 262 - 265,
call_batch_solve dereferences solver_settings when constructing batch_log
without a null check; add validation to return a typed validation error if
solver_settings is null. In call_batch_solve, before creating init_logger_t
batch_log, check whether solver_settings is nullptr and handle the error path
(returning or throwing the existing validation/error code used by this module)
instead of proceeding; use the same error reporting mechanism as other boundary
checks and reference solver_settings->get_pdlp_settings() only after the null
check so init_logger_t(batch_log) is only constructed with a valid
solver_settings.

@rg20
Copy link
Copy Markdown
Contributor

rg20 commented May 26, 2026

FYI, @mlubin call_batch_solve is being deprecated in its current form.

@rg20
Copy link
Copy Markdown
Contributor

rg20 commented May 26, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 1282e4c into NVIDIA:release/26.06 May 26, 2026
192 of 194 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants