Stabilize logger config in LP BatchSolve and version logging#1286
Conversation
`init_logger_t` reconfigures the process-global `cuopt::default_logger()` sinks under `g_guard_mutex`, but `CUOPT_LOG_*` calls go through that same global logger without holding the mutex. Three risky entry points existed: - `call_batch_solve()` ran an OpenMP loop where every worker constructed its own `init_logger_t` inside `solve_lp()`. When workers finished at different times the last guard's destructor reset sinks while siblings were still logging — a likely source of rare `pure virtual method called -> std::terminate -> SIGABRT` aborts. - `solve_lp()` and `solve_qp()` called `print_version_info()` (which uses `CUOPT_LOG_*`) before constructing their `init_logger_t`, so version logging ran through whatever global sink configuration happened to be active at that moment. Fix: - Construct an outer `init_logger_t` at the top of `call_batch_solve()` before any `CUOPT_LOG_*` and before the OpenMP region. Worker-local `init_logger_t` instances now ref-count onto it and never tear down global sinks. - Reorder `solve_lp()` and `solve_qp()` so `init_logger_t` is constructed before `print_version_info()`. Verified by running `test_parser_and_batch_solver` 300x in one process and 2x300 in concurrent processes without any abort signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Miles Lubin <mlubin@nvidia.com>
|
/ok to test a4efec7 |
📝 WalkthroughWalkthroughThis PR reorders logger initialization in PDLP solver functions to occur before version printing, and adds batch-scoped logger initialization in parallel batch execution to preserve logger configuration across parallel worker threads during batch solve operations. ChangesLogger lifecycle and initialization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/src/pdlp/utilities/cython_solve.cu`:
- Around line 262-265: call_batch_solve dereferences solver_settings when
constructing batch_log without a null check; add validation to return a typed
validation error if solver_settings is null. In call_batch_solve, before
creating init_logger_t batch_log, check whether solver_settings is nullptr and
handle the error path (returning or throwing the existing validation/error code
used by this module) instead of proceeding; use the same error reporting
mechanism as other boundary checks and reference
solver_settings->get_pdlp_settings() only after the null check so
init_logger_t(batch_log) is only constructed with a valid solver_settings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1e40d6af-6dd6-4442-b482-d5d2427af244
📒 Files selected for processing (2)
cpp/src/pdlp/solve.cucpp/src/pdlp/utilities/cython_solve.cu
| // Hold the logger configuration for the whole batch so that worker-local | ||
| // init_logger_t instances inside solve_lp() reuse it. | ||
| init_logger_t batch_log(solver_settings->get_pdlp_settings().log_file, | ||
| solver_settings->get_pdlp_settings().log_to_console); |
There was a problem hiding this comment.
Add null-check for solver_settings before logger initialization.
call_batch_solve now dereferences solver_settings to build batch_log without validating the pointer first; at this boundary, null input would crash instead of returning a typed validation error.
Suggested fix
std::pair<std::vector<std::unique_ptr<solver_ret_t>>, double> call_batch_solve(
std::vector<cuopt::linear_programming::io::data_model_view_t<int, double>*> data_models,
cuopt::linear_programming::solver_settings_t<int, double>* solver_settings)
{
raft::common::nvtx::range fun_scope("Call batch solve");
+ cuopt_expects(
+ solver_settings != nullptr, error_type_t::ValidationError, "solver_settings cannot be null");
if (cuopt::linear_programming::is_remote_execution_enabled()) {
return solve_batch_remote(data_models, solver_settings);
}As per coding guidelines, "Flag missing input validation at library and server boundaries".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Hold the logger configuration for the whole batch so that worker-local | |
| // init_logger_t instances inside solve_lp() reuse it. | |
| init_logger_t batch_log(solver_settings->get_pdlp_settings().log_file, | |
| solver_settings->get_pdlp_settings().log_to_console); | |
| std::pair<std::vector<std::unique_ptr<solver_ret_t>>, double> call_batch_solve( | |
| std::vector<cuopt::linear_programming::io::data_model_view_t<int, double>*> data_models, | |
| cuopt::linear_programming::solver_settings_t<int, double>* solver_settings) | |
| { | |
| raft::common::nvtx::range fun_scope("Call batch solve"); | |
| cuopt_expects( | |
| solver_settings != nullptr, error_type_t::ValidationError, "solver_settings cannot be null"); | |
| if (cuopt::linear_programming::is_remote_execution_enabled()) { | |
| return solve_batch_remote(data_models, solver_settings); | |
| } | |
| // Hold the logger configuration for the whole batch so that worker-local | |
| // init_logger_t instances inside solve_lp() reuse it. | |
| init_logger_t batch_log(solver_settings->get_pdlp_settings().log_file, | |
| solver_settings->get_pdlp_settings().log_to_console); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/pdlp/utilities/cython_solve.cu` around lines 262 - 265,
call_batch_solve dereferences solver_settings when constructing batch_log
without a null check; add validation to return a typed validation error if
solver_settings is null. In call_batch_solve, before creating init_logger_t
batch_log, check whether solver_settings is nullptr and handle the error path
(returning or throwing the existing validation/error code used by this module)
instead of proceeding; use the same error reporting mechanism as other boundary
checks and reference solver_settings->get_pdlp_settings() only after the null
check so init_logger_t(batch_log) is only constructed with a valid
solver_settings.
|
FYI, @mlubin |
|
/merge |
1282e4c
into
NVIDIA:release/26.06
(Claude-driven bug fix, potentially fixing the crash in https://github.com/NVIDIA/cuopt/actions/runs/26249434727/job/77262219725?pr=1103. The issue seems rare enough that I wasn't able to reliably reproduce locally to confirm the fix.)
init_logger_treconfigures the process-globalcuopt::default_logger()sinks underg_guard_mutex, butCUOPT_LOG_*calls go through that same global logger without holding the mutex. Three risky entry points existed:call_batch_solve()ran an OpenMP loop where every worker constructed its owninit_logger_tinsidesolve_lp(). When workers finished at different times the last guard's destructor reset sinks while siblings were still logging — a likely source of rarepure virtual method called -> std::terminate -> SIGABRTaborts.solve_lp()andsolve_qp()calledprint_version_info()(which usesCUOPT_LOG_*) before constructing theirinit_logger_t, so version logging ran through whatever global sink configuration happened to be active at that moment.Fix:
init_logger_tat the top ofcall_batch_solve()before anyCUOPT_LOG_*and before the OpenMP region. Worker-localinit_logger_tinstances now ref-count onto it and never tear down global sinks.solve_lp()andsolve_qp()soinit_logger_tis constructed beforeprint_version_info().Verified by running
test_parser_and_batch_solver300x in one process and 2x300 in concurrent processes without any abort signature.The changes to
solve_lpandsolve_qpare incidental and probably not related to the original crash. They're preventing a potential similar issue ifsolve_lporsolve_qpare called from multiple threads.