CuD-PDLP#1391
Conversation
…he cycle seems to be fixed, cuopt compiles
+ style too
compiles and runs
…rather than total size hehehehe
…e lose a lot of time on actal partitionning and data movements. Everything seems to be working
…nt transit on master device !
📝 WalkthroughWalkthroughThis PR adds comprehensive distributed multi-GPU support to the PDLP linear programming solver, enabling it to handle large-scale problems that exceed single-GPU memory. The implementation partitions constraint-variable bipartite graphs (via METIS or KaMinPar), distributes subproblems to per-GPU solver shards coordinated by NCCL, and reimplements key linear algebra operations (SpMV, scaling, convergence checking) to execute in a distributed manner. ChangesDistributed Multi-GPU PDLP
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 11
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/pdlp/solve.cu (1)
769-784:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject distributed
problem_tcalls before any early return.The new guard sits below the zero-constraint return and the FP32 fallback. With
use_distributed_pdlp=trueplusSinglePrecision, this path returnsrun_pdlp_solver_in_fp32(...)instead of raising the intended validation error, so an unsupported distributed configuration silently runs the single-GPU solver.Suggested fix
static optimization_problem_solution_t<i_t, f_t> run_pdlp_solver( detail::problem_t<i_t, f_t>& problem, pdlp_solver_settings_t<i_t, f_t> const& settings, const timer_t& timer, bool is_batch_mode) { + cuopt_expects(!settings.hyper_params.use_distributed_pdlp, + error_type_t::ValidationError, + "Distributed PDLP must be entered via solve_lp(mps_data_model, ...) " + "so the master GPU never materializes the full problem. Call sites " + "with a problem_t cannot dispatch to distributed mode."); + detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph, std::memory_order_relaxed); if (problem.n_constraints == 0) { ... } `#if` PDLP_INSTANTIATE_FLOAT || CUOPT_INSTANTIATE_FLOAT if constexpr (std::is_same_v<f_t, double>) { if (settings.pdlp_precision == pdlp_precision_t::SinglePrecision) { return run_pdlp_solver_in_fp32(problem, settings, timer, is_batch_mode); } } `#endif` - cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 769 - 784, The distributed-mode validation (cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...)) must be performed before any early returns so a distributed call cannot accidentally take the FP32 fallback or zero-constraint path; move or duplicate that check to occur before the SinglePrecision/FP32 branch and before the zero-constraint return so that when settings.hyper_params.use_distributed_pdlp is true (for problem_t inputs) the function immediately raises the ValidationError rather than calling run_pdlp_solver_in_fp32 or returning early. Ensure the check references the same validation message and error_type_t::ValidationError used currently.cpp/src/pdlp/pdlp.cu (1)
3063-3079:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winThe distributed average path is still unsafe in release builds.
When
multi_gpu_engineis present andnever_restart_to_averageis false, Line 3071 uses plainassert(false). In release builds that disappears, and the subsequentraft::copywritesprimal_size_h_/dual_size_h_elements intounscaled_*_avg_solution_, which were never resized for the distributed ctor. That turns this TODO into an invalid device-copy / wrong-result path instead of a clean runtime rejection.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/pdlp.cu` around lines 3063 - 3079, The path that handles multi-GPU (multi_gpu_engine) uses assert(false) which vanishes in release builds and leads to invalid device copies into unscaled_primal_avg_solution_ / unscaled_dual_avg_solution_; fix by replacing the assert with a deterministic runtime guard: either resize/allocate unscaled_primal_avg_solution_ and unscaled_dual_avg_solution_ to primal_size_h_ and dual_size_h_ (and synchronize/validate device pointers) before calling raft::copy from pdhg_solver_.get_primal_solution() / get_dual_solution(), or explicitly fail early by logging and throwing a runtime_error when multi_gpu_engine is true so the copy is never attempted; update the branch around internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to implement one of these safe behaviors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/cuopt_cli.cpp`:
- Around line 180-184: When lp_settings.hyper_params.use_distributed_pdlp is
true, guard the distributed PDLP call by checking that handle_ptr is non-null
before invoking cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if
handle_ptr is null, fail fast with a clear error (e.g., log and exit or throw)
rather than calling the distributed overload; update the branch that currently
chooses between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and
solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and
only call the distributed overload when handle_ptr is valid.
- Around line 439-447: The code currently computes requested_gpus and then uses
std::min(...) to compute provisioned_gpus and
memory_resources.reserve(provisioned_gpus) without validating requested_gpus;
add explicit validation after computing requested_gpus (and after remapping -1
when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that
raft::device_setter::get_device_count() > 0 before calling std::min or reserve.
If either value is non-positive, return/log an error or throw an exception
(consistent with surrounding error handling) referencing the parameters obtained
via settings.get_parameter<int>(CUOPT_NUM_GPUS) and
settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never
calls memory_resources.reserve with a non-positive size.
In `@cpp/src/pdlp/cusparse_view.cu`:
- Around line 501-511: The mixed-precision branch still sizes and recreates FP32
matrices using op_problem_scaled.nnz which can differ per shard; update that
block to use the shard-local nnz values (e.g. static_cast<int64_t>(A_.size())
and static_cast<int64_t>(A_T_.size())) when allocating/sizing A_mixed_ and
A_T_mixed_ and when copying/transposing data for A_T.create / A.create so you
don't overrun A_T_ or leave stale nnz metadata; ensure any metadata fields set
during the FP32 recreate follow the shard-local sizes and that all
transforms/read ranges use those local sizes (A_, A_T_, A_mixed_, A_T_mixed_).
In `@cpp/src/pdlp/distributed_pdlp/partition_loader.cu`:
- Around line 77-87: Validate partition and CSR metadata before any
slicing/indexing: check that parts.size() >= nb_cstr + nb_vars before creating
cstr_parts/var_parts, ensure all entries in parts are within [0, nb_parts)
before using them to index rank_data_t<i_t,f_t>, and verify CSR arrays
(offsets/indices) have expected lengths (e.g., offsets.size() >= rows+1 and
indices.size() == nnz) before dereferencing in functions that build/iterate the
CSR (referencing variables parts, nb_cstr, nb_vars, rank_data_t, and the CSR
offset/index containers); use cuopt_expects (or the existing error path) to fail
early with clear messages when any check fails.
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 821-825: The distributed gather of the current iterate is missing
on several return paths so master buffers can be stale; call the multi-GPU
gather before any return that serializes the current solution. Specifically,
ensure pdhg_solver_.get_mgpu_engine() and its method
gather_potential_next_solutions_to_master(pdhg_solver_,
current_termination_strategy_.get_convergence_information().get_reduced_cost())
is invoked centrally before any code that calls
fill_return_problem_solution(...), and add the same centralized gather call on
the other identified return sites (including the ConcurrentLimit and
PrimalFeasible/infeasibility exits referenced around lines ~859-863 and
~1541-1545) so the master full-size solution/reduced-cost buffers are populated
on every distributed return path.
- Around line 387-393: The distributed constructor pdlp_solver_t( problem_t<i_t,
f_t>& placeholder_problem, ... ) currently delegates to the regular ctor before
shard sizes exist, causing functions that use primal_size_h_/dual_size_h_ (e.g.,
set_initial_primal_solution, handling of initial_primal_solution and
initial_dual_solution and warm-start data) to operate on zero-length buffers;
update this constructor to either (a) validate and reject any initial-state
options (initial_primal_solution, initial_dual_solution, warm-start) up front
and return an error, or (b) defer all logic that applies initial iterates (calls
to set_initial_primal_solution / set_initial_dual_solution and warm-start
handling) until after shard construction when primal_size_h_ and dual_size_h_
are set, ensuring no modulo/divide-by-zero or zero-length copies occur.
In `@cpp/src/pdlp/solve.cu`:
- Around line 759-760: The global flag detail::pdlp_graph_disabled_flag() is
being mutated per-solve causing races; instead make the graph-disable decision
local to each solver instance and avoid writing the process-global flag from
solve entrypoints. Change callers that currently
store(settings.hyper_params.pdlp_disable_graph, ...) to pass the
pdlp_disable_graph boolean into the solver instance (or ctor) and have
ping_pong_graph_t::run() and related graph code read that instance-level flag
rather than detail::pdlp_graph_disabled_flag(); remove writes to the global flag
in solve functions so concurrent solves do not flip each other’s mode.
- Around line 2129-2134: The current overload erroneously hard-fails via
cuopt_expects when settings.hyper_params.use_distributed_pdlp is false and
always forwards to solve_lp_distributed_from_mps, removing the original
single-GPU/direct-MPS path; restore the prior behavior by replacing the
hard-fail with a branch: if settings.hyper_params.use_distributed_pdlp is true
call solve_lp_distributed_from_mps(handle_ptr, mps_data_model, settings,
problem_checking, use_pdlp_solver_mode) else call the non-distributed/MPS
entrypoint (the original direct-MPS function used previously—e.g.,
solve_lp_from_mps or the equivalent direct-MPS routine) so both paths are
supported, and keep or adjust cuopt_expects to validate only unsupported
parameter combinations if needed.
- Around line 2157-2205: solve_lp_distributed_from_mps builds
detail::pdlp_solver_t using settings_resolved but never applies settings.method
or calls set_pdlp_solver_mode, so requested PDLP modes/presets are ignored; fix
by checking settings_resolved.use_pdlp_solver_mode (and/or
settings_resolved.method) before constructing the solver and call
set_pdlp_solver_mode(settings_resolved) to map the preset/method into the solver
settings (or apply the mapping to settings_resolved) so the subsequent
detail::pdlp_solver_t(placeholder_problem, mps_data_model, settings_resolved) is
constructed with the intended PDLP mode.
In `@cpp/tests/linear_programming/pdlp_test.cu`:
- Around line 188-191: The test currently sets distributed_pdlp_num_gpus = -1
which lets a single-GPU run bypass the multi-GPU/NCCL path; change the test to
first query the available GPU count and if fewer than 2 GPUs are present skip
the test, otherwise set pdlp_solver_settings_t::distributed_pdlp_num_gpus to at
least 2 (e.g., max(2, available_gpus)) before calling solve_lp(&handle, problem,
dist_settings) so the distributed PDLP path is actually exercised (use
pdlp_solver_settings_t, dist_settings, distributed_pdlp_num_gpus and solve_lp as
the loci to modify).
- Around line 248-252: The test pdlp_class::distributed_parity_square41 is
loading the wrong dataset; change the argument to
expect_distributed_matches_base in that test so it points to
"linear_programming/square41/square41.mps" instead of
"linear_programming/neos3/neos3.mps" so the regression covers the intended
square41 case (update the call site in the distributed_parity_square41 test that
invokes expect_distributed_matches_base).
---
Outside diff comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 3063-3079: The path that handles multi-GPU (multi_gpu_engine) uses
assert(false) which vanishes in release builds and leads to invalid device
copies into unscaled_primal_avg_solution_ / unscaled_dual_avg_solution_; fix by
replacing the assert with a deterministic runtime guard: either resize/allocate
unscaled_primal_avg_solution_ and unscaled_dual_avg_solution_ to primal_size_h_
and dual_size_h_ (and synchronize/validate device pointers) before calling
raft::copy from pdhg_solver_.get_primal_solution() / get_dual_solution(), or
explicitly fail early by logging and throwing a runtime_error when
multi_gpu_engine is true so the copy is never attempted; update the branch
around internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to
implement one of these safe behaviors.
In `@cpp/src/pdlp/solve.cu`:
- Around line 769-784: The distributed-mode validation
(cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...)) must be
performed before any early returns so a distributed call cannot accidentally
take the FP32 fallback or zero-constraint path; move or duplicate that check to
occur before the SinglePrecision/FP32 branch and before the zero-constraint
return so that when settings.hyper_params.use_distributed_pdlp is true (for
problem_t inputs) the function immediately raises the ValidationError rather
than calling run_pdlp_solver_in_fp32 or returning early. Ensure the check
references the same validation message and error_type_t::ValidationError used
currently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7df2a4b9-585b-4517-afcb-1aa089ecb1c1
📒 Files selected for processing (41)
cpp/CMakeLists.txtcpp/cmake/thirdparty/get_kaminpar.cmakecpp/cuopt_cli.cppcpp/include/cuopt/linear_programming/constants.hcpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuhcpp/include/cuopt/linear_programming/pdlp/solver_settings.hppcpp/src/math_optimization/solver_settings.cucpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/cusparse_view.cucpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.cppcpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.hppcpp/src/pdlp/distributed_pdlp/metis_partitioner.cucpp/src/pdlp/distributed_pdlp/metis_partitioner.hppcpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cucpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hppcpp/src/pdlp/distributed_pdlp/partition_loader.cucpp/src/pdlp/distributed_pdlp/partition_loader.hppcpp/src/pdlp/distributed_pdlp/partitioner.cucpp/src/pdlp/distributed_pdlp/partitioner.hppcpp/src/pdlp/distributed_pdlp/rank_data.hppcpp/src/pdlp/distributed_pdlp/shard.cucpp/src/pdlp/distributed_pdlp/shard.hppcpp/src/pdlp/initial_scaling_strategy/initial_scaling.cucpp/src/pdlp/initial_scaling_strategy/initial_scaling.cuhcpp/src/pdlp/pdhg.cucpp/src/pdlp/pdhg.hppcpp/src/pdlp/pdlp.cucpp/src/pdlp/pdlp.cuhcpp/src/pdlp/restart_strategy/pdlp_restart_strategy.cucpp/src/pdlp/saddle_point.cucpp/src/pdlp/solve.cucpp/src/pdlp/solve.cuhcpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cucpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hppcpp/src/pdlp/termination_strategy/convergence_information.cucpp/src/pdlp/termination_strategy/convergence_information.hppcpp/src/pdlp/termination_strategy/termination_strategy.cucpp/src/pdlp/termination_strategy/termination_strategy.hppcpp/src/pdlp/utilities/mgpu_trace.cuhcpp/src/pdlp/utilities/ping_pong_graph.cuhcpp/tests/linear_programming/pdlp_test.cu
| if (lp_settings.hyper_params.use_distributed_pdlp) { | ||
| cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings); | ||
| } else { | ||
| cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings); | ||
| } |
There was a problem hiding this comment.
Guard distributed solve path against null GPU handle.
When distributed PDLP is enabled, this branch can pass a null handle_ptr if the selected memory backend is not GPU. Fail fast before calling the distributed overload.
Suggested fix
- if (lp_settings.hyper_params.use_distributed_pdlp) {
+ if (lp_settings.hyper_params.use_distributed_pdlp) {
+ if (handle_ptr == nullptr) {
+ throw std::runtime_error("Distributed PDLP requires GPU memory backend.");
+ }
cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings);
} else {
cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/cuopt_cli.cpp` around lines 180 - 184, When
lp_settings.hyper_params.use_distributed_pdlp is true, guard the distributed
PDLP call by checking that handle_ptr is non-null before invoking
cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if handle_ptr is
null, fail fast with a clear error (e.g., log and exit or throw) rather than
calling the distributed overload; update the branch that currently chooses
between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and
solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and
only call the distributed overload when handle_ptr is valid.
| int requested_gpus = use_distributed_pdlp | ||
| ? settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) | ||
| : settings.get_parameter<int>(CUOPT_NUM_GPUS); | ||
| if (use_distributed_pdlp && requested_gpus == -1) { | ||
| requested_gpus = raft::device_setter::get_device_count(); | ||
| } | ||
| const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus); | ||
|
|
||
| memory_resources.reserve(std::min(raft::device_setter::get_device_count(), num_gpus)); | ||
| for (int i = 0; i < std::min(raft::device_setter::get_device_count(), num_gpus); ++i) { | ||
| memory_resources.reserve(provisioned_gpus); |
There was a problem hiding this comment.
Validate GPU count before provisioning memory resources.
requested_gpus is used to size provisioning without enforcing a positive value (other than distributed -1 remap). Add explicit validation before std::min(...)/reserve(...) to avoid invalid allocation/setup paths.
Suggested fix
- if (use_distributed_pdlp && requested_gpus == -1) {
- requested_gpus = raft::device_setter::get_device_count();
- }
- const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus);
+ const int visible_gpus = raft::device_setter::get_device_count();
+ if (use_distributed_pdlp && requested_gpus == -1) {
+ requested_gpus = visible_gpus;
+ }
+ if (requested_gpus <= 0) {
+ std::cerr << "Invalid GPU count: " << requested_gpus
+ << " (must be > 0, or -1 only with distributed PDLP)." << std::endl;
+ return 1;
+ }
+ const int provisioned_gpus = std::min(visible_gpus, requested_gpus);
+ if (provisioned_gpus <= 0) {
+ std::cerr << "No visible GPUs available for GPU backend." << std::endl;
+ return 1;
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/cuopt_cli.cpp` around lines 439 - 447, The code currently computes
requested_gpus and then uses std::min(...) to compute provisioned_gpus and
memory_resources.reserve(provisioned_gpus) without validating requested_gpus;
add explicit validation after computing requested_gpus (and after remapping -1
when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that
raft::device_setter::get_device_count() > 0 before calling std::min or reserve.
If either value is non-positive, return/log an error or throw an exception
(consistent with surrounding error handling) referencing the parameters obtained
via settings.get_parameter<int>(CUOPT_NUM_GPUS) and
settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never
calls memory_resources.reserve with a non-positive size.
| static_cast<int64_t>(A_.size()), | ||
| const_cast<i_t*>(op_problem_scaled.offsets.data()), | ||
| const_cast<i_t*>(op_problem_scaled.variables.data()), | ||
| const_cast<f_t*>(op_problem_scaled.coefficients.data())); | ||
|
|
||
| // A_T can have a different nnz than A in multi-GPU shards | ||
| // A is just what is needed to compute A_x for owned constraints | ||
| // A_T is just what is needed to compute A_T_y for owned variables | ||
| A_T.create(op_problem_scaled.n_variables, | ||
| op_problem_scaled.n_constraints, | ||
| op_problem_scaled.nnz, | ||
| static_cast<int64_t>(A_T_.size()), |
There was a problem hiding this comment.
Use shard-local nnz in the mixed-precision setup too.
Now that this ctor allows A and A_T to have different nnz on a shard, the later mixed-precision block still sizes, transforms, and recreates the FP32 matrices with op_problem_scaled.nnz (Lines 777-804). On any distributed shard where those lengths diverge, that path can overrun A_T_ or build A_mixed_ / A_T_mixed_ with stale nnz metadata.
Suggested follow-up
- A_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream());
- A_T_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream());
+ A_float_.resize(A_.size(), handle_ptr->get_stream());
+ A_T_float_.resize(A_T_.size(), handle_ptr->get_stream());
- RAFT_CUDA_TRY(cub::DeviceTransform::Transform(op_problem_scaled.coefficients.data(),
+ RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_.data(),
A_float_.data(),
- op_problem_scaled.nnz,
+ A_.size(),
double_to_float_functor{},
handle_ptr->get_stream().value()));
RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_T_.data(),
A_T_float_.data(),
- op_problem_scaled.nnz,
+ A_T_.size(),
double_to_float_functor{},
handle_ptr->get_stream().value()));
A_mixed_.create(op_problem_scaled.n_constraints,
op_problem_scaled.n_variables,
- op_problem_scaled.nnz,
+ static_cast<int64_t>(A_.size()),
const_cast<i_t*>(op_problem_scaled.offsets.data()),
const_cast<i_t*>(op_problem_scaled.variables.data()),
A_float_.data());
A_T_mixed_.create(op_problem_scaled.n_variables,
op_problem_scaled.n_constraints,
- op_problem_scaled.nnz,
+ static_cast<int64_t>(A_T_.size()),
const_cast<i_t*>(A_T_offsets_.data()),
const_cast<i_t*>(A_T_indices_.data()),
A_T_float_.data());As per coding guidelines, "Prevent invalid memory access (out-of-bounds, use-after-free, host/device confusion) in GPU code."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/pdlp/cusparse_view.cu` around lines 501 - 511, The mixed-precision
branch still sizes and recreates FP32 matrices using op_problem_scaled.nnz which
can differ per shard; update that block to use the shard-local nnz values (e.g.
static_cast<int64_t>(A_.size()) and static_cast<int64_t>(A_T_.size())) when
allocating/sizing A_mixed_ and A_T_mixed_ and when copying/transposing data for
A_T.create / A.create so you don't overrun A_T_ or leave stale nnz metadata;
ensure any metadata fields set during the FP32 recreate follow the shard-local
sizes and that all transforms/read ranges use those local sizes (A_, A_T_,
A_mixed_, A_T_mixed_).
| cuopt_expects(A_values.size() == A_values_scaled.size(), | ||
| error_type_t::ValidationError, | ||
| "A_values and A_values_scaled must have the same length"); | ||
| cuopt_expects(A_t_values.size() == A_t_values_scaled.size(), | ||
| error_type_t::ValidationError, | ||
| "A_t_values and A_t_values_scaled must have the same length"); | ||
|
|
||
| std::vector<rank_data_t<i_t, f_t>> rank_data(nb_parts, rank_data_t<i_t, f_t>(nb_parts)); | ||
| std::vector<i_t> cstr_parts(parts.begin(), parts.begin() + nb_cstr); | ||
| std::vector<i_t> var_parts(parts.begin() + nb_cstr, parts.begin() + nb_cstr + nb_vars); | ||
|
|
There was a problem hiding this comment.
Validate partition metadata and CSR dimensions before indexing.
Line 85/Line 86 slice parts without checking size, Line 90 indexes rank_data using unvalidated part IDs, and Line 115/Line 157 dereference CSR offsets/indices without shape checks. A malformed partition file or inconsistent matrix metadata can cause out-of-bounds access/UB.
Proposed guard block (at function entry)
std::vector<rank_data_t<i_t, f_t>> partition_loader_t<i_t, f_t>::create_rank_data_from_parts(
@@
{
+ cuopt_expects(nb_parts > 0, error_type_t::ValidationError, "nb_parts must be > 0");
+ cuopt_expects(nb_cstr >= 0 && nb_vars >= 0 && nnz >= 0,
+ error_type_t::ValidationError,
+ "nb_cstr/nb_vars/nnz must be non-negative");
+
+ const auto expected_parts =
+ static_cast<size_t>(nb_cstr) + static_cast<size_t>(nb_vars);
+ cuopt_expects(parts.size() == expected_parts,
+ error_type_t::ValidationError,
+ "parts size mismatch: expected nb_cstr + nb_vars");
+
+ for (auto p : parts) {
+ cuopt_expects(p >= 0 && p < nb_parts,
+ error_type_t::ValidationError,
+ "partition id out of range [0, nb_parts)");
+ }
+
+ cuopt_expects(A_row_offsets.size() == static_cast<size_t>(nb_cstr) + 1,
+ error_type_t::ValidationError,
+ "A_row_offsets size must be nb_cstr + 1");
+ cuopt_expects(A_t_row_offsets.size() == static_cast<size_t>(nb_vars) + 1,
+ error_type_t::ValidationError,
+ "A_t_row_offsets size must be nb_vars + 1");
+
+ cuopt_expects(A_col_indices.size() == static_cast<size_t>(nnz) &&
+ A_values.size() == static_cast<size_t>(nnz) &&
+ A_values_scaled.size() == static_cast<size_t>(nnz),
+ error_type_t::ValidationError,
+ "A buffers must match nnz");
+ cuopt_expects(A_t_col_indices.size() == static_cast<size_t>(nnz) &&
+ A_t_values.size() == static_cast<size_t>(nnz) &&
+ A_t_values_scaled.size() == static_cast<size_t>(nnz),
+ error_type_t::ValidationError,
+ "A_t buffers must match nnz");
}As per coding guidelines: “Validate input at library and server boundaries.”
Also applies to: 89-94, 115-121, 156-163
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/pdlp/distributed_pdlp/partition_loader.cu` around lines 77 - 87,
Validate partition and CSR metadata before any slicing/indexing: check that
parts.size() >= nb_cstr + nb_vars before creating cstr_parts/var_parts, ensure
all entries in parts are within [0, nb_parts) before using them to index
rank_data_t<i_t,f_t>, and verify CSR arrays (offsets/indices) have expected
lengths (e.g., offsets.size() >= rows+1 and indices.size() == nnz) before
dereferencing in functions that build/iterate the CSR (referencing variables
parts, nb_cstr, nb_vars, rank_data_t, and the CSR offset/index containers); use
cuopt_expects (or the existing error path) to fail early with clear messages
when any check fails.
| pdlp_solver_t<i_t, f_t>::pdlp_solver_t( | ||
| problem_t<i_t, f_t>& placeholder_problem, | ||
| cuopt::linear_programming::io::mps_data_model_t<i_t, f_t> const& mps, | ||
| pdlp_solver_settings_t<i_t, f_t> const& settings) | ||
| // Makes all inner feilds of master 0 size | ||
| : pdlp_solver_t(placeholder_problem, settings, /*is_legacy_batch_mode=*/false) | ||
| { |
There was a problem hiding this comment.
Reject initial-state options before delegating to the shape-0 base ctor.
Line 392 delegates into the regular constructor before the distributed path knows the real dimensions. That constructor eagerly handles initial_primal_solution, initial_dual_solution, and warm-start data using primal_size_h_/dual_size_h_ from the placeholder, and set_initial_primal_solution() does size() % primal_size_h_. Any distributed solve with initial iterates can therefore divide by zero or copy into zero-length buffers before the shards exist. Please either reject those options up front for distributed PDLP or defer applying them until after shard construction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/pdlp/pdlp.cu` around lines 387 - 393, The distributed constructor
pdlp_solver_t( problem_t<i_t, f_t>& placeholder_problem, ... ) currently
delegates to the regular ctor before shard sizes exist, causing functions that
use primal_size_h_/dual_size_h_ (e.g., set_initial_primal_solution, handling of
initial_primal_solution and initial_dual_solution and warm-start data) to
operate on zero-length buffers; update this constructor to either (a) validate
and reject any initial-state options (initial_primal_solution,
initial_dual_solution, warm-start) up front and return an error, or (b) defer
all logic that applies initial iterates (calls to set_initial_primal_solution /
set_initial_dual_solution and warm-start handling) until after shard
construction when primal_size_h_ and dual_size_h_ are set, ensuring no
modulo/divide-by-zero or zero-length copies occur.
| detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph, | ||
| std::memory_order_relaxed); |
There was a problem hiding this comment.
Scope the graph-disable switch per solve instead of mutating the process-global flag.
Both entrypoints write directly into detail::pdlp_graph_disabled_flag(). ping_pong_graph_t::run() reads that same static flag for every solver instance, so two solves running with different pdlp_disable_graph values can flip each other's execution mode mid-run.
Also applies to: 2159-2160
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/pdlp/solve.cu` around lines 759 - 760, The global flag
detail::pdlp_graph_disabled_flag() is being mutated per-solve causing races;
instead make the graph-disable decision local to each solver instance and avoid
writing the process-global flag from solve entrypoints. Change callers that
currently store(settings.hyper_params.pdlp_disable_graph, ...) to pass the
pdlp_disable_graph boolean into the solver instance (or ctor) and have
ping_pong_graph_t::run() and related graph code read that instance-level flag
rather than detail::pdlp_graph_disabled_flag(); remove writes to the global flag
in solve functions so concurrent solves do not flip each other’s mode.
| cuopt_expects( | ||
| settings.hyper_params.use_distributed_pdlp, | ||
| error_type_t::ValidationError, | ||
| "solve_lp from mps_data_model: settings.hyper_params.use_distributed_pdlp must be true"); | ||
| return solve_lp_distributed_from_mps( | ||
| handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode); |
There was a problem hiding this comment.
Preserve the existing non-distributed direct-MPS entrypoint.
This unchanged overload now hard-fails unless use_distributed_pdlp is enabled, which removes the previous single-GPU/direct-MPS route behind the same public signature. That is a breaking regression for callers that build an mps_data_model_t but do not want distributed PDLP.
Suggested fix
optimization_problem_solution_t<i_t, f_t> solve_lp(
raft::handle_t const* handle_ptr,
const cuopt::linear_programming::io::mps_data_model_t<i_t, f_t>& mps_data_model,
pdlp_solver_settings_t<i_t, f_t> const& settings,
bool problem_checking,
bool use_pdlp_solver_mode)
{
- cuopt_expects(
- settings.hyper_params.use_distributed_pdlp,
- error_type_t::ValidationError,
- "solve_lp from mps_data_model: settings.hyper_params.use_distributed_pdlp must be true");
- return solve_lp_distributed_from_mps(
- handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode);
+ if (settings.hyper_params.use_distributed_pdlp) {
+ return solve_lp_distributed_from_mps(
+ handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode);
+ }
+ auto op_problem = mps_data_model_to_optimization_problem(handle_ptr, mps_data_model);
+ return solve_lp(op_problem, settings, problem_checking, use_pdlp_solver_mode, false);
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/pdlp/solve.cu` around lines 2129 - 2134, The current overload
erroneously hard-fails via cuopt_expects when
settings.hyper_params.use_distributed_pdlp is false and always forwards to
solve_lp_distributed_from_mps, removing the original single-GPU/direct-MPS path;
restore the prior behavior by replacing the hard-fail with a branch: if
settings.hyper_params.use_distributed_pdlp is true call
solve_lp_distributed_from_mps(handle_ptr, mps_data_model, settings,
problem_checking, use_pdlp_solver_mode) else call the non-distributed/MPS
entrypoint (the original direct-MPS function used previously—e.g.,
solve_lp_from_mps or the equivalent direct-MPS routine) so both paths are
supported, and keep or adjust cuopt_expects to validate only unsupported
parameter combinations if needed.
| pdlp_solver_settings_t<i_t, f_t> settings_resolved = settings; | ||
|
|
||
| detail::pdlp_graph_disabled_flag().store(settings_resolved.hyper_params.pdlp_disable_graph, | ||
| std::memory_order_relaxed); | ||
|
|
||
| if (settings_resolved.distributed_pdlp_num_gpus == -1) { | ||
| settings_resolved.distributed_pdlp_num_gpus = raft::device_setter::get_device_count(); | ||
| CUOPT_LOG_INFO( | ||
| "solve_lp_distributed_from_mps: distributed_pdlp_num_gpus == -1, auto-detected " | ||
| "%d visible CUDA device(s)", | ||
| settings_resolved.distributed_pdlp_num_gpus); | ||
| } | ||
| // PDLP precision validations (mirror the checks in run_pdlp; distributed | ||
| // path only supports the default-precision, non-batch double config). | ||
| cuopt_expects(settings_resolved.pdlp_precision == pdlp_precision_t::DefaultPrecision, | ||
| error_type_t::ValidationError, | ||
| "Distributed PDLP only supports DefaultPrecision (double)."); | ||
| cuopt_expects(!settings_resolved.inside_mip, | ||
| error_type_t::ValidationError, | ||
| "Distributed PDLP is not yet supported from inside MIP."); | ||
|
|
||
| init_logger_t log(settings_resolved.log_file, settings_resolved.log_to_console); | ||
| print_version_info(); | ||
| init_handler(handle_ptr); | ||
|
|
||
| const i_t n_vars = static_cast<i_t>(mps_data_model.get_objective_coefficients().size()); | ||
| const i_t n_cstr = static_cast<i_t>(mps_data_model.get_constraint_lower_bounds().size()); | ||
| const i_t nnz = static_cast<i_t>(mps_data_model.get_constraint_matrix_values().size()); | ||
| CUOPT_LOG_INFO( | ||
| "Solving a problem with %d constraints, %d variables (%d integers), and %d " | ||
| "nonzeros (distributed mps-direct path)", | ||
| n_cstr, | ||
| n_vars, | ||
| 0, | ||
| nnz); | ||
|
|
||
| auto lp_timer = cuopt::timer_t(settings_resolved.time_limit); | ||
|
|
||
| // Shape-0 placeholder: needed to build an empty pdlp_solver | ||
| cuopt::linear_programming::optimization_problem_t<i_t, f_t> placeholder_op(handle_ptr); | ||
| { | ||
| std::vector<i_t> empty_offsets = {0}; | ||
| placeholder_op.set_csr_constraint_matrix( | ||
| nullptr, 0, nullptr, 0, empty_offsets.data(), static_cast<i_t>(empty_offsets.size())); | ||
| } | ||
| detail::problem_t<i_t, f_t> placeholder_problem(placeholder_op); | ||
|
|
||
| detail::pdlp_solver_t<i_t, f_t> solver(placeholder_problem, mps_data_model, settings_resolved); | ||
|
|
There was a problem hiding this comment.
Validate the method and apply the preset-mode mapping in the distributed path.
solve_lp_distributed_from_mps() always constructs detail::pdlp_solver_t, but it never checks settings.method and never calls set_pdlp_solver_mode(settings_resolved) when use_pdlp_solver_mode is true. A caller can request Barrier/Concurrent or a preset PDLP mode and silently get default-parameter PDLP instead.
Suggested fix
pdlp_solver_settings_t<i_t, f_t> settings_resolved = settings;
+ cuopt_expects(settings_resolved.method == method_t::PDLP,
+ error_type_t::ValidationError,
+ "Distributed MPS solve currently supports only method_t::PDLP");
+ if (use_pdlp_solver_mode) { set_pdlp_solver_mode(settings_resolved); }
detail::pdlp_graph_disabled_flag().store(settings_resolved.hyper_params.pdlp_disable_graph,
std::memory_order_relaxed);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/src/pdlp/solve.cu` around lines 2157 - 2205,
solve_lp_distributed_from_mps builds detail::pdlp_solver_t using
settings_resolved but never applies settings.method or calls
set_pdlp_solver_mode, so requested PDLP modes/presets are ignored; fix by
checking settings_resolved.use_pdlp_solver_mode (and/or
settings_resolved.method) before constructing the solver and call
set_pdlp_solver_mode(settings_resolved) to map the preset/method into the solver
settings (or apply the mapping to settings_resolved) so the subsequent
detail::pdlp_solver_t(placeholder_problem, mps_data_model, settings_resolved) is
constructed with the intended PDLP mode.
| pdlp_solver_settings_t<int, double> dist_settings = base_settings; | ||
| dist_settings.hyper_params.use_distributed_pdlp = true; | ||
| dist_settings.distributed_pdlp_num_gpus = -1; | ||
| auto dist = solve_lp(&handle, problem, dist_settings); |
There was a problem hiding this comment.
Require at least 2 GPUs here or skip the test.
With distributed_pdlp_num_gpus = -1 and the explicit “1 GPU is fine” behavior, this helper can pass without ever exercising the multi-GPU/NCCL path the PR is adding. That leaves the highest-risk path effectively untested in single-GPU CI.
Suggested change
pdlp_solver_settings_t<int, double> dist_settings = base_settings;
dist_settings.hyper_params.use_distributed_pdlp = true;
- dist_settings.distributed_pdlp_num_gpus = -1;
+ int device_count = 0;
+ RAFT_CUDA_TRY(cudaGetDeviceCount(&device_count));
+ if (device_count < 2) {
+ GTEST_SKIP() << "distributed parity requires at least 2 GPUs";
+ }
+ dist_settings.distributed_pdlp_num_gpus = 2;
auto dist = solve_lp(&handle, problem, dist_settings);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pdlp_solver_settings_t<int, double> dist_settings = base_settings; | |
| dist_settings.hyper_params.use_distributed_pdlp = true; | |
| dist_settings.distributed_pdlp_num_gpus = -1; | |
| auto dist = solve_lp(&handle, problem, dist_settings); | |
| pdlp_solver_settings_t<int, double> dist_settings = base_settings; | |
| dist_settings.hyper_params.use_distributed_pdlp = true; | |
| int device_count = 0; | |
| RAFT_CUDA_TRY(cudaGetDeviceCount(&device_count)); | |
| if (device_count < 2) { | |
| GTEST_SKIP() << "distributed parity requires at least 2 GPUs"; | |
| } | |
| dist_settings.distributed_pdlp_num_gpus = 2; | |
| auto dist = solve_lp(&handle, problem, dist_settings); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 188 - 191, The test
currently sets distributed_pdlp_num_gpus = -1 which lets a single-GPU run bypass
the multi-GPU/NCCL path; change the test to first query the available GPU count
and if fewer than 2 GPUs are present skip the test, otherwise set
pdlp_solver_settings_t::distributed_pdlp_num_gpus to at least 2 (e.g., max(2,
available_gpus)) before calling solve_lp(&handle, problem, dist_settings) so the
distributed PDLP path is actually exercised (use pdlp_solver_settings_t,
dist_settings, distributed_pdlp_num_gpus and solve_lp as the loci to modify).
| TEST(pdlp_class, distributed_parity_square41) | ||
| { | ||
| const raft::handle_t handle{}; | ||
| expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps"); | ||
| } |
There was a problem hiding this comment.
Point the square41 test at square41, not neos3.
This test is named distributed_parity_square41 but currently loads linear_programming/neos3/neos3.mps. That means the intended square41 regression is missing, and it also reintroduces an instance this file already treats as TODO/problematic later on.
Suggested change
TEST(pdlp_class, distributed_parity_square41)
{
const raft::handle_t handle{};
- expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps");
+ expect_distributed_matches_base(handle, "linear_programming/square41/square41.mps");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| TEST(pdlp_class, distributed_parity_square41) | |
| { | |
| const raft::handle_t handle{}; | |
| expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps"); | |
| } | |
| TEST(pdlp_class, distributed_parity_square41) | |
| { | |
| const raft::handle_t handle{}; | |
| expect_distributed_matches_base(handle, "linear_programming/square41/square41.mps"); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 248 - 252, The test
pdlp_class::distributed_parity_square41 is loading the wrong dataset; change the
argument to expect_distributed_matches_base in that test so it points to
"linear_programming/square41/square41.mps" instead of
"linear_programming/neos3/neos3.mps" so the regression covers the intended
square41 case (update the call site in the distributed_parity_square41 test that
invokes expect_distributed_matches_base).
Not review ready
Not merge ready
Just to let team have a look at it but definitely needs a big clean up
closes #891