Skip to content

CuD-PDLP#1391

Open
Bubullzz wants to merge 68 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP
Open

CuD-PDLP#1391
Bubullzz wants to merge 68 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP

Conversation

@Bubullzz
Copy link
Copy Markdown
Contributor

@Bubullzz Bubullzz commented Jun 4, 2026

Not review ready
Not merge ready

Just to let team have a look at it but definitely needs a big clean up
closes #891

Bubullzz added 30 commits May 7, 2026 15:07
@Bubullzz Bubullzz requested review from a team as code owners June 4, 2026 15:24
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Bubullzz Bubullzz added the do not merge Do not merge if this flag is set label Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds comprehensive distributed multi-GPU support to the PDLP linear programming solver, enabling it to handle large-scale problems that exceed single-GPU memory. The implementation partitions constraint-variable bipartite graphs (via METIS or KaMinPar), distributes subproblems to per-GPU solver shards coordinated by NCCL, and reimplements key linear algebra operations (SpMV, scaling, convergence checking) to execute in a distributed manner.

Changes

Distributed Multi-GPU PDLP

Layer / File(s) Summary
Build system & NCCL/METIS/KaMinPar integration
cpp/CMakeLists.txt, cpp/cmake/thirdparty/get_kaminpar.cmake, cpp/src/pdlp/CMakeLists.txt
CMake discovers NCCL and METIS libraries from environment, creates imported targets, locates/builds KaMinPar via CPM, and links all three into the cuopt shared library alongside new distributed PDLP source files.
Configuration parameters & CLI support
cpp/include/cuopt/linear_programming/constants.h, cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp, cpp/src/math_optimization/solver_settings.cu, cpp/cuopt_cli.cpp
Adds configuration macros and solver settings fields for distributed PDLP (GPU count, partition files, partitioner selection). CLI conditionally routes LP solving through distributed entry point when enabled and auto-detects GPU count for RMM provisioning.
Graph partitioner interface & implementations
cpp/src/pdlp/distributed_pdlp/partitioner.hpp, cpp/src/pdlp/distributed_pdlp/partitioner.cu, cpp/src/pdlp/distributed_pdlp/metis_partitioner.hpp, cpp/src/pdlp/distributed_pdlp/metis_partitioner.cu, cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.hpp, cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.cpp
Abstract partitioner interface with dummy/METIS/KaMinPar implementations to split bipartite constraint-variable graphs into k-way balanced partitions; includes factory and validation routines.
Rank data & partition loading
cpp/src/pdlp/distributed_pdlp/rank_data.hpp, cpp/src/pdlp/distributed_pdlp/partition_loader.hpp, cpp/src/pdlp/distributed_pdlp/partition_loader.cu
Per-rank metadata structures holding ownership ranges, local CSR matrices, global↔local index maps, and per-peer halo communication plans; file-based partition I/O for reproducibility.
Shard infrastructure & NCCL communication
cpp/src/pdlp/distributed_pdlp/shard.hpp, cpp/src/pdlp/distributed_pdlp/shard.cu
Non-copyable per-GPU shard owning local optimization problem, NCCL communicator, scaled/unscaled buffers, and pre-staged GPU halo-exchange buffers for constraint/variable synchronization.
Multi-GPU engine core orchestration
cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp, cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cu
Orchestrates computation across shards via halo exchange, NCCL all-reduce, distributed L2 norm, power-iteration max-singular-value estimation, and scaling (bound/objective rescaling, Ruiz inf-norm, Pock–Chambolle); coordinates CUDA graph fork/join for reflected operations.
Distributed scaling strategy refactoring
cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cuh, cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cu
Splits Ruiz and Pock–Chambolle scaling into composable local-compute and cumulative-apply stages; adds setters for externally-injected scaling factors; makes compute_scaling_vectors public for distributed usage.
PDHG multi-GPU wiring & graph capture
cpp/src/pdlp/pdhg.hpp, cpp/src/pdlp/pdhg.cu
Adds multi-GPU state (engine pointer, is_multi_gpu flag), per-shard SpMV helpers (spmv_A_into/spmv_At_into), reflected projection transform wrappers, and CUDA graph fork/join orchestration during reflected primal/dual updates.
PDLP distributed constructor & initialization
cpp/src/pdlp/pdlp.cuh, cpp/src/pdlp/pdlp.cu
New constructor ingests MPS model directly, partitions problem, constructs multi-GPU engine with per-shard solvers, initializes distributed scaling and convergence norms via all-reduce, wires multi-GPU execution into master solver.
PDLP solver loop multi-GPU integration
cpp/src/pdlp/pdlp.cu
Adapts solver main loop initialization (gated for non-distributed only), termination checks, fixed-error computation, restart strategy, and Halpern updates to execute per-shard and synchronize state via multi-GPU engine when present.
Distributed convergence information computation
cpp/src/pdlp/termination_strategy/convergence_information.hpp, cpp/src/pdlp/termination_strategy/convergence_information.cu, cpp/src/pdlp/termination_strategy/termination_strategy.hpp, cpp/src/pdlp/termination_strategy/termination_strategy.cu
Extends convergence information to compute primal/dual residuals and objectives in multi-GPU mode via halo exchange, per-shard partial computation, all-reduce aggregation, and per-shard reference norm handling; adds mutable accessors.
Adaptive step-size strategy multi-GPU support
cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hpp, cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu
Adds mutable getters for interaction/norm buffers and owned-prefix reduction parameters to enable per-shard movement computation in multi-GPU context.
Entry point routing & graph disabling
cpp/src/pdlp/solve.cuh, cpp/src/pdlp/solve.cu
New solve_lp_distributed_from_mps entry point validates distributed preconditions, routes to MPS-based distributed PDLP constructor, applies graph-disabled flag, and handles solution post-processing and file output.
Tracing & diagnostic utilities
cpp/src/pdlp/utilities/mgpu_trace.cuh, cpp/src/pdlp/utilities/ping_pong_graph.cuh
Lightweight environment-gated multi-GPU tracing macros (MGPU_TRACE) and atomic CUDA graph disabling control for debugging distributed execution.
cuSPARSE descriptor nnz binding fix
cpp/src/pdlp/cusparse_view.cu
Fixes cuSPARSE descriptor nnz binding to use actual device buffer sizes instead of problem metadata fields, ensuring correctness for partitioned matrices.
Saddle point size relaxation for distributed
cpp/src/pdlp/saddle_point.cu
Relaxes saddle point constructor precondition to allow zero-sized primal/dual dimensions needed for master problem construction in distributed context.
Distributed PDLP tests
cpp/tests/linear_programming/pdlp_test.cu
Adds partition export/import round-trip test and distributed-vs-single-GPU parity tests verifying matching termination status, objectives, step counts (within 15%), and solution vectors across AFIRO, square41, and a2864 instances.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

feature request, improvement

Suggested reviewers

  • rgsl888prabhu
  • mlubin
  • bdice
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/src/pdlp/solve.cu (1)

769-784: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject distributed problem_t calls before any early return.

The new guard sits below the zero-constraint return and the FP32 fallback. With use_distributed_pdlp=true plus SinglePrecision, this path returns run_pdlp_solver_in_fp32(...) instead of raising the intended validation error, so an unsupported distributed configuration silently runs the single-GPU solver.

Suggested fix
 static optimization_problem_solution_t<i_t, f_t> run_pdlp_solver(
   detail::problem_t<i_t, f_t>& problem,
   pdlp_solver_settings_t<i_t, f_t> const& settings,
   const timer_t& timer,
   bool is_batch_mode)
 {
+  cuopt_expects(!settings.hyper_params.use_distributed_pdlp,
+                error_type_t::ValidationError,
+                "Distributed PDLP must be entered via solve_lp(mps_data_model, ...) "
+                "so the master GPU never materializes the full problem. Call sites "
+                "with a problem_t cannot dispatch to distributed mode.");
+
   detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph,
                                            std::memory_order_relaxed);
 
   if (problem.n_constraints == 0) {
     ...
   }
 `#if` PDLP_INSTANTIATE_FLOAT || CUOPT_INSTANTIATE_FLOAT
   if constexpr (std::is_same_v<f_t, double>) {
     if (settings.pdlp_precision == pdlp_precision_t::SinglePrecision) {
       return run_pdlp_solver_in_fp32(problem, settings, timer, is_batch_mode);
     }
   }
 `#endif`
-  cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 769 - 784, The distributed-mode
validation (cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...))
must be performed before any early returns so a distributed call cannot
accidentally take the FP32 fallback or zero-constraint path; move or duplicate
that check to occur before the SinglePrecision/FP32 branch and before the
zero-constraint return so that when settings.hyper_params.use_distributed_pdlp
is true (for problem_t inputs) the function immediately raises the
ValidationError rather than calling run_pdlp_solver_in_fp32 or returning early.
Ensure the check references the same validation message and
error_type_t::ValidationError used currently.
cpp/src/pdlp/pdlp.cu (1)

3063-3079: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

The distributed average path is still unsafe in release builds.

When multi_gpu_engine is present and never_restart_to_average is false, Line 3071 uses plain assert(false). In release builds that disappears, and the subsequent raft::copy writes primal_size_h_/dual_size_h_ elements into unscaled_*_avg_solution_, which were never resized for the distributed ctor. That turns this TODO into an invalid device-copy / wrong-result path instead of a clean runtime rejection.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/pdlp.cu` around lines 3063 - 3079, The path that handles
multi-GPU (multi_gpu_engine) uses assert(false) which vanishes in release builds
and leads to invalid device copies into unscaled_primal_avg_solution_ /
unscaled_dual_avg_solution_; fix by replacing the assert with a deterministic
runtime guard: either resize/allocate unscaled_primal_avg_solution_ and
unscaled_dual_avg_solution_ to primal_size_h_ and dual_size_h_ (and
synchronize/validate device pointers) before calling raft::copy from
pdhg_solver_.get_primal_solution() / get_dual_solution(), or explicitly fail
early by logging and throwing a runtime_error when multi_gpu_engine is true so
the copy is never attempted; update the branch around
internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to implement
one of these safe behaviors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/cuopt_cli.cpp`:
- Around line 180-184: When lp_settings.hyper_params.use_distributed_pdlp is
true, guard the distributed PDLP call by checking that handle_ptr is non-null
before invoking cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if
handle_ptr is null, fail fast with a clear error (e.g., log and exit or throw)
rather than calling the distributed overload; update the branch that currently
chooses between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and
solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and
only call the distributed overload when handle_ptr is valid.
- Around line 439-447: The code currently computes requested_gpus and then uses
std::min(...) to compute provisioned_gpus and
memory_resources.reserve(provisioned_gpus) without validating requested_gpus;
add explicit validation after computing requested_gpus (and after remapping -1
when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that
raft::device_setter::get_device_count() > 0 before calling std::min or reserve.
If either value is non-positive, return/log an error or throw an exception
(consistent with surrounding error handling) referencing the parameters obtained
via settings.get_parameter<int>(CUOPT_NUM_GPUS) and
settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never
calls memory_resources.reserve with a non-positive size.

In `@cpp/src/pdlp/cusparse_view.cu`:
- Around line 501-511: The mixed-precision branch still sizes and recreates FP32
matrices using op_problem_scaled.nnz which can differ per shard; update that
block to use the shard-local nnz values (e.g. static_cast<int64_t>(A_.size())
and static_cast<int64_t>(A_T_.size())) when allocating/sizing A_mixed_ and
A_T_mixed_ and when copying/transposing data for A_T.create / A.create so you
don't overrun A_T_ or leave stale nnz metadata; ensure any metadata fields set
during the FP32 recreate follow the shard-local sizes and that all
transforms/read ranges use those local sizes (A_, A_T_, A_mixed_, A_T_mixed_).

In `@cpp/src/pdlp/distributed_pdlp/partition_loader.cu`:
- Around line 77-87: Validate partition and CSR metadata before any
slicing/indexing: check that parts.size() >= nb_cstr + nb_vars before creating
cstr_parts/var_parts, ensure all entries in parts are within [0, nb_parts)
before using them to index rank_data_t<i_t,f_t>, and verify CSR arrays
(offsets/indices) have expected lengths (e.g., offsets.size() >= rows+1 and
indices.size() == nnz) before dereferencing in functions that build/iterate the
CSR (referencing variables parts, nb_cstr, nb_vars, rank_data_t, and the CSR
offset/index containers); use cuopt_expects (or the existing error path) to fail
early with clear messages when any check fails.

In `@cpp/src/pdlp/pdlp.cu`:
- Around line 821-825: The distributed gather of the current iterate is missing
on several return paths so master buffers can be stale; call the multi-GPU
gather before any return that serializes the current solution. Specifically,
ensure pdhg_solver_.get_mgpu_engine() and its method
gather_potential_next_solutions_to_master(pdhg_solver_,
current_termination_strategy_.get_convergence_information().get_reduced_cost())
is invoked centrally before any code that calls
fill_return_problem_solution(...), and add the same centralized gather call on
the other identified return sites (including the ConcurrentLimit and
PrimalFeasible/infeasibility exits referenced around lines ~859-863 and
~1541-1545) so the master full-size solution/reduced-cost buffers are populated
on every distributed return path.
- Around line 387-393: The distributed constructor pdlp_solver_t( problem_t<i_t,
f_t>& placeholder_problem, ... ) currently delegates to the regular ctor before
shard sizes exist, causing functions that use primal_size_h_/dual_size_h_ (e.g.,
set_initial_primal_solution, handling of initial_primal_solution and
initial_dual_solution and warm-start data) to operate on zero-length buffers;
update this constructor to either (a) validate and reject any initial-state
options (initial_primal_solution, initial_dual_solution, warm-start) up front
and return an error, or (b) defer all logic that applies initial iterates (calls
to set_initial_primal_solution / set_initial_dual_solution and warm-start
handling) until after shard construction when primal_size_h_ and dual_size_h_
are set, ensuring no modulo/divide-by-zero or zero-length copies occur.

In `@cpp/src/pdlp/solve.cu`:
- Around line 759-760: The global flag detail::pdlp_graph_disabled_flag() is
being mutated per-solve causing races; instead make the graph-disable decision
local to each solver instance and avoid writing the process-global flag from
solve entrypoints. Change callers that currently
store(settings.hyper_params.pdlp_disable_graph, ...) to pass the
pdlp_disable_graph boolean into the solver instance (or ctor) and have
ping_pong_graph_t::run() and related graph code read that instance-level flag
rather than detail::pdlp_graph_disabled_flag(); remove writes to the global flag
in solve functions so concurrent solves do not flip each other’s mode.
- Around line 2129-2134: The current overload erroneously hard-fails via
cuopt_expects when settings.hyper_params.use_distributed_pdlp is false and
always forwards to solve_lp_distributed_from_mps, removing the original
single-GPU/direct-MPS path; restore the prior behavior by replacing the
hard-fail with a branch: if settings.hyper_params.use_distributed_pdlp is true
call solve_lp_distributed_from_mps(handle_ptr, mps_data_model, settings,
problem_checking, use_pdlp_solver_mode) else call the non-distributed/MPS
entrypoint (the original direct-MPS function used previously—e.g.,
solve_lp_from_mps or the equivalent direct-MPS routine) so both paths are
supported, and keep or adjust cuopt_expects to validate only unsupported
parameter combinations if needed.
- Around line 2157-2205: solve_lp_distributed_from_mps builds
detail::pdlp_solver_t using settings_resolved but never applies settings.method
or calls set_pdlp_solver_mode, so requested PDLP modes/presets are ignored; fix
by checking settings_resolved.use_pdlp_solver_mode (and/or
settings_resolved.method) before constructing the solver and call
set_pdlp_solver_mode(settings_resolved) to map the preset/method into the solver
settings (or apply the mapping to settings_resolved) so the subsequent
detail::pdlp_solver_t(placeholder_problem, mps_data_model, settings_resolved) is
constructed with the intended PDLP mode.

In `@cpp/tests/linear_programming/pdlp_test.cu`:
- Around line 188-191: The test currently sets distributed_pdlp_num_gpus = -1
which lets a single-GPU run bypass the multi-GPU/NCCL path; change the test to
first query the available GPU count and if fewer than 2 GPUs are present skip
the test, otherwise set pdlp_solver_settings_t::distributed_pdlp_num_gpus to at
least 2 (e.g., max(2, available_gpus)) before calling solve_lp(&handle, problem,
dist_settings) so the distributed PDLP path is actually exercised (use
pdlp_solver_settings_t, dist_settings, distributed_pdlp_num_gpus and solve_lp as
the loci to modify).
- Around line 248-252: The test pdlp_class::distributed_parity_square41 is
loading the wrong dataset; change the argument to
expect_distributed_matches_base in that test so it points to
"linear_programming/square41/square41.mps" instead of
"linear_programming/neos3/neos3.mps" so the regression covers the intended
square41 case (update the call site in the distributed_parity_square41 test that
invokes expect_distributed_matches_base).

---

Outside diff comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 3063-3079: The path that handles multi-GPU (multi_gpu_engine) uses
assert(false) which vanishes in release builds and leads to invalid device
copies into unscaled_primal_avg_solution_ / unscaled_dual_avg_solution_; fix by
replacing the assert with a deterministic runtime guard: either resize/allocate
unscaled_primal_avg_solution_ and unscaled_dual_avg_solution_ to primal_size_h_
and dual_size_h_ (and synchronize/validate device pointers) before calling
raft::copy from pdhg_solver_.get_primal_solution() / get_dual_solution(), or
explicitly fail early by logging and throwing a runtime_error when
multi_gpu_engine is true so the copy is never attempted; update the branch
around internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to
implement one of these safe behaviors.

In `@cpp/src/pdlp/solve.cu`:
- Around line 769-784: The distributed-mode validation
(cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...)) must be
performed before any early returns so a distributed call cannot accidentally
take the FP32 fallback or zero-constraint path; move or duplicate that check to
occur before the SinglePrecision/FP32 branch and before the zero-constraint
return so that when settings.hyper_params.use_distributed_pdlp is true (for
problem_t inputs) the function immediately raises the ValidationError rather
than calling run_pdlp_solver_in_fp32 or returning early. Ensure the check
references the same validation message and error_type_t::ValidationError used
currently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7df2a4b9-585b-4517-afcb-1aa089ecb1c1

📥 Commits

Reviewing files that changed from the base of the PR and between d6d6f9e and 91b1ae5.

📒 Files selected for processing (41)
  • cpp/CMakeLists.txt
  • cpp/cmake/thirdparty/get_kaminpar.cmake
  • cpp/cuopt_cli.cpp
  • cpp/include/cuopt/linear_programming/constants.h
  • cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh
  • cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
  • cpp/src/math_optimization/solver_settings.cu
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/cusparse_view.cu
  • cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.cpp
  • cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.hpp
  • cpp/src/pdlp/distributed_pdlp/metis_partitioner.cu
  • cpp/src/pdlp/distributed_pdlp/metis_partitioner.hpp
  • cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cu
  • cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp
  • cpp/src/pdlp/distributed_pdlp/partition_loader.cu
  • cpp/src/pdlp/distributed_pdlp/partition_loader.hpp
  • cpp/src/pdlp/distributed_pdlp/partitioner.cu
  • cpp/src/pdlp/distributed_pdlp/partitioner.hpp
  • cpp/src/pdlp/distributed_pdlp/rank_data.hpp
  • cpp/src/pdlp/distributed_pdlp/shard.cu
  • cpp/src/pdlp/distributed_pdlp/shard.hpp
  • cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cu
  • cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cuh
  • cpp/src/pdlp/pdhg.cu
  • cpp/src/pdlp/pdhg.hpp
  • cpp/src/pdlp/pdlp.cu
  • cpp/src/pdlp/pdlp.cuh
  • cpp/src/pdlp/restart_strategy/pdlp_restart_strategy.cu
  • cpp/src/pdlp/saddle_point.cu
  • cpp/src/pdlp/solve.cu
  • cpp/src/pdlp/solve.cuh
  • cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu
  • cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hpp
  • cpp/src/pdlp/termination_strategy/convergence_information.cu
  • cpp/src/pdlp/termination_strategy/convergence_information.hpp
  • cpp/src/pdlp/termination_strategy/termination_strategy.cu
  • cpp/src/pdlp/termination_strategy/termination_strategy.hpp
  • cpp/src/pdlp/utilities/mgpu_trace.cuh
  • cpp/src/pdlp/utilities/ping_pong_graph.cuh
  • cpp/tests/linear_programming/pdlp_test.cu

Comment thread cpp/cuopt_cli.cpp
Comment on lines +180 to +184
if (lp_settings.hyper_params.use_distributed_pdlp) {
cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings);
} else {
cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard distributed solve path against null GPU handle.

When distributed PDLP is enabled, this branch can pass a null handle_ptr if the selected memory backend is not GPU. Fail fast before calling the distributed overload.

Suggested fix
-      if (lp_settings.hyper_params.use_distributed_pdlp) {
+      if (lp_settings.hyper_params.use_distributed_pdlp) {
+        if (handle_ptr == nullptr) {
+          throw std::runtime_error("Distributed PDLP requires GPU memory backend.");
+        }
         cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings);
       } else {
         cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings);
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/cuopt_cli.cpp` around lines 180 - 184, When
lp_settings.hyper_params.use_distributed_pdlp is true, guard the distributed
PDLP call by checking that handle_ptr is non-null before invoking
cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if handle_ptr is
null, fail fast with a clear error (e.g., log and exit or throw) rather than
calling the distributed overload; update the branch that currently chooses
between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and
solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and
only call the distributed overload when handle_ptr is valid.

Comment thread cpp/cuopt_cli.cpp
Comment on lines +439 to +447
int requested_gpus = use_distributed_pdlp
? settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS)
: settings.get_parameter<int>(CUOPT_NUM_GPUS);
if (use_distributed_pdlp && requested_gpus == -1) {
requested_gpus = raft::device_setter::get_device_count();
}
const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus);

memory_resources.reserve(std::min(raft::device_setter::get_device_count(), num_gpus));
for (int i = 0; i < std::min(raft::device_setter::get_device_count(), num_gpus); ++i) {
memory_resources.reserve(provisioned_gpus);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate GPU count before provisioning memory resources.

requested_gpus is used to size provisioning without enforcing a positive value (other than distributed -1 remap). Add explicit validation before std::min(...)/reserve(...) to avoid invalid allocation/setup paths.

Suggested fix
-    if (use_distributed_pdlp && requested_gpus == -1) {
-      requested_gpus = raft::device_setter::get_device_count();
-    }
-    const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus);
+    const int visible_gpus = raft::device_setter::get_device_count();
+    if (use_distributed_pdlp && requested_gpus == -1) {
+      requested_gpus = visible_gpus;
+    }
+    if (requested_gpus <= 0) {
+      std::cerr << "Invalid GPU count: " << requested_gpus
+                << " (must be > 0, or -1 only with distributed PDLP)." << std::endl;
+      return 1;
+    }
+    const int provisioned_gpus = std::min(visible_gpus, requested_gpus);
+    if (provisioned_gpus <= 0) {
+      std::cerr << "No visible GPUs available for GPU backend." << std::endl;
+      return 1;
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/cuopt_cli.cpp` around lines 439 - 447, The code currently computes
requested_gpus and then uses std::min(...) to compute provisioned_gpus and
memory_resources.reserve(provisioned_gpus) without validating requested_gpus;
add explicit validation after computing requested_gpus (and after remapping -1
when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that
raft::device_setter::get_device_count() > 0 before calling std::min or reserve.
If either value is non-positive, return/log an error or throw an exception
(consistent with surrounding error handling) referencing the parameters obtained
via settings.get_parameter<int>(CUOPT_NUM_GPUS) and
settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never
calls memory_resources.reserve with a non-positive size.

Comment on lines +501 to +511
static_cast<int64_t>(A_.size()),
const_cast<i_t*>(op_problem_scaled.offsets.data()),
const_cast<i_t*>(op_problem_scaled.variables.data()),
const_cast<f_t*>(op_problem_scaled.coefficients.data()));

// A_T can have a different nnz than A in multi-GPU shards
// A is just what is needed to compute A_x for owned constraints
// A_T is just what is needed to compute A_T_y for owned variables
A_T.create(op_problem_scaled.n_variables,
op_problem_scaled.n_constraints,
op_problem_scaled.nnz,
static_cast<int64_t>(A_T_.size()),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Use shard-local nnz in the mixed-precision setup too.

Now that this ctor allows A and A_T to have different nnz on a shard, the later mixed-precision block still sizes, transforms, and recreates the FP32 matrices with op_problem_scaled.nnz (Lines 777-804). On any distributed shard where those lengths diverge, that path can overrun A_T_ or build A_mixed_ / A_T_mixed_ with stale nnz metadata.

Suggested follow-up
-      A_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream());
-      A_T_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream());
+      A_float_.resize(A_.size(), handle_ptr->get_stream());
+      A_T_float_.resize(A_T_.size(), handle_ptr->get_stream());

-      RAFT_CUDA_TRY(cub::DeviceTransform::Transform(op_problem_scaled.coefficients.data(),
+      RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_.data(),
                                                     A_float_.data(),
-                                                    op_problem_scaled.nnz,
+                                                    A_.size(),
                                                     double_to_float_functor{},
                                                     handle_ptr->get_stream().value()));

       RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_T_.data(),
                                                     A_T_float_.data(),
-                                                    op_problem_scaled.nnz,
+                                                    A_T_.size(),
                                                     double_to_float_functor{},
                                                     handle_ptr->get_stream().value()));

       A_mixed_.create(op_problem_scaled.n_constraints,
                       op_problem_scaled.n_variables,
-                      op_problem_scaled.nnz,
+                      static_cast<int64_t>(A_.size()),
                       const_cast<i_t*>(op_problem_scaled.offsets.data()),
                       const_cast<i_t*>(op_problem_scaled.variables.data()),
                       A_float_.data());

       A_T_mixed_.create(op_problem_scaled.n_variables,
                         op_problem_scaled.n_constraints,
-                        op_problem_scaled.nnz,
+                        static_cast<int64_t>(A_T_.size()),
                         const_cast<i_t*>(A_T_offsets_.data()),
                         const_cast<i_t*>(A_T_indices_.data()),
                         A_T_float_.data());

As per coding guidelines, "Prevent invalid memory access (out-of-bounds, use-after-free, host/device confusion) in GPU code."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/cusparse_view.cu` around lines 501 - 511, The mixed-precision
branch still sizes and recreates FP32 matrices using op_problem_scaled.nnz which
can differ per shard; update that block to use the shard-local nnz values (e.g.
static_cast<int64_t>(A_.size()) and static_cast<int64_t>(A_T_.size())) when
allocating/sizing A_mixed_ and A_T_mixed_ and when copying/transposing data for
A_T.create / A.create so you don't overrun A_T_ or leave stale nnz metadata;
ensure any metadata fields set during the FP32 recreate follow the shard-local
sizes and that all transforms/read ranges use those local sizes (A_, A_T_,
A_mixed_, A_T_mixed_).

Comment on lines +77 to +87
cuopt_expects(A_values.size() == A_values_scaled.size(),
error_type_t::ValidationError,
"A_values and A_values_scaled must have the same length");
cuopt_expects(A_t_values.size() == A_t_values_scaled.size(),
error_type_t::ValidationError,
"A_t_values and A_t_values_scaled must have the same length");

std::vector<rank_data_t<i_t, f_t>> rank_data(nb_parts, rank_data_t<i_t, f_t>(nb_parts));
std::vector<i_t> cstr_parts(parts.begin(), parts.begin() + nb_cstr);
std::vector<i_t> var_parts(parts.begin() + nb_cstr, parts.begin() + nb_cstr + nb_vars);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Validate partition metadata and CSR dimensions before indexing.

Line 85/Line 86 slice parts without checking size, Line 90 indexes rank_data using unvalidated part IDs, and Line 115/Line 157 dereference CSR offsets/indices without shape checks. A malformed partition file or inconsistent matrix metadata can cause out-of-bounds access/UB.

Proposed guard block (at function entry)
 std::vector<rank_data_t<i_t, f_t>> partition_loader_t<i_t, f_t>::create_rank_data_from_parts(
@@
 {
+  cuopt_expects(nb_parts > 0, error_type_t::ValidationError, "nb_parts must be > 0");
+  cuopt_expects(nb_cstr >= 0 && nb_vars >= 0 && nnz >= 0,
+                error_type_t::ValidationError,
+                "nb_cstr/nb_vars/nnz must be non-negative");
+
+  const auto expected_parts =
+    static_cast<size_t>(nb_cstr) + static_cast<size_t>(nb_vars);
+  cuopt_expects(parts.size() == expected_parts,
+                error_type_t::ValidationError,
+                "parts size mismatch: expected nb_cstr + nb_vars");
+
+  for (auto p : parts) {
+    cuopt_expects(p >= 0 && p < nb_parts,
+                  error_type_t::ValidationError,
+                  "partition id out of range [0, nb_parts)");
+  }
+
+  cuopt_expects(A_row_offsets.size() == static_cast<size_t>(nb_cstr) + 1,
+                error_type_t::ValidationError,
+                "A_row_offsets size must be nb_cstr + 1");
+  cuopt_expects(A_t_row_offsets.size() == static_cast<size_t>(nb_vars) + 1,
+                error_type_t::ValidationError,
+                "A_t_row_offsets size must be nb_vars + 1");
+
+  cuopt_expects(A_col_indices.size() == static_cast<size_t>(nnz) &&
+                  A_values.size() == static_cast<size_t>(nnz) &&
+                  A_values_scaled.size() == static_cast<size_t>(nnz),
+                error_type_t::ValidationError,
+                "A buffers must match nnz");
+  cuopt_expects(A_t_col_indices.size() == static_cast<size_t>(nnz) &&
+                  A_t_values.size() == static_cast<size_t>(nnz) &&
+                  A_t_values_scaled.size() == static_cast<size_t>(nnz),
+                error_type_t::ValidationError,
+                "A_t buffers must match nnz");
 }

As per coding guidelines: “Validate input at library and server boundaries.”

Also applies to: 89-94, 115-121, 156-163

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/distributed_pdlp/partition_loader.cu` around lines 77 - 87,
Validate partition and CSR metadata before any slicing/indexing: check that
parts.size() >= nb_cstr + nb_vars before creating cstr_parts/var_parts, ensure
all entries in parts are within [0, nb_parts) before using them to index
rank_data_t<i_t,f_t>, and verify CSR arrays (offsets/indices) have expected
lengths (e.g., offsets.size() >= rows+1 and indices.size() == nnz) before
dereferencing in functions that build/iterate the CSR (referencing variables
parts, nb_cstr, nb_vars, rank_data_t, and the CSR offset/index containers); use
cuopt_expects (or the existing error path) to fail early with clear messages
when any check fails.

Comment thread cpp/src/pdlp/pdlp.cu
Comment on lines +387 to +393
pdlp_solver_t<i_t, f_t>::pdlp_solver_t(
problem_t<i_t, f_t>& placeholder_problem,
cuopt::linear_programming::io::mps_data_model_t<i_t, f_t> const& mps,
pdlp_solver_settings_t<i_t, f_t> const& settings)
// Makes all inner feilds of master 0 size
: pdlp_solver_t(placeholder_problem, settings, /*is_legacy_batch_mode=*/false)
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Reject initial-state options before delegating to the shape-0 base ctor.

Line 392 delegates into the regular constructor before the distributed path knows the real dimensions. That constructor eagerly handles initial_primal_solution, initial_dual_solution, and warm-start data using primal_size_h_/dual_size_h_ from the placeholder, and set_initial_primal_solution() does size() % primal_size_h_. Any distributed solve with initial iterates can therefore divide by zero or copy into zero-length buffers before the shards exist. Please either reject those options up front for distributed PDLP or defer applying them until after shard construction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/pdlp.cu` around lines 387 - 393, The distributed constructor
pdlp_solver_t( problem_t<i_t, f_t>& placeholder_problem, ... ) currently
delegates to the regular ctor before shard sizes exist, causing functions that
use primal_size_h_/dual_size_h_ (e.g., set_initial_primal_solution, handling of
initial_primal_solution and initial_dual_solution and warm-start data) to
operate on zero-length buffers; update this constructor to either (a) validate
and reject any initial-state options (initial_primal_solution,
initial_dual_solution, warm-start) up front and return an error, or (b) defer
all logic that applies initial iterates (calls to set_initial_primal_solution /
set_initial_dual_solution and warm-start handling) until after shard
construction when primal_size_h_ and dual_size_h_ are set, ensuring no
modulo/divide-by-zero or zero-length copies occur.

Comment thread cpp/src/pdlp/solve.cu
Comment on lines +759 to +760
detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph,
std::memory_order_relaxed);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Scope the graph-disable switch per solve instead of mutating the process-global flag.

Both entrypoints write directly into detail::pdlp_graph_disabled_flag(). ping_pong_graph_t::run() reads that same static flag for every solver instance, so two solves running with different pdlp_disable_graph values can flip each other's execution mode mid-run.

Also applies to: 2159-2160

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 759 - 760, The global flag
detail::pdlp_graph_disabled_flag() is being mutated per-solve causing races;
instead make the graph-disable decision local to each solver instance and avoid
writing the process-global flag from solve entrypoints. Change callers that
currently store(settings.hyper_params.pdlp_disable_graph, ...) to pass the
pdlp_disable_graph boolean into the solver instance (or ctor) and have
ping_pong_graph_t::run() and related graph code read that instance-level flag
rather than detail::pdlp_graph_disabled_flag(); remove writes to the global flag
in solve functions so concurrent solves do not flip each other’s mode.

Comment thread cpp/src/pdlp/solve.cu
Comment on lines +2129 to +2134
cuopt_expects(
settings.hyper_params.use_distributed_pdlp,
error_type_t::ValidationError,
"solve_lp from mps_data_model: settings.hyper_params.use_distributed_pdlp must be true");
return solve_lp_distributed_from_mps(
handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the existing non-distributed direct-MPS entrypoint.

This unchanged overload now hard-fails unless use_distributed_pdlp is enabled, which removes the previous single-GPU/direct-MPS route behind the same public signature. That is a breaking regression for callers that build an mps_data_model_t but do not want distributed PDLP.

Suggested fix
 optimization_problem_solution_t<i_t, f_t> solve_lp(
   raft::handle_t const* handle_ptr,
   const cuopt::linear_programming::io::mps_data_model_t<i_t, f_t>& mps_data_model,
   pdlp_solver_settings_t<i_t, f_t> const& settings,
   bool problem_checking,
   bool use_pdlp_solver_mode)
 {
-  cuopt_expects(
-    settings.hyper_params.use_distributed_pdlp,
-    error_type_t::ValidationError,
-    "solve_lp from mps_data_model: settings.hyper_params.use_distributed_pdlp must be true");
-  return solve_lp_distributed_from_mps(
-    handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode);
+  if (settings.hyper_params.use_distributed_pdlp) {
+    return solve_lp_distributed_from_mps(
+      handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode);
+  }
+  auto op_problem = mps_data_model_to_optimization_problem(handle_ptr, mps_data_model);
+  return solve_lp(op_problem, settings, problem_checking, use_pdlp_solver_mode, false);
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 2129 - 2134, The current overload
erroneously hard-fails via cuopt_expects when
settings.hyper_params.use_distributed_pdlp is false and always forwards to
solve_lp_distributed_from_mps, removing the original single-GPU/direct-MPS path;
restore the prior behavior by replacing the hard-fail with a branch: if
settings.hyper_params.use_distributed_pdlp is true call
solve_lp_distributed_from_mps(handle_ptr, mps_data_model, settings,
problem_checking, use_pdlp_solver_mode) else call the non-distributed/MPS
entrypoint (the original direct-MPS function used previously—e.g.,
solve_lp_from_mps or the equivalent direct-MPS routine) so both paths are
supported, and keep or adjust cuopt_expects to validate only unsupported
parameter combinations if needed.

Comment thread cpp/src/pdlp/solve.cu
Comment on lines +2157 to +2205
pdlp_solver_settings_t<i_t, f_t> settings_resolved = settings;

detail::pdlp_graph_disabled_flag().store(settings_resolved.hyper_params.pdlp_disable_graph,
std::memory_order_relaxed);

if (settings_resolved.distributed_pdlp_num_gpus == -1) {
settings_resolved.distributed_pdlp_num_gpus = raft::device_setter::get_device_count();
CUOPT_LOG_INFO(
"solve_lp_distributed_from_mps: distributed_pdlp_num_gpus == -1, auto-detected "
"%d visible CUDA device(s)",
settings_resolved.distributed_pdlp_num_gpus);
}
// PDLP precision validations (mirror the checks in run_pdlp; distributed
// path only supports the default-precision, non-batch double config).
cuopt_expects(settings_resolved.pdlp_precision == pdlp_precision_t::DefaultPrecision,
error_type_t::ValidationError,
"Distributed PDLP only supports DefaultPrecision (double).");
cuopt_expects(!settings_resolved.inside_mip,
error_type_t::ValidationError,
"Distributed PDLP is not yet supported from inside MIP.");

init_logger_t log(settings_resolved.log_file, settings_resolved.log_to_console);
print_version_info();
init_handler(handle_ptr);

const i_t n_vars = static_cast<i_t>(mps_data_model.get_objective_coefficients().size());
const i_t n_cstr = static_cast<i_t>(mps_data_model.get_constraint_lower_bounds().size());
const i_t nnz = static_cast<i_t>(mps_data_model.get_constraint_matrix_values().size());
CUOPT_LOG_INFO(
"Solving a problem with %d constraints, %d variables (%d integers), and %d "
"nonzeros (distributed mps-direct path)",
n_cstr,
n_vars,
0,
nnz);

auto lp_timer = cuopt::timer_t(settings_resolved.time_limit);

// Shape-0 placeholder: needed to build an empty pdlp_solver
cuopt::linear_programming::optimization_problem_t<i_t, f_t> placeholder_op(handle_ptr);
{
std::vector<i_t> empty_offsets = {0};
placeholder_op.set_csr_constraint_matrix(
nullptr, 0, nullptr, 0, empty_offsets.data(), static_cast<i_t>(empty_offsets.size()));
}
detail::problem_t<i_t, f_t> placeholder_problem(placeholder_op);

detail::pdlp_solver_t<i_t, f_t> solver(placeholder_problem, mps_data_model, settings_resolved);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the method and apply the preset-mode mapping in the distributed path.

solve_lp_distributed_from_mps() always constructs detail::pdlp_solver_t, but it never checks settings.method and never calls set_pdlp_solver_mode(settings_resolved) when use_pdlp_solver_mode is true. A caller can request Barrier/Concurrent or a preset PDLP mode and silently get default-parameter PDLP instead.

Suggested fix
   pdlp_solver_settings_t<i_t, f_t> settings_resolved = settings;
+  cuopt_expects(settings_resolved.method == method_t::PDLP,
+                error_type_t::ValidationError,
+                "Distributed MPS solve currently supports only method_t::PDLP");
+  if (use_pdlp_solver_mode) { set_pdlp_solver_mode(settings_resolved); }
 
   detail::pdlp_graph_disabled_flag().store(settings_resolved.hyper_params.pdlp_disable_graph,
                                            std::memory_order_relaxed);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 2157 - 2205,
solve_lp_distributed_from_mps builds detail::pdlp_solver_t using
settings_resolved but never applies settings.method or calls
set_pdlp_solver_mode, so requested PDLP modes/presets are ignored; fix by
checking settings_resolved.use_pdlp_solver_mode (and/or
settings_resolved.method) before constructing the solver and call
set_pdlp_solver_mode(settings_resolved) to map the preset/method into the solver
settings (or apply the mapping to settings_resolved) so the subsequent
detail::pdlp_solver_t(placeholder_problem, mps_data_model, settings_resolved) is
constructed with the intended PDLP mode.

Comment on lines +188 to +191
pdlp_solver_settings_t<int, double> dist_settings = base_settings;
dist_settings.hyper_params.use_distributed_pdlp = true;
dist_settings.distributed_pdlp_num_gpus = -1;
auto dist = solve_lp(&handle, problem, dist_settings);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require at least 2 GPUs here or skip the test.

With distributed_pdlp_num_gpus = -1 and the explicit “1 GPU is fine” behavior, this helper can pass without ever exercising the multi-GPU/NCCL path the PR is adding. That leaves the highest-risk path effectively untested in single-GPU CI.

Suggested change
   pdlp_solver_settings_t<int, double> dist_settings = base_settings;
   dist_settings.hyper_params.use_distributed_pdlp   = true;
-  dist_settings.distributed_pdlp_num_gpus           = -1;
+  int device_count                                  = 0;
+  RAFT_CUDA_TRY(cudaGetDeviceCount(&device_count));
+  if (device_count < 2) {
+    GTEST_SKIP() << "distributed parity requires at least 2 GPUs";
+  }
+  dist_settings.distributed_pdlp_num_gpus = 2;
   auto dist                                         = solve_lp(&handle, problem, dist_settings);
As per coding guidelines, "ensure new distributed paths have correctness assertions (not just “runs”)."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pdlp_solver_settings_t<int, double> dist_settings = base_settings;
dist_settings.hyper_params.use_distributed_pdlp = true;
dist_settings.distributed_pdlp_num_gpus = -1;
auto dist = solve_lp(&handle, problem, dist_settings);
pdlp_solver_settings_t<int, double> dist_settings = base_settings;
dist_settings.hyper_params.use_distributed_pdlp = true;
int device_count = 0;
RAFT_CUDA_TRY(cudaGetDeviceCount(&device_count));
if (device_count < 2) {
GTEST_SKIP() << "distributed parity requires at least 2 GPUs";
}
dist_settings.distributed_pdlp_num_gpus = 2;
auto dist = solve_lp(&handle, problem, dist_settings);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 188 - 191, The test
currently sets distributed_pdlp_num_gpus = -1 which lets a single-GPU run bypass
the multi-GPU/NCCL path; change the test to first query the available GPU count
and if fewer than 2 GPUs are present skip the test, otherwise set
pdlp_solver_settings_t::distributed_pdlp_num_gpus to at least 2 (e.g., max(2,
available_gpus)) before calling solve_lp(&handle, problem, dist_settings) so the
distributed PDLP path is actually exercised (use pdlp_solver_settings_t,
dist_settings, distributed_pdlp_num_gpus and solve_lp as the loci to modify).

Comment on lines +248 to +252
TEST(pdlp_class, distributed_parity_square41)
{
const raft::handle_t handle{};
expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Point the square41 test at square41, not neos3.

This test is named distributed_parity_square41 but currently loads linear_programming/neos3/neos3.mps. That means the intended square41 regression is missing, and it also reintroduces an instance this file already treats as TODO/problematic later on.

Suggested change
 TEST(pdlp_class, distributed_parity_square41)
 {
   const raft::handle_t handle{};
-  expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps");
+  expect_distributed_matches_base(handle, "linear_programming/square41/square41.mps");
 }
As per coding guidelines, when a bug fix lands, a regression test should cover the specific case, and the referenced PDLP dataset docs explicitly list `square41` as downloaded test data.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TEST(pdlp_class, distributed_parity_square41)
{
const raft::handle_t handle{};
expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps");
}
TEST(pdlp_class, distributed_parity_square41)
{
const raft::handle_t handle{};
expect_distributed_matches_base(handle, "linear_programming/square41/square41.mps");
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 248 - 252, The test
pdlp_class::distributed_parity_square41 is loading the wrong dataset; change the
argument to expect_distributed_matches_base in that test so it points to
"linear_programming/square41/square41.mps" instead of
"linear_programming/neos3/neos3.mps" so the regression covers the intended
square41 case (update the call site in the distributed_parity_square41 test that
invokes expect_distributed_matches_base).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Do not merge if this flag is set

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Multi GPU PDLP

1 participant