CuD-PDLP by Bubullzz · Pull Request #1391 · NVIDIA/cuopt

Bubullzz · 2026-06-04T15:24:15Z

Not review ready
Not merge ready

Just to let team have a look at it but definitely needs a big clean up
closes #891

…he cycle seems to be fixed, cuopt compiles

…olver !!!

+ style too

…k on main

compiles and runs

… but works

…t was single

…rather than total size hehehehe

…e lose a lot of time on actal partitionning and data movements. Everything seems to be working

…nt transit on master device !

… in the end

copy-pr-bot · 2026-06-04T15:24:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-04T15:50:48Z

📝 Walkthrough

Walkthrough

This PR adds comprehensive distributed multi-GPU support to the PDLP linear programming solver, enabling it to handle large-scale problems that exceed single-GPU memory. The implementation partitions constraint-variable bipartite graphs (via METIS or KaMinPar), distributes subproblems to per-GPU solver shards coordinated by NCCL, and reimplements key linear algebra operations (SpMV, scaling, convergence checking) to execute in a distributed manner.

Changes

Distributed Multi-GPU PDLP

Layer / File(s)	Summary
Build system & NCCL/METIS/KaMinPar integration `cpp/CMakeLists.txt`, `cpp/cmake/thirdparty/get_kaminpar.cmake`, `cpp/src/pdlp/CMakeLists.txt`	CMake discovers NCCL and METIS libraries from environment, creates imported targets, locates/builds KaMinPar via CPM, and links all three into the cuopt shared library alongside new distributed PDLP source files.
Configuration parameters & CLI support `cpp/include/cuopt/linear_programming/constants.h`, `cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp`, `cpp/src/math_optimization/solver_settings.cu`, `cpp/cuopt_cli.cpp`	Adds configuration macros and solver settings fields for distributed PDLP (GPU count, partition files, partitioner selection). CLI conditionally routes LP solving through distributed entry point when enabled and auto-detects GPU count for RMM provisioning.
Graph partitioner interface & implementations `cpp/src/pdlp/distributed_pdlp/partitioner.hpp`, `cpp/src/pdlp/distributed_pdlp/partitioner.cu`, `cpp/src/pdlp/distributed_pdlp/metis_partitioner.hpp`, `cpp/src/pdlp/distributed_pdlp/metis_partitioner.cu`, `cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.hpp`, `cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.cpp`	Abstract partitioner interface with dummy/METIS/KaMinPar implementations to split bipartite constraint-variable graphs into k-way balanced partitions; includes factory and validation routines.
Rank data & partition loading `cpp/src/pdlp/distributed_pdlp/rank_data.hpp`, `cpp/src/pdlp/distributed_pdlp/partition_loader.hpp`, `cpp/src/pdlp/distributed_pdlp/partition_loader.cu`	Per-rank metadata structures holding ownership ranges, local CSR matrices, global↔local index maps, and per-peer halo communication plans; file-based partition I/O for reproducibility.
Shard infrastructure & NCCL communication `cpp/src/pdlp/distributed_pdlp/shard.hpp`, `cpp/src/pdlp/distributed_pdlp/shard.cu`	Non-copyable per-GPU shard owning local optimization problem, NCCL communicator, scaled/unscaled buffers, and pre-staged GPU halo-exchange buffers for constraint/variable synchronization.
Multi-GPU engine core orchestration `cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp`, `cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cu`	Orchestrates computation across shards via halo exchange, NCCL all-reduce, distributed L2 norm, power-iteration max-singular-value estimation, and scaling (bound/objective rescaling, Ruiz inf-norm, Pock–Chambolle); coordinates CUDA graph fork/join for reflected operations.
Distributed scaling strategy refactoring `cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cuh`, `cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cu`	Splits Ruiz and Pock–Chambolle scaling into composable local-compute and cumulative-apply stages; adds setters for externally-injected scaling factors; makes `compute_scaling_vectors` public for distributed usage.
PDHG multi-GPU wiring & graph capture `cpp/src/pdlp/pdhg.hpp`, `cpp/src/pdlp/pdhg.cu`	Adds multi-GPU state (engine pointer, is_multi_gpu flag), per-shard SpMV helpers (`spmv_A_into`/`spmv_At_into`), reflected projection transform wrappers, and CUDA graph fork/join orchestration during reflected primal/dual updates.
PDLP distributed constructor & initialization `cpp/src/pdlp/pdlp.cuh`, `cpp/src/pdlp/pdlp.cu`	New constructor ingests MPS model directly, partitions problem, constructs multi-GPU engine with per-shard solvers, initializes distributed scaling and convergence norms via all-reduce, wires multi-GPU execution into master solver.
PDLP solver loop multi-GPU integration `cpp/src/pdlp/pdlp.cu`	Adapts solver main loop initialization (gated for non-distributed only), termination checks, fixed-error computation, restart strategy, and Halpern updates to execute per-shard and synchronize state via multi-GPU engine when present.
Distributed convergence information computation `cpp/src/pdlp/termination_strategy/convergence_information.hpp`, `cpp/src/pdlp/termination_strategy/convergence_information.cu`, `cpp/src/pdlp/termination_strategy/termination_strategy.hpp`, `cpp/src/pdlp/termination_strategy/termination_strategy.cu`	Extends convergence information to compute primal/dual residuals and objectives in multi-GPU mode via halo exchange, per-shard partial computation, all-reduce aggregation, and per-shard reference norm handling; adds mutable accessors.
Adaptive step-size strategy multi-GPU support `cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hpp`, `cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu`	Adds mutable getters for interaction/norm buffers and owned-prefix reduction parameters to enable per-shard movement computation in multi-GPU context.
Entry point routing & graph disabling `cpp/src/pdlp/solve.cuh`, `cpp/src/pdlp/solve.cu`	New `solve_lp_distributed_from_mps` entry point validates distributed preconditions, routes to MPS-based distributed PDLP constructor, applies graph-disabled flag, and handles solution post-processing and file output.
Tracing & diagnostic utilities `cpp/src/pdlp/utilities/mgpu_trace.cuh`, `cpp/src/pdlp/utilities/ping_pong_graph.cuh`	Lightweight environment-gated multi-GPU tracing macros (`MGPU_TRACE`) and atomic CUDA graph disabling control for debugging distributed execution.
cuSPARSE descriptor nnz binding fix `cpp/src/pdlp/cusparse_view.cu`	Fixes cuSPARSE descriptor nnz binding to use actual device buffer sizes instead of problem metadata fields, ensuring correctness for partitioned matrices.
Saddle point size relaxation for distributed `cpp/src/pdlp/saddle_point.cu`	Relaxes saddle point constructor precondition to allow zero-sized primal/dual dimensions needed for master problem construction in distributed context.
Distributed PDLP tests `cpp/tests/linear_programming/pdlp_test.cu`	Adds partition export/import round-trip test and distributed-vs-single-GPU parity tests verifying matching termination status, objectives, step counts (within 15%), and solution vectors across AFIRO, square41, and a2864 instances.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

feature request, improvement

Suggested reviewers

rgsl888prabhu
mlubin
bdice

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/src/pdlp/solve.cu (1)

769-784: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject distributed problem_t calls before any early return.

The new guard sits below the zero-constraint return and the FP32 fallback. With use_distributed_pdlp=true plus SinglePrecision, this path returns run_pdlp_solver_in_fp32(...) instead of raising the intended validation error, so an unsupported distributed configuration silently runs the single-GPU solver.

Suggested fix

 static optimization_problem_solution_t<i_t, f_t> run_pdlp_solver(
   detail::problem_t<i_t, f_t>& problem,
   pdlp_solver_settings_t<i_t, f_t> const& settings,
   const timer_t& timer,
   bool is_batch_mode)
 {
+  cuopt_expects(!settings.hyper_params.use_distributed_pdlp,
+                error_type_t::ValidationError,
+                "Distributed PDLP must be entered via solve_lp(mps_data_model, ...) "
+                "so the master GPU never materializes the full problem. Call sites "
+                "with a problem_t cannot dispatch to distributed mode.");
+
   detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph,
                                            std::memory_order_relaxed);
 
   if (problem.n_constraints == 0) {
     ...
   }
 `#if` PDLP_INSTANTIATE_FLOAT || CUOPT_INSTANTIATE_FLOAT
   if constexpr (std::is_same_v<f_t, double>) {
     if (settings.pdlp_precision == pdlp_precision_t::SinglePrecision) {
       return run_pdlp_solver_in_fp32(problem, settings, timer, is_batch_mode);
     }
   }
 `#endif`
-  cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/solve.cu` around lines 769 - 784, The distributed-mode
validation (cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...))
must be performed before any early returns so a distributed call cannot
accidentally take the FP32 fallback or zero-constraint path; move or duplicate
that check to occur before the SinglePrecision/FP32 branch and before the
zero-constraint return so that when settings.hyper_params.use_distributed_pdlp
is true (for problem_t inputs) the function immediately raises the
ValidationError rather than calling run_pdlp_solver_in_fp32 or returning early.
Ensure the check references the same validation message and
error_type_t::ValidationError used currently.

cpp/src/pdlp/pdlp.cu (1)

3063-3079: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

The distributed average path is still unsafe in release builds.

When multi_gpu_engine is present and never_restart_to_average is false, Line 3071 uses plain assert(false). In release builds that disappears, and the subsequent raft::copy writes primal_size_h_/dual_size_h_ elements into unscaled_*_avg_solution_, which were never resized for the distributed ctor. That turns this TODO into an invalid device-copy / wrong-result path instead of a clean runtime rejection.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/pdlp.cu` around lines 3063 - 3079, The path that handles
multi-GPU (multi_gpu_engine) uses assert(false) which vanishes in release builds
and leads to invalid device copies into unscaled_primal_avg_solution_ /
unscaled_dual_avg_solution_; fix by replacing the assert with a deterministic
runtime guard: either resize/allocate unscaled_primal_avg_solution_ and
unscaled_dual_avg_solution_ to primal_size_h_ and dual_size_h_ (and
synchronize/validate device pointers) before calling raft::copy from
pdhg_solver_.get_primal_solution() / get_dual_solution(), or explicitly fail
early by logging and throwing a runtime_error when multi_gpu_engine is true so
the copy is never attempted; update the branch around
internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to implement
one of these safe behaviors.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/cuopt_cli.cpp`:
- Around line 180-184: When lp_settings.hyper_params.use_distributed_pdlp is
true, guard the distributed PDLP call by checking that handle_ptr is non-null
before invoking cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if
handle_ptr is null, fail fast with a clear error (e.g., log and exit or throw)
rather than calling the distributed overload; update the branch that currently
chooses between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and
solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and
only call the distributed overload when handle_ptr is valid.
- Around line 439-447: The code currently computes requested_gpus and then uses
std::min(...) to compute provisioned_gpus and
memory_resources.reserve(provisioned_gpus) without validating requested_gpus;
add explicit validation after computing requested_gpus (and after remapping -1
when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that
raft::device_setter::get_device_count() > 0 before calling std::min or reserve.
If either value is non-positive, return/log an error or throw an exception
(consistent with surrounding error handling) referencing the parameters obtained
via settings.get_parameter<int>(CUOPT_NUM_GPUS) and
settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never
calls memory_resources.reserve with a non-positive size.

In `@cpp/src/pdlp/cusparse_view.cu`:
- Around line 501-511: The mixed-precision branch still sizes and recreates FP32
matrices using op_problem_scaled.nnz which can differ per shard; update that
block to use the shard-local nnz values (e.g. static_cast<int64_t>(A_.size())
and static_cast<int64_t>(A_T_.size())) when allocating/sizing A_mixed_ and
A_T_mixed_ and when copying/transposing data for A_T.create / A.create so you
don't overrun A_T_ or leave stale nnz metadata; ensure any metadata fields set
during the FP32 recreate follow the shard-local sizes and that all
transforms/read ranges use those local sizes (A_, A_T_, A_mixed_, A_T_mixed_).

In `@cpp/src/pdlp/distributed_pdlp/partition_loader.cu`:
- Around line 77-87: Validate partition and CSR metadata before any
slicing/indexing: check that parts.size() >= nb_cstr + nb_vars before creating
cstr_parts/var_parts, ensure all entries in parts are within [0, nb_parts)
before using them to index rank_data_t<i_t,f_t>, and verify CSR arrays
(offsets/indices) have expected lengths (e.g., offsets.size() >= rows+1 and
indices.size() == nnz) before dereferencing in functions that build/iterate the
CSR (referencing variables parts, nb_cstr, nb_vars, rank_data_t, and the CSR
offset/index containers); use cuopt_expects (or the existing error path) to fail
early with clear messages when any check fails.

In `@cpp/src/pdlp/pdlp.cu`:
- Around line 821-825: The distributed gather of the current iterate is missing
on several return paths so master buffers can be stale; call the multi-GPU
gather before any return that serializes the current solution. Specifically,
ensure pdhg_solver_.get_mgpu_engine() and its method
gather_potential_next_solutions_to_master(pdhg_solver_,
current_termination_strategy_.get_convergence_information().get_reduced_cost())
is invoked centrally before any code that calls
fill_return_problem_solution(...), and add the same centralized gather call on
the other identified return sites (including the ConcurrentLimit and
PrimalFeasible/infeasibility exits referenced around lines ~859-863 and
~1541-1545) so the master full-size solution/reduced-cost buffers are populated
on every distributed return path.
- Around line 387-393: The distributed constructor pdlp_solver_t( problem_t<i_t,
f_t>& placeholder_problem, ... ) currently delegates to the regular ctor before
shard sizes exist, causing functions that use primal_size_h_/dual_size_h_ (e.g.,
set_initial_primal_solution, handling of initial_primal_solution and
initial_dual_solution and warm-start data) to operate on zero-length buffers;
update this constructor to either (a) validate and reject any initial-state
options (initial_primal_solution, initial_dual_solution, warm-start) up front
and return an error, or (b) defer all logic that applies initial iterates (calls
to set_initial_primal_solution / set_initial_dual_solution and warm-start
handling) until after shard construction when primal_size_h_ and dual_size_h_
are set, ensuring no modulo/divide-by-zero or zero-length copies occur.

In `@cpp/src/pdlp/solve.cu`:
- Around line 759-760: The global flag detail::pdlp_graph_disabled_flag() is
being mutated per-solve causing races; instead make the graph-disable decision
local to each solver instance and avoid writing the process-global flag from
solve entrypoints. Change callers that currently
store(settings.hyper_params.pdlp_disable_graph, ...) to pass the
pdlp_disable_graph boolean into the solver instance (or ctor) and have
ping_pong_graph_t::run() and related graph code read that instance-level flag
rather than detail::pdlp_graph_disabled_flag(); remove writes to the global flag
in solve functions so concurrent solves do not flip each other’s mode.
- Around line 2129-2134: The current overload erroneously hard-fails via
cuopt_expects when settings.hyper_params.use_distributed_pdlp is false and
always forwards to solve_lp_distributed_from_mps, removing the original
single-GPU/direct-MPS path; restore the prior behavior by replacing the
hard-fail with a branch: if settings.hyper_params.use_distributed_pdlp is true
call solve_lp_distributed_from_mps(handle_ptr, mps_data_model, settings,
problem_checking, use_pdlp_solver_mode) else call the non-distributed/MPS
entrypoint (the original direct-MPS function used previously—e.g.,
solve_lp_from_mps or the equivalent direct-MPS routine) so both paths are
supported, and keep or adjust cuopt_expects to validate only unsupported
parameter combinations if needed.
- Around line 2157-2205: solve_lp_distributed_from_mps builds
detail::pdlp_solver_t using settings_resolved but never applies settings.method
or calls set_pdlp_solver_mode, so requested PDLP modes/presets are ignored; fix
by checking settings_resolved.use_pdlp_solver_mode (and/or
settings_resolved.method) before constructing the solver and call
set_pdlp_solver_mode(settings_resolved) to map the preset/method into the solver
settings (or apply the mapping to settings_resolved) so the subsequent
detail::pdlp_solver_t(placeholder_problem, mps_data_model, settings_resolved) is
constructed with the intended PDLP mode.

In `@cpp/tests/linear_programming/pdlp_test.cu`:
- Around line 188-191: The test currently sets distributed_pdlp_num_gpus = -1
which lets a single-GPU run bypass the multi-GPU/NCCL path; change the test to
first query the available GPU count and if fewer than 2 GPUs are present skip
the test, otherwise set pdlp_solver_settings_t::distributed_pdlp_num_gpus to at
least 2 (e.g., max(2, available_gpus)) before calling solve_lp(&handle, problem,
dist_settings) so the distributed PDLP path is actually exercised (use
pdlp_solver_settings_t, dist_settings, distributed_pdlp_num_gpus and solve_lp as
the loci to modify).
- Around line 248-252: The test pdlp_class::distributed_parity_square41 is
loading the wrong dataset; change the argument to
expect_distributed_matches_base in that test so it points to
"linear_programming/square41/square41.mps" instead of
"linear_programming/neos3/neos3.mps" so the regression covers the intended
square41 case (update the call site in the distributed_parity_square41 test that
invokes expect_distributed_matches_base).

---

Outside diff comments:
In `@cpp/src/pdlp/pdlp.cu`:
- Around line 3063-3079: The path that handles multi-GPU (multi_gpu_engine) uses
assert(false) which vanishes in release builds and leads to invalid device
copies into unscaled_primal_avg_solution_ / unscaled_dual_avg_solution_; fix by
replacing the assert with a deterministic runtime guard: either resize/allocate
unscaled_primal_avg_solution_ and unscaled_dual_avg_solution_ to primal_size_h_
and dual_size_h_ (and synchronize/validate device pointers) before calling
raft::copy from pdhg_solver_.get_primal_solution() / get_dual_solution(), or
explicitly fail early by logging and throwing a runtime_error when
multi_gpu_engine is true so the copy is never attempted; update the branch
around internal_solver_iterations_ <= 1 where multi_gpu_engine is checked to
implement one of these safe behaviors.

In `@cpp/src/pdlp/solve.cu`:
- Around line 769-784: The distributed-mode validation
(cuopt_expects(!settings.hyper_params.use_distributed_pdlp, ...)) must be
performed before any early returns so a distributed call cannot accidentally
take the FP32 fallback or zero-constraint path; move or duplicate that check to
occur before the SinglePrecision/FP32 branch and before the zero-constraint
return so that when settings.hyper_params.use_distributed_pdlp is true (for
problem_t inputs) the function immediately raises the ValidationError rather
than calling run_pdlp_solver_in_fp32 or returning early. Ensure the check
references the same validation message and error_type_t::ValidationError used
currently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7df2a4b9-585b-4517-afcb-1aa089ecb1c1

📥 Commits

Reviewing files that changed from the base of the PR and between d6d6f9e and 91b1ae5.

📒 Files selected for processing (41)

cpp/CMakeLists.txt
cpp/cmake/thirdparty/get_kaminpar.cmake
cpp/cuopt_cli.cpp
cpp/include/cuopt/linear_programming/constants.h
cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh
cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
cpp/src/math_optimization/solver_settings.cu
cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/cusparse_view.cu
cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.cpp
cpp/src/pdlp/distributed_pdlp/kaminpar_partitioner.hpp
cpp/src/pdlp/distributed_pdlp/metis_partitioner.cu
cpp/src/pdlp/distributed_pdlp/metis_partitioner.hpp
cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.cu
cpp/src/pdlp/distributed_pdlp/multi_gpu_engine.hpp
cpp/src/pdlp/distributed_pdlp/partition_loader.cu
cpp/src/pdlp/distributed_pdlp/partition_loader.hpp
cpp/src/pdlp/distributed_pdlp/partitioner.cu
cpp/src/pdlp/distributed_pdlp/partitioner.hpp
cpp/src/pdlp/distributed_pdlp/rank_data.hpp
cpp/src/pdlp/distributed_pdlp/shard.cu
cpp/src/pdlp/distributed_pdlp/shard.hpp
cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cu
cpp/src/pdlp/initial_scaling_strategy/initial_scaling.cuh
cpp/src/pdlp/pdhg.cu
cpp/src/pdlp/pdhg.hpp
cpp/src/pdlp/pdlp.cu
cpp/src/pdlp/pdlp.cuh
cpp/src/pdlp/restart_strategy/pdlp_restart_strategy.cu
cpp/src/pdlp/saddle_point.cu
cpp/src/pdlp/solve.cu
cpp/src/pdlp/solve.cuh
cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.cu
cpp/src/pdlp/step_size_strategy/adaptive_step_size_strategy.hpp
cpp/src/pdlp/termination_strategy/convergence_information.cu
cpp/src/pdlp/termination_strategy/convergence_information.hpp
cpp/src/pdlp/termination_strategy/termination_strategy.cu
cpp/src/pdlp/termination_strategy/termination_strategy.hpp
cpp/src/pdlp/utilities/mgpu_trace.cuh
cpp/src/pdlp/utilities/ping_pong_graph.cuh
cpp/tests/linear_programming/pdlp_test.cu

coderabbitai · 2026-06-04T15:50:52Z

+      if (lp_settings.hyper_params.use_distributed_pdlp) {
+        cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings);
+      } else {
+        cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings);
+      }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard distributed solve path against null GPU handle.

When distributed PDLP is enabled, this branch can pass a null handle_ptr if the selected memory backend is not GPU. Fail fast before calling the distributed overload.

Suggested fix

- if (lp_settings.hyper_params.use_distributed_pdlp) { + if (lp_settings.hyper_params.use_distributed_pdlp) { + if (handle_ptr == nullptr) { + throw std::runtime_error("Distributed PDLP requires GPU memory backend."); + } cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings); } else { cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/cuopt_cli.cpp` around lines 180 - 184, When lp_settings.hyper_params.use_distributed_pdlp is true, guard the distributed PDLP call by checking that handle_ptr is non-null before invoking cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if handle_ptr is null, fail fast with a clear error (e.g., log and exit or throw) rather than calling the distributed overload; update the branch that currently chooses between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and only call the distributed overload when handle_ptr is valid.

coderabbitai · 2026-06-04T15:50:52Z

+    int requested_gpus              = use_distributed_pdlp
+                                        ? settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS)
+                                        : settings.get_parameter<int>(CUOPT_NUM_GPUS);
+    if (use_distributed_pdlp && requested_gpus == -1) {
+      requested_gpus = raft::device_setter::get_device_count();
+    }
+    const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus);

-    memory_resources.reserve(std::min(raft::device_setter::get_device_count(), num_gpus));
-    for (int i = 0; i < std::min(raft::device_setter::get_device_count(), num_gpus); ++i) {
+    memory_resources.reserve(provisioned_gpus);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate GPU count before provisioning memory resources.

requested_gpus is used to size provisioning without enforcing a positive value (other than distributed -1 remap). Add explicit validation before std::min(...)/reserve(...) to avoid invalid allocation/setup paths.

Suggested fix

- if (use_distributed_pdlp && requested_gpus == -1) { - requested_gpus = raft::device_setter::get_device_count(); - } - const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus); + const int visible_gpus = raft::device_setter::get_device_count(); + if (use_distributed_pdlp && requested_gpus == -1) { + requested_gpus = visible_gpus; + } + if (requested_gpus <= 0) { + std::cerr << "Invalid GPU count: " << requested_gpus + << " (must be > 0, or -1 only with distributed PDLP)." << std::endl; + return 1; + } + const int provisioned_gpus = std::min(visible_gpus, requested_gpus); + if (provisioned_gpus <= 0) { + std::cerr << "No visible GPUs available for GPU backend." << std::endl; + return 1; + }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/cuopt_cli.cpp` around lines 439 - 447, The code currently computes requested_gpus and then uses std::min(...) to compute provisioned_gpus and memory_resources.reserve(provisioned_gpus) without validating requested_gpus; add explicit validation after computing requested_gpus (and after remapping -1 when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that raft::device_setter::get_device_count() > 0 before calling std::min or reserve. If either value is non-positive, return/log an error or throw an exception (consistent with surrounding error handling) referencing the parameters obtained via settings.get_parameter<int>(CUOPT_NUM_GPUS) and settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never calls memory_resources.reserve with a non-positive size.

coderabbitai · 2026-06-04T15:50:52Z

+           static_cast<int64_t>(A_.size()),
           const_cast<i_t*>(op_problem_scaled.offsets.data()),
           const_cast<i_t*>(op_problem_scaled.variables.data()),
           const_cast<f_t*>(op_problem_scaled.coefficients.data()));

+  // A_T can have a different nnz than A in multi-GPU shards
+  // A is just what is needed to compute A_x for owned constraints
+  // A_T is just what is needed to compute A_T_y for owned variables
  A_T.create(op_problem_scaled.n_variables,
             op_problem_scaled.n_constraints,
-             op_problem_scaled.nnz,
+             static_cast<int64_t>(A_T_.size()),


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Use shard-local nnz in the mixed-precision setup too.

Now that this ctor allows A and A_T to have different nnz on a shard, the later mixed-precision block still sizes, transforms, and recreates the FP32 matrices with op_problem_scaled.nnz (Lines 777-804). On any distributed shard where those lengths diverge, that path can overrun A_T_ or build A_mixed_ / A_T_mixed_ with stale nnz metadata.

Suggested follow-up

- A_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream()); - A_T_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream()); + A_float_.resize(A_.size(), handle_ptr->get_stream()); + A_T_float_.resize(A_T_.size(), handle_ptr->get_stream()); - RAFT_CUDA_TRY(cub::DeviceTransform::Transform(op_problem_scaled.coefficients.data(), + RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_.data(), A_float_.data(), - op_problem_scaled.nnz, + A_.size(), double_to_float_functor{}, handle_ptr->get_stream().value())); RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_T_.data(), A_T_float_.data(), - op_problem_scaled.nnz, + A_T_.size(), double_to_float_functor{}, handle_ptr->get_stream().value())); A_mixed_.create(op_problem_scaled.n_constraints, op_problem_scaled.n_variables, - op_problem_scaled.nnz, + static_cast<int64_t>(A_.size()), const_cast<i_t*>(op_problem_scaled.offsets.data()), const_cast<i_t*>(op_problem_scaled.variables.data()), A_float_.data()); A_T_mixed_.create(op_problem_scaled.n_variables, op_problem_scaled.n_constraints, - op_problem_scaled.nnz, + static_cast<int64_t>(A_T_.size()), const_cast<i_t*>(A_T_offsets_.data()), const_cast<i_t*>(A_T_indices_.data()), A_T_float_.data());

As per coding guidelines, "Prevent invalid memory access (out-of-bounds, use-after-free, host/device confusion) in GPU code."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/cusparse_view.cu` around lines 501 - 511, The mixed-precision branch still sizes and recreates FP32 matrices using op_problem_scaled.nnz which can differ per shard; update that block to use the shard-local nnz values (e.g. static_cast<int64_t>(A_.size()) and static_cast<int64_t>(A_T_.size())) when allocating/sizing A_mixed_ and A_T_mixed_ and when copying/transposing data for A_T.create / A.create so you don't overrun A_T_ or leave stale nnz metadata; ensure any metadata fields set during the FP32 recreate follow the shard-local sizes and that all transforms/read ranges use those local sizes (A_, A_T_, A_mixed_, A_T_mixed_).

coderabbitai · 2026-06-04T15:50:52Z

+  cuopt_expects(A_values.size() == A_values_scaled.size(),
+                error_type_t::ValidationError,
+                "A_values and A_values_scaled must have the same length");
+  cuopt_expects(A_t_values.size() == A_t_values_scaled.size(),
+                error_type_t::ValidationError,
+                "A_t_values and A_t_values_scaled must have the same length");
+
+  std::vector<rank_data_t<i_t, f_t>> rank_data(nb_parts, rank_data_t<i_t, f_t>(nb_parts));
+  std::vector<i_t> cstr_parts(parts.begin(), parts.begin() + nb_cstr);
+  std::vector<i_t> var_parts(parts.begin() + nb_cstr, parts.begin() + nb_cstr + nb_vars);
+


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Validate partition metadata and CSR dimensions before indexing.

Line 85/Line 86 slice parts without checking size, Line 90 indexes rank_data using unvalidated part IDs, and Line 115/Line 157 dereference CSR offsets/indices without shape checks. A malformed partition file or inconsistent matrix metadata can cause out-of-bounds access/UB.

Proposed guard block (at function entry)

std::vector<rank_data_t<i_t, f_t>> partition_loader_t<i_t, f_t>::create_rank_data_from_parts( @@ { + cuopt_expects(nb_parts > 0, error_type_t::ValidationError, "nb_parts must be > 0"); + cuopt_expects(nb_cstr >= 0 && nb_vars >= 0 && nnz >= 0, + error_type_t::ValidationError, + "nb_cstr/nb_vars/nnz must be non-negative"); + + const auto expected_parts = + static_cast<size_t>(nb_cstr) + static_cast<size_t>(nb_vars); + cuopt_expects(parts.size() == expected_parts, + error_type_t::ValidationError, + "parts size mismatch: expected nb_cstr + nb_vars"); + + for (auto p : parts) { + cuopt_expects(p >= 0 && p < nb_parts, + error_type_t::ValidationError, + "partition id out of range [0, nb_parts)"); + } + + cuopt_expects(A_row_offsets.size() == static_cast<size_t>(nb_cstr) + 1, + error_type_t::ValidationError, + "A_row_offsets size must be nb_cstr + 1"); + cuopt_expects(A_t_row_offsets.size() == static_cast<size_t>(nb_vars) + 1, + error_type_t::ValidationError, + "A_t_row_offsets size must be nb_vars + 1"); + + cuopt_expects(A_col_indices.size() == static_cast<size_t>(nnz) && + A_values.size() == static_cast<size_t>(nnz) && + A_values_scaled.size() == static_cast<size_t>(nnz), + error_type_t::ValidationError, + "A buffers must match nnz"); + cuopt_expects(A_t_col_indices.size() == static_cast<size_t>(nnz) && + A_t_values.size() == static_cast<size_t>(nnz) && + A_t_values_scaled.size() == static_cast<size_t>(nnz), + error_type_t::ValidationError, + "A_t buffers must match nnz"); }

As per coding guidelines: “Validate input at library and server boundaries.”

Also applies to: 89-94, 115-121, 156-163

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/distributed_pdlp/partition_loader.cu` around lines 77 - 87, Validate partition and CSR metadata before any slicing/indexing: check that parts.size() >= nb_cstr + nb_vars before creating cstr_parts/var_parts, ensure all entries in parts are within [0, nb_parts) before using them to index rank_data_t<i_t,f_t>, and verify CSR arrays (offsets/indices) have expected lengths (e.g., offsets.size() >= rows+1 and indices.size() == nnz) before dereferencing in functions that build/iterate the CSR (referencing variables parts, nb_cstr, nb_vars, rank_data_t, and the CSR offset/index containers); use cuopt_expects (or the existing error path) to fail early with clear messages when any check fails.

coderabbitai · 2026-06-04T15:50:52Z

+pdlp_solver_t<i_t, f_t>::pdlp_solver_t(
+  problem_t<i_t, f_t>& placeholder_problem,
+  cuopt::linear_programming::io::mps_data_model_t<i_t, f_t> const& mps,
+  pdlp_solver_settings_t<i_t, f_t> const& settings)
+  // Makes all inner feilds of master 0 size
+  : pdlp_solver_t(placeholder_problem, settings, /*is_legacy_batch_mode=*/false)
+{


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Reject initial-state options before delegating to the shape-0 base ctor.

Line 392 delegates into the regular constructor before the distributed path knows the real dimensions. That constructor eagerly handles initial_primal_solution, initial_dual_solution, and warm-start data using primal_size_h_/dual_size_h_ from the placeholder, and set_initial_primal_solution() does size() % primal_size_h_. Any distributed solve with initial iterates can therefore divide by zero or copy into zero-length buffers before the shards exist. Please either reject those options up front for distributed PDLP or defer applying them until after shard construction.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/pdlp.cu` around lines 387 - 393, The distributed constructor pdlp_solver_t( problem_t<i_t, f_t>& placeholder_problem, ... ) currently delegates to the regular ctor before shard sizes exist, causing functions that use primal_size_h_/dual_size_h_ (e.g., set_initial_primal_solution, handling of initial_primal_solution and initial_dual_solution and warm-start data) to operate on zero-length buffers; update this constructor to either (a) validate and reject any initial-state options (initial_primal_solution, initial_dual_solution, warm-start) up front and return an error, or (b) defer all logic that applies initial iterates (calls to set_initial_primal_solution / set_initial_dual_solution and warm-start handling) until after shard construction when primal_size_h_ and dual_size_h_ are set, ensuring no modulo/divide-by-zero or zero-length copies occur.

coderabbitai · 2026-06-04T15:50:52Z

+  detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph,
+                                           std::memory_order_relaxed);


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Scope the graph-disable switch per solve instead of mutating the process-global flag.

Both entrypoints write directly into detail::pdlp_graph_disabled_flag(). ping_pong_graph_t::run() reads that same static flag for every solver instance, so two solves running with different pdlp_disable_graph values can flip each other's execution mode mid-run.

Also applies to: 2159-2160

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 759 - 760, The global flag detail::pdlp_graph_disabled_flag() is being mutated per-solve causing races; instead make the graph-disable decision local to each solver instance and avoid writing the process-global flag from solve entrypoints. Change callers that currently store(settings.hyper_params.pdlp_disable_graph, ...) to pass the pdlp_disable_graph boolean into the solver instance (or ctor) and have ping_pong_graph_t::run() and related graph code read that instance-level flag rather than detail::pdlp_graph_disabled_flag(); remove writes to the global flag in solve functions so concurrent solves do not flip each other’s mode.

coderabbitai · 2026-06-04T15:50:52Z

+  cuopt_expects(
+    settings.hyper_params.use_distributed_pdlp,
+    error_type_t::ValidationError,
+    "solve_lp from mps_data_model: settings.hyper_params.use_distributed_pdlp must be true");
+  return solve_lp_distributed_from_mps(
+    handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the existing non-distributed direct-MPS entrypoint.

This unchanged overload now hard-fails unless use_distributed_pdlp is enabled, which removes the previous single-GPU/direct-MPS route behind the same public signature. That is a breaking regression for callers that build an mps_data_model_t but do not want distributed PDLP.

Suggested fix

optimization_problem_solution_t<i_t, f_t> solve_lp( raft::handle_t const* handle_ptr, const cuopt::linear_programming::io::mps_data_model_t<i_t, f_t>& mps_data_model, pdlp_solver_settings_t<i_t, f_t> const& settings, bool problem_checking, bool use_pdlp_solver_mode) { - cuopt_expects( - settings.hyper_params.use_distributed_pdlp, - error_type_t::ValidationError, - "solve_lp from mps_data_model: settings.hyper_params.use_distributed_pdlp must be true"); - return solve_lp_distributed_from_mps( - handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode); + if (settings.hyper_params.use_distributed_pdlp) { + return solve_lp_distributed_from_mps( + handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode); + } + auto op_problem = mps_data_model_to_optimization_problem(handle_ptr, mps_data_model); + return solve_lp(op_problem, settings, problem_checking, use_pdlp_solver_mode, false); }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 2129 - 2134, The current overload erroneously hard-fails via cuopt_expects when settings.hyper_params.use_distributed_pdlp is false and always forwards to solve_lp_distributed_from_mps, removing the original single-GPU/direct-MPS path; restore the prior behavior by replacing the hard-fail with a branch: if settings.hyper_params.use_distributed_pdlp is true call solve_lp_distributed_from_mps(handle_ptr, mps_data_model, settings, problem_checking, use_pdlp_solver_mode) else call the non-distributed/MPS entrypoint (the original direct-MPS function used previously—e.g., solve_lp_from_mps or the equivalent direct-MPS routine) so both paths are supported, and keep or adjust cuopt_expects to validate only unsupported parameter combinations if needed.

coderabbitai · 2026-06-04T15:50:52Z

+  pdlp_solver_settings_t<i_t, f_t> settings_resolved = settings;
+
+  detail::pdlp_graph_disabled_flag().store(settings_resolved.hyper_params.pdlp_disable_graph,
+                                           std::memory_order_relaxed);
+
+  if (settings_resolved.distributed_pdlp_num_gpus == -1) {
+    settings_resolved.distributed_pdlp_num_gpus = raft::device_setter::get_device_count();
+    CUOPT_LOG_INFO(
+      "solve_lp_distributed_from_mps: distributed_pdlp_num_gpus == -1, auto-detected "
+      "%d visible CUDA device(s)",
+      settings_resolved.distributed_pdlp_num_gpus);
+  }
+  // PDLP precision validations (mirror the checks in run_pdlp; distributed
+  // path only supports the default-precision, non-batch double config).
+  cuopt_expects(settings_resolved.pdlp_precision == pdlp_precision_t::DefaultPrecision,
+                error_type_t::ValidationError,
+                "Distributed PDLP only supports DefaultPrecision (double).");
+  cuopt_expects(!settings_resolved.inside_mip,
+                error_type_t::ValidationError,
+                "Distributed PDLP is not yet supported from inside MIP.");
+
+  init_logger_t log(settings_resolved.log_file, settings_resolved.log_to_console);
+  print_version_info();
+  init_handler(handle_ptr);
+
+  const i_t n_vars = static_cast<i_t>(mps_data_model.get_objective_coefficients().size());
+  const i_t n_cstr = static_cast<i_t>(mps_data_model.get_constraint_lower_bounds().size());
+  const i_t nnz    = static_cast<i_t>(mps_data_model.get_constraint_matrix_values().size());
+  CUOPT_LOG_INFO(
+    "Solving a problem with %d constraints, %d variables (%d integers), and %d "
+    "nonzeros (distributed mps-direct path)",
+    n_cstr,
+    n_vars,
+    0,
+    nnz);
+
+  auto lp_timer = cuopt::timer_t(settings_resolved.time_limit);
+
+  // Shape-0 placeholder: needed to build an empty pdlp_solver
+  cuopt::linear_programming::optimization_problem_t<i_t, f_t> placeholder_op(handle_ptr);
+  {
+    std::vector<i_t> empty_offsets = {0};
+    placeholder_op.set_csr_constraint_matrix(
+      nullptr, 0, nullptr, 0, empty_offsets.data(), static_cast<i_t>(empty_offsets.size()));
+  }
+  detail::problem_t<i_t, f_t> placeholder_problem(placeholder_op);
+
+  detail::pdlp_solver_t<i_t, f_t> solver(placeholder_problem, mps_data_model, settings_resolved);
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate the method and apply the preset-mode mapping in the distributed path.

solve_lp_distributed_from_mps() always constructs detail::pdlp_solver_t, but it never checks settings.method and never calls set_pdlp_solver_mode(settings_resolved) when use_pdlp_solver_mode is true. A caller can request Barrier/Concurrent or a preset PDLP mode and silently get default-parameter PDLP instead.

Suggested fix

pdlp_solver_settings_t<i_t, f_t> settings_resolved = settings; + cuopt_expects(settings_resolved.method == method_t::PDLP, + error_type_t::ValidationError, + "Distributed MPS solve currently supports only method_t::PDLP"); + if (use_pdlp_solver_mode) { set_pdlp_solver_mode(settings_resolved); } detail::pdlp_graph_disabled_flag().store(settings_resolved.hyper_params.pdlp_disable_graph, std::memory_order_relaxed);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/pdlp/solve.cu` around lines 2157 - 2205, solve_lp_distributed_from_mps builds detail::pdlp_solver_t using settings_resolved but never applies settings.method or calls set_pdlp_solver_mode, so requested PDLP modes/presets are ignored; fix by checking settings_resolved.use_pdlp_solver_mode (and/or settings_resolved.method) before constructing the solver and call set_pdlp_solver_mode(settings_resolved) to map the preset/method into the solver settings (or apply the mapping to settings_resolved) so the subsequent detail::pdlp_solver_t(placeholder_problem, mps_data_model, settings_resolved) is constructed with the intended PDLP mode.

coderabbitai · 2026-06-04T15:50:52Z

+  pdlp_solver_settings_t<int, double> dist_settings = base_settings;
+  dist_settings.hyper_params.use_distributed_pdlp   = true;
+  dist_settings.distributed_pdlp_num_gpus           = -1;
+  auto dist                                         = solve_lp(&handle, problem, dist_settings);


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Require at least 2 GPUs here or skip the test.

With distributed_pdlp_num_gpus = -1 and the explicit “1 GPU is fine” behavior, this helper can pass without ever exercising the multi-GPU/NCCL path the PR is adding. That leaves the highest-risk path effectively untested in single-GPU CI.

Suggested change

pdlp_solver_settings_t<int, double> dist_settings = base_settings; dist_settings.hyper_params.use_distributed_pdlp = true; - dist_settings.distributed_pdlp_num_gpus = -1; + int device_count = 0; + RAFT_CUDA_TRY(cudaGetDeviceCount(&device_count)); + if (device_count < 2) { + GTEST_SKIP() << "distributed parity requires at least 2 GPUs"; + } + dist_settings.distributed_pdlp_num_gpus = 2; auto dist = solve_lp(&handle, problem, dist_settings);

As per coding guidelines, "ensure new distributed paths have correctness assertions (not just “runs”)."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pdlp_solver_settings_t<int, double> dist_settings = base_settings;

dist_settings.hyper_params.use_distributed_pdlp = true;

dist_settings.distributed_pdlp_num_gpus = -1;

auto dist = solve_lp(&handle, problem, dist_settings);

pdlp_solver_settings_t<int, double> dist_settings = base_settings;

dist_settings.hyper_params.use_distributed_pdlp = true;

int device_count = 0;

RAFT_CUDA_TRY(cudaGetDeviceCount(&device_count));

if (device_count < 2) {

GTEST_SKIP() << "distributed parity requires at least 2 GPUs";

}

dist_settings.distributed_pdlp_num_gpus = 2;

auto dist = solve_lp(&handle, problem, dist_settings);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 188 - 191, The test currently sets distributed_pdlp_num_gpus = -1 which lets a single-GPU run bypass the multi-GPU/NCCL path; change the test to first query the available GPU count and if fewer than 2 GPUs are present skip the test, otherwise set pdlp_solver_settings_t::distributed_pdlp_num_gpus to at least 2 (e.g., max(2, available_gpus)) before calling solve_lp(&handle, problem, dist_settings) so the distributed PDLP path is actually exercised (use pdlp_solver_settings_t, dist_settings, distributed_pdlp_num_gpus and solve_lp as the loci to modify).

coderabbitai · 2026-06-04T15:50:52Z

+TEST(pdlp_class, distributed_parity_square41)
+{
+  const raft::handle_t handle{};
+  expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps");
+}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Point the square41 test at square41, not neos3.

This test is named distributed_parity_square41 but currently loads linear_programming/neos3/neos3.mps. That means the intended square41 regression is missing, and it also reintroduces an instance this file already treats as TODO/problematic later on.

Suggested change

TEST(pdlp_class, distributed_parity_square41) { const raft::handle_t handle{}; - expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps"); + expect_distributed_matches_base(handle, "linear_programming/square41/square41.mps"); }

As per coding guidelines, when a bug fix lands, a regression test should cover the specific case, and the referenced PDLP dataset docs explicitly list `square41` as downloaded test data.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

TEST(pdlp_class, distributed_parity_square41)

{

const raft::handle_t handle{};

expect_distributed_matches_base(handle, "linear_programming/neos3/neos3.mps");

}

TEST(pdlp_class, distributed_parity_square41)

{

const raft::handle_t handle{};

expect_distributed_matches_base(handle, "linear_programming/square41/square41.mps");

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/linear_programming/pdlp_test.cu` around lines 248 - 252, The test pdlp_class::distributed_parity_square41 is loading the wrong dataset; change the argument to expect_distributed_matches_base in that test so it points to "linear_programming/square41/square41.mps" instead of "linear_programming/neos3/neos3.mps" so the regression covers the intended square41 case (update the call site in the distributed_parity_square41 test that invokes expect_distributed_matches_base).

Bubullzz added 30 commits May 7, 2026 15:07

first commit !! added multi_gpu_partition file to solver settings

1e0bd53

slowly skeletonning

978d17b

better shard.cuh

dd0c0ef

wip

2037eca

added a bit of skeleton. Forward declared pdlp_solver in shard.hpp, t…

0f62eff

…he cycle seems to be fixed, cuopt compiles

still wip but going well

d89c85a

cursor broke everything grrr

5534ff0

partition loader now partition loads

dd935c5

big advancements ayo ! We can soon start working on imlementing the s…

09eb20b

…olver !!!

added pre loop setup need to manage boxing

b5ebfd2

+ style too

added distributed transform

0965a60

added semicolon and existing runtime error enum

d4d1cab

added } and fixed cuot_expects in partition loader

6659dd9

small bug fixes

b2ed271

a version that compiles #heheha 😎😎😎😎

50d16ce

removed use of engine:transaform

359d9f4

added multi-gpu SpMV #heheha

910a49a

transformed a transform. it compiles hehe

76c0b3f

updated take step for distributed. compiles but doesnt run. will chec…

5ec7138

…k on main

Merge branch 'main' into cuD-PDLP

1f02afd

support spmvop on multi-gpu

de19f38

compile ready

0030a6c

can run now

172ebc2

passing all tests, good merge

23d0798

fixed the errors hihi, finished distributed part for compte_fixed_error

30881ce

style

c33faf2

now manage halpern update in multi-gpu pdlp

98e0ce6

small fix to calls of multi_gpu_engine_ and scale/unscale solutions.

84128bf

compiles and runs

comments

abe4dd2

added is multi gpu to pdhg

5c41497

Bubullzz added 20 commits May 29, 2026 06:29

force re-inject offset and variables to undo the sort, cheap and ugly…

a1ffe1d

… but works

few style changes, better args and prints

c9394d9

added disable_graph flag, afiro gets solved on non-graph just as if i…

4faa7df

…t was single

makes reductions in compute interraction adn movement use owned_size …

61acddb

…rather than total size hehehehe

added emtis partitionner, still need it in the env. it is FAST. but w…

b8b59bf

…e lose a lot of time on actal partitionning and data movements. Everything seems to be working

forgot to push a file, maybe doesnt compile lol

7d74e74

fixed dummy partitionner on single gpu

859a299

added some plumbing, will not load full problem on gpu

7daa740

added guard to ensure presolver is not supported in mGPU

8a39e8c

plumbed pdlp_distributed_solver with mps_data_model and now data does…

5a3b9ce

…nt transit on master device !

removed usage of problem_t for distributed PDLP

e4739b5

added a cuopt assert for solve_lp in mgpu mode

1903f4b

style

0aacb4f

fixed bound/objective rescaling, now afiro on 8 shards work but hangs…

6df8145

… in the end

actually disable the graph ^^ (kms)

df9f793

added option to export parts file

4c8bcd1

addded test for import export parts file

a8a8054

added full solve tests

5abcd2e

added kaminpar partitionner and possibility to chose the partitionner

0b0ce2c

style

91b1ae5

Bubullzz requested review from a team as code owners June 4, 2026 15:24

Bubullzz requested review from hlinsen, mlubin and rgsl888prabhu June 4, 2026 15:24

Bubullzz added the do not merge Do not merge if this flag is set label Jun 4, 2026

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

		detail::pdlp_graph_disabled_flag().store(settings.hyper_params.pdlp_disable_graph,
		std::memory_order_relaxed);

Conversation

Bubullzz commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bubullzz commented Jun 4, 2026 •

edited

Loading