Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
1e0bd53
first commit !! added multi_gpu_partition file to solver settings
Bubullzz May 7, 2026
978d17b
slowly skeletonning
Bubullzz May 7, 2026
dd0c0ef
better shard.cuh
Bubullzz May 7, 2026
2037eca
wip
Bubullzz May 10, 2026
0f62eff
added a bit of skeleton. Forward declared pdlp_solver in shard.hpp, t…
Bubullzz May 18, 2026
d89c85a
still wip but going well
Bubullzz May 19, 2026
5534ff0
cursor broke everything grrr
Bubullzz May 19, 2026
dd935c5
partition loader now partition loads
Bubullzz May 19, 2026
09eb20b
big advancements ayo ! We can soon start working on imlementing the s…
Bubullzz May 19, 2026
b5ebfd2
added pre loop setup need to manage boxing
Bubullzz May 20, 2026
0965a60
added distributed transform
Bubullzz May 20, 2026
d4d1cab
added semicolon and existing runtime error enum
Bubullzz May 20, 2026
6659dd9
added } and fixed cuot_expects in partition loader
Bubullzz May 20, 2026
b2ed271
small bug fixes
Bubullzz May 20, 2026
50d16ce
a version that compiles #heheha 😎😎😎😎
Bubullzz May 20, 2026
359d9f4
removed use of engine:transaform
Bubullzz May 21, 2026
910a49a
added multi-gpu SpMV #heheha
Bubullzz May 22, 2026
76c0b3f
transformed a transform. it compiles hehe
Bubullzz May 22, 2026
5ec7138
updated take step for distributed. compiles but doesnt run. will chec…
Bubullzz May 22, 2026
1f02afd
Merge branch 'main' into cuD-PDLP
Bubullzz May 22, 2026
de19f38
support spmvop on multi-gpu
Bubullzz May 22, 2026
0030a6c
compile ready
Bubullzz May 22, 2026
172ebc2
can run now
Bubullzz May 22, 2026
23d0798
passing all tests, good merge
Bubullzz May 22, 2026
30881ce
fixed the errors hihi, finished distributed part for compte_fixed_error
Bubullzz May 22, 2026
c33faf2
style
Bubullzz May 22, 2026
98e0ce6
now manage halpern update in multi-gpu pdlp
Bubullzz May 26, 2026
84128bf
small fix to calls of multi_gpu_engine_ and scale/unscale solutions.
Bubullzz May 26, 2026
abe4dd2
comments
Bubullzz May 26, 2026
5c41497
added is multi gpu to pdhg
Bubullzz May 26, 2026
37b1fda
added pdhg get mgpu engine
Bubullzz May 26, 2026
57c7061
added non const convergence information getter
Bubullzz May 26, 2026
9f78d05
compute_convergence_information is now on multi-gpu
Bubullzz May 26, 2026
c484485
fill_return_problem_solutionis now ready !!
Bubullzz May 26, 2026
fc46080
added reduced cost in gathering of solution, builds and runs
Bubullzz May 26, 2026
6538382
updated mgpu scale/unscale logic
Bubullzz May 27, 2026
a88285a
wired mgpu restart
Bubullzz May 27, 2026
b34c5f6
dummy version locally seems to work ?????
Bubullzz May 27, 2026
b784a44
added dummy partitionner
Bubullzz May 27, 2026
ca7d7a9
added stream forking for cuda graph
Bubullzz May 27, 2026
0310d50
updated convergence information to use potential_next rather than cu…
Bubullzz May 27, 2026
f811bc8
disabled graph, can sole afiro hehe
Bubullzz May 27, 2026
4d7e2fc
added join_from_shards in convergence_info, now afiro is erfect 510 b…
Bubullzz May 28, 2026
7ad4606
use spmvop in mgpu and fixed small bug of increment_iteration_since_l…
Bubullzz May 28, 2026
03d1259
re-enabled graph. not working
Bubullzz May 28, 2026
cdc912b
Cleaner sync semantics, ez ez ez, single mGPU gives exact same result…
Bubullzz May 28, 2026
04d22cf
pad local matrices for easier integration and allow mismatch of nnz b…
Bubullzz May 29, 2026
b41df45
copy scalars to host rather than direct d2d. better
Bubullzz May 29, 2026
a1ffe1d
force re-inject offset and variables to undo the sort, cheap and ugly…
Bubullzz May 29, 2026
c9394d9
few style changes, better args and prints
Bubullzz May 29, 2026
4faa7df
added disable_graph flag, afiro gets solved on non-graph just as if i…
Bubullzz May 29, 2026
61acddb
makes reductions in compute interraction adn movement use owned_size …
Bubullzz May 31, 2026
b8b59bf
added emtis partitionner, still need it in the env. it is FAST. but w…
Bubullzz May 31, 2026
7d74e74
forgot to push a file, maybe doesnt compile lol
Bubullzz May 31, 2026
859a299
fixed dummy partitionner on single gpu
Bubullzz Jun 1, 2026
7daa740
added some plumbing, will not load full problem on gpu
Bubullzz Jun 1, 2026
8a39e8c
added guard to ensure presolver is not supported in mGPU
Bubullzz Jun 1, 2026
5a3b9ce
plumbed pdlp_distributed_solver with mps_data_model and now data doe…
Bubullzz Jun 2, 2026
e4739b5
removed usage of problem_t for distributed PDLP
Bubullzz Jun 2, 2026
1903f4b
added a cuopt assert for solve_lp in mgpu mode
Bubullzz Jun 2, 2026
0aacb4f
style
Bubullzz Jun 2, 2026
6df8145
fixed bound/objective rescaling, now afiro on 8 shards work but hangs…
Bubullzz Jun 2, 2026
df9f793
actually disable the graph ^^ (kms)
Bubullzz Jun 2, 2026
4c8bcd1
added option to export parts file
Bubullzz Jun 4, 2026
a8a8054
addded test for import export parts file
Bubullzz Jun 4, 2026
5abcd2e
added full solve tests
Bubullzz Jun 4, 2026
0b0ce2c
added kaminpar partitionner and possibility to chose the partitionner
Bubullzz Jun 4, 2026
91b1ae5
style
Bubullzz Jun 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,81 @@ create_logger_macros(CUOPT "cuopt::default_logger()" include/cuopt)

find_package(CUDSS REQUIRED)

# ##################################################################################################
# - NCCL (multi-GPU distributed PDLP) -------------------------------------------------------------
# NCCL is shipped via the conda env; no canonical CMake config target, so look it
# up by name in the standard lib paths (plus CONDA_PREFIX as a hint).
set(NCCL_HINT_PREFIXES "")
if (DEFINED ENV{CONDA_PREFIX} AND NOT "$ENV{CONDA_PREFIX}" STREQUAL "")
list(APPEND NCCL_HINT_PREFIXES "$ENV{CONDA_PREFIX}")
endif ()
find_path(NCCL_INCLUDE_DIR
NAMES nccl.h
HINTS ${NCCL_HINT_PREFIXES}
PATH_SUFFIXES include
)
find_library(NCCL_LIBRARY
NAMES nccl
HINTS ${NCCL_HINT_PREFIXES}
PATH_SUFFIXES lib lib64
)
if (NOT NCCL_INCLUDE_DIR OR NOT NCCL_LIBRARY)
message(FATAL_ERROR "NCCL not found. Looked in ${NCCL_HINT_PREFIXES}. Install nccl-dev / libnccl-dev in the active env.")
endif ()
add_library(nccl_external UNKNOWN IMPORTED GLOBAL)
set_target_properties(nccl_external PROPERTIES
IMPORTED_LOCATION "${NCCL_LIBRARY}"
INTERFACE_INCLUDE_DIRECTORIES "${NCCL_INCLUDE_DIR}"
)
message(STATUS "Using NCCL: ${NCCL_LIBRARY}")

# ##################################################################################################
# - METIS (graph partitioning for distributed PDLP) -----------------------------------------------
# Found by searching CONDA_PREFIX first, then CUOPT_METIS_ROOT (cmake var or env)
# if the user wants to pull METIS from a different conda env / system path.
set(METIS_HINT_PREFIXES "")
if (DEFINED ENV{CONDA_PREFIX} AND NOT "$ENV{CONDA_PREFIX}" STREQUAL "")
list(APPEND METIS_HINT_PREFIXES "$ENV{CONDA_PREFIX}")
endif ()
if (DEFINED CUOPT_METIS_ROOT AND NOT "${CUOPT_METIS_ROOT}" STREQUAL "")
list(APPEND METIS_HINT_PREFIXES "${CUOPT_METIS_ROOT}")
endif ()
if (DEFINED ENV{CUOPT_METIS_ROOT} AND NOT "$ENV{CUOPT_METIS_ROOT}" STREQUAL "")
list(APPEND METIS_HINT_PREFIXES "$ENV{CUOPT_METIS_ROOT}")
endif ()
find_path(METIS_INCLUDE_DIR
NAMES metis.h
HINTS ${METIS_HINT_PREFIXES}
PATH_SUFFIXES include
)
find_library(METIS_LIBRARY
NAMES metis libmetis
HINTS ${METIS_HINT_PREFIXES}
PATH_SUFFIXES lib lib64
)
if (NOT METIS_INCLUDE_DIR OR NOT METIS_LIBRARY)
message(FATAL_ERROR "METIS not found. Looked in: ${METIS_HINT_PREFIXES}. "
"Install it via 'conda install -c conda-forge metis' in the active env, "
"or set CUOPT_METIS_ROOT to a prefix containing include/metis.h and lib/libmetis.{so,a}.")
endif ()
add_library(metis_external UNKNOWN IMPORTED GLOBAL)
set_target_properties(metis_external PROPERTIES
IMPORTED_LOCATION "${METIS_LIBRARY}"
INTERFACE_INCLUDE_DIRECTORIES "${METIS_INCLUDE_DIR}"
)
message(STATUS "Using METIS: ${METIS_LIBRARY}")

# ##################################################################################################
# - KaMinPar (multi-threaded partitioning for distributed PDLP) ------------------------------------
# Brought in the RAPIDS way (rapids_cpm_find): uses an installed KaMinPar (deb/rpm/conda,
# discovered via its CMake config) if present, otherwise builds the pinned source via CPM.
# Distributed PDLP prefers KaMinPar over METIS.
include(cmake/thirdparty/get_kaminpar.cmake)
if (NOT TARGET KaMinPar::KaMinPar)
message(FATAL_ERROR "KaMinPar::KaMinPar was not made available by get_kaminpar.cmake")
endif ()
message(STATUS "Using KaMinPar (distributed PDLP prefers KaMinPar over METIS)")

# ##################################################################################################
# - gRPC and Protobuf setup -----------------------------------------------------------------------

Expand Down Expand Up @@ -576,6 +651,9 @@ target_link_libraries(cuopt
${CUDSS_LIB_FILE}
PRIVATE
${CUOPT_PRIVATE_CUDA_LIBS}
nccl_external
metis_external
KaMinPar::KaMinPar
$<$<BOOL:${CUOPT_ENABLE_GRPC}>:protobuf::libprotobuf>
$<$<BOOL:${CUOPT_ENABLE_GRPC}>:gRPC::grpc++>
)
Expand Down
48 changes: 48 additions & 0 deletions cpp/cmake/thirdparty/get_kaminpar.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# cmake-format: off
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
# cmake-format: on

# Multi-threaded graph partitioner for distributed PDLP.
# Uses rapids_cpm_find so a system / conda / .deb install of KaMinPar (which ships a
# CMake config package exporting KaMinPar::KaMinPar) is used when available, and
# otherwise the pinned source is cloned and built via CPM. KaMinPar depends on TBB,
# which cuOpt already requires (see find_package(TBB) for papilo).
function(find_and_configure_kaminpar)
set(oneValueArgs VERSION PINNED_TAG)
cmake_parse_arguments(PKG "" "${oneValueArgs}" "" ${ARGN})

rapids_cpm_find(KaMinPar ${PKG_VERSION}
GLOBAL_TARGETS KaMinPar::KaMinPar
CPM_ARGS
GIT_REPOSITORY https://github.com/KaHIP/KaMinPar.git
GIT_TAG ${PKG_PINNED_TAG}
EXCLUDE_FROM_ALL
OPTIONS
"KAMINPAR_BUILD_APPS OFF"
"KAMINPAR_BUILD_TOOLS OFF"
"KAMINPAR_BUILD_TESTS OFF"
"KAMINPAR_BUILD_BENCHMARKS OFF"
"KAMINPAR_BUILD_EXAMPLES OFF"
"KAMINPAR_BUILD_DISTRIBUTED OFF"
# Timers use global state and force single-threaded use of the library
# interface; disable so cuOpt can call the partitioner freely.
"KAMINPAR_ENABLE_TIMERS OFF"
# Avoid an extra hard dependency on Google Sparsehash.
"KAMINPAR_BUILD_WITH_SPARSEHASH OFF"
# cuOpt's TBB is discovered via a legacy find that only exposes TBB::tbb
# (no TBB::tbbmalloc target); disable KaMinPar's optional tbbmalloc use.
"KAMINPAR_ENABLE_TBB_MALLOC OFF"
# Large LP constraint graphs can exceed 2^31 directed edges.
"KAMINPAR_64BIT_EDGE_IDS ON"
"INSTALL_KAMINPAR OFF"
)

if(KaMinPar_ADDED)
message(VERBOSE "CUOPT: Using KaMinPar located in ${KaMinPar_SOURCE_DIR}")
else()
message(VERBOSE "CUOPT: Using KaMinPar located in ${KaMinPar_DIR}")
endif()
endfunction()

find_and_configure_kaminpar(VERSION 3.7.3 PINNED_TAG v3.7.3)
24 changes: 20 additions & 4 deletions cpp/cuopt_cli.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -176,7 +176,12 @@ int run_single_file(const std::string& file_path,
auto solution = cuopt::linear_programming::solve_mip(problem_interface.get(), mip_settings);
} else {
auto& lp_settings = settings.get_pdlp_settings();
auto solution = cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings);

if (lp_settings.hyper_params.use_distributed_pdlp) {
cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings);
} else {
cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings);
}
Comment on lines +180 to +184
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard distributed solve path against null GPU handle.

When distributed PDLP is enabled, this branch can pass a null handle_ptr if the selected memory backend is not GPU. Fail fast before calling the distributed overload.

Suggested fix
-      if (lp_settings.hyper_params.use_distributed_pdlp) {
+      if (lp_settings.hyper_params.use_distributed_pdlp) {
+        if (handle_ptr == nullptr) {
+          throw std::runtime_error("Distributed PDLP requires GPU memory backend.");
+        }
         cuopt::linear_programming::solve_lp(handle_ptr.get(), mps_data_model, lp_settings);
       } else {
         cuopt::linear_programming::solve_lp(problem_interface.get(), lp_settings);
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/cuopt_cli.cpp` around lines 180 - 184, When
lp_settings.hyper_params.use_distributed_pdlp is true, guard the distributed
PDLP call by checking that handle_ptr is non-null before invoking
cuopt::linear_programming::solve_lp(handle_ptr.get(), ...); if handle_ptr is
null, fail fast with a clear error (e.g., log and exit or throw) rather than
calling the distributed overload; update the branch that currently chooses
between solve_lp(handle_ptr.get(), mps_data_model, lp_settings) and
solve_lp(problem_interface.get(), lp_settings) to validate handle_ptr first and
only call the distributed overload when handle_ptr is valid.

}
} catch (const std::exception& e) {
fprintf(stderr, "cuopt_cli error: %s\n", e.what());
Expand Down Expand Up @@ -426,10 +431,21 @@ int main(int argc, char* argv[])
std::vector<rmm::mr::cuda_async_memory_resource> memory_resources;

if (memory_backend == cuopt::linear_programming::memory_backend_t::GPU) {
const int num_gpus = settings.get_parameter<int>(CUOPT_NUM_GPUS);
// Distributed PDLP scales one shard per GPU and uses its own knob; everything else
// (concurrent, batch, MIP) uses num_gpus which is capped at 2.
// For distributed PDLP, -1 means "auto-detect": resolve to the visible device
// count so the RMM memory pools match what solve.cu will eventually dispatch.
const bool use_distributed_pdlp = settings.get_parameter<bool>(CUOPT_USE_DISTRIBUTED_PDLP);
int requested_gpus = use_distributed_pdlp
? settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS)
: settings.get_parameter<int>(CUOPT_NUM_GPUS);
if (use_distributed_pdlp && requested_gpus == -1) {
requested_gpus = raft::device_setter::get_device_count();
}
const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus);

memory_resources.reserve(std::min(raft::device_setter::get_device_count(), num_gpus));
for (int i = 0; i < std::min(raft::device_setter::get_device_count(), num_gpus); ++i) {
memory_resources.reserve(provisioned_gpus);
Comment on lines +439 to +447
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate GPU count before provisioning memory resources.

requested_gpus is used to size provisioning without enforcing a positive value (other than distributed -1 remap). Add explicit validation before std::min(...)/reserve(...) to avoid invalid allocation/setup paths.

Suggested fix
-    if (use_distributed_pdlp && requested_gpus == -1) {
-      requested_gpus = raft::device_setter::get_device_count();
-    }
-    const int provisioned_gpus = std::min(raft::device_setter::get_device_count(), requested_gpus);
+    const int visible_gpus = raft::device_setter::get_device_count();
+    if (use_distributed_pdlp && requested_gpus == -1) {
+      requested_gpus = visible_gpus;
+    }
+    if (requested_gpus <= 0) {
+      std::cerr << "Invalid GPU count: " << requested_gpus
+                << " (must be > 0, or -1 only with distributed PDLP)." << std::endl;
+      return 1;
+    }
+    const int provisioned_gpus = std::min(visible_gpus, requested_gpus);
+    if (provisioned_gpus <= 0) {
+      std::cerr << "No visible GPUs available for GPU backend." << std::endl;
+      return 1;
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/cuopt_cli.cpp` around lines 439 - 447, The code currently computes
requested_gpus and then uses std::min(...) to compute provisioned_gpus and
memory_resources.reserve(provisioned_gpus) without validating requested_gpus;
add explicit validation after computing requested_gpus (and after remapping -1
when use_distributed_pdlp is true) to ensure requested_gpus > 0 and that
raft::device_setter::get_device_count() > 0 before calling std::min or reserve.
If either value is non-positive, return/log an error or throw an exception
(consistent with surrounding error handling) referencing the parameters obtained
via settings.get_parameter<int>(CUOPT_NUM_GPUS) and
settings.get_parameter<int>(CUOPT_DISTRIBUTED_PDLP_NUM_GPUS) so the code never
calls memory_resources.reserve with a non-positive size.

for (int i = 0; i < provisioned_gpus; ++i) {
RAFT_CUDA_TRY(cudaSetDevice(i));
memory_resources.emplace_back();
rmm::mr::set_per_device_resource(rmm::cuda_device_id{i}, memory_resources.back());
Expand Down
22 changes: 14 additions & 8 deletions cpp/include/cuopt/linear_programming/constants.h
Original file line number Diff line number Diff line change
Expand Up @@ -80,14 +80,20 @@
#define CUOPT_MIP_STRONG_BRANCHING_SIMPLEX_ITERATION_LIMIT \
"mip_strong_branching_simplex_iteration_limit"

#define CUOPT_SOLUTION_FILE "solution_file"
#define CUOPT_NUM_CPU_THREADS "num_cpu_threads"
#define CUOPT_NUM_GPUS "num_gpus"
#define CUOPT_USER_PROBLEM_FILE "user_problem_file"
#define CUOPT_PRESOLVE_FILE "presolve_file"
#define CUOPT_RANDOM_SEED "random_seed"
#define CUOPT_PDLP_PRECISION "pdlp_precision"
#define CUOPT_MIP_SEMICONTINUOUS_BIG_M "mip_semi_continuous_big_m"
#define CUOPT_SOLUTION_FILE "solution_file"
#define CUOPT_NUM_CPU_THREADS "num_cpu_threads"
#define CUOPT_NUM_GPUS "num_gpus"
#define CUOPT_DISTRIBUTED_PDLP_NUM_GPUS "distributed_pdlp_num_gpus"
#define CUOPT_MULTI_GPU_PARTITION_FILE "multi_gpu_partition_file"
#define CUOPT_MULTI_GPU_EXPORT_PARTITION_FILE "multi_gpu_export_partition_file"
#define CUOPT_DISTRIBUTED_PDLP_PARTITIONER "distributed_pdlp_partitioner"
#define CUOPT_USE_DISTRIBUTED_PDLP "use_distributed_pdlp"
#define CUOPT_PDLP_DISABLE_GRAPH "pdlp_disable_graph"
#define CUOPT_USER_PROBLEM_FILE "user_problem_file"
#define CUOPT_PRESOLVE_FILE "presolve_file"
#define CUOPT_RANDOM_SEED "random_seed"
#define CUOPT_PDLP_PRECISION "pdlp_precision"
#define CUOPT_MIP_SEMICONTINUOUS_BIG_M "mip_semi_continuous_big_m"

#define CUOPT_MIP_HYPER_HEURISTIC_POPULATION_SIZE "mip_hyper_heuristic_population_size"
#define CUOPT_MIP_HYPER_HEURISTIC_NUM_CPUFJ_THREADS "mip_hyper_heuristic_num_cpufj_threads"
Expand Down
16 changes: 10 additions & 6 deletions cpp/include/cuopt/linear_programming/pdlp/pdlp_hyper_params.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,16 @@ struct pdlp_hyper_params_t {
bool bound_objective_rescaling = true;
bool use_reflected_primal_dual = true;
bool use_fixed_point_error = true;
double reflection_coefficient = 1.0;
double restart_k_p = 0.99;
double restart_k_i = 0.01;
double restart_k_d = 0.0;
double restart_i_smooth = 0.3;
bool use_conditional_major = true;
bool use_distributed_pdlp = false;
// Debug/diagnostic knob: when true, PDLP bypasses CUDA-graph capture in
// ping_pong_graph_t and executes each iteration eagerly
bool pdlp_disable_graph = false;
double reflection_coefficient = 1.0;
double restart_k_p = 0.99;
double restart_k_i = 0.01;
double restart_k_d = 0.0;
double restart_i_smooth = 0.3;
bool use_conditional_major = true;
};

// TODO most likely we want to get rid of pdlp_solver_mode and just have prebuilt
Expand Down
19 changes: 19 additions & 0 deletions cpp/include/cuopt/linear_programming/pdlp/solver_settings.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,25 @@ class pdlp_solver_settings_t {
presolver_t presolver{presolver_t::Default};
bool dual_postsolve{true};
int num_gpus{1};
// Number of GPUs to use specifically for distributed PDLP (use_distributed_pdlp=true).
// -1 means auto-detect
int distributed_pdlp_num_gpus{-1};
std::string multi_gpu_partition_file{""};
// If non-empty, the partition computed for distributed PDLP is written to this
// path (one part-id per line) right after partitioning. The file can be fed
// back via multi_gpu_partition_file. Exposed as the multi_gpu_export_partition_file
// parameter (CLI: --multi-gpu-export-partition-file <path>).
std::string multi_gpu_export_partition_file{""};
// Which graph partitioner distributed PDLP uses. One of:
// "auto" - 1 GPU => Dummy; otherwise KaMinPar
// "dummy" - round-robin, no graph (trivial)
// "metis" - serial METIS_PartGraphKway
// "kaminpar" - multi-threaded KaMinPar
// Exposed as the distributed_pdlp_partitioner parameter
// (CLI: --distributed-pdlp-partitioner <auto|dummy|metis|kaminpar>).
std::string distributed_pdlp_partitioner{"auto"};
// Set to true inside the shards
bool is_distributed_sub_pdlp{false};
method_t method{method_t::Concurrent};
bool inside_mip{false};
// For concurrent termination
Expand Down
8 changes: 7 additions & 1 deletion cpp/src/math_optimization/solver_settings.cu
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ solver_settings_t<i_t, f_t>::solver_settings_t() : pdlp_settings(), mip_settings
{CUOPT_MIP_REDUCED_COST_STRENGTHENING, &mip_settings.reduced_cost_strengthening, -1, std::numeric_limits<i_t>::max(), -1},
{CUOPT_NUM_GPUS, &pdlp_settings.num_gpus, 1, 2, 1},
{CUOPT_NUM_GPUS, &mip_settings.num_gpus, 1, 2, 1},
{CUOPT_DISTRIBUTED_PDLP_NUM_GPUS, &pdlp_settings.distributed_pdlp_num_gpus, -1, 576, -1},
{CUOPT_MIP_BATCH_PDLP_STRONG_BRANCHING, &mip_settings.mip_batch_pdlp_strong_branching, 0, 2, 0},
{CUOPT_MIP_BATCH_PDLP_RELIABILITY_BRANCHING, &mip_settings.mip_batch_pdlp_reliability_branching, 0, 2, 0},
{CUOPT_MIP_STRONG_BRANCHING_SIMPLEX_ITERATION_LIMIT, &mip_settings.strong_branching_simplex_iteration_limit, -1,std::numeric_limits<i_t>::max(), -1},
Expand Down Expand Up @@ -177,6 +178,8 @@ solver_settings_t<i_t, f_t>::solver_settings_t() : pdlp_settings(), mip_settings
{CUOPT_DUAL_POSTSOLVE, &pdlp_settings.dual_postsolve, true},
{CUOPT_BARRIER_ITERATIVE_REFINEMENT, &pdlp_settings.barrier_iterative_refinement, true},
{CUOPT_MIP_PROBING, &mip_settings.probing, true},
{CUOPT_USE_DISTRIBUTED_PDLP, &pdlp_settings.hyper_params.use_distributed_pdlp, false},
{CUOPT_PDLP_DISABLE_GRAPH, &pdlp_settings.hyper_params.pdlp_disable_graph, false},
};
// String parameters
string_parameters = {
Expand All @@ -187,7 +190,10 @@ solver_settings_t<i_t, f_t>::solver_settings_t() : pdlp_settings(), mip_settings
{CUOPT_USER_PROBLEM_FILE, &mip_settings.user_problem_file, ""},
{CUOPT_USER_PROBLEM_FILE, &pdlp_settings.user_problem_file, ""},
{CUOPT_PRESOLVE_FILE, &mip_settings.presolve_file, ""},
{CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""}
{CUOPT_PRESOLVE_FILE, &pdlp_settings.presolve_file, ""},
{CUOPT_MULTI_GPU_PARTITION_FILE, &pdlp_settings.multi_gpu_partition_file, ""},
{CUOPT_MULTI_GPU_EXPORT_PARTITION_FILE, &pdlp_settings.multi_gpu_export_partition_file, ""},
{CUOPT_DISTRIBUTED_PDLP_PARTITIONER, &pdlp_settings.distributed_pdlp_partitioner, "auto"},
};
// clang-format on
}
Expand Down
6 changes: 6 additions & 0 deletions cpp/src/pdlp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,12 @@ set(LP_CORE_FILES
${CMAKE_CURRENT_SOURCE_DIR}/termination_strategy/convergence_information.cu
${CMAKE_CURRENT_SOURCE_DIR}/optimal_batch_size_handler/optimal_batch_size_handler.cu
${CMAKE_CURRENT_SOURCE_DIR}/utilities/ping_pong_graph.cu
${CMAKE_CURRENT_SOURCE_DIR}/distributed_pdlp/shard.cu
${CMAKE_CURRENT_SOURCE_DIR}/distributed_pdlp/multi_gpu_engine.cu
${CMAKE_CURRENT_SOURCE_DIR}/distributed_pdlp/partition_loader.cu
${CMAKE_CURRENT_SOURCE_DIR}/distributed_pdlp/partitioner.cu
${CMAKE_CURRENT_SOURCE_DIR}/distributed_pdlp/metis_partitioner.cu
${CMAKE_CURRENT_SOURCE_DIR}/distributed_pdlp/kaminpar_partitioner.cpp
)

# C and Python adapter files
Expand Down
17 changes: 11 additions & 6 deletions cpp/src/pdlp/cusparse_view.cu
Original file line number Diff line number Diff line change
Expand Up @@ -498,14 +498,17 @@ cusparse_view_t<i_t, f_t>::cusparse_view_t(
// setup cusparse view
A.create(op_problem_scaled.n_constraints,
op_problem_scaled.n_variables,
op_problem_scaled.nnz,
static_cast<int64_t>(A_.size()),
const_cast<i_t*>(op_problem_scaled.offsets.data()),
const_cast<i_t*>(op_problem_scaled.variables.data()),
const_cast<f_t*>(op_problem_scaled.coefficients.data()));

// A_T can have a different nnz than A in multi-GPU shards
// A is just what is needed to compute A_x for owned constraints
// A_T is just what is needed to compute A_T_y for owned variables
A_T.create(op_problem_scaled.n_variables,
op_problem_scaled.n_constraints,
op_problem_scaled.nnz,
static_cast<int64_t>(A_T_.size()),
Comment on lines +501 to +511
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Use shard-local nnz in the mixed-precision setup too.

Now that this ctor allows A and A_T to have different nnz on a shard, the later mixed-precision block still sizes, transforms, and recreates the FP32 matrices with op_problem_scaled.nnz (Lines 777-804). On any distributed shard where those lengths diverge, that path can overrun A_T_ or build A_mixed_ / A_T_mixed_ with stale nnz metadata.

Suggested follow-up
-      A_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream());
-      A_T_float_.resize(op_problem_scaled.nnz, handle_ptr->get_stream());
+      A_float_.resize(A_.size(), handle_ptr->get_stream());
+      A_T_float_.resize(A_T_.size(), handle_ptr->get_stream());

-      RAFT_CUDA_TRY(cub::DeviceTransform::Transform(op_problem_scaled.coefficients.data(),
+      RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_.data(),
                                                     A_float_.data(),
-                                                    op_problem_scaled.nnz,
+                                                    A_.size(),
                                                     double_to_float_functor{},
                                                     handle_ptr->get_stream().value()));

       RAFT_CUDA_TRY(cub::DeviceTransform::Transform(A_T_.data(),
                                                     A_T_float_.data(),
-                                                    op_problem_scaled.nnz,
+                                                    A_T_.size(),
                                                     double_to_float_functor{},
                                                     handle_ptr->get_stream().value()));

       A_mixed_.create(op_problem_scaled.n_constraints,
                       op_problem_scaled.n_variables,
-                      op_problem_scaled.nnz,
+                      static_cast<int64_t>(A_.size()),
                       const_cast<i_t*>(op_problem_scaled.offsets.data()),
                       const_cast<i_t*>(op_problem_scaled.variables.data()),
                       A_float_.data());

       A_T_mixed_.create(op_problem_scaled.n_variables,
                         op_problem_scaled.n_constraints,
-                        op_problem_scaled.nnz,
+                        static_cast<int64_t>(A_T_.size()),
                         const_cast<i_t*>(A_T_offsets_.data()),
                         const_cast<i_t*>(A_T_indices_.data()),
                         A_T_float_.data());

As per coding guidelines, "Prevent invalid memory access (out-of-bounds, use-after-free, host/device confusion) in GPU code."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/pdlp/cusparse_view.cu` around lines 501 - 511, The mixed-precision
branch still sizes and recreates FP32 matrices using op_problem_scaled.nnz which
can differ per shard; update that block to use the shard-local nnz values (e.g.
static_cast<int64_t>(A_.size()) and static_cast<int64_t>(A_T_.size())) when
allocating/sizing A_mixed_ and A_T_mixed_ and when copying/transposing data for
A_T.create / A.create so you don't overrun A_T_ or leave stale nnz metadata;
ensure any metadata fields set during the FP32 recreate follow the shard-local
sizes and that all transforms/read ranges use those local sizes (A_, A_T_,
A_mixed_, A_T_mixed_).

const_cast<i_t*>(A_T_offsets_.data()),
const_cast<i_t*>(A_T_indices_.data()),
const_cast<f_t*>(A_T_.data()));
Expand Down Expand Up @@ -914,14 +917,14 @@ cusparse_view_t<i_t, f_t>::cusparse_view_t(
// setup cusparse view
A.create(op_problem.n_constraints,
op_problem.n_variables,
op_problem.nnz,
static_cast<int64_t>(A_.size()),
const_cast<i_t*>(op_problem.offsets.data()),
const_cast<i_t*>(op_problem.variables.data()),
const_cast<f_t*>(op_problem.coefficients.data()));

A_T.create(op_problem.n_variables,
op_problem.n_constraints,
op_problem.nnz,
static_cast<int64_t>(A_T_.size()),
const_cast<i_t*>(A_T_offsets_.data()),
const_cast<i_t*>(A_T_indices_.data()),
const_cast<f_t*>(A_T_.data()));
Expand Down Expand Up @@ -1129,16 +1132,18 @@ cusparse_view_t<i_t, f_t>::cusparse_view_t(
// Copying them from the existing cuSparse view is a bad practice and creates segfault post
// CUDA 12.4 Using the saved pointer of the existing cusparse view to make sure we capture the
// correct pointer
// See comment in the PDHG cusparse_view_t ctor: bind the descriptor nnz to
// the actual value-buffer length so A and A_T stay symmetric and shard-safe.
A.create(op_problem.n_constraints,
op_problem.n_variables,
op_problem.nnz,
static_cast<int64_t>(A_.size()),
const_cast<i_t*>(A_offsets_.data()),
const_cast<i_t*>(A_indices_.data()),
const_cast<f_t*>(A_.data()));

A_T.create(op_problem.n_variables,
op_problem.n_constraints,
op_problem.nnz,
static_cast<int64_t>(existing_cusparse_view.A_T_.size()),
const_cast<i_t*>(existing_cusparse_view.A_T_offsets_.data()),
const_cast<i_t*>(existing_cusparse_view.A_T_indices_.data()),
const_cast<f_t*>(existing_cusparse_view.A_T_.data()));
Expand Down
Loading