Skip to content

ROUTING_TEST retail_float CVRPTW_Retail/18 intermittent SIGABRT from out-of-bounds route priority sort #1221

@bdice

Description

@bdice

Summary

ROUTING_TEST intermittently aborts in level0_retail/retail_float_test_t.CVRPTW_Retail/18 with host heap corruption. The same class of failure appeared in CI as double free or corruption (out), and local ASan points at an out-of-bounds host read in adapted_sol_t::priority_remove_diff_routes.

CI failure:
https://github.com/NVIDIA/cuopt/actions/runs/25819296468/job/75863364466?pr=1198

Failing Test

./tests/routing/ROUTING_TEST \
  --gtest_filter='level0_retail/retail_float_test_t.CVRPTW_Retail/18'

This is the parameter from cpp/tests/routing/level0/l0_routing_test.cu:

retail_params_t{}.set_vehicle_fixed_costs().set_multi_capacity().set_vehicle_tw()

Release Reproducer

The release build failure is intermittent, but reproduced locally when varying glibc heap perturbation per process:

cd /home/coder/cuopt/cpp/build/conda/cuda-13.2/release

for i in $(seq 1 128); do
  perturb=$(( (i * 37) % 255 + 1 ))
  MALLOC_CHECK_=3 MALLOC_PERTURB_="$perturb" \
    ./tests/routing/ROUTING_TEST \
    --gtest_filter='level0_retail/retail_float_test_t.CVRPTW_Retail/18' \
    > "/tmp/cuopt-routing-retail18-perturb-${perturb}-run-${i}.log" 2>&1
  status=$?
  printf 'retail18 run %s perturb=%s exit=%s\n' "$i" "$perturb" "$status"
  test "$status" -eq 0 || break
done

Observed failure:

retail18 run 90 perturb=16 exit=134
Note: Google Test filter = level0_retail/retail_float_test_t.CVRPTW_Retail/18
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from level0_retail/retail_float_test_t
[ RUN      ] level0_retail/retail_float_test_t.CVRPTW_Retail/18
double free or corruption (out)

ASan Reproducer

Configured sanitizer build:

LIBCUOPT_BUILD_DIR=/home/coder/cuopt/cpp/build/asan \
  PARALLEL_LEVEL=8 \
  ./build.sh libcuopt -fsanitize -n --skip-grpc-build --skip-c-python-adapters

Ran the same focused gtest under ASan/UBSan with varied MALLOC_PERTURB_:

cd /home/coder/cuopt/cpp/build/asan
ASAN_RT=$(/home/coder/.conda/envs/rapids/bin/x86_64-conda-linux-gnu-g++ -print-file-name=libasan.so)

for i in $(seq 1 32); do
  if [ "$i" -eq 1 ]; then
    perturb=16
  else
    perturb=$(( (i * 37) % 255 + 1 ))
  fi

  LD_PRELOAD="$ASAN_RT" \
  ASAN_OPTIONS='protect_shadow_gap=0:replace_intrin=0:abort_on_error=1:detect_leaks=0:halt_on_error=1:fast_unwind_on_malloc=0:alloc_dealloc_mismatch=1:detect_odr_violation=0' \
  UBSAN_OPTIONS='halt_on_error=1:abort_on_error=1:print_stacktrace=1' \
  MALLOC_PERTURB_="$perturb" \
    ./tests/routing/ROUTING_TEST \
    --gtest_filter='level0_retail/retail_float_test_t.CVRPTW_Retail/18' \
    > "/tmp/cuopt-routing-retail18-asan-perturb-${perturb}-run-${i}.log" 2>&1
  status=$?
  printf 'asan retail18 run %s perturb=%s exit=%s\n' "$i" "$perturb" "$status"
  test "$status" -eq 0 || break
done

Observed failure:

asan retail18 run 24 perturb=124 exit=134

ASan output:

Note: Google Test filter = level0_retail/retail_float_test_t.CVRPTW_Retail/18
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from level0_retail/retail_float_test_t
[ RUN      ] level0_retail/retail_float_test_t.CVRPTW_Retail/18
AddressSanitizer:DEADLYSIGNAL
=================================================================
==78192==ERROR: AddressSanitizer: SEGV on unknown address 0x502000260008 (pc 0x7d5e40b4c11a bp 0x000000000004 sp 0x7ffc5ed16a80 T0)
==78192==The signal is caused by a READ memory access.
    #0 0x7d5e40b4c11a in void std::__insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>::priority_remove_diff_routes(cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1> const&)::{lambda(auto:1, auto:2)#1}> >(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>::priority_remove_diff_routes(cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1> const&)::{lambda(auto:1, auto:2)#1}>) (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x454c11a)
    #1 0x7d5e40b8de7d in cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>::priority_remove_diff_routes(cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1> const&) (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x458de7d)
    #2 0x7d5e40ba6a92 in cuopt::routing::detail::adapted_modifier_t<int, float, (cuopt::routing::request_t)1>::equalize_routes_and_nodes(cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>&, cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>&, std::array<double, 9ul>, bool) (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x45a6a92)
    #3 0x7d5e40b8c992 in cuopt::routing::solve<cuopt::routing::detail::pool_allocator_t<int, float, cuopt::routing::detail::solution_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float> >, cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float>, cuopt::routing::detail::adapted_generator_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::adapted_modifier_t<int, float, (cuopt::routing::request_t)1> >::recombine(cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>&, cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>&, bool&, bool) (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x458c992)
    #4 0x7d5e40b3310d in cuopt::routing::solve<cuopt::routing::detail::pool_allocator_t<int, float, cuopt::routing::detail::solution_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float> >, cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float>, cuopt::routing::detail::adapted_generator_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::adapted_modifier_t<int, float, (cuopt::routing::request_t)1> >::improve_population_fixed_threshold(cuopt::routing::population<cuopt::routing::detail::pool_allocator_t<int, float, cuopt::routing::detail::solution_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float> >, cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float> >&, int, int, bool) [clone .isra.0] (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x453310d)
    #5 0x7d5e40b9e26a in cuopt::routing::solve<cuopt::routing::detail::pool_allocator_t<int, float, cuopt::routing::detail::solution_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float> >, cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::problem_t<int, float>, cuopt::routing::detail::adapted_generator_t<int, float, (cuopt::routing::request_t)1>, cuopt::routing::detail::adapted_modifier_t<int, float, (cuopt::routing::request_t)1> >::run_working_loop() (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x459e26a)
    #6 0x7d5e40ba18d3 in cuopt::routing::ges_solver_t<int, float, (cuopt::routing::request_t)1>::compute_ges_solution(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x45a18d3)
    #7 0x7d5e40cccd80 in cuopt::routing::assignment_t<int> cuopt::routing::solver_t<int, float>::run_ges_solver<(cuopt::routing::request_t)1>(int) (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x46ccd80)
    #8 0x7d5e40ccd9ec in cuopt::routing::solver_t<int, float>::solve() (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x46cd9ec)
    #9 0x5b4851b6dc5e in cuopt::routing::test::routing_retail_test_t<int, float>::test_cvrptw() (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0x80c5e)
    #10 0x5b4851bb275b in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0xc575b)
    #11 0x5b4851bb2a30 in testing::Test::Run() (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0xc5a30)
    #12 0x5b4851bb2dd6 in testing::TestInfo::Run() (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0xc5dd6)
    #13 0x5b4851bb3255 in testing::TestSuite::Run() (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0xc6255)
    #14 0x5b4851bb6d62 in testing::internal::UnitTestImpl::RunAllTests() (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0xc9d62)
    #15 0x5b4851bb7250 in testing::UnitTest::Run() (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0xca250)
    #16 0x5b4851b208df in main (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0x338df)
    #17 0x7d5e3bf2f1c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
    #18 0x7d5e3bf2f28a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 8e9fd827446c24067541ac5390e6f527fb5947bb)
    #19 0x5b4851b2235d in _start (/home/coder/cuopt/cpp/build/asan/tests/routing/ROUTING_TEST+0x3535d)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/home/coder/cuopt/cpp/build/asan/libcuopt.so+0x454c11a) in void std::__insertion_sort<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>::priority_remove_diff_routes(cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1> const&)::{lambda(auto:1, auto:2)#1}> >(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__ops::_Iter_comp_iter<cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1>::priority_remove_diff_routes(cuopt::routing::detail::adapted_sol_t<int, float, (cuopt::routing::request_t)1> const&)::{lambda(auto:1, auto:2)#1}>)
==78192==ABORTING

Suspected Root Cause

cpp/src/routing/adapters/adapted_sol.cuh, in adapted_sol_t::priority_remove_diff_routes, builds route_priority with one entry per route in remove_route_ids, then indexes that vector by route id inside the comparator:

std::vector<i_t> route_priority;
route_priority.reserve(remove_route_ids.size());
for (auto& id : remove_route_ids) {
  route_priority.push_back(routes[id].length);
}
std::sort(remove_route_ids.begin(), remove_route_ids.end(), [&](auto i, auto j) {
  return route_priority[i] < route_priority[j];
});

i and j are route ids, not indices into route_priority. When route ids are larger than remove_route_ids.size() - 1, the comparator reads out of bounds. That matches the ASan stack in std::__insertion_sort and explains the intermittent allocator corruption in release builds.

A minimal local fix is to sort by the route lengths directly:

-      std::vector<i_t> route_priority;
-      route_priority.reserve(remove_route_ids.size());
-      for (auto& id : remove_route_ids) {
-        route_priority.push_back(routes[id].length);
-      }
       std::sort(remove_route_ids.begin(), remove_route_ids.end(), [&](auto i, auto j) {
-        return route_priority[i] < route_priority[j];
+        return routes[i].length < routes[j].length;
       });

Fix Validation

After applying the minimal comparator fix locally:

cmake --build /home/coder/cuopt/cpp/build/asan --target ROUTING_TEST -j 8
cmake --build /home/coder/cuopt/cpp/build/conda/cuda-13.2/release --target ROUTING_TEST -j 8

Focused ASan validation passed 24/24, including the previously ASan-failing MALLOC_PERTURB_=124 and release-failing MALLOC_PERTURB_=16 values.

Focused release heap-check validation passed 24/24, including MALLOC_PERTURB_=16 and MALLOC_PERTURB_=124.

Additional earlier fixed validation passed 40/40 for ASan and 40/40 for release heap-check.

Notes

This is very hard to reproduce. After intentionally restoring the broken comparator, a fresh rerun did not fail in 64 ASan attempts and the release rerun reached 118 attempts before timeout. The ASan and release failures shown above are raw logs from the local repro, and the CI log shows the same host allocator abort class, so I think it's the same problem.

Metadata

Metadata

Assignees

Labels

awaiting responseThis expects a response from maintainer or contributor depending on who requested in last comment.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions