Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in SCIPsolveConcurrent() #2

Open
Mizux opened this issue Jul 30, 2021 · 5 comments
Open

Segfault in SCIPsolveConcurrent() #2

Mizux opened this issue Jul 30, 2021 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@Mizux
Copy link
Owner

Mizux commented Jul 30, 2021

Currently, tests is segfaulting on CI when using the github linux runner

Unfortunately, I didn't manage to reproduce it locally...
EDIT: I may have a mean to reproduce it.

Steps to Reproduce the Problem

  1. configure the project using cmake -S. -Bbuild -DCMAKE_BUILD_TYPE=Debug
    note: By default it will build in RelWithDebInfo
  2. build using cmake --build build -v
  3. run Foo program in a loop gdb build/bin/Foo then, inside gdb type the following commands:
    set pagination off
    break exit
    commands
    run
    end
    

3bis. You can also put this in a .gdbinit file:

set pagination off
# set breakpoint pending on
break exit
commands
  run
end
run

Then run gdb build/bin/Foo, it will automatically run the test

note2: you may need

cat ~/.gdbinit 
set auto-load safe-path /

Expected Behavior

no segfault

Actual Behavior

After few minutes, gdb should stop
note: on my machine it take ~250 attempts aka ~2min to reproduce it while it is nearly 80-90% on the CI

Possible trace:

(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7abe537 in __GI_abort () at abort.c:79
#2  0x00007ffff7abe40f in __assert_fail_base (fmt=0x7ffff7c27128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=<optimized out>) at assert.c:92
#3  0x00007ffff7acd662 in __GI___assert_fail (assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=0x555556147340 <__PRETTY_FUNCTION__.1> "SCIPconcurrentSolve") at assert.c:101
#4  0x0000555555f62d19 in SCIPconcurrentSolve (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:518
#5  0x00005555558fbf28 in SCIPsolveConcurrent (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3043
#6  0x0000555555558eed in main () at /usr/local/google/home/corentinl/dev/scip-multithread/Foo/src/main.cpp:120

DevNote

Seems, I can't reproduce it if clang's Thread Sanitizer is enabled (ed by using -DCLANG_TSAN=ON)
ref: https://clang.llvm.org/docs/ThreadSanitizer.html

@Mizux Mizux added the bug Something isn't working label Jul 30, 2021
@Mizux Mizux self-assigned this Jul 30, 2021
@Mizux Mizux changed the title jobs are flaky on Linux CI: Linux jobs are flaky Jul 30, 2021
@Mizux
Copy link
Owner Author

Mizux commented Jul 30, 2021

Compiling in debug, I got the following trace:

cmake -S. -Bbuild -DCMAKE_BUILD_TYPE=Debug
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7abe537 in __GI_abort () at abort.c:79
#2  0x00007ffff7abe40f in __assert_fail_base (fmt=0x7ffff7c27128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=<optimized out>) at assert.c:92
#3  0x00007ffff7acd662 in __GI___assert_fail (assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=0x555556147340 <__PRETTY_FUNCTION__.1> "SCIPconcurrentSolve") at assert.c:101
#4  0x0000555555f62d19 in SCIPconcurrentSolve (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:518
#5  0x00005555558fbf28 in SCIPsolveConcurrent (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3043
#6  0x0000555555558eed in main () at /usr/local/google/home/corentinl/dev/scip-multithread/Foo/src/main.cpp:120

(gdb) backtrace full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {0, 140737348986476, 8517419008, 93825045131888, 93825045131989, 93825045131888, 93825045131888, 93825045132067, 93825045132188, 93825045131888, 93825045132188, 0, 0, 0, 0, 0}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007ffff7abe537 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x5555587c7270, sa_sigaction = 0x5555587c7270}, sa_mask = {__val = {0, 93825045131888, 179, 0, 0, 0, 21474836480, 140737488346544, 140737488346368, 140737350118352, 140737350103336, 0, 9938603782819116032, 140737350086616, 140737354117120, 140737350103336}}, sa_flags = 1444179168, sa_restorer = 0x206}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ffff7abe40f in __assert_fail_base (fmt=0x7ffff7c27128 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=<optimized out>) at assert.c:92
        str = 0x5555587c7270 ""
        total = 4096
#3  0x00007ffff7acd662 in __GI___assert_fail (assertion=0x555556147030 "idx >= 0 && idx < nconcsolvers", file=0x555556146ce0 "/usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c", line=518, function=0x555556147340 <__PRETTY_FUNCTION__.1> "SCIPconcurrentSolve") at assert.c:101
No locals.
#4  0x0000555555f62d19 in SCIPconcurrentSolve (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:518
        syncstore = 0x55555621ad40
        idx = -1
        jobid = 1
        i = 16
        retcode = SCIP_OKAY
        concsolvers = 0x555558327350
        nconcsolvers = 16
        __PRETTY_FUNCTION__ = "SCIPconcurrentSolve"
#5  0x00005555558fbf28 in SCIPsolveConcurrent (scip=0x5555561f6eb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3043
        retcode = 1444922816
        i = 16
        rndgen = 0x0
        minnthreads = 1
        maxnthreads = 16
        __PRETTY_FUNCTION__ = "SCIPsolveConcurrent"
#6  0x0000555555558eed in main () at /usr/local/google/home/corentinl/dev/scip-multithread/Foo/src/main.cpp:120
        scip_ = 0x5555561f6eb0
        x_ = 0x5555563663f8
        y_ = 0x5555563665c8
        z_ = 0x555556366fb8
        constraint_0_ = 0x555556233ec8
        vars_0 = {0x5555563665c8, 0x555556366fb8, 0x5555563663f8}
        vals_0 = {7, 3, 2}
        constraint_1_ = 0x555556233f40
        vars_1 = {0x5555563665c8, 0x555556366fb8, 0x5555563663f8}
        vals_1 = {-5, 7, 3}
        constraint_2_ = 0x555556233fb8
        vars_2 = {0x5555563665c8, 0x555556366fb8, 0x5555563663f8}
        vals_2 = {2, -6, 5}
        stage = 21845
        scip_status = 1431668704
        solution_number_ = 21845

notice the idx = -1 which come from

@Mizux
Copy link
Owner Author

Mizux commented Jul 30, 2021

When using RelWithDebInfo, code is compiled with -O3 so you can't access to the idx value...

Thread 1 "Foo" received signal SIGSEGV, Segmentation fault.
SCIPconcsolverGetSolvingData (concsolver=0xa1, scip=0x555555c3eeb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concsolver.c:343
343	   return concsolver->type->concsolvercopysolvdata(concsolver, scip);
(gdb) bt
#0  SCIPconcsolverGetSolvingData (concsolver=0xa1, scip=0x555555c3eeb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concsolver.c:343
#1  0x0000555555a9a037 in SCIPconcurrentSolve (scip=scip@entry=0x555555c3eeb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:520
#2  0x00005555557019db in SCIPsolveConcurrent (scip=0x555555c3eeb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3043
#3  0x000055555570236f in SCIPsolveConcurrent (scip=<optimized out>) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3049
#4  0x0000555555559856 in main () at /usr/local/google/home/corentinl/dev/scip-multithread/Foo/src/main.cpp:120
(gdb) bt full
#0  SCIPconcsolverGetSolvingData (concsolver=0xa1, scip=0x555555c3eeb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concsolver.c:343
No locals.
#1  0x0000555555a9a037 in SCIPconcurrentSolve (scip=scip@entry=0x555555c3eeb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/concurrent.c:520
        _restat_ = <optimized out>
        syncstore = 0x555555c62b40
        idx = <optimized out>
        jobid = <optimized out>
        i = <optimized out>
        retcode = SCIP_OKAY
        concsolvers = 0x555557d55560
        nconcsolvers = 16
#2  0x00005555557019db in SCIPsolveConcurrent (scip=0x555555c3eeb0) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3043
        retcode = <optimized out>
        i = <optimized out>
        rndgen = 0x0
        minnthreads = <optimized out>
        maxnthreads = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
#3  0x000055555570236f in SCIPsolveConcurrent (scip=<optimized out>) at /usr/local/google/home/corentinl/dev/scip-multithread/build/_deps/scip-src/src/scip/scip_solve.c:3049
        retcode = <optimized out>
        i = <optimized out>
        rndgen = <optimized out>
        minnthreads = <optimized out>
        maxnthreads = <optimized out>
        _restat_ = <optimized out>
        nconcsolvertypes = <optimized out>
        concsolvertypes = <optimized out>
        nthreads = <optimized out>
        memorylimit = <optimized out>
        solvertypes = <optimized out>
        weights = <optimized out>
        prios = <optimized out>
        ncandsolvertypes = <optimized out>
        prefpriosum = <optimized out>
        _restat_ = <optimized out>
        infeas = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        j = <optimized out>
        prio = <optimized out>
        _restat_ = <optimized out>
        concsolver = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
        _restat_ = <optimized out>
#4  0x0000555555559856 in main () at /usr/local/google/home/corentinl/dev/scip-multithread/Foo/src/main.cpp:120
        scip_ = 0x555555c3eeb0
        x_ = 0x555555dacd78
        y_ = 0x555555dacf40
        z_ = 0x555555dad928
        constraint_0_ = 0x555555c76cf8
        vars_0 = {0x555555dacf40, 0x555555dad928, 0x555555dacd78}
        vals_0 = {7, 3, 2}
        constraint_1_ = 0x555555c76d68
        vars_1 = {0x555555dacf40, 0x555555dad928, 0x555555dacd78}
        vals_1 = {-5, 7, 3}
        constraint_2_ = 0x555555c76dd8
        vars_2 = {0x555555dacf40, 0x555555dad928, 0x555555dacd78}
        vals_2 = {2, -6, 5}
        stage = <optimized out>
        scip_status = <optimized out>
        solution_number_ = <optimized out>

@Mizux
Copy link
Owner Author

Mizux commented Jul 30, 2021

SCIPsetIntParam(scip_, "parallel/maxnthreads", 8);
SCIPsolveConcurrent(scip_);

This will call SCIPconcurrentSolve()

SCIP_RETCODE SCIPconcurrentSolve(
   SCIP*                 scip                /**< pointer to scip datastructure */
   )
...
retcode = SCIPtpiCollectJobs(jobid);
idx = SCIPsyncstoreGetWinner(syncstore);
assert(idx >= 0 && idx < nconcsolvers);
SCIP_CALL( SCIPconcsolverGetSolvingData(concsolvers[idx], scip) );

ref: https://github.com/scipopt/scip/blob/a6142b7d1d7892f950b1e5127925f7c435151beb/src/scip/concurrent.c#L517-L518

and SCIPsyncstoreGetWinner() may return -1

/** gets the solver that had the best status, or -1 if solve is not stopped yet */
int SCIPsyncstoreGetWinner(
   SCIP_SYNCSTORE*       syncstore           /**< the synchronization store */
   )
{
   assert(syncstore != NULL);
   assert(syncstore->initialized);

   if( syncstore->lastsync == NULL || syncstore->lastsync->status == SCIP_STATUS_UNKNOWN )
      return -1;
...

ref: https://github.com/scipopt/scip/blob/0941d97923752d494e315223b4aef70cd2a54639/src/scip/syncstore.c#L515-L524

Thus the segfault when trying to access concsolvers[-1]...

It seems, on GitHub linux hosted worker, the method SCIPsyncstoreGetWinner() will return -1 nearly each time
Open Questions:

  • Why code didn't check the index value or retcode before using the returned value ?
  • Why we have so many -1 returned by SCIPsyncstoreGetWinner() ?
  • Why syncstore->lastsync is empty while SCIPtpiCollectJobs(jobid) seems to have correctly collected all results ?

@Mizux Mizux changed the title CI: Linux jobs are flaky Segfault in SCIPsolveConcurrent() Jul 30, 2021
Mizux added a commit that referenced this issue Aug 2, 2021
@Mizux Mizux pinned this issue Mar 7, 2022
@iainfullelove
Copy link

@Mizux has there been any progress on this issue with SCIP?
I have recently tried using the concurrent solver and ran into the same error

@Mizux
Copy link
Owner Author

Mizux commented Jun 5, 2023

@Mizux has there been any progress on this issue with SCIP? I have recently tried using the concurrent solver and ran into the same error

IIRC few months ago, SCIP dev say me it is fixed on master branch but still didn't have/take the time to verify their assertion...
note: no release since v803 which is impacted by this issue....
ref: https://github.com/scipopt/scip/tags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants