Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP server crashes with fatal signal 11: C-stack trace labeled "crash" #43

Closed
triska opened this issue Jan 23, 2016 · 24 comments
Closed
Labels

Comments

@triska
Copy link
Member

triska commented Jan 23, 2016

Let server.pl consist of:

:- use_module(library(http/thread_httpd)).
:- use_module(library(http/http_dispatch)).
:- use_module(library(http/http_unix_daemon)).

:- initialization http_daemon.

:- http_handler(/, handle_request, []).

handle_request(_Request) :-
    format("Content-type: text/plain~n~n"),
    format("hello!!").

Start the server with:

$ swipl server.pl --interactive --port=3030

Let test_server.tcl consist of:

set i 0
while {true} {
    if {$i % 100 == 0} { puts "$i."}
    set s [socket localhost 3030]
    puts $s "GET /"
    set i [expr {$i + 1}]
}

and run this Tcl script with:

$ tclsh test_server.tcl

On OS-X 10.10.1, I get the following output from the test script:

$ tclsh test_server.tcl
0.
100.
200.
...
2500.
couldn't open socket: nodename nor servname provided, or not known

and the SWI-Prolog HTTP server crashes with:

% Started server at http://localhost:3030/
% Started server at port 3030
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.3.14-62-g18f1547-DIRTY)
Copyright (c) 1990-2015 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.

For help, use ?- help(Topic). or ?- apropos(Word).

?- 
<b>SWI-Prolog [thread 3 (httpd@3030_2) at Sat Jan 23 23:38:32 2016]: received fatal signal 11 (segv)
C-stack trace labeled "crash":</b>
  [0] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/libswipl.dylib(save_backtrace+0xb9) [0x104a60d69]
  [1] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/libswipl.dylib(crashHandler+0xa8) [0x104a61348]
  [2] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/libswipl.dylib(dispatch_signal+0x2d3) [0x104a09333]
  [3] /usr/lib/system/libsystem_platform.dylib(_sigtramp+0x1a) [0x7fff91c49f1a]
  [5] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/libswipl.dylib(Sgetcode+0x260) [0x104a4e520]
  [6] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/readutil.dylib(read_line_to_codes3+0x7c) [0x104b30b0c]
  [7] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/libswipl.dylib(PL_next_solution+0x80c2) [0x10499dfd2]
  [8] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/libswipl.dylib(callProlog+0x151) [0x1049ebcc1]
  [9] /usr/local/lib/swipl-7.1.27/lib/x86_64-darwin14.0.0/libswipl.dylib(start_thread+0x140) [0x104a1e810]
  [10] /usr/lib/system/libsystem_pthread.dylib(_pthread_body+0x83) [0x7fff88bb72fc]
  [11] /usr/lib/system/libsystem_pthread.dylib(_pthread_body+0x0) [0x7fff88bb7279]
  [12] /usr/lib/system/libsystem_pthread.dylib(thread_start+0xd) [0x7fff88bb54b1]

On Debian 8.2, the test script test_server.tcl runs seemingly without problems, but as soon as I interrupt the script with Ctrl+C, the SWI-Prolog HTTP server also crashes on Debian, and its output is:

?- 
SWI-Prolog [thread 4 (httpd@3030_3) at Sat Jan 23 23:45:32 2016]: received fatal signal 11 (segv)

SWI-Prolog [thread 3 (httpd@3030_2) at Sat Jan 23 23:45:32 2016]: received fatal signal 11 (segv)

SWI-Prolog [thread 5 (httpd@3030_4) at Sat Jan 23 23:45:32 2016]: received fatal signal 11 (segv)
C-stack trace labeled "crash":
C-stack trace labeled "crash":

SWI-Prolog [thread 6 (httpd@3030_5) at Sat Jan 23 23:45:32 2016]: received fatal signal 11 (segv)

SWI-Prolog [thread 2 (httpd@3030_1) at Sat Jan 23 23:45:32 2016]: received fatal signal 11 (segv)
C-stack trace labeled "crash":
C-stack trace labeled "crash":
C-stack trace labeled "crash":
  [0] save_backtrace() at :? [0x7f5c64ef183a]
  [0] save_backtrace() at :? [0x7f5c64ef183a]
  [0] save_backtrace() at :? [0x7f5c64ef183a]
  [0] save_backtrace() at :? [0x7f5c64ef183a]
  [0] save_backtrace() at :? [0x7f5c64ef183a]
  [1] crashHandler() at pl-cstack.c:? [0x7f5c64ef1a04]
  [1] crashHandler() at pl-cstack.c:? [0x7f5c64ef1a04]
  [1] crashHandler() at pl-cstack.c:? [0x7f5c64ef1a04]
  [1] crashHandler() at pl-cstack.c:? [0x7f5c64ef1a04]
  [1] crashHandler() at pl-cstack.c:? [0x7f5c64ef1a04]
  [2] dispatch_signal() at pl-setup.c:? [0x7f5c64ea162b]
  [2] dispatch_signal() at pl-setup.c:? [0x7f5c64ea162b]
  [2] dispatch_signal() at pl-setup.c:? [0x7f5c64ea162b]
  [2] dispatch_signal() at pl-setup.c:? [0x7f5c64ea162b]
  [3] __restore_rt() at ??:? [0x7f5c64c0b8d0]
  [3] __restore_rt() at ??:? [0x7f5c64c0b8d0]
  [3] __restore_rt() at ??:? [0x7f5c64c0b8d0]
  [2] dispatch_signal() at pl-setup.c:? [0x7f5c64ea162b]
  [3] __restore_rt() at ??:? [0x7f5c64c0b8d0]
  [4] S__fillbuf() at ??:? [0x7f5c64ee361d]
  [4] S__fillbuf() at ??:? [0x7f5c64ee361d]
  [4] S__fillbuf() at ??:? [0x7f5c64ee361d]
  [3] __restore_rt() at ??:? [0x7f5c64c0b8d0]
  [5] Sgetcode() at ??:? [0x7f5c64ee40d0]
  [4] S__fillbuf() at ??:? [0x7f5c64ee361d]
  [5] Sgetcode() at ??:? [0x7f5c64ee40d0]
  [4] S__fillbuf() at ??:? [0x7f5c64ee361d]
  [5] Sgetcode() at ??:? [0x7f5c64ee40d0]
  [6] read_line_to_codes3() at readutil.c:? [0x7f5c63645c67]
  [6] read_line_to_codes3() at readutil.c:? [0x7f5c63645c67]
  [5] Sgetcode() at ??:? [0x7f5c64ee40d0]
  [6] read_line_to_codes3() at readutil.c:? [0x7f5c63645c67]
  [7] PL_next_solution() at ??:? [0x7f5c64e55a65]
  [7] PL_next_solution() at ??:? [0x7f5c64e55a65]
  [5] Sgetcode() at ??:? [0x7f5c64ee40d0]
  [8] callProlog() at :? [0x7f5c64e8ada9]
  [6] read_line_to_codes3() at readutil.c:? [0x7f5c63645c67]
  [8] callProlog() at :? [0x7f5c64e8ada9]
  [7] PL_next_solution() at ??:? [0x7f5c64e55a65]
  [6] read_line_to_codes3() at readutil.c:? [0x7f5c63645c67]
  [7] PL_next_solution() at ??:? [0x7f5c64e55a65]
  [8] callProlog() at :? [0x7f5c64e8ada9]
  [9] start_thread() at pl-thread.c:? [0x7f5c64eb7fb2]
  [9] start_thread() at pl-thread.c:? [0x7f5c64eb7fb2]
  [8] callProlog() at :? [0x7f5c64e8ada9]
  [7] PL_next_solution() at ??:? [0x7f5c64e55a65]
  [9] start_thread() at pl-thread.c:? [0x7f5c64eb7fb2]
  [10] start_thread() at ??:? [0x7f5c64c040a4]
  [10] start_thread() at ??:? [0x7f5c64c040a4]
  [8] callProlog() at :? [0x7f5c64e8ada9]
  [9] start_thread() at pl-thread.c:? [0x7f5c64eb7fb2]
  [10] start_thread() at ??:? [0x7f5c64c040a4]
  [9] start_thread() at pl-thread.c:? [0x7f5c64eb7fb2]
  [10] start_thread() at ??:? [0x7f5c64c040a4]
  [11] clone() at ??:? [0x7f5c6493904d]
@JanWielemaker
Copy link
Member

JanWielemaker commented Jan 24, 2016

You do not read and close the input. As a result, you'll run out of open files. What and where depends a lot on timing. In my case it is Tcl running out first. Look like in your case it is Prolog running out first and probably somewhere ignores an error. You might get an idea after compiling for debugging and running under GDB.

@triska
Copy link
Member Author

triska commented Jan 24, 2016

Here is the GDB trace, does it help?

$ gdb --args swipl server.pl --interactive --port=3030
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from swipl...done.
(gdb) run
Starting program: /usr/local/bin/swipl server.pl --interactive --port=3030
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff49bf700 (LWP 10039)]
[New Thread 0x7fffeffff700 (LWP 10040)]
[New Thread 0x7fffef7fe700 (LWP 10041)]
[New Thread 0x7fffeeffd700 (LWP 10042)]
[New Thread 0x7fffee7fc700 (LWP 10043)]
[New Thread 0x7fffedffb700 (LWP 10044)]
% Started server at http://localhost:3030/
% Started server at port 3030
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.3.15-19-g08f20ea)
Copyright (c) 1990-2015 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.

For help, use ?- help(Topic). or ?- apropos(Word).

?- 
<b>Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff49bf700 (LWP 10039)]
0x00007ffff7b6d95d in S__fillbuf (s=0x1) at os/pl-stream.c:553
553     if ( (rc=S__wait(s)) < 0 )</b>
(gdb) bt
#0  0x00007ffff7b6d95d in S__fillbuf (s=0x1) at os/pl-stream.c:553
#1  0x00007ffff7b6e410 in get_byte (s=0x7ffff00356b0) at os/pl-stream.c:694
#2  Sgetcode (s=0x7ffff00356b0) at os/pl-stream.c:969
#3  0x00007ffff62cfc67 in read_line_to_codes3 (stream=<optimized out>, 
    codes=235, tail=0) at readutil.c:64
#4  0x00007ffff7adfce9 in PL_next_solution (qid=8509856) at pl-vmi.c:3594
#5  0x00007ffff7b15069 in callProlog (module=0x7bbde0, goal=<optimized out>, 
    flags=<optimized out>, ex=0x7ffff49beec0) at pl-pro.c:319
#6  0x00007ffff7b42272 in start_thread (closure=0x0) at pl-thread.c:1353
#7  0x00007ffff788e0a4 in start_thread (arg=0x7ffff49bf700)
    at pthread_create.c:309
#8  0x00007ffff75c304d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) 

@JanWielemaker
Copy link
Member

#0 0x00007ffff7b6d95d in S__fillbuf (s=0x1) at os/pl-stream.c:553
#1 0x00007ffff7b6e410 in get_byte (s=0x7ffff00356b0) at os/pl-stream.c:694

Is Prolog compiled for optimization? This is clearly wrong. get_byte() passes the same IOSTREAM to S__fillbuf(). When compiled with optimization, this may well be gdb that is mistaken. Use COFLAGS=-gdwarf-2 -g3. I do expect IOSTREAM to be broken. The only question is how and why. So, please print relevant variables in the relevant functions.

@triska
Copy link
Member Author

triska commented Jan 24, 2016

I have now compiled it without optimization, and using the COFLAGS you gave.

The crash now yields:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff3764700 (LWP 15670)]
0x00007ffff7b63bad in S__fillbuf (s=0x265000007fffe807) at os/pl-stream.c:558
558   if ( s->flags & SIO_NBUF )

but the backtrace apparently no longer works:

(gdb) bt
#0  0x00007ffff7b63bad in S__fillbuf (s=0x265000007fffe807)
    at os/pl-stream.c:558
#1  0x268800007ffff7b6 in ?? ()
#2  0x880000007ffff376 in ?? ()
#3  0x268000007fffe807 in ?? ()
#4  0x880000007ffff376 in ?? ()
#5  0x26c000007fffe807 in ?? ()
#6  0x49ef00007ffff376 in ?? ()
#7  0xbcb000007ffff7b6 in ?? ()
...

This happens with several different compilation options, including "-O0 -g3" and "-O0 -g". I tried this on Debian 8.2.

@JanWielemaker
Copy link
Member

JanWielemaker commented Jan 24, 2016

Looks like a stack corruption. This made me check on a system with an older version of GCC (4.8.4) as that may result in a different stack layout. Same result. As is, Tcl crashes on too many open files. If I use ulimit -n 2500 in the shell running Tcl, the Prolog server starts printing exceptions about too many open files, but it doesn't crash.

Ran Prolog under Valgrind. Doesn't show any issues. Makes it hard to debug :( Two things may help: try Valgrind in your case or use GCC's stack protector to crash before the stack is corrupt. Never tried the latter route. Should be by passing -fstack-protector-strong to GCC.

Note that you can probably make the crash happen quicker using e.g., ulimit -n 100 before running Prolog. Valgrind is slow ... Under Valgrind, you get this warning. As far as I can tell, this happens internally inside glibc. Prolog nicely does accept() on fd=3 and gets -1 returned.

Warning: invalid file descriptor 97 in syscall accept()

@triska
Copy link
Member Author

triska commented Jan 24, 2016

I am using GCC 4.9.2.

When the crash happens, valgrind shows:

?- ==3256== Thread 6:
==3256== Invalid read of size 4
==3256==    at 0x4EFF95D: S(float, int, long, long, bool, float __restrict) (pl-stream.c:553)
==3256==    by 0x4F0040F: get_byte (pl-stream.c:694)
==3256==    by 0x4F0040F: Sgetcode (pl-stream.c:969)
==3256==    by 0x6B41B4B: read_line_to_codes3 (readutil.c:64)
==3256==    by 0x6B41DA4: read_line_to_codes2 (readutil.c:112)
==3256==    by 0x4E71CE8: PL_next_solution (pl-vmi.c:3594)
==3256==    by 0x4EA7068: callProlog (pl-pro.c:319)
==3256==    by 0x4ED4271: start_thread (pl-thread.c:1353)
==3256==    by 0x51760A3: start_thread (pthread_create.c:309)
==3256==  Address 0x2d is not stack'd, malloc'd or (recently) free'd
==3256== 

SWI-Prolog [thread 6 (httpd@3030_5) at Sun Jan 24 13:37:06 2016]: received fatal signal 11 (segv)

SWI-Prolog [thread 5 (httpd@3030_4) at Sun Jan 24 13:37:06 2016]: received fatal signal 11 (segv)

SWI-Prolog [thread 3 (httpd@3030_2) at Sun Jan 24 13:37:06 2016]: received fatal signal 11 (segv)

SWI-Prolog [thread 4 (httpd@3030_3) at Sun Jan 24 13:37:06 2016]: received fatal signal 11 (segv)

SWI-Prolog [thread 2 (httpd@3030_1) at Sun Jan 24 13:37:06 2016]: received fatal signal 11 (segv)
C-stack trace labeled "crash":
C-stack trace labeled "crash":
C-stack trace labeled "crash":
C-stack trace labeled "crash":
C-stack trace labeled "crash":
  [0] save_backtrace() at swipl-devel/src/os/pl-cstack.c:310 [0x4f0db7a]
  [0] save_backtrace() at swipl-devel/src/os/pl-cstack.c:310 [0x4f0db7a]
  [0] save_backtrace() at swipl-devel/src/os/pl-cstack.c:310 [0x4f0db7a]
  [0] save_backtrace() at swipl-devel/src/os/pl-cstack.c:310 [0x4f0db7a]
  [0] save_backtrace() at swipl-devel/src/os/pl-cstack.c:310 [0x4f0db7a]
  [1] crashHandler() at swipl-devel/src/os/pl-cstack.c:465 [0x4f0dd44]
  [1] crashHandler() at swipl-devel/src/os/pl-cstack.c:465 [0x4f0dd44]
  [1] crashHandler() at swipl-devel/src/os/pl-cstack.c:465 [0x4f0dd44]
  [1] crashHandler() at swipl-devel/src/os/pl-cstack.c:465 [0x4f0dd44]
  [1] crashHandler() at swipl-devel/src/os/pl-cstack.c:465 [0x4f0dd44]
  [2] dispatch_signal() at swipl-devel/src/pl-setup.c:465 [0x4ebd8eb]
  [2] dispatch_signal() at swipl-devel/src/pl-setup.c:465 [0x4ebd8eb]
  [2] dispatch_signal() at swipl-devel/src/pl-setup.c:465 [0x4ebd8eb]
  [2] dispatch_signal() at swipl-devel/src/pl-setup.c:465 [0x4ebd8eb]
  [2] dispatch_signal() at swipl-devel/src/pl-setup.c:465 [0x4ebd8eb]
  [3] __restore_rt() at ??:? [0x517d8d0]
  [3] __restore_rt() at ??:? [0x517d8d0]
  [3] __restore_rt() at ??:? [0x517d8d0]
  [3] __restore_rt() at ??:? [0x517d8d0]
  [3] __restore_rt() at ??:? [0x517d8d0]
  [4] S__fillbuf() at swipl-devel/src/os/pl-stream.c:553 [0x4eff95d]
  [4] S__fillbuf() at swipl-devel/src/os/pl-stream.c:553 [0x4eff95d]
  [4] S__fillbuf() at swipl-devel/src/os/pl-stream.c:553 [0x4eff95d]
  [4] S__fillbuf() at swipl-devel/src/os/pl-stream.c:553 [0x4eff95d]
  [4] S__fillbuf() at swipl-devel/src/os/pl-stream.c:553 [0x4eff95d]
  [5] get_byte() at swipl-devel/src/os/pl-stream.c:694 [0x4f00410]
  [5] get_byte() at swipl-devel/src/os/pl-stream.c:694 [0x4f00410]
  [5] get_byte() at swipl-devel/src/os/pl-stream.c:694 [0x4f00410]
  [5] get_byte() at swipl-devel/src/os/pl-stream.c:694 [0x4f00410]
  [5] get_byte() at swipl-devel/src/os/pl-stream.c:694 [0x4f00410]
  [6] read_line_to_codes3() at swipl-devel/packages/clib/readutil.c:64 [0x6b41b4c]
  [6] read_line_to_codes3() at swipl-devel/packages/clib/readutil.c:64 [0x6b41b4c]
  [7] read_line_to_codes2() at swipl-devel/packages/clib/readutil.c:113 [0x6b41da5]
  [6] read_line_to_codes3() at swipl-devel/packages/clib/readutil.c:64 [0x6b41b4c]
  [6] read_line_to_codes3() at swipl-devel/packages/clib/readutil.c:64 [0x6b41b4c]
  [6] read_line_to_codes3() at swipl-devel/packages/clib/readutil.c:64 [0x6b41b4c]
  [7] read_line_to_codes2() at swipl-devel/packages/clib/readutil.c:113 [0x6b41da5]
  [7] read_line_to_codes2() at swipl-devel/packages/clib/readutil.c:113 [0x6b41da5]
  [7] read_line_to_codes2() at swipl-devel/packages/clib/readutil.c:113 [0x6b41da5]
  [7] read_line_to_codes2() at swipl-devel/packages/clib/readutil.c:113 [0x6b41da5]
  [8] PL_next_solution() at swipl-devel/src/pl-vmi.c:3594 [0x4e71ce9]
  [8] PL_next_solution() at swipl-devel/src/pl-vmi.c:3594 [0x4e71ce9]
  [8] PL_next_solution() at swipl-devel/src/pl-vmi.c:3594 [0x4e71ce9]
  [8] PL_next_solution() at swipl-devel/src/pl-vmi.c:3594 [0x4e71ce9]
  [8] PL_next_solution() at swipl-devel/src/pl-vmi.c:3594 [0x4e71ce9]
  [9] callProlog() at swipl-devel/src/pl-pro.c:325 [0x4ea7069]
  [9] callProlog() at swipl-devel/src/pl-pro.c:325 [0x4ea7069]
  [9] callProlog() at swipl-devel/src/pl-pro.c:325 [0x4ea7069]
  [9] callProlog() at swipl-devel/src/pl-pro.c:325 [0x4ea7069]
  [9] callProlog() at swipl-devel/src/pl-pro.c:325 [0x4ea7069]
  [10] start_thread() at swipl-devel/src/pl-thread.c:1353 [0x4ed4272]
  [10] start_thread() at swipl-devel/src/pl-thread.c:1353 [0x4ed4272]
  [10] start_thread() at swipl-devel/src/pl-thread.c:1353 [0x4ed4272]
  [10] start_thread() at swipl-devel/src/pl-thread.c:1353 [0x4ed4272]
  [10] start_thread() at swipl-devel/src/pl-thread.c:1353 [0x4ed4272]
  [11] start_thread() at /build/glibc-I9DIZl/glibc-2.19/nptl/pthread_create.c:309 (discriminator 2) [0x51760a4]
  [11] start_thread() at /build/glibc-I9DIZl/glibc-2.19/nptl/pthread_create.c:309 (discriminator 2) [0x51760a4]
  [11] start_thread() at /build/glibc-I9DIZl/glibc-2.19/nptl/pthread_create.c:309 (discriminator 2) [0x51760a4]
  [11] start_thread() at /build/glibc-I9DIZl/glibc-2.19/nptl/pthread_create.c:309 (discriminator 2) [0x51760a4]
  [11] start_thread() at /build/glibc-I9DIZl/glibc-2.19/nptl/pthread_create.c:309 (discriminator 2) [0x51760a4]
  [12] clone() at /build/glibc-I9DIZl/glibc-2.19/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:113 [0x547104d]
==3256== 
==3256== HEAP SUMMARY:
==3256==     in use at exit: 6,024,345 bytes in 53,359 blocks
==3256==   total heap usage: 61,381 allocs, 8,022 frees, 12,782,058 bytes allocated
==3256== 
==3256== LEAK SUMMARY:
==3256==    definitely lost: 66,384 bytes in 19 blocks
==3256==    indirectly lost: 0 bytes in 0 blocks
==3256==      possibly lost: 936,560 bytes in 927 blocks
==3256==    still reachable: 5,021,401 bytes in 52,413 blocks
==3256==         suppressed: 0 bytes in 0 blocks
==3256== Rerun with --leak-check=full to see details of leaked memory
==3256== 
==3256== For counts of detected and suppressed errors, rerun with: -v
==3256== ERROR SUMMARY: 5 errors from 1 contexts (suppressed: 0 from 0

@JanWielemaker
Copy link
Member

JanWielemaker commented Jan 24, 2016

I have no clue. It is totally non-reproducible. I'm afraid you'll have to dig deep using GDB to see what is going on with the IOSTREAM pointer. For now I see it as a not too critical error that might depend on a broken build. I'm more interested in the attributed var issue.

@triska
Copy link
Member Author

triska commented Jan 30, 2016

I have one more data point: The number of allowed open files on Debian is quite large, much larger than this test case requires:

$ ulimit -Hn; ulimit -Sn
65536
65536

Therefore, I think that this is not the underlying cause of this issue, though it may of course be related.

@JanWielemaker
Copy link
Member

JanWielemaker commented Jan 30, 2016

The devil is normally in the details. Luckily Ubuntu 15.10 seems to have stack protection enabled by default, so locating the bug was trivial. The issue is select(), which is used to wait for sockets that cannot handle file handles >= 1024. The bad news is that there is no simple fix. The only sensible way out seems to be to move to the poll() API, or better epoll(), but that only exists in Linux :( Unfortunately, this means all select() calls must be replaced to make any sense. That are about 20 of them in the whole code base. Not really high on my priority list. Keep ulimit -n below 1024 and you are fine.

Here is a reference article

@triska
Copy link
Member Author

triska commented Jan 30, 2016

The critical select call in this case seems to be the one in pl-stream.c, around line 400. If you show me how to replace it by poll, I and everyone else who is interested in serving more than 1024 clients can try to fix the remaining ones.

@JanWielemaker
Copy link
Member

The link quite nicely explains how to do the rewrite. But, once you have over 1024 file handles, anything you open will have a handle >= 1024 and thus all operations involving select that use file handles (some only use the timeout argument) anywhere in the system becomes dangerous.

@triska
Copy link
Member Author

triska commented Jan 31, 2016

I can trigger the crash on OSX 10.10.2 even when I set ulimit -n 100.

For the sake of robustness and portability, until the select calls are replaced by poll, please consider ensuring a safe value from within SWI-Prolog, i.e., refuse to open more files when the number would exceed FD_SETSIZE.

@JanWielemaker
Copy link
Member

JanWielemaker commented Feb 1, 2016

Either OSX has a ridiculously low limit or there is a different issue. Anyway, replace select() by poll() or set ulimit not too high. I suspect that the OS default is typically FD_SETSIZE. That is at least the case on Linux, where both are 1024. If you change that upward, you have to make sure the application is not using select(). I see no reason for temporary workarounds inside Prolog if the typical OS has proper defaults and surely a way around. Would have been nice if the C library was a bit more forgiving...

@JanWielemaker
Copy link
Member

JanWielemaker commented Feb 2, 2016

Did step one. Might fix your problem, but basically makes it a gamble what works and what not over FD_SETSIZE. Worse, it seems we need to keep the select() code as poll() isn't yet ll around :(

@wouterbeek wouterbeek added the bug label May 7, 2016
@wouterbeek
Copy link
Contributor

wouterbeek commented May 8, 2016

TR;DR

This bug is not OS-X specific. It reproduces differently on Debian/Ubuntu because of a different default setting for number of files.

The solution is to replace all select() calls with poll() calls (approx. 20 occurrences). But this solution is currently not available on Windows.

@JanWielemaker
Copy link
Member

poll() conforms to POSIX.1-2001. It is just Windows missing it. Note that some calls to select() are only used for timeout. These need not be changed either.

@wouterbeek
Copy link
Contributor

wouterbeek commented May 8, 2016

@JanWielemaker How do the Windows guys fix these types of issues then?

@JanWielemaker
Copy link
Member

I think you have to stop using the POSIX emulation by MS and instead either directly use the native Win32 API or roll your own (better) POSIX emulation. Considering the Linux subsystem in Windows 10, things may have improved. Using poll() is anyway the first step. I've done one. It isn't particularly hard, but it is quite a bit of typing.

@triska
Copy link
Member Author

triska commented Jun 17, 2016

Thank you Jan for your recent changes to use poll(). However, on OSX, I still get the same crash of SWI-Prolog after the file descriptors are exhausted by the Tcl script.

$ ulimit -S -n
2560

@JanWielemaker
Copy link
Member

I either missed one or poll() isn't detected. Is HAVE_POLL defined in config.h? Can you add a stack trace to indicate where it goes wrong?

@JanWielemaker
Copy link
Member

FYI, the original test now runs fine on Linux (Ubuntu 14.04).

@triska
Copy link
Member Author

triska commented Jun 17, 2016

While trying to reproduce the original issue on Debian 8.1, I now got also a different message:

$ swipl server.pl --port=3030 --interactive
% Started server at http://localhost:3030/
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.3.23-4-g53e67a2)
Copyright (c) 1990-2015 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?- WARNING: Race condition detected.  Please report at:
WARNING:   https://github.com/SWI-Prolog/swipl-devel/issues
C-stack trace labeled "addNewHTable":
  [0] save_backtrace() at :? [0x7fb4e8f2095a]
  [1] addNewHTable() at :? [0x7fb4e8f14c11]
  [2] getStreamContext() at pl-file.c:? [0x7fb4e8f001d2]
  [3] PL_unify_stream() at ??:? [0x7fb4e8f04264]
  [4] pl_open_socket() at socket.c:? [0x7fb4e7872b06]
  [5] PL_next_solution() at ??:? [0x7fb4e8e7ebcd]
  [6] callProlog() at :? [0x7fb4e8eb4449]
  [7] start_thread() at pl-thread.c:? [0x7fb4e8ee18e2]
  [8] start_thread() at ??:? [0x7fb4e8c2c0a4]
  [9] clone() at ??:? [0x7fb4e896104d]
Foreign predicate http_stream:cgi_discard/1 did not clear exception: error(io_error(write,(0x7fb4d80677d0)),context(http_stream:cgi_discard/1,Inappropriate ioctl for device))

For the client, I used:

$ ulimit -n 30000
$ tclsh test_server.tcl
0.
1.
...
28233.
couldn't open socket: cannot assign requested address

One other interesting data point:

When I log in to the shell on Debian 8.1, I initially get:

$ ulimit -a
...
open files                      (-n) 65536
...

yet I can reach barely half that limit with the client:

...
28232.
28233.
couldn't open socket: cannot assign requested address
    while executing
"socket localhost 3030"

It is tempting to assume that somehow twice as many files/sockets/whatever are opened, but that seems not to be the case, because I can attain the limit if I lower it (for example, to 100).

Please make sure to try this with higher limits. As far as I recall, Ubuntu uses considerably lower limits that do not trigger any issues. For comparison, with Debian 8.1, I get the following limits per default:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3926
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3926
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@JanWielemaker
Copy link
Member

All seems fine on Ubuntu 16.04. Yes, TCL stops at roughly 28,000 with cannot assign requested address. If I lower the -n limit on the server too far I get Too many open files errors. Otherwise
nothing goes wrong.

I'm not sure what is causing this to fail around 28,000. Might be that you cannot have more than
64K connections on one IP address AFAIK. I don't know whether both having the client and server
on the same host make one connection count for two.

The reported race is probably a bug, but most likely not fatal. Dunno what the inappropriate ioctl
is. It fails to reproduce.

@triska
Copy link
Member Author

triska commented Jun 26, 2016

I have recompiled the whole system from the ground up on both Debian and OSX, and the issue seems gone now, so I'm closing this. Thank you very much Jan for this awesome scalability improvement!

@triska triska closed this as completed Jun 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants