-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTTP server crashes with fatal signal 11: C-stack trace labeled "crash" #43
Comments
You do not read and close the input. As a result, you'll run out of open files. What and where depends a lot on timing. In my case it is Tcl running out first. Look like in your case it is Prolog running out first and probably somewhere ignores an error. You might get an idea after compiling for debugging and running under GDB. |
Here is the GDB trace, does it help? $ gdb --args swipl server.pl --interactive --port=3030
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from swipl...done.
(gdb) run
Starting program: /usr/local/bin/swipl server.pl --interactive --port=3030
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff49bf700 (LWP 10039)]
[New Thread 0x7fffeffff700 (LWP 10040)]
[New Thread 0x7fffef7fe700 (LWP 10041)]
[New Thread 0x7fffeeffd700 (LWP 10042)]
[New Thread 0x7fffee7fc700 (LWP 10043)]
[New Thread 0x7fffedffb700 (LWP 10044)]
% Started server at http://localhost:3030/
% Started server at port 3030
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.3.15-19-g08f20ea)
Copyright (c) 1990-2015 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?-
<b>Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff49bf700 (LWP 10039)]
0x00007ffff7b6d95d in S__fillbuf (s=0x1) at os/pl-stream.c:553
553 if ( (rc=S__wait(s)) < 0 )</b>
(gdb) bt
#0 0x00007ffff7b6d95d in S__fillbuf (s=0x1) at os/pl-stream.c:553
#1 0x00007ffff7b6e410 in get_byte (s=0x7ffff00356b0) at os/pl-stream.c:694
#2 Sgetcode (s=0x7ffff00356b0) at os/pl-stream.c:969
#3 0x00007ffff62cfc67 in read_line_to_codes3 (stream=<optimized out>,
codes=235, tail=0) at readutil.c:64
#4 0x00007ffff7adfce9 in PL_next_solution (qid=8509856) at pl-vmi.c:3594
#5 0x00007ffff7b15069 in callProlog (module=0x7bbde0, goal=<optimized out>,
flags=<optimized out>, ex=0x7ffff49beec0) at pl-pro.c:319
#6 0x00007ffff7b42272 in start_thread (closure=0x0) at pl-thread.c:1353
#7 0x00007ffff788e0a4 in start_thread (arg=0x7ffff49bf700)
at pthread_create.c:309
#8 0x00007ffff75c304d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) |
Is Prolog compiled for optimization? This is clearly wrong. get_byte() passes the same IOSTREAM to S__fillbuf(). When compiled with optimization, this may well be gdb that is mistaken. Use |
I have now compiled it without optimization, and using the The crash now yields: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffff3764700 (LWP 15670)] 0x00007ffff7b63bad in S__fillbuf (s=0x265000007fffe807) at os/pl-stream.c:558 558 if ( s->flags & SIO_NBUF ) but the backtrace apparently no longer works: (gdb) bt #0 0x00007ffff7b63bad in S__fillbuf (s=0x265000007fffe807) at os/pl-stream.c:558 #1 0x268800007ffff7b6 in ?? () #2 0x880000007ffff376 in ?? () #3 0x268000007fffe807 in ?? () #4 0x880000007ffff376 in ?? () #5 0x26c000007fffe807 in ?? () #6 0x49ef00007ffff376 in ?? () #7 0xbcb000007ffff7b6 in ?? () ... This happens with several different compilation options, including "-O0 -g3" and "-O0 -g". I tried this on Debian 8.2. |
Looks like a stack corruption. This made me check on a system with an older version of GCC (4.8.4) as that may result in a different stack layout. Same result. As is, Tcl crashes on too many open files. If I use Ran Prolog under Valgrind. Doesn't show any issues. Makes it hard to debug :( Two things may help: try Valgrind in your case or use GCC's stack protector to crash before the stack is corrupt. Never tried the latter route. Should be by passing Note that you can probably make the crash happen quicker using e.g., Warning: invalid file descriptor 97 in syscall accept() |
I am using GCC 4.9.2. When the crash happens,
|
I have no clue. It is totally non-reproducible. I'm afraid you'll have to dig deep using GDB to see what is going on with the |
I have one more data point: The number of allowed open files on Debian is quite large, much larger than this test case requires: $ ulimit -Hn; ulimit -Sn 65536 65536 Therefore, I think that this is not the underlying cause of this issue, though it may of course be related. |
The devil is normally in the details. Luckily Ubuntu 15.10 seems to have stack protection enabled by default, so locating the bug was trivial. The issue is Here is a reference article |
The critical |
The link quite nicely explains how to do the rewrite. But, once you have over 1024 file handles, anything you open will have a handle >= 1024 and thus all operations involving select that use file handles (some only use the timeout argument) anywhere in the system becomes dangerous. |
I can trigger the crash on OSX 10.10.2 even when I set For the sake of robustness and portability, until the |
Either OSX has a ridiculously low limit or there is a different issue. Anyway, replace |
Did step one. Might fix your problem, but basically makes it a gamble what works and what not over |
TR;DR This bug is not OS-X specific. It reproduces differently on Debian/Ubuntu because of a different default setting for number of files. The solution is to replace all |
poll() conforms to POSIX.1-2001. It is just Windows missing it. Note that some calls to select() are only used for timeout. These need not be changed either. |
@JanWielemaker How do the Windows guys fix these types of issues then? |
I think you have to stop using the POSIX emulation by MS and instead either directly use the native Win32 API or roll your own (better) POSIX emulation. Considering the Linux subsystem in Windows 10, things may have improved. Using poll() is anyway the first step. I've done one. It isn't particularly hard, but it is quite a bit of typing. |
Thank you Jan for your recent changes to use $ ulimit -S -n 2560 |
I either missed one or poll() isn't detected. Is HAVE_POLL defined in config.h? Can you add a stack trace to indicate where it goes wrong? |
FYI, the original test now runs fine on Linux (Ubuntu 14.04). |
While trying to reproduce the original issue on Debian 8.1, I now got also a different message: $ swipl server.pl --port=3030 --interactive % Started server at http://localhost:3030/ Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 7.3.23-4-g53e67a2) Copyright (c) 1990-2015 University of Amsterdam, VU Amsterdam SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. Please visit http://www.swi-prolog.org for details. For help, use ?- help(Topic). or ?- apropos(Word). ?- WARNING: Race condition detected. Please report at: WARNING: https://github.com/SWI-Prolog/swipl-devel/issues C-stack trace labeled "addNewHTable": [0] save_backtrace() at :? [0x7fb4e8f2095a] [1] addNewHTable() at :? [0x7fb4e8f14c11] [2] getStreamContext() at pl-file.c:? [0x7fb4e8f001d2] [3] PL_unify_stream() at ??:? [0x7fb4e8f04264] [4] pl_open_socket() at socket.c:? [0x7fb4e7872b06] [5] PL_next_solution() at ??:? [0x7fb4e8e7ebcd] [6] callProlog() at :? [0x7fb4e8eb4449] [7] start_thread() at pl-thread.c:? [0x7fb4e8ee18e2] [8] start_thread() at ??:? [0x7fb4e8c2c0a4] [9] clone() at ??:? [0x7fb4e896104d] Foreign predicate http_stream:cgi_discard/1 did not clear exception: error(io_error(write,(0x7fb4d80677d0)),context(http_stream:cgi_discard/1,Inappropriate ioctl for device)) For the client, I used: $ ulimit -n 30000 $ tclsh test_server.tcl 0. 1. ... 28233. couldn't open socket: cannot assign requested address One other interesting data point: When I log in to the shell on Debian 8.1, I initially get: $ ulimit -a ... open files (-n) 65536 ... yet I can reach barely half that limit with the client: ... 28232. 28233. couldn't open socket: cannot assign requested address while executing "socket localhost 3030" It is tempting to assume that somehow twice as many files/sockets/whatever are opened, but that seems not to be the case, because I can attain the limit if I lower it (for example, to 100). Please make sure to try this with higher limits. As far as I recall, Ubuntu uses considerably lower limits that do not trigger any issues. For comparison, with Debian 8.1, I get the following limits per default: $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 3926 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 65536 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 3926 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited |
All seems fine on Ubuntu 16.04. Yes, TCL stops at roughly 28,000 with cannot assign requested address. If I lower the -n limit on the server too far I get I'm not sure what is causing this to fail around 28,000. Might be that you cannot have more than The reported race is probably a bug, but most likely not fatal. Dunno what the inappropriate ioctl |
I have recompiled the whole system from the ground up on both Debian and OSX, and the issue seems gone now, so I'm closing this. Thank you very much Jan for this awesome scalability improvement! |
Let
server.pl
consist of:Start the server with:
Let
test_server.tcl
consist of:and run this Tcl script with:
On OS-X 10.10.1, I get the following output from the test script:
$ tclsh test_server.tcl 0. 100. 200. ... 2500. couldn't open socket: nodename nor servname provided, or not known
and the SWI-Prolog HTTP server crashes with:
On Debian 8.2, the test script
test_server.tcl
runs seemingly without problems, but as soon as I interrupt the script with Ctrl+C, the SWI-Prolog HTTP server also crashes on Debian, and its output is:The text was updated successfully, but these errors were encountered: