-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSL server can assume a state where it no longer answers to client requests #25
Comments
Unfortunately the services we run here at the VU are hard to run that way because our infrastructure requires a proxy in between. This rules out eu.swi-prolog.org and swish.swi-prolog.org. I could use SSL on us.swi-prolog.org. Not sure how the CDN will respond if one of its backends uses HTTPS and the other HTTP. It is also doubtful whether that works as 99% of the connections will come from fastly. Your issue might be caused by a misbehaving client (which of course should not lead to a DoS). Anyway, make sure the system is compiled with this line uncommented in
If it hangs, you can find the blocking thread using gdb and you can (often) find the mutex and print its details to see whether it is an attempt at a recursive lock or which thread is holding the lock. You can also use
to dump a Prolog stack trace to stderr for the current thread. If it is a deadlock, these things usually suffice to find out what is happening. |
I have constructed a self-contained test case that shows the problem, or at least an issue that is likely very closely related. First, please create openssl genrsa -out server.key 2048 openssl req -new -key server.key -out server.csr cp server.key server.key.org openssl rsa -in server.key.org -out server.key openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt The Prolog server that shows the issue is :- use_module(library(ssl)). run :- ssl_context(server, SSL, [key_file('server.key'), certificate_file('server.crt'), cipher_list('DEFAULT:!DES:!RSA') ]), tcp_socket(Socket), Port = 8371, tcp_bind(Socket, localhost:Port), format("listening on port: ~w\n", [Port]), tcp_listen(Socket, 5), tcp_open_socket(Socket, AcceptFd, _), repeat, tcp_accept(AcceptFd, Client, _PeerName), tcp_open_socket(Client, StreamPair), stream_pair(StreamPair, PlainRead, PlainWrite), catch(ssl_negotiate(SSL, PlainRead, PlainWrite, _Read, _Write), _, true), catch(close(StreamPair), _, true), false. :- initialization(run). Run it with:
It will listen on port To hammer the server with repeated connection attempts, we use the #!/bin/bash i=0 port=$1 while true do echo thrashing $i sslscan localhost:$port | grep -i acc i=$((i+1)) done You run the script via:
(EDIT: corrected port) Depending on the ciphers that are enabled, it will hang after "thrashing 21", "thrashing 45" or a similar (rather small) number of runs. I have also produced a backtrace as suggested, but it does not tell me a lot: [8] socket:tcp_accept/3 [PC=1 in supervisor] [7] run/0 [PC=79 in clause 1] [6] $toplevel:residue_vars/2 [PC=5 in clause 2] [5] $toplevel:$execute_goal2/2 [PC=8 in clause 1] [3] $toplevel:$query_loop/0 [PC=101 in clause 1] [2] $toplevel:$runtoplevel/0 [PC=45 in clause 1] [1] $toplevel:$toplevel/0 [PC=3 in clause 1] [0] system:$c_call_prolog/0 [PC=0 in top query clause] Please try to reproduce this and let me know if I can supply an additional information. If you send me your SSH public key, I will create an account for you to reproduce this. |
Using the server.pl file, calls directly to server/1 fail, but after using the graphical debugger and trace option I followed the steps of calling server/0. Just going through the navigator of the server.pl file, I see that server/0 and client/0 loop with exception handling calling the ssl.pl with server/1 and server/3. With what limited things I know about ssl, and comparing it to the issue of certificate verification when relying on root certificates, my guess is that the client fails to verify, so the server ends up calling server/0. When you call server/0 after consulting server.pl it enters recursion. Since it is meant to go to ssl.pl after the ssl_init module is called, it could be that the server basically ends up "talking to itself" which is why it hangs. Since this happens at low thrash values, the negotiation is failing, but with any "root" only attempts the "root" will be the only one to accept the connection. Another thing could be how ports are being handled. I only see one use of a port(Port) and with SSL the required https port is 443, but most hard-coded ports in the files are 2443 which is UDP (ignoring any 1111 ports which would need to obviously be switched by an admin). So it looks like the UDP default port of 2443 is getting instantiated in place of a multi-threading with 443. |
Built and ran this on Fedora 24. Stopped it after 300,000 runs. No hangs, no leaks, all just fine. To some extend this is a bit strange to me as I think the code should have been as below, closing the SSL streams
|
@FunctionAnalysis I can't relate your post to this thread. There is only a run/0, no server/0, client/0, etc. |
@triska A hanging
With the source code at hand, that typically suffices to get a clue. |
Looking at the latest ssl folder, and downloading the zip available in the official SWIPL repository, there are seven prolog source files. I was looking through all of them viewed in the navigator tool. There is a call using the server module which itself calls a server/0. I will post screen captures to illustrate what I am referring to.
|
@JanWielemaker : There was a typo in the script invocation (I have now fixed it), please use the following to thrash the server (note the correct server port): ./thrash.sh 8371 I can reproduce the hang on a fresh Debian 8.5 install, please send me your public key if you want full root access to the system. The commands you posted yield: (gdb) info threads Id Target Id Frame * 1 Thread 0x7f2d9d6f5700 (LWP 3163) "swipl" 0x00007f2d9cf92cb0 in __accept_nocancel () at ../sysdeps/unix/syscall-template.S:81 (gdb) thread 1 [Switching to thread 1 (Thread 0x7f2d9d6f5700 (LWP 3163))] #0 0x00007f2d9cf92cb0 in __accept_nocancel () at ../sysdeps/unix/syscall-template.S:81 81 in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x00007f2d9cf92cb0 in __accept_nocancel () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f2d9b9180e6 in nbio_accept (master=, addr=addr@entry=0x7ffd240a4d60, addrlen=addrlen@entry=0x7ffd240a4d5c) at nonblockio.c:2138 #2 0x00007f2d9b916d43 in pl_accept (Master=, Slave=271, Peer=272) at socket.c:480 #3 0x00007f2d9d1df0c5 in PL_next_solution (qid=3, qid@entry=22) at pl-vmi.c:3678 #4 0x00007f2d9d21453f in query_loop (goal=goal@entry=316421, loop=0) at pl-pro.c:115 #5 0x00007f2d9d214cdb in prologToplevel (goal=316421) at pl-pro.c:428 #6 0x00007f2d9d24ead4 in PL_initialise (argc=1, argc@entry=2, argv=, argv@entry=0x7ffd240a5678) at pl-init.c:913 #7 0x0000000000400831 in main (argc=2, argv=0x7ffd240a5678) at pl-main.c:118 |
With the correct params I can at least reproduce your issue. First of all, the closing is not correct, so we need some work to fix that. Eventually the test program is now this:
The thread is not needed, but frees the toplevel so we can look around. Using this, trashing no longer leaks memory and file handles. It runs for quite a while before pausing. If you wait for the The issue is probably caused by sockets that are not closed correctly and wait for their lingering period. You can see that by just waiting. At some point it resumes. As far as I can see, Prolog is closing all sockets nicely, so it might be |
OK, I think the test case I constructed is not sufficient to elicit the hang I observe when running the HTTPS server in practice. Yesterday, I observed the actual issue again, after 5 days of continuous uptime. I have run GDB on the hanging process and obtained: (gdb) info threads Id Target Id Frame 7 Thread 0x7f1a17f3e700 (LWP 1522) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=4, buf=buf@entry=0x7f1a1038dc80, n=n@entry=4096, flags=-1, flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33 6 Thread 0x7f19d3fff700 (LWP 1543) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=10, buf=buf@entry=0x7f19ec04ca70, n=n@entry=4096, flags=-1, flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33 5 Thread 0x7f19d27fc700 (LWP 1546) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=7, buf=buf@entry=0x7f19f004ae10, n=n@entry=4096, flags=-1, flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33 4 Thread 0x7f19ca7fc700 (LWP 1553) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=5, buf=buf@entry=0x7f19d800cf40, n=n@entry=4096, flags=-1, flags@entry=0) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33 3 Thread 0x7f19c9ffb700 (LWP 1554) "swipl" 0x00007f1a1bccaccd in accept () at ../sysdeps/unix/syscall-template.S:81 2 Thread 0x7f19c97fa700 (LWP 1555) "swipl" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 * 1 Thread 0x7f1a1c42c700 (LWP 1520) "swipl" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 (gdb) thread 1 [Switching to thread 1 (Thread 0x7f1a1c42c700 (LWP 1520))] #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 238 in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S (gdb) bt #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #1 0x00007f1a1bf7607f in dispatch_cond_wait () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #2 0x00007f1a1bf7b643 in get_message () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #3 0x00007f1a1bf7b9c1 in pl_thread_get_message1_va () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #4 0x00007f1a1bf17a92 in PL_next_solution () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #5 0x00007f1a1bf4c3fe in query_loop () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #6 0x00007f1a1bf4cbcb in prologToplevel () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #7 0x00007f1a1bf86914 in PL_initialise () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #8 0x0000000000400831 in main () I have also tried to obtain a backtrace with (gdb) call PL_backtrace(25,1) Program received signal SIGPIPE, Broken pipe. 0x00007f1a1bccaa7d in write () at ../sysdeps/unix/syscall-template.S:81 81 ../sysdeps/unix/syscall-template.S: No such file or directory. The program being debugged was signaled while in a function called from GDB. GDB remains in the frame where the signal was received. To change this behavior use "set unwindonsignal on". Evaluation of the expression containing the function (PL_backtrace) will be abandoned. When the function is done executing, GDB will silently stop. (gdb) bt #0 0x00007f1a1bccaa7d in write () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f1a1bfa97bf in S__flushbuf () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #2 0x00007f1a1bfab6b9 in Sflush () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #3 0x00007f1a1bfab99d in Svdprintf () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #4 0x00007f1a1bfaba4c in Sdprintf () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #5 0x00007f1a1bf675fe in writeContextFrame () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #6 0x00007f1a1bf676d2 in PL_backtrace () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #7 #8 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 #9 0x00007f1a1bf7607f in dispatch_cond_wait () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #10 0x00007f1a1bf7b643 in get_message () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #11 0x00007f1a1bf7b9c1 in pl_thread_get_message1_va () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #12 0x00007f1a1bf17a92 in PL_next_solution () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #13 0x00007f1a1bf4c3fe in query_loop () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #14 0x00007f1a1bf4cbcb in prologToplevel () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #15 0x00007f1a1bf86914 in PL_initialise () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3 #16 0x0000000000400831 in main () Does this help in any way? Please let me know what else I can try to locate the root cause of this. |
This may be perfectly ok as a state. What I do see is only 4 worker threads. That is very low for a real server. I typically use at least 16, depending in the load. Note that if keep alive is used (default), each worker will remain waiting for a second for new requests before dropping the line and become part of the available queue. If it permanently blocks the threads hanging in recv are somehow doing something not sensitive to timeout. That can be legitimate if you modify the timeout and read e.g. posted data. What we do need is a PL_backtrace() that returns a string, as the backtrace is commonly send to |
It is strange: I am running this program via a The salient part of the service file is: [Service] UMask=022 Environment=LANG=en_US.utf8 Restart=on-abort StartLimitInterval=60 LimitNOFILE=1000 DefaultStartLimitBurst=5 WorkingDirectory=... ExecStart=/usr/local/bin/swipl ... \ --no-fork --user=... --https --certfile=... --keyfile=...\ --pidfile=/var/run/proloxy.pid --workers=32 --syslog=proloxy \ --cipherlist=... --keep_alive_timeout=600 The program is my HTTP-proxy Proloxy, which only relays the requests. One interesting piece of information, which may be a coincidence however: I see from the log files that yesterday, in the hours before the issue occurred, there were 4 HEAD requests, one at 5:38, one at 15:01, one at 15:03 and one at 15:19. The issue occurred after that, probably between 19:00 and 20:00 o'clock. Could it be related to the HEAD requests? All other requests yesterday were |
A keep_alive_timeout of 600 will need many more workers or spawn jobs on new threads. Still, there are two mysteries what happened with the 28 workers and are the remaining 4 in keep alive timeout or not? No clue whether HEAD requests are part of the story, but that should be easy to test, no? |
What did indeed happen to the other threads?? They are all there when I restart the service, and also now one day later, but seem to pass away silently somewhere along the the way. I will monitor this over the coming days. Is there anything I can do to find the reason why the threads go away? |
I have now run hundreds of |
One down. I glanced through the code, but I can't really find anything obvious. Somewhat suspect may be ssl negotiation errors, but I think they are handled correctly. Fear you need some debug statements. You can have a thread monitoring the others and print a message if one disappears, so you have a fairly precise time. In SWISH you find lib/replay.pl that you can use as a starting point to stress test your server from log files. |
Failed SSL negotiation currently leaks file descriptors, please see #33 and the patch for HTTP. In the example above though, there are still enough file descriptors. Thank you for the pointers! |
I have made some progress with this issue by adding the following definitions to the Proloxy rules that run the site: :- initialization(watch_threads). watch_threads :- thread_create(log_thread_listing, _, [alias(watcher)]). threads(Ts) :- findall(I-D, thread_statistics(I, D), Ts). log_format(F, Args) :- http_log(F, Args). log_thread_listing :- sleep(5), threads(Ts), length(Ts, L), log_format("num_threads(~w).\n", [L]), maplist(log_thread, Ts), sleep(1800), log_thread_listing. log_thread(I-T) :- ( thread_property(I, alias(A)) -> log_format("thread(~q, ~q).\n", [A,T.status]) ; true ). These definitions log the threads' states every 30 minutes. The following has happened with a server that does not use HTTPS, so the underlying cause of this issue may in fact affect both HTTP and HTTPS servers. When I start the server, all threads are running exactly as expected: server(started, 1486898099). num_threads(19). thread(main, running). thread(watcher, running). thread('httpd@80_1', running). thread('httpd@80_2', running). thread('httpd@80_3', running). thread('httpd@80_4', running). thread('httpd@80_5', running). thread('httpd@80_6', running). thread('httpd@80_7', running). thread('httpd@80_8', running). thread('httpd@80_9', running). thread('httpd@80_10', running). thread('httpd@80_11', running). thread('httpd@80_12', running). thread('httpd@80_13', running). thread('httpd@80_14', running). thread('httpd@80_15', running). thread('httpd@80_16', running). thread('http@80', running). /*Sun Feb 12 12:14:58 2017*/ etc. This list stays unchanged for weeks and months without any changes. Sometimes, I see error messages like the following in the log files: /*Sun Mar 19 08:02:28 2017*/ request(1939, 1489906948.327, [peer(ip(181,26,188,206)),method(get),request_uri(/),path(/),http_version(1-0)]). completed(1939, 0.007126103000018702, 0, 500, error("bytes('text/html', ...): remote hangup after 28,672 bytes")). This is completely OK of course, and normally does not change the state of the threads that are reported in the log file. However, sometimes things seem to go wrong with such requests. Here is a concrete example. First, I get: /*Sun Mar 19 09:37:41 2017*/ request(1941, 1489912661.103, [peer(ip(49,245,126,195)),method(get),request_uri(/),path(/),http_version(1-0)]). completed(1941, 0.0073578000000225074, 0, 500, error("bytes('text/html', ...): remote hangup after 28,672 bytes")). This looks innocent enough, just like in the earlier case. However, from then on, I get: thread(main, running). thread(watcher, running). thread('httpd@80_1', running). thread('httpd@80_2', running). thread('httpd@80_3', running). thread('httpd@80_4', running). thread('httpd@80_5', running). thread('httpd@80_6', running). thread('httpd@80_7', running). thread('httpd@80_8', running). thread('httpd@80_9', running). thread('httpd@80_10', running). thread('httpd@80_11', running). thread('httpd@80_12', running). thread('httpd@80_13', running). thread('httpd@80_14', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_15', running). thread('httpd@80_16', running). thread('http@80', running). From this, it seems that one of the threads has a severe problem from now on, because the thread stays in this state, while the others still run as normal. Then, one day later (in this concrete case), I get two more requests: /*Mon Mar 20 07:56:33 2017*/ request(1967, 1489992993.684, [peer(ip(189,129,142,140)),method(get),request_uri(/),path(/),http_version(1-0)]). completed(1967, 0.008738023000006478, 0, 500, error("bytes('text/html', ...): remote hangup after 28,672 bytes")). /*Mon Mar 20 08:02:50 2017*/ request(1968, 1489993370.896, [peer(ip(187,192,214,75)),method(get),request_uri(/),path(/),http_version(1-0)]). completed(1968, 0.008106599000001324, 0, 500, error("bytes('text/html', ...): remote hangup after 29,135 bytes")). From then on, two more threads exhibit the same problem: num_threads(19). thread(main, running). thread(watcher, running). thread('httpd@80_1', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_2', running). thread('httpd@80_3', running). thread('httpd@80_4', running). thread('httpd@80_5', running). thread('httpd@80_6', running). thread('httpd@80_7', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_8', running). thread('httpd@80_9', running). thread('httpd@80_10', running). thread('httpd@80_11', running). thread('httpd@80_12', running). thread('httpd@80_13', running). thread('httpd@80_14', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_15', running). thread('httpd@80_16', running). thread('http@80', running). This goes on for a while, until eventually (i.e., after several more requests like above), we only get: num_threads(19). thread(main, running). thread(watcher, running). thread('httpd@80_1', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_2', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_3', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_4', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_5', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_6', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_7', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_8', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_9', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_10', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_11', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_12', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_13', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_14', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_15', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('httpd@80_16', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))). thread('http@80', running). Thus, all worker threads have turned from This is with SWI-Prolog ~7.5.0 (1a0ff1b33683571b81669a11f4683e82994b8112), from end of January. Since then, a few patches have been applied to SWI-Prolog that may even solve this issue ( 098e6be854104a843be495911b699460b3fa57ec, maybe?), and I will of course restart the server with the latest SWI version to see if this reproduces. Until then, I would be grateful for any additional feedback and hints. |
Ah. This is progress! I think this implies that it is possible for catch/3 to miss an exception. I have a small hunch this may be related to 1a3c397bf0258138b85c061e97ae828bb455d742 which is related to dealing with an exception (e.g., timeout, abort) arriving inside the Prolog hook that diagnoses another exception. I.e., the writer gets an exception due to the I/O error and while inside the exception hook it gets a timeout exception. Similar stuff seems to be going on in SWISH, which processes a lot more requests since it connects its users using websockets. If you look at http://swish.swi-prolog.org/example/stats.swinb, section Number of threads and visitors, you see that the number of threads gets higher and higher when compared to the number of Pengines. As most of SWISH' requests are handled in forked threads rather than the main HTTP worker threads it doesn't affect SWISH that much (a thread in terminated state is just a little structure that remembers the exit state). Besides, SWISH typically doesn't live long enough :( Now I need to think of a setup to replicate this. Possibly setting up an HTTP server with a short timeout and use a Prolog client that deliberately hangs the connection close to the timeout. |
Was trying to reproduce this. Trying to get the I/O error due to the client dropping the connection and the timeout at the same time does indeed kill the HTTP worker, but this is correctly picked up by I missed something before: your exception is a format to |
I am using Proloxy, started using the following [Unit] Description=Proloxy [Service] UMask=022 Environment=LANG=en_US.utf8 Restart=on-abort StartLimitInterval=60 LimitNOFILE=1000 DefaultStartLimitBurst=5 WorkingDirectory=/home/web/proloxy ExecStart=/usr/local/bin/swipl rules.pl proloxy.pl \ --no-fork --port=80 --user=web \ --pidfile=/var/run/proloxy.pid --workers=16 --syslog=proloxy [Install] WantedBy=multi-user.target In the concrete case I described earlier, my :- use_module(library(http/http_dispatch)). :- use_module(library(http/http_log)). :- use_module(library(http/http_files)). % allow reading .well-known for Let's Encrypt request_prefix_target(Request, '', _) :- memberchk(request_uri(URI), Request), atom_concat('/.well-known/', _, URI), http_reply_from_files('/home/web/html', [], Request). /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Redirect everything to https://... - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ request_prefix_target(Request, _, _) :- memberchk(request_uri(URI), Request), memberchk(host(Host), Request), atomic_list_concat(['https://',Host,URI], Target), http_redirect(moved, Target, Request). This only redirects from HTTP to HTTPS. |
If we look at https://bugzilla.redhat.com/show_bug.cgi?id=754938, we see that systemd indeed captures the output of a service using a socket. I think the question splits into two:
Here is some more backup for my suspicion: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771122 So, my working hypothesis is this:
This fits with the observation all works well for a long time. At some point you do an |
This may definitely be one (or even the) way to trigger this problem. Only for completeness, I would like to mention that I have not run In fact, I think this may also happen due to syslogd: [...] rsyslogd was HUPed In such cases, is there a way to reconnect to the journal daemon? I mean if the journal daemon restarts, SWI-Prolog should establish a connection to the new instance if possible. |
One additional data point: $ systemctl status rsyslog.service ● rsyslog.service - System Logging Service Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled) Active: active (running) since Mon 2016-11-07 14:58:44 CET; 4 months 27 days ago Docs: man:rsyslogd(8) http://www.rsyslog.com/doc/ Main PID: 714 (rsyslogd) CGroup: /system.slice/rsyslog.service └─714 /usr/sbin/rsyslogd -n According to this, the rsyslog daemon was in fact running throughout the entire time the HTTP server was running. |
It is hard to figure out what exactly happened. I'm not 100% convinced this is why things go wrong. From the Prolog messages this seems very likely. We need to find the exact relation between the service and systemd and/or a way to reproduce the result which is plausible or better provable that it happened. Possibly this can be found in other reports as we don't seem to be the only victim. What is the OS version? Is that Debian Jessie? And no, there is no way to restore the connection as I understand it. The service has two options: ignore the errors from its output streams or terminate and rely on the systemd restart policy. |
It is not syslog that is causing the problem, it is the journal service of systemd that catches the output of the services. See e.g., https://bugs.freedesktop.org/show_bug.cgi?id=84923. If I query that on my desktop (Ubuntu 16,10), I get
Note the last line. I don't know really what this means. I did see bug reports that relate a closed Not very elegantly, but possibly the most sensible solution is to add an option to streams to ignore write errors and enable that in the |
When I query:
then I get: ● systemd-journal.service Loaded: not-found (Reason: No such file or directory) Active: inactive (dead) This could suggest that this service may not be the reason for this problem in this particular instance. Please note that I have since restarted the HTTP server, and all threads are running exactly as expected. This system is: $ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.7 (jessie) Release: 8.7 Codename: jessie The suggested option for streams indeed seems very inelegant to me. Can you use |
I don't know what it means that You can put a catch around the print_message/2, but in this case not inside your handler as it is printing an error that results from the final communication which is not executed under your handler. That is also why the thread dies as the entire user part of the handler is guarded. I don't think we want to place a catch around every print_message/2 call :( So I think we have two options: as above ignore I/O errors on I am tempted to implement suicide in |
● systemd-journald.service - Journal Service Loaded: loaded (/lib/systemd/system/systemd-journald.service; static) Active: active (running) since Mon 2017-03-20 04:42:43 CET; 2 weeks 6 days ago Docs: man:systemd-journald.service(8) man:journald.conf(5) Main PID: 8706 (systemd-journal) Status: "Processing requests..." CGroup: /system.slice/systemd-journald.service └─8706 /lib/systemd/systemd-journald Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. Let us call this system For comparison, I have a different machine that is in almost all respects identical, and the hang has not occurred there: ● systemd-journald.service - Journal Service Loaded: loaded (/lib/systemd/system/systemd-journald.service; static) Active: active (running) since Tue 2016-10-25 13:07:50 CEST; 5 months 14 days ago Docs: man:systemd-journald.service(8) man:journald.conf(5) Main PID: 12468 (systemd-journal) Status: "Processing requests..." CGroup: /system.slice/systemd-journald.service └─12468 /lib/systemd/systemd-journald Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. Note that both systems display the warning about the rotated journal, but only on When I manually restart the journal service, I can reproduce the issue with the following Tcl script: #!/usr/bin/tclsh set s [socket localhost 80] puts $s "GET /\n" flush $s close $s It simply sends a request and closes the channel immediately. I see that the latest git version contains a fix for this, thank you very much! I do not understand though why I have simply kept both systems running with as little interference as possible (no
This runs for weeks without problems. On Sunday, March 19 (4:40), the HTTPS server was restarted by the cron job. The This is important due to the way the whole |
That seems to be the conclusion. Together write the |
In my view, this is not yet fully solved: Instead of a "starved" server, users will now get a "terminated" server, using the sample service file that ships with SWI-Prolog itself. To reproduce the problem with the latest git version, please perform the following steps:
Expectation: The server halts and restarts. To solve this problem, please consider changing the |
Leaving it stopped is indeed not good. As you have it all setup, could you propose a fix? |
In my setup, I have solved this by changing the value of
This is the setting that I have now also published for Proloxy and LetSWICrypt. Fedora has some guidelines, or at least a discussion of potential guidelines, and importantly a table that explains the values. Quoting Lennart: I am fine with either on-abort or on-failure. I think on-failure is the nicer choice for daemons that are written somewhat cleanly, but on-abort is the safer choice. Please take a look and apply it as it best suits the overall infrastructure. Thank you! |
Thanks for sorting this out. I've decided to adjust the example to use
|
Every few days (sometimes: weeks) of continuous uptime, the HTTPS server I am running assumes a state where it still accepts connections, but does not complete an SSL handshake.
In that state, when I try to connect to the server with
openssl
, I get:followed by nothing more, i.e., nothing more appears and the handshake seems to hang. When the server is still working correctly, the output continues with:
Please let me know what I can do to trace this down. I can use
gdb
to trace locked threads with:I will try this next time this hangs. So far, I could not replicate this issue with any tests. I have run several benchmarks on the server, using the Apache bench tool
ab
, and it seems to behave as expected in all scenarios I have tried. When the hang occurs, there are still enough available resources on the server to easily make further connections. I can also use all other ports, and can SSH into the server without any problems. Only the HTTPS connections no longer work.To collect experience with such scenarios, please consider moving the SWI website or other frequently used sites to HTTPS, using native SWI-Prolog to make the connection without intermediary tools. I would also greatly appreciate feedback from other users who use the SSL package, and advice from networking/openssl experts who are interested in helping with this. @chigley, I would greatly appreciate your input in particular, at your convenience.
The text was updated successfully, but these errors were encountered: