Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL server can assume a state where it no longer answers to client requests #25

Closed
triska opened this issue Aug 21, 2016 · 35 comments
Closed

Comments

@triska
Copy link
Member

triska commented Aug 21, 2016

Every few days (sometimes: weeks) of continuous uptime, the HTTPS server I am running assumes a state where it still accepts connections, but does not complete an SSL handshake.

In that state, when I try to connect to the server with openssl, I get:

$ openssl s_client -host $HOST -port 443
CONNECTED(00000003)

followed by nothing more, i.e., nothing more appears and the handshake seems to hang. When the server is still working correctly, the output continues with:

depth=2 C = US, O = "thawte, Inc.", OU = Certification Services Division, OU = "(c) 2008 thawte, Inc. - For authorized use only", CN = thawte Primary Root CA - G3
verify return:1
depth=1 C = US, O = "thawte, Inc.", OU = Domain Validated SSL, CN = thawte DV SSL SHA256 CA
verify return:1
...

Please let me know what I can do to trace this down. I can use gdb to trace locked threads with:

gdb swipl 
(gdb) info threads
 ...
(gdb) thread 
(gdb) bt

I will try this next time this hangs. So far, I could not replicate this issue with any tests. I have run several benchmarks on the server, using the Apache bench tool ab, and it seems to behave as expected in all scenarios I have tried. When the hang occurs, there are still enough available resources on the server to easily make further connections. I can also use all other ports, and can SSH into the server without any problems. Only the HTTPS connections no longer work.

To collect experience with such scenarios, please consider moving the SWI website or other frequently used sites to HTTPS, using native SWI-Prolog to make the connection without intermediary tools. I would also greatly appreciate feedback from other users who use the SSL package, and advice from networking/openssl experts who are interested in helping with this. @chigley, I would greatly appreciate your input in particular, at your convenience.

@JanWielemaker
Copy link
Member

Unfortunately the services we run here at the VU are hard to run that way because our infrastructure requires a proxy in between. This rules out eu.swi-prolog.org and swish.swi-prolog.org. I could use SSL on us.swi-prolog.org. Not sure how the CDN will respond if one of its backends uses HTTPS and the other HTTP. It is also doubtful whether that works as 99% of the connections will come from fastly. Your issue might be caused by a misbehaving client (which of course should not lead to a DoS).

Anyway, make sure the system is compiled with this line uncommented in build:

export COFLAGS="-O2 -gdwarf-2 -g3"

If it hangs, you can find the blocking thread using gdb and you can (often) find the mutex and print its details to see whether it is an attempt at a recursive lock or which thread is holding the lock. You can also use

(gdb) call PL_backtrace(25,1)

to dump a Prolog stack trace to stderr for the current thread. If it is a deadlock, these things usually suffice to find out what is happening.

@triska
Copy link
Member Author

triska commented Sep 19, 2016

I have constructed a self-contained test case that shows the problem, or at least an issue that is likely very closely related.

First, please create server.key and server.crt (alternatively, you can use existing files instead):

openssl genrsa -out server.key 2048
openssl req -new -key server.key -out server.csr
cp server.key server.key.org
openssl rsa -in server.key.org -out server.key
openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt

The Prolog server that shows the issue is server.pl:

:- use_module(library(ssl)).
run :-
        ssl_context(server, SSL, [key_file('server.key'),
                                  certificate_file('server.crt'),
                                  cipher_list('DEFAULT:!DES:!RSA')
                                  ]),
        tcp_socket(Socket),
        Port = 8371,
        tcp_bind(Socket, localhost:Port),
        format("listening on port: ~w\n", [Port]),
        tcp_listen(Socket, 5),
        tcp_open_socket(Socket, AcceptFd, _),
        repeat,
            tcp_accept(AcceptFd, Client, _PeerName),
            tcp_open_socket(Client, StreamPair),
            stream_pair(StreamPair, PlainRead, PlainWrite),
            catch(ssl_negotiate(SSL, PlainRead, PlainWrite, _Read, _Write),
                  _,
                  true),
            catch(close(StreamPair), _, true),
            false.
:- initialization(run).

Run it with:

$ swipl server.pl

It will listen on port 8371, you can change it easily in the source.

To hammer the server with repeated connection attempts, we use the sslscan tool. You can install it easily via apt-get. A single run is not enough, so we use the script thrash.sh to thrash the server:

#!/bin/bash
i=0
port=$1
while true
do
    echo thrashing $i
    sslscan localhost:$port | grep -i acc
    i=$((i+1))
done

You run the script via:

$ ./thrash.sh 8371

(EDIT: corrected port)

Depending on the ciphers that are enabled, it will hang after "thrashing 21", "thrashing 45" or a similar (rather small) number of runs.

I have also produced a backtrace as suggested, but it does not tell me a lot:

  [8] socket:tcp_accept/3 [PC=1 in supervisor]
  [7] run/0 [PC=79 in clause 1]
  [6] $toplevel:residue_vars/2 [PC=5 in clause 2]
  [5] $toplevel:$execute_goal2/2 [PC=8 in clause 1]
  [3] $toplevel:$query_loop/0 [PC=101 in clause 1]
  [2] $toplevel:$runtoplevel/0 [PC=45 in clause 1]
  [1] $toplevel:$toplevel/0 [PC=3 in clause 1]
  [0] system:$c_call_prolog/0 [PC=0 in top query clause]

Please try to reproduce this and let me know if I can supply an additional information. If you send me your SSH public key, I will create an account for you to reproduce this.

@FunctionAnalysis
Copy link

Using the server.pl file, calls directly to server/1 fail, but after using the graphical debugger and trace option I followed the steps of calling server/0.

Just going through the navigator of the server.pl file, I see that server/0 and client/0 loop with exception handling calling the ssl.pl with server/1 and server/3.

With what limited things I know about ssl, and comparing it to the issue of certificate verification when relying on root certificates, my guess is that the client fails to verify, so the server ends up calling server/0.

When you call server/0 after consulting server.pl it enters recursion. Since it is meant to go to ssl.pl after the ssl_init module is called, it could be that the server basically ends up "talking to itself" which is why it hangs.

Since this happens at low thrash values, the negotiation is failing, but with any "root" only attempts the "root" will be the only one to accept the connection.

Another thing could be how ports are being handled. I only see one use of a port(Port) and with SSL the required https port is 443, but most hard-coded ports in the files are 2443 which is UDP (ignoring any 1111 ports which would need to obviously be switched by an admin).

So it looks like the UDP default port of 2443 is getting instantiated in place of a multi-threading with 443.

@JanWielemaker
Copy link
Member

Built and ran this on Fedora 24. Stopped it after 300,000 runs. No hangs, no leaks, all just fine. To some extend this is a bit strange to me as I think the code should have been as below, closing the SSL streams
rather than the raw streams. I added a print message call just be be sure. No errors are printed. This
code runs just fine as well.

            stream_pair(StreamPair, PlainRead, PlainWrite),
            catch(ssl_negotiate(SSL, PlainRead, PlainWrite, Read, Write),
                  Essl,
                  print_message(error, Essl)),
            catch(call_cleanup(close(Write), close(Read)), E,
          print_message(error, E)),
            false.

@JanWielemaker
Copy link
Member

@FunctionAnalysis I can't relate your post to this thread. There is only a run/0, no server/0, client/0, etc.

@JanWielemaker
Copy link
Member

@triska A hanging tcp_accept/3 is a little weird. There is either some issue in the Prolog layer around sockets or there is some system reason to block. As you've got it in gdb, please run the gdb bt command against the hanging thread. That should give the C stack from which we can deduce whether it is hanging on a Prolog lock, the accept() system call or whatever. With some luck you can figure out what. GDB bluffers guide:

  • info threads lists all threads
  • thread N switches to a thread
  • bt prints a backtrace
  • up and down go a frame up or down
  • p <expression> prints the value of an expression

With the source code at hand, that typically suffices to get a clue.

@FunctionAnalysis
Copy link

Looking at the latest ssl folder, and downloading the zip available in the official SWIPL repository, there are seven prolog source files. I was looking through all of them viewed in the navigator tool.

There is a call using the server module which itself calls a server/0.

I will post screen captures to illustrate what I am referring to.

On Sep 20, 2016, at 08:58, Jan Wielemaker notifications@github.com wrote:

@FunctionAnalysis I can't relate your post to this thread. There is only a run/0, no server/0, client/0, etc.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@FunctionAnalysis
Copy link

In the interest of my time, here is a capture of the main ssl package on the official repo.
ssl 5

@triska
Copy link
Member Author

triska commented Sep 20, 2016

@JanWielemaker : There was a typo in the script invocation (I have now fixed it), please use the following to thrash the server (note the correct server port):

./thrash.sh 8371

I can reproduce the hang on a fresh Debian 8.5 install, please send me your public key if you want full root access to the system.

The commands you posted yield:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f2d9d6f5700 (LWP 3163) "swipl" 0x00007f2d9cf92cb0 in __accept_nocancel () at ../sysdeps/unix/syscall-template.S:81
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f2d9d6f5700 (LWP 3163))]
#0  0x00007f2d9cf92cb0 in __accept_nocancel ()
    at ../sysdeps/unix/syscall-template.S:81
81  in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007f2d9cf92cb0 in __accept_nocancel ()
    at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f2d9b9180e6 in nbio_accept (master=,
    addr=addr@entry=0x7ffd240a4d60, addrlen=addrlen@entry=0x7ffd240a4d5c)
    at nonblockio.c:2138
#2  0x00007f2d9b916d43 in pl_accept (Master=, Slave=271,
    Peer=272) at socket.c:480
#3  0x00007f2d9d1df0c5 in PL_next_solution (qid=3, qid@entry=22)
    at pl-vmi.c:3678
#4  0x00007f2d9d21453f in query_loop (goal=goal@entry=316421, loop=0)
    at pl-pro.c:115
#5  0x00007f2d9d214cdb in prologToplevel (goal=316421) at pl-pro.c:428
#6  0x00007f2d9d24ead4 in PL_initialise (argc=1, argc@entry=2,
    argv=, argv@entry=0x7ffd240a5678) at pl-init.c:913
#7  0x0000000000400831 in main (argc=2, argv=0x7ffd240a5678) at pl-main.c:118

@JanWielemaker
Copy link
Member

With the correct params I can at least reproduce your issue. First of all, the closing is not correct, so we need some work to fix that. Eventually the test program is now this:

:- use_module(library(ssl)).

run :-
    ssl_context(server, SSL,
            [ key_file('server.key'),
              certificate_file('server.crt'),
              cipher_list('DEFAULT:!DES:!RSA'),
              close_parent(true)
            ]),
        tcp_socket(Socket),
    tcp_setopt(Socket, reuseaddr),
    Port = 8371,
        tcp_bind(Socket, localhost:Port),
    current_prolog_flag(pid, PID),
        format("Process ~d: listening on port: ~w\n", [PID, Port]),
        tcp_listen(Socket, 5),
        tcp_open_socket(Socket, AcceptFd, _),
        between(1, infinite, Run),
            tcp_accept(AcceptFd, Client, _PeerName),
            tcp_open_socket(Client, StreamPair),
            stream_pair(StreamPair, PlainRead, PlainWrite),
            (   catch(ssl_negotiate(SSL, PlainRead, PlainWrite, Read, Write),
              Essl,
              print_message(error, Essl))
        ->  (   var(Essl)
        ->  format('Connected ~D~n', [Run]),
            catch(call_cleanup(close(Write), close(Read)), E,
              print_message(error, E))
        ;   format('Failed ~D~n', [Run]),
            catch(close(StreamPair), Ec,
              print_message(error, Ec))
        )
        ;   format('Negotiate failed ~D~n', [Run])
        ),
        false.

:- initialization
    thread_create(run, _, []).

lsfd :-
    current_prolog_flag(pid, PID),
    format(string(Cmd), "ls -l /proc/~w/fd", [PID]),
    shell(Cmd).

The thread is not needed, but frees the toplevel so we can look around. Using this, trashing no longer leaks memory and file handles. It runs for quite a while before pausing. If you wait for the ./thrash.sh to stop, you can indeed find that Prolog is trapped in accept(). If you stop ./thrash.sh and just call the sslscan without grep, you'll find that it gets an error in bind(). That is before it tries to contact the Prolog server and that is why the Prolog server gets no connection attempts and hangs in accept(). Nothing wrong there.

The issue is probably caused by sockets that are not closed correctly and wait for their lingering period. You can see that by just waiting. At some point it resumes. As far as I can see, Prolog is closing all sockets nicely, so it might be sslscan that doesn't do so. You may try the thrash loop against another server.

@triska
Copy link
Member Author

triska commented Oct 26, 2016

OK, I think the test case I constructed is not sufficient to elicit the hang I observe when running the HTTPS server in practice.

Yesterday, I observed the actual issue again, after 5 days of continuous uptime. I have run GDB on the hanging process and obtained:

(gdb) info threads
  Id   Target Id         Frame
  7    Thread 0x7f1a17f3e700 (LWP 1522) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=4,
    buf=buf@entry=0x7f1a1038dc80, n=n@entry=4096, flags=-1, flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
  6    Thread 0x7f19d3fff700 (LWP 1543) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=10,
    buf=buf@entry=0x7f19ec04ca70, n=n@entry=4096, flags=-1, flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
  5    Thread 0x7f19d27fc700 (LWP 1546) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=7,
    buf=buf@entry=0x7f19f004ae10, n=n@entry=4096, flags=-1, flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
  4    Thread 0x7f19ca7fc700 (LWP 1553) "swipl" 0x00007f1a1bccaddb in __libc_recv (fd=5,
    buf=buf@entry=0x7f19d800cf40, n=n@entry=4096, flags=-1, flags@entry=0)
    at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33
  3    Thread 0x7f19c9ffb700 (LWP 1554) "swipl" 0x00007f1a1bccaccd in accept ()
    at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7f19c97fa700 (LWP 1555) "swipl" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
* 1    Thread 0x7f1a1c42c700 (LWP 1520) "swipl" pthread_cond_timedwait@@GLIBC_2.3.2 ()
    at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f1a1c42c700 (LWP 1520))]
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
238 in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
(gdb) bt
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x00007f1a1bf7607f in dispatch_cond_wait () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#2  0x00007f1a1bf7b643 in get_message () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#3  0x00007f1a1bf7b9c1 in pl_thread_get_message1_va ()
   from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#4  0x00007f1a1bf17a92 in PL_next_solution () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#5  0x00007f1a1bf4c3fe in query_loop () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#6  0x00007f1a1bf4cbcb in prologToplevel () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#7  0x00007f1a1bf86914 in PL_initialise () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#8  0x0000000000400831 in main ()

I have also tried to obtain a backtrace with PL_backtrace, but I think this caused the program to terminate:

(gdb) call PL_backtrace(25,1)
Program received signal SIGPIPE, Broken pipe.
0x00007f1a1bccaa7d in write () at ../sysdeps/unix/syscall-template.S:81
81  ../sysdeps/unix/syscall-template.S: No such file or directory.
The program being debugged was signaled while in a function called from GDB.
GDB remains in the frame where the signal was received.
To change this behavior use "set unwindonsignal on".
Evaluation of the expression containing the function
(PL_backtrace) will be abandoned.
When the function is done executing, GDB will silently stop.
(gdb) bt
#0  0x00007f1a1bccaa7d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f1a1bfa97bf in S__flushbuf () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#2  0x00007f1a1bfab6b9 in Sflush () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#3  0x00007f1a1bfab99d in Svdprintf () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#4  0x00007f1a1bfaba4c in Sdprintf () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#5  0x00007f1a1bf675fe in writeContextFrame () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#6  0x00007f1a1bf676d2 in PL_backtrace () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#7  
#8  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#9  0x00007f1a1bf7607f in dispatch_cond_wait () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#10 0x00007f1a1bf7b643 in get_message () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#11 0x00007f1a1bf7b9c1 in pl_thread_get_message1_va ()
   from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#12 0x00007f1a1bf17a92 in PL_next_solution () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#13 0x00007f1a1bf4c3fe in query_loop () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#14 0x00007f1a1bf4cbcb in prologToplevel () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#15 0x00007f1a1bf86914 in PL_initialise () from /usr/local/lib/swipl-7.3.25/lib/x86_64-linux/libswipl.so.7.3
#16 0x0000000000400831 in main ()

Does this help in any way? Please let me know what else I can try to locate the root cause of this.

@JanWielemaker
Copy link
Member

This may be perfectly ok as a state. What I do see is only 4 worker threads. That is very low for a real server. I typically use at least 16, depending in the load. Note that if keep alive is used (default), each worker will remain waiting for a second for new requests before dropping the line and become part of the available queue.

If it permanently blocks the threads hanging in recv are somehow doing something not sensitive to timeout. That can be legitimate if you modify the timeout and read e.g. posted data.

What we do need is a PL_backtrace() that returns a string, as the backtrace is commonly send to
an unknown stream :( Anyway, the above may give you a clue. Are we dealing with POST? How is t he data read?

@triska
Copy link
Member Author

triska commented Oct 26, 2016

It is strange: I am running this program via a systemd service file, using --workers=32!

The salient part of the service file is:

[Service]
UMask=022
Environment=LANG=en_US.utf8
Restart=on-abort
StartLimitInterval=60
LimitNOFILE=1000
DefaultStartLimitBurst=5
WorkingDirectory=...
ExecStart=/usr/local/bin/swipl ... \
   --no-fork --user=... --https --certfile=...  --keyfile=...\
   --pidfile=/var/run/proloxy.pid --workers=32 --syslog=proloxy \
   --cipherlist=... --keep_alive_timeout=600

The program is my HTTP-proxy Proloxy, which only relays the requests.

One interesting piece of information, which may be a coincidence however: I see from the log files that yesterday, in the hours before the issue occurred, there were 4 HEAD requests, one at 5:38, one at 15:01, one at 15:03 and one at 15:19. The issue occurred after that, probably between 19:00 and 20:00 o'clock. Could it be related to the HEAD requests?

All other requests yesterday were GET requests. On the days before yesterday, I had also several HEAD requests, and a few POST requests that are related to Pengines. The vast majority of all requests were always GET.

@JanWielemaker
Copy link
Member

A keep_alive_timeout of 600 will need many more workers or spawn jobs on new threads. Still, there are two mysteries what happened with the 28 workers and are the remaining 4 in keep alive timeout or not? No clue whether HEAD requests are part of the story, but that should be easy to test, no?

@triska
Copy link
Member Author

triska commented Oct 26, 2016

What did indeed happen to the other threads?? They are all there when I restart the service, and also now one day later, but seem to pass away silently somewhere along the the way. I will monitor this over the coming days. Is there anything I can do to find the reason why the threads go away?

@triska
Copy link
Member Author

triska commented Oct 26, 2016

I have now run hundreds of HEAD requests against the server, and they are all handled correctly. All threads are still running.

@JanWielemaker
Copy link
Member

One down. I glanced through the code, but I can't really find anything obvious. Somewhat suspect may be ssl negotiation errors, but I think they are handled correctly. Fear you need some debug statements. You can have a thread monitoring the others and print a message if one disappears, so you have a fairly precise time. In SWISH you find lib/replay.pl that you can use as a starting point to stress test your server from log files.

@triska
Copy link
Member Author

triska commented Oct 27, 2016

Failed SSL negotiation currently leaks file descriptors, please see #33 and the patch for HTTP.

In the example above though, there are still enough file descriptors. Thank you for the pointers!

@triska
Copy link
Member Author

triska commented Apr 3, 2017

I have made some progress with this issue by adding the following definitions to the Proloxy rules that run the site:

:- initialization(watch_threads).

watch_threads :-
        thread_create(log_thread_listing, _, [alias(watcher)]).

threads(Ts) :-
        findall(I-D, thread_statistics(I, D), Ts).

log_format(F, Args) :- http_log(F, Args).

log_thread_listing :-
        sleep(5),
        threads(Ts),
        length(Ts, L),
        log_format("num_threads(~w).\n", [L]),
        maplist(log_thread, Ts),
        sleep(1800),
        log_thread_listing.

log_thread(I-T) :-
        (   thread_property(I, alias(A)) ->
            log_format("thread(~q, ~q).\n", [A,T.status])
        ;   true
        ).

These definitions log the threads' states every 30 minutes. The following has happened with a server that does not use HTTPS, so the underlying cause of this issue may in fact affect both HTTP and HTTPS servers.

When I start the server, all threads are running exactly as expected:

server(started, 1486898099).
num_threads(19).
thread(main, running).
thread(watcher, running).
thread('httpd@80_1', running).
thread('httpd@80_2', running).
thread('httpd@80_3', running).
thread('httpd@80_4', running).
thread('httpd@80_5', running).
thread('httpd@80_6', running).
thread('httpd@80_7', running).
thread('httpd@80_8', running).
thread('httpd@80_9', running).
thread('httpd@80_10', running).
thread('httpd@80_11', running).
thread('httpd@80_12', running).
thread('httpd@80_13', running).
thread('httpd@80_14', running).
thread('httpd@80_15', running).
thread('httpd@80_16', running).
thread('http@80', running).
/*Sun Feb 12 12:14:58 2017*/ etc.

This list stays unchanged for weeks and months without any changes. Sometimes, I see error messages like the following in the log files:

/*Sun Mar 19 08:02:28 2017*/ request(1939, 1489906948.327, [peer(ip(181,26,188,206)),method(get),request_uri(/),path(/),http_version(1-0)]).
completed(1939, 0.007126103000018702, 0, 500, error("bytes('text/html', ...): remote hangup after 28,672 bytes")).

This is completely OK of course, and normally does not change the state of the threads that are reported in the log file.

However, sometimes things seem to go wrong with such requests. Here is a concrete example. First, I get:

/*Sun Mar 19 09:37:41 2017*/ request(1941, 1489912661.103, [peer(ip(49,245,126,195)),method(get),request_uri(/),path(/),http_version(1-0)]).
completed(1941, 0.0073578000000225074, 0, 500, error("bytes('text/html', ...): remote hangup after 28,672 bytes")).

This looks innocent enough, just like in the earlier case. However, from then on, I get:

thread(main, running).
thread(watcher, running).
thread('httpd@80_1', running).
thread('httpd@80_2', running).
thread('httpd@80_3', running).
thread('httpd@80_4', running).
thread('httpd@80_5', running).
thread('httpd@80_6', running).
thread('httpd@80_7', running).
thread('httpd@80_8', running).
thread('httpd@80_9', running).
thread('httpd@80_10', running).
thread('httpd@80_11', running).
thread('httpd@80_12', running).
thread('httpd@80_13', running).
thread('httpd@80_14', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_15', running).
thread('httpd@80_16', running).
thread('http@80', running).

From this, it seems that one of the threads has a severe problem from now on, because the thread stays in this state, while the others still run as normal.

Then, one day later (in this concrete case), I get two more requests:

/*Mon Mar 20 07:56:33 2017*/ request(1967, 1489992993.684, [peer(ip(189,129,142,140)),method(get),request_uri(/),path(/),http_version(1-0)]).
completed(1967, 0.008738023000006478, 0, 500, error("bytes('text/html', ...): remote hangup after 28,672 bytes")).
/*Mon Mar 20 08:02:50 2017*/ request(1968, 1489993370.896, [peer(ip(187,192,214,75)),method(get),request_uri(/),path(/),http_version(1-0)]).
completed(1968, 0.008106599000001324, 0, 500, error("bytes('text/html', ...): remote hangup after 29,135 bytes")).

From then on, two more threads exhibit the same problem:

num_threads(19).
thread(main, running).
thread(watcher, running).
thread('httpd@80_1', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_2', running).
thread('httpd@80_3', running).
thread('httpd@80_4', running).
thread('httpd@80_5', running).
thread('httpd@80_6', running).
thread('httpd@80_7', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_8', running).
thread('httpd@80_9', running).
thread('httpd@80_10', running).
thread('httpd@80_11', running).
thread('httpd@80_12', running).
thread('httpd@80_13', running).
thread('httpd@80_14', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_15', running).
thread('httpd@80_16', running).
thread('http@80', running).

This goes on for a while, until eventually (i.e., after several more requests like above), we only get:

num_threads(19).
thread(main, running).
thread(watcher, running).
thread('httpd@80_1', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_2', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_3', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_4', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_5', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_6', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_7', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_8', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_9', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_10', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_11', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_12', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_13', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_14', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_15', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('httpd@80_16', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).
thread('http@80', running).

Thus, all worker threads have turned from running to exception, making the server completely unusable: Clients can still connect, but they never receive a response. In fact, even the timeout no longer works in this state.

This is with SWI-Prolog ~7.5.0 (1a0ff1b33683571b81669a11f4683e82994b8112), from end of January.

Since then, a few patches have been applied to SWI-Prolog that may even solve this issue ( 098e6be854104a843be495911b699460b3fa57ec, maybe?), and I will of course restart the server with the latest SWI version to see if this reproduces. Until then, I would be grateful for any additional feedback and hints.

@JanWielemaker
Copy link
Member

Ah. This is progress! I think this implies that it is possible for catch/3 to miss an exception. I have a small hunch this may be related to 1a3c397bf0258138b85c061e97ae828bb455d742 which is related to dealing with an exception (e.g., timeout, abort) arriving inside the Prolog hook that diagnoses another exception. I.e., the writer gets an exception due to the I/O error and while inside the exception hook it gets a timeout exception.

Similar stuff seems to be going on in SWISH, which processes a lot more requests since it connects its users using websockets. If you look at http://swish.swi-prolog.org/example/stats.swinb, section Number of threads and visitors, you see that the number of threads gets higher and higher when compared to the number of Pengines. As most of SWISH' requests are handled in forked threads rather than the main HTTP worker threads it doesn't affect SWISH that much (a thread in terminated state is just a little structure that remembers the exit state). Besides, SWISH typically doesn't live long enough :(

Now I need to think of a setup to replicate this. Possibly setting up an HTTP server with a short timeout and use a Prolog client that deliberately hangs the connection close to the timeout.

@JanWielemaker
Copy link
Member

thread('httpd@80_10', exception(error(io_error(write,user_error),context(system:format/3,'Broken pipe')))).

Was trying to reproduce this. Trying to get the I/O error due to the client dropping the connection and the timeout at the same time does indeed kill the HTTP worker, but this is correctly picked up by recreate_worker/2, so no harm is done. It is still dubious that this can happen and I need to figure out why as that may also be a reason why SWISH keeps leaking resources, but it is not very likely that is the problem we have here.

I missed something before: your exception is a format to user_error! That is not part of the normal HTTP workflow and basically only used for message printing. Could it be that user error is connected to some socket of the server manager and the server manager stopped listening to it? If that is the case, messages printed by the server outside the guarded area will cause the thread to die. Are you using systemd? What config? Is your code somewhere doing something with user_error?

@triska
Copy link
Member Author

triska commented Apr 5, 2017

I am using Proloxy, started using the following systemd configuration file:

[Unit]
Description=Proloxy

[Service]
UMask=022
Environment=LANG=en_US.utf8
Restart=on-abort
StartLimitInterval=60
LimitNOFILE=1000
DefaultStartLimitBurst=5
WorkingDirectory=/home/web/proloxy
ExecStart=/usr/local/bin/swipl rules.pl proloxy.pl \
 --no-fork --port=80 --user=web \
 --pidfile=/var/run/proloxy.pid --workers=16 --syslog=proloxy

[Install]
WantedBy=multi-user.target

In the concrete case I described earlier, my rules.pl file looks essentially only like this:

:- use_module(library(http/http_dispatch)).
:- use_module(library(http/http_log)).
:- use_module(library(http/http_files)).

% allow reading .well-known for Let's Encrypt
request_prefix_target(Request, '', _) :-
        memberchk(request_uri(URI), Request),
        atom_concat('/.well-known/', _, URI),
        http_reply_from_files('/home/web/html', [], Request).

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
   Redirect everything to https://...
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */

request_prefix_target(Request, _, _) :-
        memberchk(request_uri(URI), Request),
        memberchk(host(Host), Request),
        atomic_list_concat(['https://',Host,URI], Target),
        http_redirect(moved, Target, Request).

This only redirects from HTTP to HTTPS.

@JanWielemaker
Copy link
Member

If we look at https://bugzilla.redhat.com/show_bug.cgi?id=754938, we see that systemd indeed captures the output of a service using a socket. I think the question splits into two:

  • Does systemd close the socket? Why?
  • What should we do if a write to stderr fails? Ignore the error? Halt the service and hope we
    get restarted? It is definitely a bad idea to let the thread die, but the alternatives are unclear:
    • Ignoring the errors means we stop making log messages available, but we keep going
    • Halting might mean we get restarted, but might also mean the service is stopped.

Here is some more backup for my suspicion: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771122

So, my working hypothesis is this:

  • Restart the journal deamon
  • Make sure a client disconnects. This causes the Prolog server to log a message which it cannot
    and this crashes the server thread.

This fits with the observation all works well for a long time. At some point you do an apt-get upgrade,
systemd components are updated, the journal daemon is restarted and short after that you start loosing threads. Do you have the setup to verify that easily?

@triska
Copy link
Member Author

triska commented Apr 5, 2017

This may definitely be one (or even the) way to trigger this problem. Only for completeness, I would like to mention that I have not run apt-get upgrade during the entire time the server was running. However, I assume there may also be other causes that make the syslog daemon restart.

In fact, I think this may also happen due to logrotate, because in var/log/messages, I see several cases of:

syslogd: [...] rsyslogd was HUPed

In such cases, is there a way to reconnect to the journal daemon? I mean if the journal daemon restarts, SWI-Prolog should establish a connection to the new instance if possible.

@triska
Copy link
Member Author

triska commented Apr 5, 2017

One additional data point:

$ systemctl status rsyslog.service
● rsyslog.service - System Logging Service
   Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled)
   Active: active (running) since Mon 2016-11-07 14:58:44 CET; 4 months 27 days ago
     Docs: man:rsyslogd(8)
           http://www.rsyslog.com/doc/
 Main PID: 714 (rsyslogd)
   CGroup: /system.slice/rsyslog.service
           └─714 /usr/sbin/rsyslogd -n

According to this, the rsyslog daemon was in fact running throughout the entire time the HTTP server was running.

@JanWielemaker
Copy link
Member

It is hard to figure out what exactly happened. I'm not 100% convinced this is why things go wrong. From the Prolog messages this seems very likely. We need to find the exact relation between the service and systemd and/or a way to reproduce the result which is plausible or better provable that it happened. Possibly this can be found in other reports as we don't seem to be the only victim. What is the OS version? Is that Debian Jessie?

And no, there is no way to restore the connection as I understand it. The service has two options: ignore the errors from its output streams or terminate and rely on the systemd restart policy.

@JanWielemaker
Copy link
Member

It is not syslog that is causing the problem, it is the journal service of systemd that catches the output of the services. See e.g., https://bugs.freedesktop.org/show_bug.cgi?id=84923. If I query that on my desktop (Ubuntu 16,10), I get

● systemd-journald.service - Journal Service
   Loaded: loaded (/lib/systemd/system/systemd-journald.service; static; vendor preset: enabled)
   Active: active (running) since Wed 2017-04-05 10:45:22 CEST; 2 days ago
     Docs: man:systemd-journald.service(8)
           man:journald.conf(5)
 Main PID: 283 (systemd-journal)
   Status: "Processing requests..."
    Tasks: 1
   Memory: 135.6M
      CPU: 35.429s
   CGroup: /system.slice/systemd-journald.service
           └─283 /lib/systemd/systemd-journald

Apr 05 10:45:22 hppc823 systemd-journald[283]: Runtime journal (/run/log/journal/) is 8.0M, max 320.4M, 312.4M free.
Apr 05 10:45:22 hppc823 systemd-journald[283]: Journal started
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Note the last line. I don't know really what this means. I did see bug reports that relate a closed stderr to logrotate. On the other hand, the systemd docs say you do not need logrotate with systemd as it disposes old log records itself.

Not very elegantly, but possibly the most sensible solution is to add an option to streams to ignore write errors and enable that in the http_unix_daemon.pl module?

@triska
Copy link
Member Author

triska commented Apr 7, 2017

When I query:

$ systemctl status systemd-journal

then I get:

● systemd-journal.service
   Loaded: not-found (Reason: No such file or directory)
   Active: inactive (dead)

This could suggest that this service may not be the reason for this problem in this particular instance. Please note that I have since restarted the HTTP server, and all threads are running exactly as expected.

This system is:

$ lsb_release  -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 8.7 (jessie)
Release:	8.7
Codename:	jessie

The suggested option for streams indeed seems very inelegant to me. Can you use catch/3 to catch this error within the HTTP layer?

@JanWielemaker
Copy link
Member

I don't know what it means that systemd-journal is not present. I guess it is named differently or part of something else in that version is systemd. No matter what, in all versions of systemd the services chat via a named pipe to the service daemon and under some conditions this named pipe is destroyed and re-created. Older versions versions seem to suffer from this issue more than recent versions, but the reports suggests the issue isn't fixed.

You can put a catch around the print_message/2, but in this case not inside your handler as it is printing an error that results from the final communication which is not executed under your handler. That is also why the thread dies as the entire user part of the handler is guarded. I don't think we want to place a catch around every print_message/2 call :( So I think we have two options: as above ignore I/O errors on stderr (user_error) or extend the code in thread_httpd.pl that monitors the death of HTTP workers to terminate the server if there is an I/O error on user_error (next to managing the service another way).

I am tempted to implement suicide in thread_httpd.pl as well as ignoring output errors. The user can than choose to enable ignoring output errors and make the server die as a whole, the only two sensible ways to react I can think of.

@triska
Copy link
Member Author

triska commented Apr 9, 2017

systemd-journald restarting is indeed most likely the reason for these hangs. On the system where this most recently appeared (i.e., the test case above), I get, after correcting the service name:

    ● systemd-journald.service - Journal Service
       Loaded: loaded (/lib/systemd/system/systemd-journald.service; static)
       Active: active (running) since Mon 2017-03-20 04:42:43 CET; 2 weeks 6 days ago
         Docs: man:systemd-journald.service(8)
               man:journald.conf(5)
     Main PID: 8706 (systemd-journal)
       Status: "Processing requests..."
       CGroup: /system.slice/systemd-journald.service
               └─8706 /lib/systemd/systemd-journald

    Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Let us call this system F (faulty).

For comparison, I have a different machine that is in almost all respects identical, and the hang has not occurred there:

● systemd-journald.service - Journal Service
   Loaded: loaded (/lib/systemd/system/systemd-journald.service; static)
   Active: active (running) since Tue 2016-10-25 13:07:50 CEST; 5 months 14 days ago
     Docs: man:systemd-journald.service(8)
           man:journald.conf(5)
 Main PID: 12468 (systemd-journal)
   Status: "Processing requests..."
   CGroup: /system.slice/systemd-journald.service
           └─12468 /lib/systemd/systemd-journald

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Note that both systems display the warning about the rotated journal, but only on F the HTTP server has starved, apparently because the service was restarted there.

When I manually restart the journal service, I can reproduce the issue with the following Tcl script:

#!/usr/bin/tclsh
set s [socket localhost 80]
puts $s "GET /\n"
flush $s
close $s

It simply sends a request and closes the channel immediately.

I see that the latest git version contains a fix for this, thank you very much! I do not understand though why systemd-journald was so recently restarted only on system F, while on the other it has been running without any issues for several months.

I have simply kept both systems running with as little interference as possible (no dist-upgrade, no manual service restarts, almost no logins). The only immediate difference between these systems is: On F, I have 2 cron jobs that automatically renew the certificate and restart the HTTPS server. The setup is:

  • Sunday 4:30: renew certificate
  • Sunday 4:40: restart the HTTPS server.

This runs for weeks without problems. On Sunday, March 19 (4:40), the HTTPS server was restarted by the cron job. The systemd-journald was restarted on Monday March 20, also around 4:40, i.e., almost exactly one day later. Why? Is there anything in the HTTP daemon that could cause a systemd-journald restart?

This is important due to the way the whole systemd infrastructure reacts to this kind of problem: If systemd-journald restarts, this may indirectly also affect other services...

@JanWielemaker
Copy link
Member

This is important due to the way the whole systemd infrastructure reacts to this kind of problem: If systemd-journald restarts, this may indirectly also affect other services...

That seems to be the conclusion. Together write the write_errors(ignore) stream property we can now react in the two ways that make sense. Various post suggest this issue is recognised and addressed in systemd. This closes the issue for SWI-Prolog. Thanks for all the time you invested in resolving this.

@triska
Copy link
Member Author

triska commented Apr 10, 2017

In my view, this is not yet fully solved: Instead of a "starved" server, users will now get a "terminated" server, using the sample service file that ships with SWI-Prolog itself.

To reproduce the problem with the latest git version, please perform the following steps:

  1. start a server with the sample systemd configuration file
  2. restart journald via $ sudo systemctl restart systemd-journald
  3. run the Tcl script shown above.

Expectation: The server halts and restarts.
Actual result: The server halts and stays terminated.

To solve this problem, please consider changing the systemd configuration file or terminating the server in such a way that systemd restarts it when necessary or documenting that we must adjust the configuration file to avoid this exact issue. Thank you!

@JanWielemaker
Copy link
Member

Leaving it stopped is indeed not good. As you have it all setup, could you propose a fix?

@triska
Copy link
Member Author

triska commented Apr 10, 2017

In my setup, I have solved this by changing the value of Restart from on-abort to on-failure in my .service file:

Restart=on-failure

This is the setting that I have now also published for Proloxy and LetSWICrypt.

Fedora has some guidelines, or at least a discussion of potential guidelines, and importantly a table that explains the values.

Quoting Lennart:

I am fine with either on-abort or on-failure. I think on-failure is the nicer choice for daemons that are written somewhat cleanly, but on-abort is the safer choice.

A failure in the sense of "on-failure" is five different things:

  1. the main daemon exits and dumps core
  2. the main daemon exits with an abortive signal
  3. the main daemon exits with an exit code != 0
  4. the watchdog feature is enabled for a service and a service fails to respond
  5. some action such as service reload is executed but this times out.

OTOH "on-abort" only restarts a service on the first two cases, and just moves the services into a "failed" state otherwise.

Please take a look and apply it as it best suits the overall infrastructure. Thank you!

@JanWielemaker
Copy link
Member

Thanks for sorting this out. I've decided to adjust the example to use Restart=on-failure. Otherwise I have to do use something like below. I don't like that very much either.

current_prolog_flag(pid, PID),
kill(PID, abrt).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants