Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outbound RADIUS/TLS connection stalls #3501

Open
pauldekkers opened this issue Jun 18, 2020 · 9 comments
Open

outbound RADIUS/TLS connection stalls #3501

pauldekkers opened this issue Jun 18, 2020 · 9 comments

Comments

@pauldekkers
Copy link

pauldekkers commented Jun 18, 2020

Issue type

  • Defect - Unexpected behaviour (obvious or verified by project member).

Defect

How to reproduce the issue

Using the freeradius/freeradius-server:3.0.21 Docker image (on Ubuntu 18.04 LTS with it's Docker 18.09.7):

Because of remote network problems, one of our RadSec peers was unreachable in such a way that the TCP sessions set up slow, and the TLS challenge never happens (before closing the TCP session) or after a huge delay. Also openssl's s_client timed out after a while.

The result is that the entire server stalled and no requests were processed at all (with some 100s reqs/s coming in). This happened fairly quickly after a restart as well, probably because of the high traffic volume, so a restart didn't resume the handling of (other) requests.

Output of [radiusd|freeradius] -X showing issue occurring

Trying SSL to port 2083
Requiring Server certificate
(0) (other): before SSL initialization
(0) TLS_connect: before SSL initialization
(0) >>> send TLS 1.2  [length 00b1]
(0) TLS_connect: SSLv3/TLS write client hello

... and that's where it stalled.

I have no backtrace, the daemons didn't crash (or at least not within the 15 minutes, and I had to act). Inspected with strace (that I unfortunately didn't keep), and all workers were just blocked I guess, not much happening.

I believe FreeRADIUS 4 would be better in the sense that it does some things asynchronous, but I also believe it doesn't have RADIUS/TLS support yet (otherwise I'd give it a go, also because we may have a higher chance of making things dynamic as well, if we can dynamically create home_servers - but that's an entirely different topic).

I think I would be able to reproduce when I connect to a simple destination that accepts the TCP session, but doesn't do anything after. Happy to give it a go if you have more ideas about debugging this.

@alandekok
Copy link
Member

alandekok commented Aug 12, 2020

Because of remote network problems, one of our RadSec peers was unreachable in such a way that the TCP sessions set up slow, and the TLS challenge never happens (before closing the TCP session) or after a huge delay. Also openssl's s_client timed out after a while.

The short answer is "if the network is broken, things will stall".

For v4, the outbound connections are done asynchronously. We don't yet have TLS support, but it's possible with a bit of work.

Unless there are mitigating reasons, I'm going to close this as "known, can't fix in v3". If we make every write asynchronous, that requires major changes to the server core. hence all of the work on v4. :(

@pauldekkers
Copy link
Author

pauldekkers commented Aug 13, 2020

Isn't it just/also a timeout missing in (calling the) tls_new_client_session?

I understand your point about things being synchronous in this release, but there are other parts of the code that deal with connection errors better (ie. marking a server dead if a connection can't be established, with a shorter timeout).

We use FreeRADIUS in eduroam: in the global scope, it's (very) likely some network connections or servers will fail at some point. Well, QED in this case.

It was very easy to reproduce BTW; I had a peer listening with "nc -l -p 2083", and tried to connect to it. The bt from the gcore gives me:

(gdb) bt
#0  0x00007f188f48135e in __libc_read (fd=14, buf=0x564ecde2a583, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:27
#1  0x00007f188fde6b7e in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#2  0x00007f188fde1fba in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007f188fde0e53 in ?? () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007f188fde1403 in BIO_read () from /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007f188faca913 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#6  0x00007f188facefbd in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007f188facc6c2 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007f188fafe6e8 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007f188faf461d in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007f188fae04c4 in SSL_do_handshake () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#11 0x0000564eccdcb3e8 in tls_new_client_session (ctx=ctx@entry=0x564ecde12b70, conf=0x564ecdce0690, fd=14, certs=certs@entry=0x564ecde12c40) at src/main/tls.c:541
#12 0x0000564eccda6792 in proxy_new_listener (ctx=<optimized out>, home=0x564ecdb0e6e0, src_port=src_port@entry=0) at src/main/listen.c:2873
#13 0x0000564eccdbaac1 in insert_into_proxy_hash (request=request@entry=0x564ecde07550) at src/main/process.c:2302
#14 0x0000564eccdbf7d0 in request_proxy (request=request@entry=0x564ecde07550) at src/main/process.c:3337
#15 0x0000564eccdbf95c in request_proxy (request=0x564ecde07550) at src/main/process.c:3307
#16 request_running (request=0x564ecde07550, action=<optimized out>) at src/main/process.c:1644
#17 0x0000564eccdbc110 in request_queue_or_run (request=request@entry=0x564ecde07550, process=process@entry=0x564eccdbf820 <request_running>) at src/main/process.c:1106
#18 0x0000564eccdbcfa7 in request_receive (ctx=ctx@entry=0x564ecde07350, listener=listener@entry=0x564ecddd0ce0, packet=<optimized out>, client=client@entry=0x564ecdd91a20,
    fun=fun@entry=0x564eccd9a070 <rad_authenticate>) at src/main/process.c:1892
#19 0x0000564eccda5656 in auth_socket_recv (listener=0x564ecddd0ce0) at src/main/listen.c:1597
#20 0x0000564eccdb94ae in event_socket_handler (xel=<optimized out>, fd=<optimized out>, ctx=<optimized out>) at src/main/process.c:4867
#21 0x00007f189022765f in fr_event_loop (el=0x564ecdd935f0) at src/lib/event.c:649
#22 0x0000564eccd99295 in main (argc=<optimized out>, argv=<optimized out>) at src/main/radiusd.c:634

In the debugging output, again:

Trying SSL to port 2083
Requiring Server certificate
(0) (other): before SSL initialization
(0) TLS_connect: before SSL initialization
(0) >>> send TLS 1.2  [length 00b1]
(0) TLS_connect: SSLv3/TLS write client hello

Now I had this reproduces, I could check on the actual timeout (and wait for that) and this appears to be 5 minutes, before it ends the attempt with:

(0) TLS_connect: Need to read more data: SSLv3/TLS write client hello
tls: System call (I/O) error (-1)

@alandekok
Copy link
Member

alandekok commented Aug 13, 2020

Isn't it just/also a timeout missing in (calling the) tls_new_client_session?

It's a lot more than that, unfortunately. The timeout is controlled by the BIO_() functions. We're using functions supplied by OpenSSL, which are blocking by default.

What might work is to edit src/main/listen.c, and change:

		/*
		 *	FIXME: connect() is blocking!
		 *	We do this with the proxy mutex locked, which may
		 *	cause large delays!
		 *
		 *	http://www.developerweb.net/forum/showthread.php?p=13486
		 */
		this->fd = fr_socket_client_tcp(&home->src_ipaddr,
						&home->ipaddr, home->port, true); // change 'false' to 'true'

That sets the underlying socket to be non-blocking. This MAY work. It also may have other side effects if things go wrong.

If that seems to work for you, we can add it as a configurable flag in v3. I'm wary of changing existing behavior. So I would want to ensure that the new functionality gets used only if explicitly enabled.

@alandekok alandekok reopened this Aug 13, 2020
@pauldekkers
Copy link
Author

pauldekkers commented Aug 13, 2020

The side-effects are serious indeed; it doesn't work.

Typically gives a tls: TLS_connect: Error in SSLv2/v3 write client hello B on connect, and after some more attempts I've sometimes seen the peer certificate, but also segfaults.

@alandekok
Copy link
Member

alandekok commented Aug 13, 2020

That's unhappy. :(

I'll see if I can figure something out. But generally speaking v3 is synchronous. Adding outbound TLS to v4 is likely a week or so of work. Right now we're booked on a lot of other things.

@alandekok
Copy link
Member

alandekok commented Aug 13, 2020

nonblock.patch.gz

Maybe this patch will help? It will set the socket as non-blocking, and then keep retrying the SSL_connect() until it succeeds, or fails permanently.

there's no timeout yet. But if the code works, a timeout can be added.

@pauldekkers
Copy link
Author

pauldekkers commented Aug 17, 2020

This patch does help, apart from the fact that if the peer is reachable and healthy, the first request always fails (because it is proxied I guess before the connection is actually ready). I would not use it in production because of that to be honest, it would do more harm. But perhaps your idea of a timeout resolves that :-)

@alandekok
Copy link
Member

alandekok commented Aug 17, 2020

That's good feedback. The main issues left then are:

  • timeout on connect
  • saving copies of packets to be sent, until the connect either succeeds or fails

@pauldekkers
Copy link
Author

pauldekkers commented Apr 16, 2021

@ajrass AFAIK there is no good patch yet - I wouldn't use what you see here in production, unless you'd like to accept that all first connections fail.

@FreeRADIUS FreeRADIUS deleted a comment from ajrass Apr 18, 2021
@FreeRADIUS FreeRADIUS deleted a comment from ajrass Apr 18, 2021
@FreeRADIUS FreeRADIUS deleted a comment from ajrass Apr 18, 2021
@FreeRADIUS FreeRADIUS deleted a comment from ajrass Apr 18, 2021
@FreeRADIUS FreeRADIUS deleted a comment from ajrass Apr 18, 2021
@FreeRADIUS FreeRADIUS deleted a comment from ajrass Apr 18, 2021
@FreeRADIUS FreeRADIUS deleted a comment from alandekok Apr 18, 2021
@FreeRADIUS FreeRADIUS deleted a comment from alandekok Apr 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants