New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outbound RADIUS/TLS connection stalls #3501
Comments
The short answer is "if the network is broken, things will stall". For v4, the outbound connections are done asynchronously. We don't yet have TLS support, but it's possible with a bit of work. Unless there are mitigating reasons, I'm going to close this as "known, can't fix in v3". If we make every write asynchronous, that requires major changes to the server core. hence all of the work on v4. :( |
Isn't it just/also a timeout missing in (calling the) tls_new_client_session? I understand your point about things being synchronous in this release, but there are other parts of the code that deal with connection errors better (ie. marking a server dead if a connection can't be established, with a shorter timeout). We use FreeRADIUS in eduroam: in the global scope, it's (very) likely some network connections or servers will fail at some point. Well, QED in this case. It was very easy to reproduce BTW; I had a peer listening with "nc -l -p 2083", and tried to connect to it. The bt from the gcore gives me:
In the debugging output, again:
Now I had this reproduces, I could check on the actual timeout (and wait for that) and this appears to be 5 minutes, before it ends the attempt with:
|
It's a lot more than that, unfortunately. The timeout is controlled by the What might work is to edit
That sets the underlying socket to be non-blocking. This MAY work. It also may have other side effects if things go wrong. If that seems to work for you, we can add it as a configurable flag in v3. I'm wary of changing existing behavior. So I would want to ensure that the new functionality gets used only if explicitly enabled. |
The side-effects are serious indeed; it doesn't work. Typically gives a |
That's unhappy. :( I'll see if I can figure something out. But generally speaking v3 is synchronous. Adding outbound TLS to v4 is likely a week or so of work. Right now we're booked on a lot of other things. |
Maybe this patch will help? It will set the socket as non-blocking, and then keep retrying the there's no timeout yet. But if the code works, a timeout can be added. |
This patch does help, apart from the fact that if the peer is reachable and healthy, the first request always fails (because it is proxied I guess before the connection is actually ready). I would not use it in production because of that to be honest, it would do more harm. But perhaps your idea of a timeout resolves that :-) |
That's good feedback. The main issues left then are:
|
@ajrass AFAIK there is no good patch yet - I wouldn't use what you see here in production, unless you'd like to accept that all first connections fail. |
pauldekkers commentedJun 18, 2020
Issue type
Defect
How to reproduce the issue
Using the freeradius/freeradius-server:3.0.21 Docker image (on Ubuntu 18.04 LTS with it's Docker 18.09.7):
Because of remote network problems, one of our RadSec peers was unreachable in such a way that the TCP sessions set up slow, and the TLS challenge never happens (before closing the TCP session) or after a huge delay. Also
openssl's s_client
timed out after a while.The result is that the entire server stalled and no requests were processed at all (with some 100s reqs/s coming in). This happened fairly quickly after a restart as well, probably because of the high traffic volume, so a restart didn't resume the handling of (other) requests.
Output of
[radiusd|freeradius] -X
showing issue occurring... and that's where it stalled.
I have no backtrace, the daemons didn't crash (or at least not within the 15 minutes, and I had to act). Inspected with strace (that I unfortunately didn't keep), and all workers were just blocked I guess, not much happening.
I believe FreeRADIUS 4 would be better in the sense that it does some things asynchronous, but I also believe it doesn't have RADIUS/TLS support yet (otherwise I'd give it a go, also because we may have a higher chance of making things dynamic as well, if we can dynamically create home_servers - but that's an entirely different topic).
I think I would be able to reproduce when I connect to a simple destination that accepts the TCP session, but doesn't do anything after. Happy to give it a go if you have more ideas about debugging this.
The text was updated successfully, but these errors were encountered: