Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Not all Endpoints can't reconnect due to "Client TLS handshake failed" error after "reload or restart" #6517
Not all Endpoints can't reconnect due to "Client TLS handshake failed" error after "reload or restart"
After upgrading to Icinga-2.9.0 and 2.9.1 we ran into a huge problem with reconnecting to our endpoints.
Do a "systemctl restart icinga2", a "systemctl reload icinga2" or a reload in Icingaweb2 and Director-1.4.3.
Between 15:31:12 and 15:31:14 a reconnect is triggered (and logged) for all 6897 endpoints.
For a few endpoint this is successful, the connection is established and config files are synced and updated.
Suddenly the CLient TLS handshakes appear
On client side (10.127.11.99) the following will occur in the logfile
Also the API gets unresponsive now and you can't query something. System and database load isn't higher as in 2.8.4 at this time.
A few minutes later, some system succeed in reconneting and syncing, other not.
Another few minutes later, about 1000 systems could do a successful reconnect without any TLS errors, but then TLS errors in logfile raises again. In half an hour there are only ~ 3000 endpoints reconnected, all others have to deal with the strange TLS handshake failures
With 2.9.0 and 2.9.1 it isn't possible to reconnect all endpoints in a time effective manner.
A Rollback to 2.8.4 fixes this problem immediately. After approximately 5 minutes all endpoints are reconnected.
Reconnection in 2.9.0 and 2.9.1 should work as in 2.8.x and before.
It's not possible to do reconnect for ALL endpoints. Only a few ( ~ 1500 till 2000) could do a successful reconnect. All others have to deal with TLS errors nd possible timeouts.
Additionally a restart or reload creates a process in state
Rollback to 2.8.4 at the moment ...
Steps to Reproduce (for bugs)
Hard to reproduce, because you need at least ~ 7000 endpoints and ~ 140.000 services connected.
Rollback done to 2.8.4-1
** Disabled features: compatlog debuglog elasticsearch gelf graphite livestatus opentsdb perfdata statusdata syslog
I know that it might be insane in that setup, but I'd be interested in a gdb full backtrace from both running processes (master and client) at the time of such a reconnect attempt. Anything which looks suspicious, also network bandwidth peak usage, or CPU/IO changes. A full verbose look into the system, and a reproducer - be it many clients, a specific scenario with connection scenarios, etc. From a first look this can be anything and nothing.
Founds a method to reproduce this issue:
Use socat to mock the Icinga Agents:
Let Icinga connect to socat 7000 times:
And as the last step you need to disable the hostname check on our master:
This will result in:
Same problem here, after restart
i tried the following:
Between a lot of TLS failures, I see also messages like the following:
That max data length thing is a different problem which will be solved with #6595.
The rest is - after days and many hours of reading the code and debugging different changes - a mix of asynchronous parallel connection and TLS handshake handling. More insights and possible patches in the coming days.
The problem is with a master connecting actively to 17000 endpoints on startup, and later in a 60 seconds reconnect timer. In our test setup, we don't have so many containers. Noah created a scenario in 2 VMs which involves a master VM, and a "satellite" VM which is the home for 500-800 clients built in Docker containers where this is reproducible.
In addition to the master connections, the reported setup also includes IDO queries and Director deployments. @jschanz was so kind to provide logs and gdb traces offline (I asked him since there's no support contract or NDA signed, but we know each other in person).
The reason I've asked for gdb backtraces from the running process is simply explained - connection problems normally indicate a resource problem. The logs from Jens also show threads handling TCP connections (and later performing TLS handshakes).
At some point either the network stack gives up on 1000+ parallel connections, or the TLS handshake is so damn expensive (CPU wise) that other operations slow down. Generally speaking, threads are fighting for resources and locks will be there, and likely many context switches.
This problem isn't new at all, there's no key difference between 2.8.4 or 2.9.1 or 2.7.2 even. The possible influencers for not seeing it under specific circumstances are simple: CPU resources and fast network.
A patch from 2015/2016 moved the synchronous TLS handling into an asynchronous event loop with wakeup signals via multiplexed socket pairs. Threads are waiting for others threads waking them up, and with the many socket IO from 500 to 17000 connection attempts, the Kernel might be overloaded.
In our analysis, next to trying many things which may have influenced the problem, we've also started to reverse engineer the current code base and its socket IO with events to get a full picture. During the analysis we've also looked into alternative implementations since our code implements these design patterns, but does not use external code which might be more error-prone.
As we've agreed during our team weekend, we will push this knowledge into the docs underneath "19-technical-concepts.md".
In order to fully debug the problem, we've created some custom docker image builds from source (still using release builds). Then docker-compose fires up the client containers with a fixed port range. The master gets a looped zones.conf for all these endpoints to connect to (same IP, different port).
A small patch is made against Icinga 2: Disable the name checks in the certificate and the connecting CN. Otherwise we could not re-use a single certificate to verify the CA trust chain.
Another opt-in patch disables the "Liveness"-Checks which would disconnect idle connections after 60 seconds.
In addition to the above changes, we've added many log messages to see the exact actions and calls in the event loop and state machine. This many threads make it impossible to debug with gdb and variants. These log messages won't be added to release builds though, only those which help the user. One of them is #6602
The resources and time to build the above can be counted as round about 10 work days.
Tested with a WQ with infinite tasks and 16 threads, on a 16 core system with 32 gb ram. Requires an additional patch which moves the "endpoint is connecting" into the same place where the asynchronous enqueuing happens - to avoid duplicated connection tasks in the WQ from an ongoing timer run.
Decreased the reconnect timer interval to 10s, not sure whether this is a good idea at this point, this definitely needs more tests.
Works pretty good, and moves the API request processing into "batches". At some point the queue is empty again (in the last log line)
Find a reasonable number for
Create a patch and PR.
Tests by those affected.
Below is an analysis I've been writing with our docs in mind. Will result in a separate PR.
TLS Network IO
TLS Connection Handling
TLS-Handshake timeouts occur if the server is busy with reconnect handling and other tasks which run in isolated threads. Icinga 2 uses threads in many ways, e.g. for timers to wake them up, wait for check results, etc.
In terms of the cluster communication, the following flow applies.
Clients Processes Connection
Data Transmission between Server and Client Role
Once the TLS handshake and certificate verification is completed, the role is either
Asynchronous Socket IO
Everything runs through TLS, we don't use any "raw" connections nor plain message handling.
The TLS handshake and further read/write operations are not performed in a synchronous fashion in the new client's thread. Instead, all clients share an asynchronous "event pool".
The TlsStream constructor registers a new SocketEvent by calling its constructor. It binds the previously created TCP socket and itself into the created SocketEvent object.
The selected engine is stored as
By default, there are 8 of these worker threads.
On Socket Event State Machine
Once TlsStream->Handshake() is called, this initializes the current action to
Once the handshake is completed, current action is changed to either
This also depends on the returned error codes of the SSL interface functions. Whenever
In the scenario where the master actively connects to the clients, the client will wait for data and change the event sockets to
TLS Error Handling
From this question:
Successful TLS Actions
Once a stream has data available, it calls
All of them read data from the stream and process the messages. At this point the string is available as JSON already and later decoded (e.g. Icinga data structures, as Dictionary).
General Design Patterns
Alternative Implementations and Libraries
While analysing Icinga 2's socket IO event handling, the libraries and implementations
Our main "problem" with Icinga 2 are modern compilers supporting the full C++11 feature set.
Given the below projects, we are also not fans of wrapping C interfaces into
One key thing for external code is license compatibility with GPLv2.
added a commit
Sep 12, 2018
referenced this issue
Sep 12, 2018
Hallo, I try to add some of my observations.
I hope this might give you another look at the problem.
For the TLS handshake timeout, I would opt for making this configurable to allow users fine tuning (at the cost of performance with lagging connections).
@Icebird2000 thanks, but that would mean that both, your master and your satellite are trying to connect to each other, or am I mistaken here? Meaning to say, the endpoints have the
Yes, you are right. The
So, this seems to be a problem with the master actively using Tcp->Connect(), but timing out on the firewall prohibiting this?
In short, it works.
I had removed the
Now i run with the
added a commit
Sep 13, 2018
referenced this issue
Sep 14, 2018
The workqueue in the middle still needs discussion, especially with threads waiting endlessly until tcp/tls timeouts happen. The queue may also block pending tasks if workers are acquired. A better approach would likely be a) use a connection pool ourselves with wake-up events, but no threads waiting for tasks, with the possibility to add on-demand threads if performance doesn't scale. b) drop the socketevent implementation and move back to synchronous socket handshakes/polls.
We're reading and evaluating more on this now.
Moving the TLS handshake part out of the async IO design pattern partially works. Until you'll have to handle incoming socket events where no events are triggered anymore. Once you'll decide to move the remaining read/write operations out of this, you may read from the plain sockets but there's no way to signal data availability to registered handlers late away in JSON-RPC/HTTP.
One thing I've recognized during my rewrite is that SocketEvents are statically initialized once. This doesn't happen during application startup, but once the first TlsStream object is constructed - maybe too late for the socket IO thing when 7000 socket connect/handshake events then happen.
Something like this without any connection made yet.
referenced this issue
Sep 24, 2018
Dynamic Thread Pool for Connections
45 secs to fully reconnect 500 clients, 400 after 10 seconds.
Simulated reload in between (stop, start) and in a good spot (seems that the Kernel did not drop the connection yet, or any sort of keepalive re-using). 5 Seconds is very good.
Hmmm 7000 endpoints with socat crashes my macbook, since fork() fails at some point.
Moving to bigger test VMs.
Starts to peak at 93 threads.
Takes ~82 seconds to connect 10000 endpoints.
One observation - a dynamic thread pool compared to spawning and destroying threads is probably the solution here. 7000 threads for connections are very expensive, and I assume that our switch from
Tests against the REST API
stampede creates 1000 hosts simultaneously, therefore it opens many connections. This leads up to 2000 threads for this second, and once all connections are established, it fires the requests. 2 out of 1000 result in a 500 error.
Following the design pattern for many connections on HTTP servers, this is totally fine to spawn this many threads for a moment. Since the thread pool implementation also kills off 2 threads at max, the resources for dropping idle threads are not that wasted as before - where 5000 threads were just destroyed.