New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all Endpoints can't reconnect due to "Client TLS handshake failed" error after "reload or restart" #6517

Closed
jschanz opened this Issue Aug 3, 2018 · 33 comments

Comments

Projects
None yet
7 participants
@jschanz
Contributor

jschanz commented Aug 3, 2018

Not all Endpoints can't reconnect due to "Client TLS handshake failed" error after "reload or restart"

After upgrading to Icinga-2.9.0 and 2.9.1 we ran into a huge problem with reconnecting to our endpoints.

Do a "systemctl restart icinga2", a "systemctl reload icinga2" or a reload in Icingaweb2 and Director-1.4.3.
The following is happening:

[2018-08-02 15:29:29 +0200] information/ExternalCommandListener: 'command' stopped.
[2018-08-02 15:31:10 +0200] information/WorkQueue: #10 (DaemonCommand::Run) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-08-02 15:31:12 +0200] information/ApiListener: 'api' started.
[2018-08-02 15:31:12 +0200] information/ApiListener: Copying 1 zone configuration files for zone 'mon-icinga2-02.nst.example.org' to '/var/lib/icinga2/api/zones/mon-icinga2-02.nst.example.org'.
[2018-08-02 15:31:12 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/mon-icinga2-02.nst.example.org' (0 Bytes). Received timestamp '2018-08-02 15:31:12 +0200' (1533216672.088238), Current timestamp '2018-08-02 13:32:44 +0200' (1533209564.472608).
[2018-08-02 15:31:12 +0200] information/ApiListener: Copying 1 zone configuration files for zone 'mon-icinga2-03.nst.example.org' to '/var/lib/icinga2/api/zones/mon-icinga2-03.nst.example.org'.
[2018-08-02 15:31:12 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/mon-icinga2-03.nst.example.org' (0 Bytes). Received timestamp '2018-08-02 15:31:12 +0200' (1533216672.097991), Current timestamp '2018-08-02 13:32:44 +0200' (1533209564.484660).
[2018-08-02 15:31:12 +0200] information/ApiListener: Copying 10 zone configuration files for zone 'mon-icinga2-01.nst.example.org' to '/var/lib/icinga2/api/zones/mon-icinga-01.nst.example.org'.
[2018-08-02 15:31:12 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/mon-icinga-01.example.org' (0 Bytes). Received timestamp '2018-08-02 15:31:12 +0200' (1533216672.178659), Current timestamp '2018-08-02 13:32:44 +0200' (1533209564.565199).
[2018-08-02 15:31:12 +0200] information/ApiListener: Copying 12 zone configuration files for zone 'director-global' to '/var/lib/icinga2/api/zones/director-global'.
[2018-08-02 15:31:12 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones/director-global' (0 Bytes). Received timestamp '2018-08-02 15:31:12 +0200' (1533216672.184592), Current timestamp '2018-08-02 13:32:44 +0200' (1533209564.571902).
[2018-08-02 15:31:12 +0200] information/ApiListener: Adding new listener on port '5665'
[2018-08-02 15:31:12 +0200] information/ApiListener: Reconnecting to endpoint 'system1.example.org' via host '10.0.1.104' and port '5665'
[2018-08-02 15:31:12 +0200] information/ApiListener: Reconnecting to endpoint 'system2.example.org' via host '10.0.2.104' and port '5665'
[2018-08-02 15:31:12 +0200] information/ApiListener: Reconnecting to endpoint 'system3.example.org' via host '10.0.3.104' and port '5665'
...
[2018-08-02 15:31:14 +0200] information/ApiListener: Reconnecting to endpoint 'system1134.example.org' via host '10.5.1.101' and port '5665'
...

Between 15:31:12 and 15:31:14 a reconnect is triggered (and logged) for all 6897 endpoints.

For a few endpoint this is successful, the connection is established and config files are synced and updated.

[2018-08-02 15:31:22 +0200] information/ApiListener: Finished sending config file updates for endpoint 'system5.example.org' in zone 'system5.example.org'.
[2018-08-02 15:31:22 +0200] information/ApiListener: Syncing runtime objects to endpoint 'system5.example.org'.

Suddenly the CLient TLS handshakes appear

[2018-08-02 15:31:22 +0200] critical/ApiListener: Client TLS handshake failed (to [10.3.2.103]:5665)
Context:
        (0) Handling new API client connection

[2018-08-02 15:31:22 +0200] critical/ApiListener: Client TLS handshake failed (to [10.1.27.154]:5665)
Context:
        (0) Handling new API client connection

[2018-08-02 15:31:22 +0200] critical/ApiListener: Client TLS handshake failed (to [10.245.10.103]:5665)
Context:
        (0) Handling new API client connection
...
[2018-08-02 15:31:22 +0200] critical/ApiListener: Client TLS handshake failed (to [10.127.11.99]:5665)
Context:
        (0) Handling new API client connection

On client side (10.127.11.99) the following will occur in the logfile

...
[2018-08-02 15:31:23 +0200] critical/ApiListener: Client TLS handshake failed (from [10.195.1.10]:33143)
Context:
        (0) Handling new API client connection

[2018-08-02 15:31:31 +0200] information/JsonRpcConnection: No messages for identity 'mon-icinga2-01.nst.example.org' have been received in the last 60 seconds.
[2018-08-02 15:31:31 +0200] warning/JsonRpcConnection: API client disconnected for identity 'mon-icinga2-01.nst.example.org'
[2018-08-02 15:31:31 +0200] warning/JsonRpcConnection: API client disconnected for identity 'mon-icinga2-01.nst.example.org'
[2018-08-02 15:31:31 +0200] warning/ApiListener: Removing API client for endpoint 'mon-icinga2-01.nst.example.org'. 0 API clients left.
...
[2018-08-02 15:32:24 +0200] critical/ApiListener: Client TLS handshake failed (from [10.195.1.10]:39162)
Context:
        (0) Handling new API client connection

[2018-08-02 15:32:44 +0200] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2018-08-02 15:33:25 +0200] critical/ApiListener: Client TLS handshake failed (from [10.195.1.10]:44709)
Context:
        (0) Handling new API client connection

[2018-08-02 15:34:25 +0200] information/ApiListener: New client connection for identity 'mon-icinga2-01.nst.example.org' from [10.195.1.10]:51952
[2018-08-02 15:34:30 +0200] warning/TlsStream: TLS stream was disconnected.
[2018-08-02 15:34:30 +0200] warning/ApiListener: No data received on new API connection for identity 'mon-icinga2-01.nst.example.org'. Ensure that the remote endpoints are properly configured in a cluster setup.
Context:
        (0) Handling new API client connection

[2018-08-02 15:35:13 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);

Also the API gets unresponsive now and you can't query something. System and database load isn't higher as in 2.8.4 at this time.

A few minutes later, some system succeed in reconneting and syncing, other not.

[2018-08-02 15:45:39 +0200] information/ApiListener: Finished reconnecting to endpoint 'system479.example.org' via host '10.52.217.105' and port '5665'
[2018-08-02 15:45:39 +0200] information/ApiListener: Finished reconnecting to endpoint 'system325.example.org' via host '10.71.10.158' and port '5665'
[2018-08-02 15:45:39 +0200] critical/ApiListener: Client TLS handshake failed (to [10.5.169.2]:5665)
Context:
        (0) Handling new API client connection

[2018-08-02 15:45:39 +0200] critical/ApiListener: Client TLS handshake failed (to [10.11.1.3]:5665)
Context:
        (0) Handling new API client connection

[2018-08-02 15:45:39 +0200] critical/ApiListener: Client TLS handshake failed (to [10.8.64.43]:5665)
Context:
        (0) Handling new API client connection

[2018-08-02 15:45:39 +0200] information/ApiListener: Finished reconnecting to endpoint 'system123.example.org' via host '10.25.78.109' and port '5665'
[2018-08-02 15:45:39 +0200] information/ApiListener: Finished reconnecting to endpoint 'system5.example.org' via host '10.61.80.159' and port '5665'
[2018-08-02 15:45:39 +0200] information/ApiListener: Finished reconnecting to endpoint 'system12.example.org' via host '10.25.149.110' and port '5665'

Another few minutes later, about 1000 systems could do a successful reconnect without any TLS errors, but then TLS errors in logfile raises again. In half an hour there are only ~ 3000 endpoints reconnected, all others have to deal with the strange TLS handshake failures

[2018-08-02 15:31:23 +0200] critical/ApiListener: Client TLS handshake failed (from [10.195.1.10]:33143)
Context:
        (0) Handling new API client connection

[2018-08-02 15:31:31 +0200] information/JsonRpcConnection: No messages for identity 'mon-icinga-01.nst.example.org' have been received in the last 60 seconds.

With 2.9.0 and 2.9.1 it isn't possible to reconnect all endpoints in a time effective manner.

A Rollback to 2.8.4 fixes this problem immediately. After approximately 5 minutes all endpoints are reconnected.

Expected Behavior

Reconnection in 2.9.0 and 2.9.1 should work as in 2.8.x and before.

Current Behavior

It's not possible to do reconnect for ALL endpoints. Only a few ( ~ 1500 till 2000) could do a successful reconnect. All others have to deal with TLS errors nd possible timeouts.

Additionally a restart or reload creates a process in state

root     22070   823  0 15:30 ?        00:00:00 [icinga2] <defunct>
root     22257   823  0 12:02 ?        00:00:00 [icinga2] <defunct>
[2018-08-02 15:31:31 +0200] information/JsonRpcConnection: No messages for identity 'mon-icinga-01.nst.example.org' have been received in the last 60 seconds.

Possible Solution

Rollback to 2.8.4 at the moment ...

Steps to Reproduce (for bugs)

Hard to reproduce, because you need at least ~ 7000 endpoints and ~ 140.000 services connected.

Context

Your Environment

  • Version used (icinga2 --version):

Rollback done to 2.8.4-1

icinga2 - The Icinga 2 network monitoring daemon (version: r2.8.4-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: Debian GNU/Linux
  Platform version: 8 (jessie)
  Kernel: Linux
  Kernel version: 3.16.0-6-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.9.2
  Build host: 289f70f9cece
  • Operating System and version:

    • Debian 8, x86_64
  • Enabled features (icinga2 feature list):

** Disabled features: compatlog debuglog elasticsearch gelf graphite livestatus opentsdb perfdata statusdata syslog
* Enabled features: api checker command ido-mysql influxdb mainlog notification

  • Icinga Web 2 version and modules (System - About):

  • Config validation (icinga2 daemon -C):

  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.

    • icinga2 object list --type Zone | grep Object | wc -l
      q6898

    • icinga2 object list --type Zone | grep Object | wc -l
      6899

@Icebird2000

This comment has been minimized.

Icebird2000 commented Aug 3, 2018

Hello Jens,

I have the same Problem. dnsmichi told me in
#6445 (comment) i should open an Bugreport for this, but i had no time.

@dnsmichi dnsmichi added the Cluster label Aug 6, 2018

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Aug 6, 2018

I know that it might be insane in that setup, but I'd be interested in a gdb full backtrace from both running processes (master and client) at the time of such a reconnect attempt. Anything which looks suspicious, also network bandwidth peak usage, or CPU/IO changes. A full verbose look into the system, and a reproducer - be it many clients, a specific scenario with connection scenarios, etc. From a first look this can be anything and nothing.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Aug 7, 2018

Oh, and I'd like to see the zones.conf from both ends, especially I am interested in the connection direction and log_duration.

@N-o-X

This comment has been minimized.

Member

N-o-X commented Aug 9, 2018

Founds a method to reproduce this issue:

Use socat to mock the Icinga Agents:

socat OPENSSL-LISTEN:5665,certificate=icinga-certs/sat1.crt,key=icinga-certs/sat1.key,cafile=icinga-certs/ca.crt,fork - > /dev/null

Let Icinga connect to socat 7000 times:

for (i in range(7000)) {
  object Endpoint "sat" + i {
    host = "10.211.55.63"
  }
  object Zone "sat" + i {
    parent = ZoneName
    endpoints = [ name ]
  }
}

And as the last step you need to disable the hostname check on our master:

diff --git a/lib/remote/apilistener.cpp b/lib/remote/apilistener.cpp
@@ -471,1 +471,1 @@ void ApiListener::NewClientHandlerInternal
-      identity = GetCertificateCN(cert);
+      identity = hostname;

This will result in:

[2018-08-09 10:45:55 +0200] critical/ApiListener: Client TLS handshake failed (to [10.211.55.63]:5665) Error: Timeout during handshake.
Context:
	(0) Handling new API client connection

TlsStream::Handshake() will throw std::runtime_error("Timeout during handshake.")) after 10 seconds, but this is ignored at ApiListener::NewClientHandlerInternal:448. Don't really know why it's running into that timeout and can't say if thats the one we're searching for.

@JanaFischinger

This comment has been minimized.

JanaFischinger commented Aug 17, 2018

Same problem here, after restart

[2018-08-17 13:47:01 +0200] warning/TlsStream: TLS stream was disconnected.
[2018-08-17 13:47:01 +0200] critical/ApiListener: Client TLS handshake failed (from [::ffff:123.123.123.123]:49778)
Context:
	(0) Handling new API client connection

[2018-08-17 13:48:01 +0200] warning/TlsStream: TLS stream was disconnected.
[2018-08-17 13:48:01 +0200] critical/ApiListener: Client TLS handshake failed (from [::ffff:123.123.123.123]:49870)
Context:
	(0) Handling new API client connection

i tried the following:

  • uninstall icinga2 agent
  • remove icinga2 folder from C:\Programdata
  • install current agent 2.9
  • run icinga2-agent-kickstart.ps1 script from director.
    but problem still persits.

@Crunsher Crunsher added this to the 2.9.2 milestone Aug 20, 2018

@jschanz

This comment has been minimized.

Contributor

jschanz commented Aug 27, 2018

@dnsmichi Do you need further informations for this issue? I'm back from holiday now ...

@N-o-X

This comment has been minimized.

Member

N-o-X commented Aug 31, 2018

@jschanz I'm currently trying to reproduce this correctly. Do those TLS handshake failures also happen on 2.8.4, or is this only happening on 2.9.x? Because in my tests 2.8.4 and 2.9.1 both get those TLS handshake failures every now and then with 1000+ Endpoints.

@jschanz

This comment has been minimized.

Contributor

jschanz commented Sep 4, 2018

@N-o-X
Is see this errors also in 2.8.4 and before, but not so many as in 2.9.1. I think it has something to do with a lot of connections and a timeout somewhere set.
In 2.8.4 this happens sometimes on WAN clients with a bad connection. Maybe packets get lost, dropped or whatever or the amount of data to transfer is to large for the small bandwidth.
I'll try to upgrade back to 2.9.x in the next days, but before i need to split the client connections to more icinga2 satellites to avoid further problems.

@jschanz

This comment has been minimized.

Contributor

jschanz commented Sep 4, 2018

Between a lot of TLS failures, I see also messages like the following:

[2018-09-04 14:43:52 +0200] warning/JsonRpcConnection: Error while reading JSON-RPC message for identity 'randomhost.randomdomain': Error: Max data length exceeded: 64 KB

	(0) icinga2: icinga::NetString::ReadStringFromStream(boost::intrusive_ptr<icinga::Stream> const&, icinga::String*, icinga::StreamReadContext&, bool, long) (+0x40c) [0x5ec1dc]
	(1) icinga2: icinga::JsonRpc::ReadMessage(boost::intrusive_ptr<icinga::Stream> const&, icinga::String*, icinga::StreamReadContext&, bool, long) (+0x2e) [0x691e1e]
	(2) icinga2: icinga::JsonRpcConnection::ProcessMessage() (+0x67) [0x691f17]
	(3) icinga2: icinga::JsonRpcConnection::DataAvailableHandler() (+0x98) [0x6ae418]
	(4) /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2() [0x618c8e]
	(5) icinga2: icinga::Stream::SignalDataAvailable() (+0x3d) [0x673a8d]
	(6) icinga2: icinga::TlsStream::OnEvent(int) (+0x2bc) [0x68040c]
	(7) icinga2: icinga::SocketEventEngineEpoll::ThreadProc(int) (+0x3eb) [0x67b71b]
	(8) libstdc++.so.6: <unknown function> (+0xb6970) [0x2b9281969970]
	(9) libpthread.so.0: <unknown function> (+0x8064) [0x2b9280b8b064]
	(10) libc.so.6: clone (+0x6d) [0x2b92821bd62d]
@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 4, 2018

Error: Max data length exceeded: 64 KB. Ouch @N-o-X.

@N-o-X

This comment has been minimized.

Member

N-o-X commented Sep 5, 2018

Mhm... that should only happen with unauthorized clients. Known endpoints should not be limited by that. Might need some investigation.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 11, 2018

That max data length thing is a different problem which will be solved with #6595.

The rest is - after days and many hours of reading the code and debugging different changes - a mix of asynchronous parallel connection and TLS handshake handling. More insights and possible patches in the coming days.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 11, 2018

icinga2_socket_tls_timeouts_tcp_dump

icinga2_socket_tls_timeouts_code_analysis_01
icinga2_socket_tls_timeouts_code_analysis_02

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 12, 2018

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 12, 2018

Analysis

The problem is with a master connecting actively to 17000 endpoints on startup, and later in a 60 seconds reconnect timer. In our test setup, we don't have so many containers. Noah created a scenario in 2 VMs which involves a master VM, and a "satellite" VM which is the home for 500-800 clients built in Docker containers where this is reproducible.

In addition to the master connections, the reported setup also includes IDO queries and Director deployments. @jschanz was so kind to provide logs and gdb traces offline (I asked him since there's no support contract or NDA signed, but we know each other in person).

The reason I've asked for gdb backtraces from the running process is simply explained - connection problems normally indicate a resource problem. The logs from Jens also show threads handling TCP connections (and later performing TLS handshakes).

At some point either the network stack gives up on 1000+ parallel connections, or the TLS handshake is so damn expensive (CPU wise) that other operations slow down. Generally speaking, threads are fighting for resources and locks will be there, and likely many context switches.

This problem isn't new at all, there's no key difference between 2.8.4 or 2.9.1 or 2.7.2 even. The possible influencers for not seeing it under specific circumstances are simple: CPU resources and fast network.

A patch from 2015/2016 moved the synchronous TLS handling into an asynchronous event loop with wakeup signals via multiplexed socket pairs. Threads are waiting for others threads waking them up, and with the many socket IO from 500 to 17000 connection attempts, the Kernel might be overloaded.

In our analysis, next to trying many things which may have influenced the problem, we've also started to reverse engineer the current code base and its socket IO with events to get a full picture. During the analysis we've also looked into alternative implementations since our code implements these design patterns, but does not use external code which might be more error-prone.

As we've agreed during our team weekend, we will push this knowledge into the docs underneath "19-technical-concepts.md".

Test Setup

In order to fully debug the problem, we've created some custom docker image builds from source (still using release builds). Then docker-compose fires up the client containers with a fixed port range. The master gets a looped zones.conf for all these endpoints to connect to (same IP, different port).

A small patch is made against Icinga 2: Disable the name checks in the certificate and the connecting CN. Otherwise we could not re-use a single certificate to verify the CA trust chain.

Another opt-in patch disables the "Liveness"-Checks which would disconnect idle connections after 60 seconds.

In addition to the above changes, we've added many log messages to see the exact actions and calls in the event loop and state machine. This many threads make it impossible to debug with gdb and variants. These log messages won't be added to release builds though, only those which help the user. One of them is #6602

The resources and time to build the above can be counted as round about 10 work days.

Things Tried

  • Drop the Socket-Timeout, or make it last longer. This results into stale TLS Handshakes and leaks connection threads.

  • Revert IsEof() handling, which previously fixed false positive. Doesn't solve anything.

  • Deal with SSL_shutdown() not closed cleanly, but this doesn't influence the handshake itself. https://bugs.ruby-lang.org/issues/4237

  • SSL_do_handshake sometimes never returns success, and needs a repeated handshake. Our sockets are non-blocking, so a TLS handshake might need more retries. Tried it, made it worse and best of luck sometimes solved the handshake problem - the data processing and connection problem still exists.

  • Reconnect timer -> all 60 seconds, get all endpoints and connect if possible. Maybe do that more often, but only with a shard of endpoints. Currently 500 clients create 500 tcp connect() calls which may overload the system itself.

    • random sleep for tests
    • as client and as server, also test the REST API.
  • Client Connect Queue

Tested with a WQ with infinite tasks and 16 threads, on a 16 core system with 32 gb ram. Requires an additional patch which moves the "endpoint is connecting" into the same place where the asynchronous enqueuing happens - to avoid duplicated connection tasks in the WQ from an ongoing timer run.

Decreased the reconnect timer interval to 10s, not sure whether this is a good idea at this point, this definitely needs more tests.

  • Server API WQ

Works pretty good, and moves the API request processing into "batches". At some point the queue is empty again (in the last log line)

michi@mbmif ~/coding/netways/elippmann/icinga2-api-stampede (master *=) $ php -d memory_limit=512M bin/stampede.php  --hosts 1000 --verbose

[2018-09-11 17:33:01 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2018-09-11 17:34:01 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 127, rate: 14.2833/s (857/min 857/5min 857/15min);
[2018-09-11 17:33:51 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 441, rate: 9.05/s (543/min 543/5min 543/15min); empty in less than 1 millisecond
[2018-09-11 17:33:41 +0200] information/WorkQueue: #7 (ApiListener, IncomingConnectQueue) items: 812, rate: 2.86667/s (172/min 172/5min 172/15min); empty in 10 seconds
[2018-09-11 17:34:10 +0200] information/ConfigItem: Activated all objects.

Next Steps

Find a reasonable number for

  • socket event IO threads
  • Connection queues
  • Reconnect interval (keep it at 60 or 10 seconds, the rest is discussed in a separate issue nothing for 2.9.2)

Create a patch and PR.

Tests by those affected.

More Insights

Below is an analysis I've been writing with our docs in mind. Will result in a separate PR.

TLS Network IO

TLS Connection Handling

TLS-Handshake timeouts occur if the server is busy with reconnect handling and other tasks which run in isolated threads. Icinga 2 uses threads in many ways, e.g. for timers to wake them up, wait for check results, etc.

In terms of the cluster communication, the following flow applies.

Master Connects

  • The master initializes the connection in a loop through all known zones it should connect to, extracting the endpoints and their host/port attribute.
  • This calls AddConnection() whereas a Tcp::Connect() is called to create a TCP socket.
  • A new thread is spawned for future connection handling, this binds ApiListener::NewClientHandler().
  • On top of the TCP socket, a new TLS stream is created.
  • The master performs a TLS->Handshake()
  • Certificates are verified and the endpoint name is compared to the CN.

Clients Processes Connection

  • The client listens for new incoming connections as 'TCP server' pattern inside ListenerThreadProc() with an endless loop.
  • Once a new connection is detected, TCP->Accept() performs the initial socket establishment.
  • A new thread is spawned for future connection handling, this binds ApiListener::NewClientHandler(), Role being Server.
  • On top of the TCP socket, a new TLS stream is created.
  • The client performs a TLS->Handshake().

Data Transmission between Server and Client Role

Once the TLS handshake and certificate verification is completed, the role is either Client or Server.

  • Client: Send "Hello" message.
  • Server: TLS->WaitForData() waits for incoming messages from the remote client.

Client in this case is the instance which initiated the connection. If the master is doing this, the Icinga 2 client/agent acts as "server" which accepts incoming connections.

Asynchronous Socket IO

Everything runs through TLS, we don't use any "raw" connections nor plain message handling.

The TLS handshake and further read/write operations are not performed in a synchronous fashion in the new client's thread. Instead, all clients share an asynchronous "event pool".

The TlsStream constructor registers a new SocketEvent by calling its constructor. It binds the previously created TCP socket and itself into the created SocketEvent object.

SocketEvent::InitializeEngine() takes care of whether to use epoll (Linux) or poll (BSD, Unix, Windows) as preferred socket poll engine. epoll has proven to be faster on Linux systems.

The selected engine is stored as l_SocketIOEngine and later Start() ensures to do the following:

  • Use a fixed number for creating IO threads.
  • Create a dumb_socketpair which basically is a pipe from in->out and multiplexes the TCP socket into a local Unix socket. This removes the complexity and slowlyness of the kernel dealing with the TCP stack and new events.
  • InitializeThread() prepares epoll with epoll_create, socket descriptors and event mapping for later wakeup.
  • Each event FD has its own "worker event thread" which deals with incoming data, called ThreadProc as endless loop.

By default, there are 8 of these worker threads.

In the ThreadProc loop, the following happens:

  • epoll_wait gets called and provides an event whether new data is ready (via socket IO from the Kernel).
  • The event created with epoll_event holds the .fd.data attribute which references the multiplexed event FD (and therefore tcp socket FD).
  • All events in this cycle are stored with their descriptors in a list.
  • Once the epoll loop is finished, the collected events are processed and the socketevent descriptor (which is the TlsStream object) calls OnEvent().

On Socket Event State Machine

OnEvent implements the "state machine" depending on the current desired action. By default, this is TlsActionNone.

Once TlsStream->Handshake() is called, this initializes the current action to TlsActionHandshake and performs SSL_do_handshake(). This function returns > 0 when successful, anything below needs to be dealt separately.
If the handshake was successful, the registered condition variable m_CV gets signalled and the thread waiting for the handshake in TlsStream->Handshake() wakes up and continues within the ApiListener::NewClientHandler() function.

Once the handshake is completed, current action is changed to either TlsActionRead or TlsActionWrite. This happens
in the beginning of the state machine when there is no action selected yet.

  • Read: Received events indicate POLLIN (or POLLERR/POLLHUP as error, but normally mean "read").
  • Write: The send buffer of the TLS stream is greater 0 bytes, and the received events allow POLLOUT on the event socket.
  • Nothing matched: Change the event sockets to POLLIN ("read"), and return, waiting for the next event.

This also depends on the returned error codes of the SSL interface functions. Whenever SSL_WANT_READ occurs, the event polling needs be changed to use POLLIN, vice versa for SSL_WANT_WRITE and POLLOUT.

In the scenario where the master actively connects to the clients, the client will wait for data and change the event sockets to Read once there's something coming on the sockets.

Action Description
Read Calls SSL_read() with a fixed buffer size of 64 KB. If rc > 0, the receive buffer of the TLS stream is filled and success indicated. This endless loop continues until a) SSL_pending() says no more data from remote b) Maximum bytes are read. If success is true, the condition variable notifies the thread in WaitForData to wake up.
Write The send buffer of the TLS stream Peek()s the first 64KB and calls SSL_write() to send them over the socket. The returned value is the number of bytes written, this is adjusted within the send buffer in the Read() call (it also optimizes the memory usage).
Handshake Calls SSL_do_handshake() and if successful, the condition variable wakes up the thread waiting for it in Handshake().
TLS Error Handling
TLS error code Description
SSL_WANT_READ The next event should read again, change events to POLLIN.
SSL_ERROR_WANT_WRITE The next event should write, change events to POLLOUT.
SSL_ERROR_ZERO_RETURN Nothing was returned, close the TLS stream and immediately return.
default Extract the error code and log a fancy error for the user. Close the connection.

From this question:

With non-blocking sockets, SSL_WANT_READ means "wait for the socket to be readable, then call this function again."; conversely, SSL_WANT_WRITE means "wait for the socket to be writeable, then call this function again.". You can get either SSL_WANT_WRITE or SSL_WANT_READ from both an SSL_read() or SSL_write() call.
Successful TLS Actions
  • Initialize the next TLS action to none. This re-evaluates the conditions upon next event call.
  • If the stream still contains data, adjust the socket events.
    • If the send buffer contains data, change events to POLLIN|POLLOUT.
    • Otherwise POLLIN to wait for data.
  • Process data when the receive buffer has them available and we are actively handling events.
  • If the TLS stream is supposed to shutdown, close everything including the TLS connection.

Data Processing

Once a stream has data available, it calls SignalDataAvailable(). This holds a condition variable which wakes up another thread in a handled which was previously registered, e.g. for JsonRpcConnection, HttpServerConnection or HttpClientConnection objects.

All of them read data from the stream and process the messages. At this point the string is available as JSON already and later decoded (e.g. Icinga data structures, as Dictionary).

General Design Patterns

Taken from https://www.ibm.com/developerworks/aix/library/au-libev/index.html

One of the biggest problems facing many server deployments, particularly web server deployments, is the ability to handle a large number of connections. Whether you are building cloud-based services to handle network traffic, distributing your application over IBM Amazon EC instances, or providing a high-performance component for your web site, you need to be able to handle a large number of simultaneous connections.

A good example is the recent move to more dynamic web applications, especially those using AJAX techniques. If you are deploying a system that allows many thousands of clients to update information directly within a web page, such as a system providing live monitoring of an event or issue, then the speed at which you can effectively serve the information is vital. In a grid or cloud situation, you might have permanent open connections from thousands of clients simultaneously, and you need to be able to serve the requests and responses to each client.

Before looking at how libevent and libev are able to handle multiple network connections, let's take a brief look at some of the traditional solutions for handling this type of connectivity.

### Handling multiple clients

There are a number of different traditional methods that handle multiple connections, but usually they result in an issue handling large quantities of connections, either because they use too much memory, too much CPU, or they reach an operating system limit of some kind.

The main solutions used are:

* Round-robin: The early systems use a simple solution of round-robin selection, simply iterating over a list of open network connections and determining whether there is any data to read. This is both slow (especially as the number of connections increases) and inefficient (since other connections may be sending requests and expecting responses while you are servicing the current one). The other connections have to wait while you iterate through each one. If you have 100 connections and only one has data, you still have to work through the other 99 to get to the one that needs servicing.
* poll, epoll, and variations: This uses a modification of the round-robin approach, using a structure to hold an array of each of the connections to be monitored, with a callback mechanism so that when data is identified on a network socket, the handling function is called. The problem with poll is that the size of the structure can be quite large, and modifying the structure as you add new network connections to the list can increase the load and affect performance.
* select: The select() function call uses a static structure, which had previously been hard-coded to a relatively small number (1024 connections), which makes it impractical for very large deployments.
There are other implementations on individual platforms (such as /dev/poll on Solaris, or kqueue on FreeBSD/NetBSD) that may perform better on their chosen OS, but they are not portable and don't necessarily resolve the upper level problems of handling requests.

All of the above solutions use a simple loop to wait and handle requests, before dispatching the request to a separate function to handle the actual network interaction. The key is that the loop and network sockets need a lot of management code to ensure that you are listening, updating, and controlling the different connections and interfaces.

An alternative method of handling many different connections is to make use of the multi-threading support in most modern kernels to listen and handle connections, opening a new thread for each connection. This shifts the responsibility back to the operating system directly but implies a relatively large overhead in terms of RAM and CPU, as each thread will need it's own execution space. And if each thread (ergo network connection) is busy, then the context switching to each thread can be significant. Finally, many kernels are not designed to handle such a large number of active threads.

Alternative Implementations and Libraries

While analysing Icinga 2's socket IO event handling, the libraries and implementations
below have been collected too. This thread
also sheds more light in modern programming techniques.

Our main "problem" with Icinga 2 are modern compilers supporting the full C++11 feature set.
Recent analysis have proven that gcc on CentOS 6 or SLES11 are too old to use modern
programming techniques or anything which implemens C++14 at least.

Given the below projects, we are also not fans of wrapping C interfaces into
C++ code in case you want to look into possible patches.

One key thing for external code is license compatibility with GPLv2.
Modified BSD and Boost can be pulled into the third-party/ directory, best header only and compiled
into the Icinga 2 binary.

C

C++

@Icebird2000

This comment has been minimized.

Icebird2000 commented Sep 12, 2018

Hallo, I try to add some of my observations.
We have about 100 Hosts in 30 Zones.
On my System the Problem starts with 2.9.0/2.9.1 on the Master and are gone after downgrade to 2.8.4 with the same config.
IMHO it is related to a communication Problem between the Master and the Satellite due to a firewall which block the Communication from the Master to the Satellite and allow the Communication from Satellite to Master.
The Problem is only on the Master-System. The Satellite can work 2.9.1 with no Problem.

I hope this might give you another look at the problem.

@ekeih

This comment has been minimized.

Contributor

ekeih commented Sep 13, 2018

I really like this:

As we've agreed during our team weekend, we will push this knowledge into the docs underneath "19-technical-concepts.md".

It is super great to have detailed technical documentation that explains how Icinga2 works! Thank you! 👍

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 13, 2018

Thanks.

For the TLS handshake timeout, I would opt for making this configurable to allow users fine tuning (at the cost of performance with lagging connections).

[2018-09-13 12:41:55 +0000] critical/ApiListener: Client TLS handshake failed (to [10.77.20.8]:10062): Error: Timeout during handshake.

	(0) Handling new API client connection

Context:
	(0) Handling new API client connection

@Icebird2000 thanks, but that would mean that both, your master and your satellite are trying to connect to each other, or am I mistaken here? Meaning to say, the endpoints have the host attribute set everywhere on all involved nodes.

@Icebird2000

This comment has been minimized.

Icebird2000 commented Sep 13, 2018

Yes, you are right. Thehost attribute is set on all involved nodes.

zone.conf master

object Zone "sat1.dc.example.com" {
  endpoints = [ "sat1.dc.example.com" ]
  parent = "master"
}

object Endpoint "sat1.dc.example.com" {
  host = "XX.XX.XX.X"
}

/*
 * Global zone for templates
 */
object Zone "global-templates" {
  global = true
}

zone.conf satellite

object Endpoint "mas1.dc.example.com" {
        host = "mas1.dc.example.com"
        port = "5665"
}

object Zone "master" {
        endpoints = [ "mas1.dc.example.com" ]
}

object Endpoint NodeName {
}

object Zone ZoneName {
        endpoints = [ NodeName ]
        parent = "master"
}

/*
 * Global zone for templates
 */
object Zone "global-templates" {
  global = true
}
@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 13, 2018

So, this seems to be a problem with the master actively using Tcp->Connect(), but timing out on the firewall prohibiting this?
What happens in your setup, if you ensure that the master does not have the host attribute for all satellite endpoints? (instead, the satellite nodes should have the host attribute for the master endpoint) ... does this change something with 2.9.1?

@Icebird2000

This comment has been minimized.

Icebird2000 commented Sep 13, 2018

In short, it works.

I had removed the host attribute and test it for 15 Minutes and no error occured.
I change readd the host attribute and the errors are back

[2018-09-13 17:28:12 +0200] critical/ApiListener: Client TLS handshake failed (to [XX.XX.XX.XX]:5665)
Context:
        (0) Handling new API client connection

[2018-09-13 17:28:12 +0200] critical/ApiListener: Client TLS handshake failed (to [XX.XX.XX.XX]:5665)
Context:
        (0) Handling new API client connection

Now i run with the host attribute removed - config.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 13, 2018

Hm. This would indicate that a TCP->Connect() was successful, and the TLS->Handshake continues. This really is strange. Thanks for the insights, much appreciated.

dnsmichi added a commit that referenced this issue Sep 13, 2018

Add ApiListener#tls_handshake_timeout option
This allows to specify the previously hardcoded
timeout of 10s.

refs #6517
@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 14, 2018

Further tests from a possible patch will be in #6619, once this is merged you can test the snapshot packages.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 21, 2018

The workqueue in the middle still needs discussion, especially with threads waiting endlessly until tcp/tls timeouts happen. The queue may also block pending tasks if workers are acquired. A better approach would likely be a) use a connection pool ourselves with wake-up events, but no threads waiting for tasks, with the possibility to add on-demand threads if performance doesn't scale. b) drop the socketevent implementation and move back to synchronous socket handshakes/polls.

We're reading and evaluating more on this now.
http://www.kegel.com/c10k.html

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 24, 2018

Moving the TLS handshake part out of the async IO design pattern partially works. Until you'll have to handle incoming socket events where no events are triggered anymore. Once you'll decide to move the remaining read/write operations out of this, you may read from the plain sockets but there's no way to signal data availability to registered handlers late away in JSON-RPC/HTTP.

One thing I've recognized during my rewrite is that SocketEvents are statically initialized once. This doesn't happen during application startup, but once the first TlsStream object is constructed - maybe too late for the socket IO thing when 7000 socket connect/handshake events then happen.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 24, 2018

Something like this without any connection made yet.

(lldb) thr list
Process 47793 stopped
* thread #1: tid = 0xea700b, 0x00007fff59f88d82 libsystem_kernel.dylib`__semwait_signal + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  thread #2: tid = 0xea7023, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
  thread #3: tid = 0xea7024, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
  thread #4: tid = 0xea7025, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
  thread #5: tid = 0xea7026, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
  thread #6: tid = 0xea7027, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
  thread #7: tid = 0xea7028, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
  thread #8: tid = 0xea7029, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
  thread #9: tid = 0xea702a, 0x00007fff59f8a09a libsystem_kernel.dylib`poll + 10
@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 24, 2018

Dynamic Thread Pool for Connections

watch -n 1 -c "curl -k -s -u root:icinga 'https://localhost:5665/v1/status/ApiListener?pretty=1' | grep num_conn_endpoints"
watch -n 1 -c "ps -T -C icinga2 | wc -l"

500 Clients.
Reconnect-Interval lowered from 60 to 10 seconds.

45 secs to fully reconnect 500 clients, 400 after 10 seconds.
Thread peak was at 130, then drops to 100 and now 50 once all connections have been established.

Simulated reload in between (stop, start) and in a good spot (seems that the Kernel did not drop the connection yet, or any sort of keepalive re-using). 5 Seconds is very good.

root@nhilverling2:~/icinga2# icinga2 daemon
[2018-09-24 15:21:16 +0000] information/cli: Icinga application loader (version: v2.9.1-190-g91989ff)
...
[2018-09-24 15:21:20 +0000] information/ApiListener: Finished syncing endpoint 'icinga-sat-124' in zone 'icinga-sat-124'.
@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 24, 2018

Hmmm 7000 endpoints with socat crashes my macbook, since fork() fails at some point.

mbmif /usr/local/icinga2/var/lib/icinga2/certs (master *) # socat OPENSSL-LISTEN:7000,certificate=mbmif.int.netways.de.crt,key=mbmif.int.netways.de.key,cafile=ca.crt,fork - > /dev/null
  • 700, works fine.

Moving to bigger test VMs.

  • 5000
root@nhilverling2:/usr/local/var/lib/icinga2/certs# socat OPENSSL-LISTEN:7000,certificate=nhilverling2.crt,key=nhilverling2.key,cafile=ca.crt,fork - /dev/null
root@nhilverling2:~/icinga2# vim /usr/local/etc/icinga2/zones.conf

/* SOCAT */
for (i in range(5000)) {
  object Endpoint "sat" + i {
    host = "127.0.0.1"
    port = 7000
  }
  object Zone "sat" + i {
    parent = "master"
    endpoints = [ name ]
  }
}
root@nhilverling2:~# watch -n 1 -c "ps -T -C icinga2 | wc -l"
watch -n 1 -c "curl -k -s -u root:icinga 'https://localhost:5665/v1/status/ApiListener?pretty=1' | grep num_conn_endpoints"

Starts to peak at 93 threads.
Starts at :16, 5000 clients are connected at :46 ... so it takes ~30 seconds to reconnect.

More tomorrow.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 25, 2018

  • 10000 clients with socat.

Takes ~82 seconds to connect 10000 endpoints.
Threads stay at 93.
After a while, connections drop since there's nothing returned by the dummy clients. In a real world scenario, the cluster heartbeat would be returned and clients stay connected.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 25, 2018

One observation - a dynamic thread pool compared to spawning and destroying threads is probably the solution here. 7000 threads for connections are very expensive, and I assume that our switch from boost::thread to std::thread also causes troubles here (2.8 -> 2.9 development tasks for C++11). An interesting read: https://www.meetingcpp.com/blog/items/c11-and-boost.html

Tests against the REST API

stampede creates 1000 hosts simultaneously, therefore it opens many connections. This leads up to 2000 threads for this second, and once all connections are established, it fires the requests. 2 out of 1000 result in a 500 error.
The dynamic thread pool resizes immediately once the many tasks are finished. The problem with this test are the config object creation which slow down Icinga 2 - it involves compiling the object and starting it into memory. A little less operations like querying the object status would return faster.

Following the design pattern for many connections on HTTP servers, this is totally fine to spawn this many threads for a moment. Since the thread pool implementation also kills off 2 threads at max, the resources for dropping idle threads are not that wasted as before - where 5000 threads were just destroyed.

dnsmichi added a commit that referenced this issue Sep 25, 2018

dnsmichi added a commit that referenced this issue Sep 25, 2018

Use a dynamic thread pool for API connections
The full analysis is located in #6517.

fixes #6517
@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Sep 25, 2018

Re-opening for test feedback. Triggering snapshot builds as we speak.

@dnsmichi

This comment has been minimized.

Member

dnsmichi commented Oct 11, 2018

#6514 provided external results that this works, therefore I am closing here. I cannot hold off 2.10 for any longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment