-
Notifications
You must be signed in to change notification settings - Fork 570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga 2 Cluster Problem endpoints are not connected while checking still working, version 2.10.4-1 #7075
Comments
Today, just a few minutes ago, cu-icinga-01 reloaded, this time, it lost connection with nkg-icinga-01, error messages on cu-icinga-01:
On client node, debug log stuck totally, until I restart icinag2 on it. Cannot see logs between 2019-04-03 18:23:12 -0700 and 2019-04-03 18:42:09 -0700.
Actually, I am really confused.. looks like this logs are different from last time, any suggestion is welcome. thank you very much. |
One note for the configuration inside zones.conf - since you've copied it, each side (master and satellites) will now connect to each other. This is done by the Looks like the master is not able to connect to the satellites from your logs, and always runs into TLS handshake timeouts. This essentially blocks and eats quite a few CPU resources. Instead, the satellites establish the connections already, and the check results run through. Your master(s) are just overloaded, fix the connection direction issue first before looking into anything else. Cheers, |
Thank you Michael, Let me confirm if I have correct understanding. Correct cluster behavior should be master connecting satellites or satellites connecting master, they should not connect each other at the same time. To disable bi-connecting, I should remove host attribute of satellites' endpoints from master nodes, then just keep satellites connecting to master nodes. Thanks. |
Here‘s the docs with some more derails: https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#endpoint-connection-direction |
Thank you Michael, Have corrected configuration, let me watch for one week, let master nodes connect satellites, satellites do not connect master nodes. Thanks. |
Hi Michael @dnsmichi , #Describe the Issue
And I can see mater node is attempting to collect njg-icinga-01 by netstat command, 10.35.60.236 is the lost master node, it is trying to connect satellite's 5665 port, then disconnect, then connect again, keep doing this.
On master node, it keeps printing trying to reconnect satellite:
From tcpdump, we can see looks like master node disconnect connection by itself because satellite didn't send certificate information after Hello: Also attached packet file : We can see after TCP connection connected, master node sent "Client Hello" and satellite feedback ack, but no following packets of certificate, so, after 10 seconds, master node disconnected. Is there any possible problem in my environment? Thanks. |
After restart icinga2 on satellite, captured correct behavior, satellite should send Server Hello to master node after ack packet: Also attached cap file, from 163 packet, it became good: Looks like the issue is on satellite node, it cannot feedback caused timeout, master node disconnect connection. checking more on satellite, if there is any issue on system level. |
Likely this also is related to the cluster and API overloads we've seen with 2.9 and 2.10, and are aiming to solve with 2.11. You can test the snapshot packages, if the problem persists - all details are collected in #7041. |
Hm.. not sure if it can help or not, I found the satellites who open too many filehandlers will be easy to fail connect. I have setup cronjob to collect "lsof | grep icinga2 | grep CLOSE | wc -l " for every minutes, at present, all my satellites opened file handler is less than 11,000, I tried reload master node several times, but the issue not happen. Will try tomorrow, then check if it is related with opened file descriptor, if it is, I think there might be some defect on close connection.
Thanks. |
Today ,the issue happen again, the file descriptor number became more than 20000+ I reloaded master node at 16:31:01, no cluster alert at that time, but the opened fd became more and more, until 18:01, master node reload again, this satellite not connected... For other satellites, the fd is normal (means similar with this one before 16:31). Acutllay, I set file-max = 65535, but seems icinga2 cannot handle tls connection anymore. Downgraded half satellites to 2.8.1 to watch for one week.
|
So the master connects to the satellites, and leaves the connection open.
Just curious, which FQDN points to 10.60.89.10? (dig -x 10.60.89.10). Seems that sometimes this is the first master too.
Since you're using FQDNs for the If you re-iterate this on the master, how many stale sockets are opened over there, this likely is not only the case for the satellites, but the masters too. And last but not least, if you can, please test the snapshot packages (master and satellite in this case). The parts with TLS connection handling have been rewritten from scratch. |
I checked DNS too, it is OK, 10.60.89.10 is a security scanning server in our internal network, it will detect all listening ports on all hosts, so, it was trying to keep connecting and disconnecting 5665 port, and that server has no DNS records, so we can only see its IP. This indicates it will have some problem on TCP connection handling, and I do think master node should have same issue too, but I didn't collect fd on master node, I can do this next step. Hm.. because this is our production monitoring system, last time, I used big effort to convince our team to get approval to upgrade from 2.5.4 to 2.8.4 to solve segment fault crash issue, however, meet this issue, I don't think I can get approval to use snapshot version :( Anyway, thank you very much, I will watch for one week that if 2.8.1 version has this issue or not, if not, I will downgrade to 2.8.1, no method... |
Uh.
And we have a winner, this is likely the same issue as with #6559 and #6562 where an external tool scans the API via TCP and HTTP, and somehow connections are not closed properly. This leaves stale connections all around, nothing related to Icinga cluster connections on their own. This is something which was hidden in 2.8.4 with corked connections, more visible with 2.9 and 2.10. A fix is in the making, that's why I've asked whether to test the snapshot packages. This really would help to know whether everything is resolved prior to any release :) |
Thank you Michael for confirm, yes.. after investigation, I don't think it is cluster connections issue. It is stale connections caused cluster connection issue. I think we can test this in lab environment next week, I will do it like this:
Use an automation script, just connecting 5665 port, disconnecting, repeating fast, see if it possible that only 2.10.4 will have this phenomenon (too many CLOSE_WAIT fd). By the way, until now, I see one of my 2.10.4 satellite has same issue, all 2.8.1 satellites OK, I will do test again, if only this node will not connect after master node reload, this needs time to verify :( Thanks. |
Update:
For above two issues, I have seen there are issues tracking them, so, I want to close this issue to reduce duplicated topics. Thanks Michael( @dnsmichi ) for talking about this issue, I suggest icinga2 should provide some LTS version, for those versions, no new features, just bug/security fix, then users like me will feel safer :) If there is extra effort to provide LTS, I would like to contribute if what I can see. Thanks. |
LTS is on the roadmap, but nothing which directly affects Icinga 2 as core at the moment. 2.8.x won't hit that anymore - if it runs for you, good, but you may encounter different bugs with it (notification handling, etc.) which will always require updated versions. 2.11 is a good starting point where many of the mentioned problems are subject being fixed. That is why I've asked you to test the snapshot packages to be sure your ongoing problems are fixed :) Cheers, |
Describe the bug
I have two nodes in master zone, 4 zones, in each zone, there are 2 satellites.
Master nodes: cu-icinga-01, cu-icinga-02
2 Satellites in sample zone: njg-icinga-01, njg-icinga-02
Detail configuartion files attached below.
Made configuration change on cu-icinga-01, reload icinga2: "systemctl reload icinga2", there will be Cluster Problem:
the endpoints are random.
On satellites, there are logs (debug level), when master node reload, it said API clients left, and then, it keep alerting "Not connecting to Endpoint 'cu-icinga-01' because we're already trying to connect to it."
To Reproduce
Expected behavior
It should not have this type alerts
Screenshots
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):icinga2 feature list
):icinga2 daemon -C
):zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.zones.conf on master nodes:
zones.conf on satellite nodes is same with master nodes', I copied it from master nodes:
Additional context
I need to restart icinga2 on all not connected endpoints, it will recover.
The cluster was built with version 2.8.4 as a new cluster, and I met endpoint not connected issue after reloading master node with TLS handshake failed error message, then update to 2.10.4-1 as suggested from community, then the log shown as above.
Do I miss anything?
Thanks
The text was updated successfully, but these errors were encountered: