Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #11046] Icinga2 agent gets stuck after disconnect and won't relay messages #3881

Closed
icinga-migration opened this issue Jan 28, 2016 · 24 comments

Comments

Projects
None yet
1 participant
@icinga-migration
Copy link
Member

commented Jan 28, 2016

This issue has been migrated from Redmine: https://dev.icinga.com/issues/11046

Created by ziaunys on 2016-01-28 04:31:31 +00:00

Assignee: gbeutner
Status: Resolved (closed on 2016-05-11 08:15:03 +00:00)
Target Version: 2.4.8
Last Update: 2016-08-08 11:21:58 +00:00 (in Redmine)

Icinga Version: 2.4.1
Backport?: Not yet backported
Include in Changelog: 1

I'm running Icinga2 with agents relaying checks to a master zone with one master. I had an issue where the Icinga2 agent lost connection to the master and triggered the cluster-zone check I had set up for it. The agent was still running and never recovered. I waited hoping it would eventually reconnect and replay its log. However, this never happened and the agent was down for days. When I finally restarted the agent I saw this in the log on the disconnected agent.

[2016-01-27 20:25:56 -0800] information/ApiListener: Syncing runtime objects to endpoint 'icinga-master01-prod.ops.puppetlabs.net'.
[2016-01-27 20:25:56 -0800] information/ApiListener: Finished sending updates for endpoint 'icinga-master01-prod.ops.puppetlabs.net'.
[2016-01-27 20:25:57 -0800] information/ApiListener: Replayed 61882 messages.

So apparently it had been faithfully checking, but never relayed any messages. This doesn't seem like normal behavior because I haven't had this happen before. I can provide additional info if needed.

Attachments

Changesets

2016-05-11 08:09:54 +00:00 by gbeutner 2adfcb5

Improve error handling for TlsStream::Handshake

fixes #11046

2016-05-12 09:11:02 +00:00 by gbeutner 6344539

Improve error handling for TlsStream::Handshake

fixes #11046

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Jan 29, 2016

Updated by mfriedrich on 2016-01-29 09:21:13 +00:00

I'm not sure whether this has been fixed already inside the git master, but we've seen that in the past in customer setups as well. Do you have a test stage where you are able to test the latest snapshot packages to reproduce the problem is fixed?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Feb 1, 2016

Updated by ziaunys on 2016-02-01 18:19:02 +00:00

dnsmichi wrote:

I'm not sure whether this has been fixed already inside the git master, but we've seen that in the past in customer setups as well. Do you have a test stage where you are able to test the latest snapshot packages to reproduce the problem is fixed?

I could use the snapshot packages in a dev environment. I'm just unsure of how I can reproduce the issue.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Feb 24, 2016

Updated by mfriedrich on 2016-02-24 20:10:27 +00:00

  • Status changed from New to Feedback
  • Assigned to set to ziaunys

Can you please re-test this with 2.4.3?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Feb 25, 2016

Updated by ziaunys on 2016-02-25 19:14:51 +00:00

  • File added icinga2.log

I haven't seen this particular issue again, but after upgrading the agent on all nodes a large set of them alert after the master restarts and they won't recover. Looking at the logs I see a ton of connection timeouts. As far as I can tell they won't recover until the node restarts. When the master restarts it logs messages like this:

[2016-02-25 06:25:14 -0800] critical/TcpSocket: Invalid socket: Network is unreachable
[2016-02-25 06:25:14 -0800] critical/ApiListener: Cannot connect to host 'burji3.puppetlabs.com' on port '5665'
[2016-02-25 06:25:14 -0800] critical/TcpSocket: Invalid socket: Connection timed out

The set of hosts that timeout is pretty random. Let me know if I should file a separate issue.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 4, 2016

Updated by mfriedrich on 2016-03-04 15:54:20 +00:00

  • Parent Id set to 11313
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 11, 2016

Updated by mfriedrich on 2016-03-11 13:58:18 +00:00

Hm, that looks more like a either a network service problem to me, or the master currently not running (e.g. through a reload). Maybe the dos/ips also changed somehow so the static endpoint configuration is not uptodate anymore?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by ziaunys on 2016-03-15 16:53:33 +00:00

dnsmichi wrote:

Hm, that looks more like a either a network service problem to me, or the master currently not running (e.g. through a reload). Maybe the dos/ips also changed somehow so the static endpoint configuration is not uptodate anymore?

This happens every time the master reloads. On reload, many agents disconnect and never reconnect until reloading the master agent a second time which I don't really understand. In 2.4.1 there was an issue where reloading caused cluster zone checks to fail, but they would always recover. I can't reproduce the issue in our test environment, but it only has 78 hosts as opposed to around 300 in the production environment. I'll investigate possible network issues. Perhaps since we have more agents now it's possible we're starting to hit some kind of network limitation.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 15, 2016

Updated by ziaunys on 2016-03-15 22:44:07 +00:00

I just wanted to update this ticket and note that I just encountered the error of agents getting stuck not relaying messages. This time a few random hosts disconnected and reconnected. However, the master still showed these agents in the disconnected state and they never replayed their log to the master.

Here's the log data for the timeline of the incident.

[2016-03-15 14:42:41 -0700] warning/JsonRpcConnection: API client disconnected for identity 'icinga-master01-prod.ops.puppetlabs.net'
[2016-03-15 14:42:41 -0700] warning/ApiListener: Removing API client for endpoint 'icinga-master01-prod.ops.puppetlabs.net'. 0 API clients left.

At this point the master showed this client as disconnected and it went into the DOWN state. According to the following log line the agent client reconnected to the master. However, the master was never updated.

[2016-03-15 14:42:42 -0700] information/JsonRpcConnection: Reconnecting to API endpoint 'icinga-master01-prod.ops.puppetlabs.net' via host 'icinga-master01-prod.ops.puppetlabs.net' and port '5665'
[2016-03-15 14:43:47 -0700] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2016-03-15 14:48:47 -0700] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2016-03-15 14:53:47 -0700] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'

At this point I restarted the master agent and saw this error in the client's log.

[2016-03-15 14:54:40 -0700] warning/TlsStream: TLS stream was disconnected.
[2016-03-15 14:54:40 -0700] critical/ApiListener: Client TLS handshake failed
Context:
(0) Handling new API client connection

Finally, the client was able to connect to the master and replay its log which then caused the host to recover.

[2016-03-15 14:54:41 -0700] information/ApiListener: New client connection for identity 'icinga-master01-prod.ops.puppetlabs.net'
[2016-03-15 14:54:41 -0700] information/ApiListener: Sending config updates for endpoint 'icinga-master01-prod.ops.puppetlabs.net'.
[2016-03-15 14:54:41 -0700] information/ApiListener: Syncing runtime objects to endpoint 'icinga-master01-prod.ops.puppetlabs.net'.
[2016-03-15 14:54:41 -0700] information/ApiListener: Finished sending config updates for endpoint 'icinga-master01-prod.ops.puppetlabs.net'.
[2016-03-15 14:54:41 -0700] information/ApiListener: Sending replay log for endpoint 'icinga-master01-prod.ops.puppetlabs.net'.
[2016-03-15 14:54:41 -0700] information/ApiListener: Replayed 270 messages.
[2016-03-15 14:54:41 -0700] information/ApiListener: Finished sending replay log for endpoint 'icinga-master01-prod.ops.puppetlabs.net'.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2016

Updated by mfriedrich on 2016-03-18 14:26:24 +00:00

And all your nodes are using the same version, e.g. 2.4.4?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 18, 2016

Updated by ziaunys on 2016-03-18 16:33:00 +00:00

dnsmichi wrote:

And all your nodes are using the same version, e.g. 2.4.4?

They're all running 2.4.3 now. I haven't had the chance to upgrade. Were there any bug fixes that might contribute to resolving this issue?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2016

Updated by ziaunys on 2016-03-22 22:05:19 +00:00

I upgraded all my agents to 2.4.4 and I'm still having this issue. I do have a little more info. It seems like the agent getting stuck is caused by the master reloading. So it seems like the reload causes a handful of host checks to fail and then the agents are stuck in a state where they stop replaying their logs. If I schedule a forced check command nothing happens. When restarting the master a 2nd time, most of the cluster-zone checks that were failing will recover, but not all of them. Restarting the agent also fixes the issue. Somehow both the master and client get into a state where they stop communicating.

I'm wondering if it's just a scale issue since I don't have this problem in our staging environment. The Icinga2 master has 4 CPU's with 2GB of memory and the Icinga2 process only uses around 130MB. The 15 min. load average is around 1 and there isn't enough network activity to saturate the NIC so it doesn't seem like the master is under-provisioned.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2016

Updated by ziaunys on 2016-04-05 03:06:01 +00:00

I'm still seeing this issue and it's becoming a problem because it seems to happen about 1/4th of the time that the icinga2 master reloads. We've basically had to disable Puppet from running on a regular cadence and because of how random the issue is it's a burden for people who are on-call. Is it possible that we're running more Icinga2 agents than other users? I would expect others to see this bug as well.

I do have some new information. The only agents that won't recover when the master reloads are ones that the master doesn't actively connect to. In our environment we take advantage of the agent's bi-directional communication. Thus, agents never recover in situations where the agent can connect to the master but that master can't connect to the agent.

Do you guys have any kind of recommended workaround until there's a fix? If I haven't provided enough data I can try to run some hosts in debug mode for a while. It's just more difficult to reproduce in our test environment since it seems to be brought on by the fact that there are 300 agents.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2016

Updated by ziaunys on 2016-04-05 17:02:59 +00:00

So I decided to try and strace an agent in ths broken state and I found that the agent appears to sleep indefinitely.

Here's the output of strace for an agent that has disconnected from the master on reload:

Process 24849 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0
nanosleep({2, 500000000}, NULL) = 0

The agent seems to call "nanosleep" indefinitely. There are no log messages indicating that the agent has attempted to reconnect.

After making this discovery I ran gdb on the running process with the following output:

(gdb) bt

#0  0x00007f30338f2e5d in nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f303391d514 in usleep () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f30361bb46d in icinga::Application::RunEventLoop() () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so
#3  0x00007f3030821d87 in icinga::IcingaApplication::Main() () from /usr/lib/x86_64-linux-gnu/icinga2/libicinga.so
#4  0x00007f30361a977a in icinga::Application::Run() () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so
#5  0x00007f3035ad0d3e in icinga::DaemonCommand::Run(boost::program_options::variables_map const&, std::vector > const&) const ()

from /usr/lib/x86_64-linux-gnu/icinga2/libcli.so

#6  0x00000000004136b8 in ?? ()
#7  0x000000000041109a in ?? ()
#8  0x00007f3033863ead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000000000041116d in ?? ()
#10 0x00007ffc1ef666c8 in ?? ()
#11 0x000000000000001c in ?? ()
#12 0x0000000000000006 in ?? ()
#13 0x00007ffc1ef66f17 in ?? ()
#14 0x00007ffc1ef66f46 in ?? ()
#15 0x00007ffc1ef66f58 in ?? ()
#16 0x00007ffc1ef66f5f in ?? ()
#17 0x00007ffc1ef66f62 in ?? ()
#18 0x00007ffc1ef66f65 in ?? ()
#19 0x0000000000000000 in ?? ()

Based on this output it seems like the issue is somewhere in the event loop?

Let me know if I can do any additional troubleshooting.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2016

Updated by mfriedrich on 2016-04-05 17:52:02 +00:00

Can you please geberate a backtrace of that running process as described here?
http://docs.icinga.org/icinga2/latest/doc/module/icinga2/toc\#!/icinga2/latest/doc/module/icinga2/chapter/development#development-debug-gdb-backtrace

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2016

Updated by ziaunys on 2016-04-05 20:19:18 +00:00

dnsmichi wrote:

Can you please geberate a backtrace of that running process as described here?
http://docs.icinga.org/icinga2/latest/doc/module/icinga2/toc\#!/icinga2/latest/doc/module/icinga2/chapter/development#development-debug-gdb-backtrace

Yes. I now realize that first backtrace wasn't useful lol.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2016

Updated by ziaunys on 2016-04-05 21:52:17 +00:00

  • File added gdb_bt.log

I've gone and attached a better gdb backtrace following the development guide. I have a core dump too if you need that.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 14, 2016

Updated by mfriedrich on 2016-04-14 10:20:06 +00:00

  • Status changed from Feedback to New
  • Assigned to deleted ziaunys
  • Priority changed from Normal to High
  • Target Version set to 2.4.6
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 14, 2016

Updated by mfriedrich on 2016-04-14 10:22:17 +00:00

Looks like a deadlock.

Thread 18 (Thread 0x7f302b53b700 (LWP 24865)):

#0  0x00007f3036496344 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

No symbol table info available.

#1  0x00007f30361e671b in boost::condition_variable::wait(boost::unique_lock&) () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so

No symbol table info available.

#2  0x00007f303617d400 in icinga::TlsStream::Handshake() () from /usr/lib/x86_64-linux-gnu/icinga2/libbase.so

No symbol table info available.

#3  0x00007f303578aed3 in icinga::ApiListener::NewClientHandlerInternal(boost::intrusive_ptr const&, icinga::String const&, icinga::ConnectionRole) () from /usr/lib/x86_64-linux-gnu/icinga2/libremote.so

No symbol table info available.

#4  0x00007f303578be4d in icinga::ApiListener::NewClientHandler(boost::intrusive_ptr const&, icinga::String const&, icinga::ConnectionRole) () from /usr/lib/x86_64-linux-gnu/icinga2/libremote.so

No symbol table info available.

#5  0x00007f303578c179 in icinga::ApiListener::AddConnection(boost::intrusive_ptr const&) () from /usr/lib/x86_64-linux-gnu/icinga2/libremote.so

No symbol table info available.

#6  0x00007f30368ad629 in ?? () from /usr/lib/libboost_thread.so.1.49.0

No symbol table info available.

#7  0x00007f3036491b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

No symbol table info available.

#8  0x00007f303392330d in clone () from /lib/x86_64-linux-gnu/libc.so.6

No symbol table info available.

#9  0x0000000000000000 in ?? ()

No symbol table info available.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 20, 2016

Updated by gbeutner on 2016-04-20 16:35:40 +00:00

  • Target Version changed from 2.4.6 to 2.4.7
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Apr 21, 2016

Updated by gbeutner on 2016-04-21 07:47:00 +00:00

  • Target Version changed from 2.4.7 to 2.4.8
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

Updated by mfriedrich on 2016-05-10 11:25:28 +00:00

  • Relates set to 11730
@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 11, 2016

Updated by gbeutner on 2016-05-11 08:14:39 +00:00

  • Status changed from New to Assigned
  • Assigned to set to gbeutner

I'm not entirely sure if this patch fixes the problem, however unfortunately I'm unable to reproduce the problem here. Please check whether this problem still occurs with the current master branch.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented May 11, 2016

Updated by gbeutner on 2016-05-11 08:15:03 +00:00

  • Status changed from Assigned to Resolved
  • Done % changed from 0 to 100

Applied in changeset 2adfcb5.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

commented Aug 8, 2016

Updated by mfriedrich on 2016-08-08 11:21:59 +00:00

  • Parent Id deleted 11313
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.