Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dev.icinga.com #10002] Deadlock in WorkQueue::Enqueue #3324

Closed
icinga-migration opened this issue Aug 26, 2015 · 11 comments

Comments

@icinga-migration
Copy link
Member

@icinga-migration icinga-migration commented Aug 26, 2015

This issue has been migrated from Redmine: https://dev.icinga.com/issues/10002

Created by aledermueller on 2015-08-26 13:34:46 +00:00

Assignee: gbeutner
Status: Resolved (closed on 2015-10-15 13:19:22 +00:00)
Target Version: 2.3.11
Last Update: 2015-10-15 13:19:22 +00:00 (in Redmine)

Icinga Version: 2.3.8
Backport?: Already backported
Include in Changelog: 1

Hey,

Agents (zones): approx. 400 (mixed versions with 2.3.8 and 2.3.9)
Masters: 2 (Version 2.3.8)

After a while Icinga2 on one master hangs without using resources like CPU and IO. netstat shows full Recv-Qs (data from the agents) and empty Send-Qs. While 2/3 of the connections is on close_wait, the other 1/3 is established.

A stacktrace is attached: gdb -p xxx -ex 'thread apply all bt full' -ex deta -ex q -batch > debug

In the debug log are mainly the following entries. The counter for pending tasks is growing....

[2015-08-26 14:05:18 +0200] notice/ThreadPool: Pool #1: Pending tasks: 13; Average latency: 0ms; Threads: 4; Pool utilization: 14.7925%
[2015-08-26 14:05:33 +0200] notice/ThreadPool: Pool #1: Pending tasks: 86; Average latency: 34ms; Threads: 5; Pool utilization: 24.8859%
[2015-08-26 14:05:48 +0200] notice/ThreadPool: Pool #1: Pending tasks: 5; Average latency: 0ms; Threads: 4; Pool utilization: 20.9834%
[2015-08-26 14:06:03 +0200] notice/ThreadPool: Pool #1: Pending tasks: 372; Average latency: 0ms; Threads: 8; Pool utilization: 71.0408%
[2015-08-26 14:06:18 +0200] notice/ThreadPool: Pool #1: Pending tasks: 584; Average latency: 0ms; Threads: 36; Pool utilization: 75.625%
[2015-08-26 14:06:33 +0200] notice/ThreadPool: Pool #1: Pending tasks: 858; Average latency: 0ms; Threads: 64; Pool utilization: 92.9821%
[2015-08-26 14:06:48 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1091; Average latency: 0ms; Threads: 64; Pool utilization: 99.7029%
[2015-08-26 14:07:03 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1257; Average latency: 0ms; Threads: 64; Pool utilization: 99.9874%
[2015-08-26 14:07:18 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1371; Average latency: 0ms; Threads: 64; Pool utilization: 99.9995%
[2015-08-26 14:07:33 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1401; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-08-26 14:07:48 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1485; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-08-26 14:08:03 +0200] notice/ThreadPool: Pool #1: Pending tasks: 1545; Average latency: 0ms; Threads: 64; Pool utilization: 100%
...
[2015-08-26 15:27:45 +0200] notice/ThreadPool: Thread pool; current: 16; adjustment: 2
[2015-08-26 15:27:45 +0200] notice/ThreadPool: Pool #1: Pending tasks: 2453; Average latency: 0ms; Threads: 64; Pool utilization: 100%
[2015-08-26 15:27:45 +0200] notice/ThreadPool: Thread pool; current: 16; adjustment: 2

Thanks, Achim

Attachments

Changesets

2015-09-02 05:46:30 +00:00 by (unknown) 5c77e6e

Fix deadlock in ApiListener::RelayMessage

fixes #10002

2015-09-02 07:16:20 +00:00 by (unknown) 35acba7

Remove default WQ limits

refs #10002

2015-10-15 13:16:51 +00:00 by (unknown) e480af3

Remove default WQ limits

refs #10002

2015-10-15 13:18:02 +00:00 by (unknown) c8d24b6

Fix deadlock in ApiListener::RelayMessage

fixes #10002

Relations:

@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Aug 27, 2015

Updated by aledermueller on 2015-08-27 07:10:51 +00:00

  • File added debug-100-master1-idomaster
  • File added debug-100-master2

The same thing happened again. Now the second master shows the same behavior/logs. A stacktrace of both is attached, master1 is the host writing to the ido-master.

Thanks, Achim

@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Aug 27, 2015

Updated by mfriedrich on 2015-08-27 14:52:46 +00:00

  • Relates set to 9983
@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Aug 31, 2015

Updated by mfrosch on 2015-08-31 11:23:54 +00:00

Maybe also connected to #9976 ?

@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Aug 31, 2015

Updated by mfrosch on 2015-08-31 11:24:00 +00:00

  • Relates set to 9976
@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Aug 31, 2015

Updated by mfrosch on 2015-08-31 14:24:42 +00:00

  • Relates set to 9798
@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Sep 2, 2015

Updated by gbeutner on 2015-09-02 05:46:59 +00:00

There's an experimental patch in the master branch which needs further testing.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Sep 2, 2015

Updated by Anonymous on 2015-09-02 05:47:02 +00:00

  • Status changed from New to Resolved
  • Done % changed from 0 to 100

Applied in changeset 5c77e6e.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Sep 2, 2015

Updated by gbeutner on 2015-09-02 05:47:19 +00:00

  • Category set to Cluster
  • Status changed from Resolved to Feedback
  • Assigned to set to aledermueller
  • Target Version set to 2.4.0
@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Sep 14, 2015

Updated by mfriedrich on 2015-09-14 08:22:08 +00:00

According to Achim and Blerim, the fixes made it work again (2.3.10 without fixes causes trouble, the snapshot packages run fine for nearly a week now). I'd say we'll test this a little more and may back port that into 2.3.11 next week.

@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Sep 14, 2015

Updated by mfriedrich on 2015-09-14 08:23:04 +00:00

  • Priority changed from Normal to High
@icinga-migration

This comment has been minimized.

Copy link
Member Author

@icinga-migration icinga-migration commented Oct 15, 2015

Updated by mfriedrich on 2015-10-15 13:19:22 +00:00

  • Subject changed from Pool utilization: 100% to Deadlock in WorkQueue::Enqueue
  • Status changed from Feedback to Resolved
  • Assigned to changed from aledermueller to gbeutner
  • Target Version changed from 2.4.0 to 2.3.11
  • Backport? changed from TBD to Yes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.