Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop ApiListener#m_RelayQueue #9855

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Al2Klimov
Copy link
Member

Instead run the previously enqueued function directly. The latter doesn't really do anything more than this:

  • ApiListener::SyncRelayMessage
    • ApiListener::RelayMessageOne
      • ApiListener::SyncSendMessage
        • JsonRpcConnection::SendMessage
          • strand::post <-- The enqueued function just enqueues a function here!
    • ApiListener::PersistMessage
      • NetString::WriteStringToStream
        • StdioStream::Write
          • basic_ostream::write <-- Will end up in user space buffer or in Linux disk cache – so strictly speaking in yet another queue.

But this queue isn't just useless. It's also "malicious" (even if not counting useless malloc(3) and mutex locks):

  • Hw specs: 56c, 256G RAM
  • Icinga 2.14 config: see below

Icinga just works as long as I don't start this:

for i in {1..30}; do (while curl -ksSo /dev/null -d '{"filter":"service.name==string('$(($RANDOM % 10))')&&host.name==string('$RANDOM')","pretty":1}' -X GET -u root:a97ccaf00cbaf91a 'https://127.0.0.1:5665/v1/objects/services'; do true; done) & done

Then the queue dropped here just grows and grows – with the memory usage of course.
This PR fixes the memory leak.
Maybe it slows Icinga a bit down in critical situations where I'm literally DDoSing it, but it's better than an OOM crash.

for (i in range(750000)) {
	object Host i {
		check_command = "dummy"
		vars.foo1 = "foo1"
		vars.foo2 = "foo2"
		vars.foo3 = "foo3"
		vars.foo4 = "foo4"
		vars.foo5 = "foo5"
		vars.foo6 = "foo6"
		vars.foo7 = "foo7"
		vars.foo8 = "foo8"
		vars.foo9 = "foo9"
		vars.foo10 = "foo10"
		vars.foo11 = "foo11"
		vars.foo12 = "foo12"
	}
}

for (i in range(10)) {
	apply Service i {
		check_command = "dummy"
		vars.foo1 = "foo1"
		vars.foo2 = "foo2"
		vars.foo3 = "foo3"
		vars.foo4 = "foo4"
		vars.foo5 = "foo5"
		vars.foo6 = "foo6"
		vars.foo7 = "foo7"
		vars.foo8 = "foo8"
		vars.foo9 = "foo9"
		vars.foo10 = "foo10"
		vars.foo11 = "foo11"
		vars.foo12 = "foo12"
		assign where true
	}
}

Behaviour in Icinga 2.10

  • JsonRpcConnection::SendMessage
    • JsonRpc::SendMessage
      • NetString::WriteStringToStream
        • TlsStream::Write
          • FIFO::Write <-- Actually yet another queue 🤷‍♂️

Behaviour in Icinga 2.2

Actual sync writing reasoning the relay queue:

  • TlsStream::Write
    • Socket::Poll
    • SSL_write

@Al2Klimov Al2Klimov added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) core/crash Shouldn't happen, requires attention labels Aug 24, 2023
@cla-bot cla-bot bot added the cla/signed label Aug 24, 2023
@julianbrost
Copy link
Contributor

Is there more than the local Endpoint object in the config? If not, shouldn't ApiListener::SyncRelayMessage() pretty much only conclude that there's nothing to do? If no other endpoint exists, there is no endpoint where the message can be sent to and there should never be the need to write a replay logs. Or in other words: in that scenario, ApiListener::SyncRelayMessage() should be a pretty expensive no-op.

What is the CPU utilization in your tests of both the complete system and just the icinga2 process with and without your curl loops running? I mean these requests shouldn't put any additional items into RelayQueue so they should only make it tip over if this takes away resources that could otherwise be used for processing the queue. That makes me wonder: can you reproduce the same behavior by limiting the number of CPU cores available to icinga2 to a value that's too low (i.e. just less than what it would use if it wasn't limited and therefore fully using the limit)?

Also, when you observe checks per second, does this change? I think your PR might fix it by putting some back pressure there instead of overflowing a queue.

@Al2Klimov
Copy link
Member Author

shouldn't ApiListener::SyncRelayMessage() pretty much only conclude that there's nothing to do? If no other endpoint exists, there is no endpoint where the message can be sent to and there should never be the need to write a replay logs. Or in other words: in that scenario, ApiListener::SyncRelayMessage() should be a pretty expensive no-op.

+1

I think your PR might fix it by putting some back pressure there instead of overflowing a queue.

This is the idea behind it. A DDoSed app should go the best of all available (bad) ways, in this case it's the "back pressure".

@Al2Klimov
Copy link
Member Author

What is the CPU utilization in your tests of both the complete system and just the icinga2 process with and without your curl loops running?

I assume you mean v2.14.

CPU utilization icinga2 process complete system (% of all cores)
without curl %CPU 1071 %Cpu(s): 17,8 us, 2,2 sy, 1,6 ni, 77,9 id, 0,0 wa, 0,3 hi, 0,2 si, 0,0 st
with curl %CPU 3383 %Cpu(s): 62,9 us, 1,7 sy, 0,0 ni, 34,8 id, 0,0 wa, 0,5 hi, 0,1 si, 0,0 st

can you reproduce the same behavior by limiting the number of CPU cores available to icinga2 to a value that's too low (i.e. just less than what it would use if it wasn't limited and therefore fully using the limit)?

I don't quite understand the question, especially given two different situations (under load / not under load) above. Please specify an exact number of cores. In theory:

less parallelism = less processing/t = more pile-up/t w/ same config

Also, when you observe checks per second, does this change?

v2.14 or this PR?

Forget the question, I don't get a complete answer beyond what's shown below anyway.

W/o curl

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 0,
                "active_host_checks_15min": 750001,
                "active_host_checks_1min": 0,
                "active_host_checks_5min": 247959,
                "active_service_checks": 21551.3,
                "active_service_checks_15min": 6189561,
                "active_service_checks_1min": 1293078,
                "active_service_checks_5min": 6164680,
                "avg_execution_time": 7.9963848839634e-06,
                "avg_latency": 320.15754831003255,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 0,
                "max_execution_time": 43.529618978500366,
                "max_latency": 428.3694260120392,
                "min_execution_time": 0,
                "min_latency": 216.07322096824646,

W/ curl

> GET /v1/status HTTP/1.1
> Host: 127.0.0.1:5665
> Authorization: Basic cm9vdDphOTdjY2FmMDBjYmFmOTFh
> User-Agent: curl/7.85.0
> Accept: application/json
> Content-Length: 12
> Content-Type: application/x-www-form-urlencoded
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: Icinga/r2.14.0-1
< Content-Type: application/json
< Content-Length: 12091
<
{
    "results": [
        {
            "name": "ApiListener",
            "perfdata": [
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_conn_endpoints",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 0,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_endpoints",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 0,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
* TLSv1.2 (IN), TLS header, Supplemental data (23):
                    "label": "api_num_http_clients",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 31,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_json_rpc_anonymous_clients",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 0,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_json_rpc_relay_queue_item_rate",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 93500.96666666666,

@julianbrost
Copy link
Contributor

%Cpu(s): 62,9 us, 1,7 sy, 0,0 ni, 34,8 id, 0,0 wa, 0,5 hi, 0,1 si, 0,0 st

Do I interpret this correctly that 34.8% of 56 cores were idle, i.e. plenty more CPU resources available?

can you reproduce the same behavior by limiting the number of CPU cores available to icinga2 to a value that's too low (i.e. just less than what it would use if it wasn't limited and therefore fully using the limit)?

I don't quite understand the question, especially given two different situations (under load / not under load) above. Please specify an exact number of cores. In theory:

As the process used 10.71 cores without curl running, I'd start with a limit of 10 cores and see if this makes the process use 1000% CPU constantly, if not, go for fewer cores. So the idea is to deliberately provide too little CPU resources and see if this results in the same issue. Pretty much in an effort to isolate it and figure out if that curl loops actually contribute anything more than CPU load.

Also, when you observe checks per second, does this change?

v2.14 or this PR?

Both, that would have been a test for that theory. So kind of the picture perfect result would have been something like:

  • 2.14: when starting curl, check rate stays the same, queue grows.
  • PR: when starting curl, check rate goes down, because instead of filling a queue, checks wait.

@Al2Klimov
Copy link
Member Author

I don't get a complete answer beyond what's shown below anyway.

Ah, finally!

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 1619.7,
                "active_host_checks_15min": 2250006,
                "active_host_checks_1min": 97182,
                "active_host_checks_5min": 750004,
                "active_service_checks": 21400.216666666667,
                "active_service_checks_15min": 16562549,
                "active_service_checks_1min": 1284014,
                "active_service_checks_5min": 7713247,
                "avg_execution_time": 6.380216790137454e-06,
                "avg_latency": 32.46134595272463,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 0,
                "max_execution_time": 47.82143807411194,
                "max_latency": 83.98208498954773,
                "min_execution_time": 0,
             "min_latency": 6.508827209472656e-05,

@Al2Klimov
Copy link
Member Author

%Cpu(s): 62,9 us, 1,7 sy, 0,0 ni, 34,8 id, 0,0 wa, 0,5 hi, 0,1 si, 0,0 st

Do I interpret this correctly that 34.8% of 56 cores were idle, i.e. plenty more CPU resources available?

+1

Also, when you observe checks per second, does this change?

v2.14 or this PR?

Both, that would have been a test for that theory. So kind of the picture perfect result would have been something like:

  • 2.14: when starting curl, check rate stays the same, queue grows.
  • PR: when starting curl, check rate goes down, because instead of filling a queue, checks wait.

Ok, now with this PR:

W/o curl

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 0,
                "active_host_checks_15min": 750001,
                "active_host_checks_1min": 0,
                "active_host_checks_5min": 56367,
                "active_service_checks": 21757.916666666668,
                "active_service_checks_15min": 6583331,
                "active_service_checks_1min": 1305475,
                "active_service_checks_5min": 6172776,
                "avg_execution_time": 7.127906680212371e-06,
                "avg_latency": 331.47053883474115,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 2,
                "max_execution_time": 42.98665714263916,
                "max_latency": 451.1101641654968,
                "min_execution_time": 0,
                "min_latency": 211.6266384124756,

W/ curl

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 0.016666666666666666,
                "active_host_checks_15min": 2250009,
                "active_host_checks_1min": 1,
                "active_host_checks_5min": 750005,
                "active_service_checks": 29830.35,
                "active_service_checks_15min": 20534785,
                "active_service_checks_1min": 1789821,
                "active_service_checks_5min": 7502486,
                "avg_execution_time": 6.717918023529078e-06,
                "avg_latency": 3.698655656124068,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 0,
                "max_execution_time": 50.350712060928345,
                "max_latency": 7.383245944976807,
                "min_execution_time": 0,
                "min_latency": 6.723403930664063e-05,

@julianbrost
Copy link
Contributor

Those numbers don't help if you don't say how/when you've obtained them. Like how long was the process running when you got the "W/o curl" numbers and for how long were your curl loops running when you obtained the "W/ curl" numbers? Otherwise I don't even know which of the *_1min/5min/15min metrics even contain useful values.

Anyways, the config you shared has 750 000 hosts and 7 500 000 services, i.e. a total of 8 250 000 checkables. check_command = "dummy" without extra settings means always OK, so only check_interval is relevant which isn't set explicitly and defaults to 5m, so the expected check rates are:

  • host: 750 000/5min = 2 250 000/15min = 150 000/1min = 2 500/1sec
  • service: 750 000/5min = 22 500 000/15min = 1 500 000/1min = 25 000/1sec
  • total: 8 250 000/5min = 24 750 000/15min = 1 650 000/1min = 27 500/1sec

Which doesn't look any close to any of the numbers you've shared so I have no idea how to interpret these numbers. But again, I don't know how they were obtained, but given check_interval = 5m, probably only the 5min and 15min numbers are useful at all, but these are only meaningful if it was running for long enough.

@Al2Klimov
Copy link
Member Author

I think we have figured out what's the problem: https://community.icinga.com/t/memory-leak-despite-2-14-upgrade-how-to-debug/12380

Forget the memory leak part, but IMAO the queue is still useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working cla/signed core/crash Shouldn't happen, requires attention
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants