Drop ApiListener#m_RelayQueue #9855

Al2Klimov · 2023-08-24T09:23:03Z

Instead run the previously enqueued function directly. The latter doesn't really do anything more than this:

ApiListener::SyncRelayMessage
- ApiListener::RelayMessageOne
  - ApiListener::SyncSendMessage
    - JsonRpcConnection::SendMessage
      - strand::post <-- The enqueued function just enqueues a function here!
- ApiListener::PersistMessage
  - NetString::WriteStringToStream
    - StdioStream::Write
      - basic_ostream::write <-- Will end up in user space buffer or in Linux disk cache – so strictly speaking in yet another queue.

But this queue isn't just useless. It's also "malicious" (even if not counting useless malloc(3) and mutex locks):

Hw specs: 56c, 256G RAM
Icinga 2.14 config: see below

Icinga just works as long as I don't start this:

for i in {1..30}; do (while curl -ksSo /dev/null -d '{"filter":"service.name==string('$(($RANDOM % 10))')&&host.name==string('$RANDOM')","pretty":1}' -X GET -u root:a97ccaf00cbaf91a 'https://127.0.0.1:5665/v1/objects/services'; do true; done) & done

Then the queue dropped here just grows and grows – with the memory usage of course.
This PR fixes the memory leak.
Maybe it slows Icinga a bit down in critical situations where I'm literally DDoSing it, but it's better than an OOM crash.

for (i in range(750000)) {
	object Host i {
		check_command = "dummy"
		vars.foo1 = "foo1"
		vars.foo2 = "foo2"
		vars.foo3 = "foo3"
		vars.foo4 = "foo4"
		vars.foo5 = "foo5"
		vars.foo6 = "foo6"
		vars.foo7 = "foo7"
		vars.foo8 = "foo8"
		vars.foo9 = "foo9"
		vars.foo10 = "foo10"
		vars.foo11 = "foo11"
		vars.foo12 = "foo12"
	}
}

for (i in range(10)) {
	apply Service i {
		check_command = "dummy"
		vars.foo1 = "foo1"
		vars.foo2 = "foo2"
		vars.foo3 = "foo3"
		vars.foo4 = "foo4"
		vars.foo5 = "foo5"
		vars.foo6 = "foo6"
		vars.foo7 = "foo7"
		vars.foo8 = "foo8"
		vars.foo9 = "foo9"
		vars.foo10 = "foo10"
		vars.foo11 = "foo11"
		vars.foo12 = "foo12"
		assign where true
	}
}

Behaviour in Icinga 2.10

JsonRpcConnection::SendMessage
- JsonRpc::SendMessage
  - NetString::WriteStringToStream
    - TlsStream::Write
      - FIFO::Write <-- Actually yet another queue 🤷‍♂️

Behaviour in Icinga 2.2

Actual sync writing reasoning the relay queue:

TlsStream::Write
- Socket::Poll
- SSL_write

julianbrost · 2023-08-24T11:42:21Z

Is there more than the local Endpoint object in the config? If not, shouldn't ApiListener::SyncRelayMessage() pretty much only conclude that there's nothing to do? If no other endpoint exists, there is no endpoint where the message can be sent to and there should never be the need to write a replay logs. Or in other words: in that scenario, ApiListener::SyncRelayMessage() should be a pretty expensive no-op.

What is the CPU utilization in your tests of both the complete system and just the icinga2 process with and without your curl loops running? I mean these requests shouldn't put any additional items into RelayQueue so they should only make it tip over if this takes away resources that could otherwise be used for processing the queue. That makes me wonder: can you reproduce the same behavior by limiting the number of CPU cores available to icinga2 to a value that's too low (i.e. just less than what it would use if it wasn't limited and therefore fully using the limit)?

Also, when you observe checks per second, does this change? I think your PR might fix it by putting some back pressure there instead of overflowing a queue.

Al2Klimov · 2023-08-24T12:48:53Z

shouldn't ApiListener::SyncRelayMessage() pretty much only conclude that there's nothing to do? If no other endpoint exists, there is no endpoint where the message can be sent to and there should never be the need to write a replay logs. Or in other words: in that scenario, ApiListener::SyncRelayMessage() should be a pretty expensive no-op.

+1

I think your PR might fix it by putting some back pressure there instead of overflowing a queue.

This is the idea behind it. A DDoSed app should go the best of all available (bad) ways, in this case it's the "back pressure".

Al2Klimov · 2023-08-24T15:02:04Z

What is the CPU utilization in your tests of both the complete system and just the icinga2 process with and without your curl loops running?

I assume you mean v2.14.

CPU utilization	icinga2 process	complete system (% of all cores)
without curl	%CPU 1071	%Cpu(s): 17,8 us, 2,2 sy, 1,6 ni, 77,9 id, 0,0 wa, 0,3 hi, 0,2 si, 0,0 st
with curl	%CPU 3383	%Cpu(s): 62,9 us, 1,7 sy, 0,0 ni, 34,8 id, 0,0 wa, 0,5 hi, 0,1 si, 0,0 st

can you reproduce the same behavior by limiting the number of CPU cores available to icinga2 to a value that's too low (i.e. just less than what it would use if it wasn't limited and therefore fully using the limit)?

I don't quite understand the question, especially given two different situations (under load / not under load) above. Please specify an exact number of cores. In theory:

less parallelism = less processing/t = more pile-up/t w/ same config

Also, when you observe checks per second, does this change?

v2.14 or this PR?

Forget the question, I don't get a complete answer beyond what's shown below anyway.

W/o curl

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 0,
                "active_host_checks_15min": 750001,
                "active_host_checks_1min": 0,
                "active_host_checks_5min": 247959,
                "active_service_checks": 21551.3,
                "active_service_checks_15min": 6189561,
                "active_service_checks_1min": 1293078,
                "active_service_checks_5min": 6164680,
                "avg_execution_time": 7.9963848839634e-06,
                "avg_latency": 320.15754831003255,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 0,
                "max_execution_time": 43.529618978500366,
                "max_latency": 428.3694260120392,
                "min_execution_time": 0,
                "min_latency": 216.07322096824646,

W/ curl

> GET /v1/status HTTP/1.1
> Host: 127.0.0.1:5665
> Authorization: Basic cm9vdDphOTdjY2FmMDBjYmFmOTFh
> User-Agent: curl/7.85.0
> Accept: application/json
> Content-Length: 12
> Content-Type: application/x-www-form-urlencoded
>
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: Icinga/r2.14.0-1
< Content-Type: application/json
< Content-Length: 12091
<
{
    "results": [
        {
            "name": "ApiListener",
            "perfdata": [
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_conn_endpoints",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 0,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_endpoints",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 0,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
* TLSv1.2 (IN), TLS header, Supplemental data (23):
                    "label": "api_num_http_clients",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 31,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_json_rpc_anonymous_clients",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 0,
                    "warn": null
                },
                {
                    "counter": false,
                    "crit": null,
                    "label": "api_num_json_rpc_relay_queue_item_rate",
                    "max": null,
                    "min": null,
                    "type": "PerfdataValue",
                    "unit": "",
                    "value": 93500.96666666666,

julianbrost · 2023-08-24T15:17:51Z

%Cpu(s): 62,9 us, 1,7 sy, 0,0 ni, 34,8 id, 0,0 wa, 0,5 hi, 0,1 si, 0,0 st

Do I interpret this correctly that 34.8% of 56 cores were idle, i.e. plenty more CPU resources available?

can you reproduce the same behavior by limiting the number of CPU cores available to icinga2 to a value that's too low (i.e. just less than what it would use if it wasn't limited and therefore fully using the limit)?

I don't quite understand the question, especially given two different situations (under load / not under load) above. Please specify an exact number of cores. In theory:

As the process used 10.71 cores without curl running, I'd start with a limit of 10 cores and see if this makes the process use 1000% CPU constantly, if not, go for fewer cores. So the idea is to deliberately provide too little CPU resources and see if this results in the same issue. Pretty much in an effort to isolate it and figure out if that curl loops actually contribute anything more than CPU load.

Also, when you observe checks per second, does this change?

v2.14 or this PR?

Both, that would have been a test for that theory. So kind of the picture perfect result would have been something like:

2.14: when starting curl, check rate stays the same, queue grows.
PR: when starting curl, check rate goes down, because instead of filling a queue, checks wait.

Al2Klimov · 2023-08-24T15:19:12Z

I don't get a complete answer beyond what's shown below anyway.

Ah, finally!

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 1619.7,
                "active_host_checks_15min": 2250006,
                "active_host_checks_1min": 97182,
                "active_host_checks_5min": 750004,
                "active_service_checks": 21400.216666666667,
                "active_service_checks_15min": 16562549,
                "active_service_checks_1min": 1284014,
                "active_service_checks_5min": 7713247,
                "avg_execution_time": 6.380216790137454e-06,
                "avg_latency": 32.46134595272463,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 0,
                "max_execution_time": 47.82143807411194,
                "max_latency": 83.98208498954773,
                "min_execution_time": 0,
             "min_latency": 6.508827209472656e-05,

Al2Klimov · 2023-08-24T15:59:30Z

%Cpu(s): 62,9 us, 1,7 sy, 0,0 ni, 34,8 id, 0,0 wa, 0,5 hi, 0,1 si, 0,0 st

Do I interpret this correctly that 34.8% of 56 cores were idle, i.e. plenty more CPU resources available?

+1

Also, when you observe checks per second, does this change?

v2.14 or this PR?

Both, that would have been a test for that theory. So kind of the picture perfect result would have been something like:

2.14: when starting curl, check rate stays the same, queue grows.

PR: when starting curl, check rate goes down, because instead of filling a queue, checks wait.

Ok, now with this PR:

W/o curl

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 0,
                "active_host_checks_15min": 750001,
                "active_host_checks_1min": 0,
                "active_host_checks_5min": 56367,
                "active_service_checks": 21757.916666666668,
                "active_service_checks_15min": 6583331,
                "active_service_checks_1min": 1305475,
                "active_service_checks_5min": 6172776,
                "avg_execution_time": 7.127906680212371e-06,
                "avg_latency": 331.47053883474115,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 2,
                "max_execution_time": 42.98665714263916,
                "max_latency": 451.1101641654968,
                "min_execution_time": 0,
                "min_latency": 211.6266384124756,

W/ curl

            "name": "CIB",
            "perfdata": [],
            "status": {
                "active_host_checks": 0.016666666666666666,
                "active_host_checks_15min": 2250009,
                "active_host_checks_1min": 1,
                "active_host_checks_5min": 750005,
                "active_service_checks": 29830.35,
                "active_service_checks_15min": 20534785,
                "active_service_checks_1min": 1789821,
                "active_service_checks_5min": 7502486,
                "avg_execution_time": 6.717918023529078e-06,
                "avg_latency": 3.698655656124068,
                "current_concurrent_checks": 0,
                "current_pending_callbacks": 0,
                "max_execution_time": 50.350712060928345,
                "max_latency": 7.383245944976807,
                "min_execution_time": 0,
                "min_latency": 6.723403930664063e-05,

julianbrost · 2023-08-25T07:55:11Z

Those numbers don't help if you don't say how/when you've obtained them. Like how long was the process running when you got the "W/o curl" numbers and for how long were your curl loops running when you obtained the "W/ curl" numbers? Otherwise I don't even know which of the *_1min/5min/15min metrics even contain useful values.

Anyways, the config you shared has 750 000 hosts and 7 500 000 services, i.e. a total of 8 250 000 checkables. check_command = "dummy" without extra settings means always OK, so only check_interval is relevant which isn't set explicitly and defaults to 5m, so the expected check rates are:

host: 750 000/5min = 2 250 000/15min = 150 000/1min = 2 500/1sec
service: 750 000/5min = 22 500 000/15min = 1 500 000/1min = 25 000/1sec
total: 8 250 000/5min = 24 750 000/15min = 1 650 000/1min = 27 500/1sec

Which doesn't look any close to any of the numbers you've shared so I have no idea how to interpret these numbers. But again, I don't know how they were obtained, but given check_interval = 5m, probably only the 5min and 15min numbers are useful at all, but these are only meaningful if it was running for long enough.

Al2Klimov · 2023-08-25T08:42:08Z

I think we have figured out what's the problem: https://community.icinga.com/t/memory-leak-despite-2-14-upgrade-how-to-debug/12380

Forget the memory leak part, but IMAO the queue is still useless.

Al2Klimov added 2 commits August 22, 2023 12:41

ApiListener#RelayMessage(): don't use m_RelayQueue

45d7fec

Drop ApiListener#m_RelayQueue

4356661

Al2Klimov added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) core/crash Shouldn't happen, requires attention labels Aug 24, 2023

Al2Klimov requested a review from julianbrost August 24, 2023 09:23

cla-bot bot added the cla/signed label Aug 24, 2023

Al2Klimov mentioned this pull request Feb 7, 2024

JsonRpcConnection#Send*() queues messages in memory even after #Disconnect() #9985

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop ApiListener#m_RelayQueue #9855

Drop ApiListener#m_RelayQueue #9855

Al2Klimov commented Aug 24, 2023

julianbrost commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

julianbrost commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

julianbrost commented Aug 25, 2023

Al2Klimov commented Aug 25, 2023

Drop ApiListener#m_RelayQueue #9855

Are you sure you want to change the base?

Drop ApiListener#m_RelayQueue #9855

Conversation

Al2Klimov commented Aug 24, 2023

Behaviour in Icinga 2.10

Behaviour in Icinga 2.2

julianbrost commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

W/o curl

W/ curl

julianbrost commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

Al2Klimov commented Aug 24, 2023

W/o curl

W/ curl

julianbrost commented Aug 25, 2023

Al2Klimov commented Aug 25, 2023