high load average every 1h 45m #3465

akaDJon · 2017-11-13T07:00:03Z

every 1h 45m load average my server up to 3-4 (normal 0.2-0.5)
if stop telegraf service load average dont up every 1h 45m.

Why is this happening?
Is it possible to adjust the time or period?

telegraf v1.4.3
influxdb v1.3.7

danielnelson · 2017-11-13T19:45:34Z

Unless you have an input configured to run every 1h45m, there isn't any periodic tasks that would get ran by Telegraf. Is InfluxDB on the same system?

akaDJon · 2017-11-14T04:45:15Z

I have standart config for telegraf and dont have special inputs. InfluxDB on the same server. Maybe Telegraf send data to InfluxDB and InfluxDB up load average?

SM616 · 2017-11-14T09:28:40Z

Also having this on my Telegraf servers (version 1.4.3, InfluxDB version 1.3.6). I have no specific periodic tasks configured to run every 1.5 hours, and I'm using almost default configuration with one HTTP listener input and two InfluxDB outputs.

Is it something like garbage collection?

danielnelson · 2017-11-15T19:27:58Z

Could you try using the procstat input to watch the Telegraf and the InfluxDB processes, this should allow us to see which process is causing the extra load. I think it could also be useful to enable the internal plugin, which will report the memory usage by Telegraf, and you could use the influxdb input to monitor InfluxDB.

akaDJon · 2017-11-16T05:39:20Z

Error in plugin [inputs.influxdb]: [url=http://127.0.0.1:8086/debug/vars]: Get http://127.0.0.1:8086/debug/vars: dial tcp 127.0.0.1:8086: getsockopt: connection refused

But wget "http://127.0.0.1:8086/debug/vars" works normal and get json

akaDJon · 2017-11-16T13:27:03Z

procstat info

danielnelson · 2017-11-17T00:30:50Z

Hmm, nothing appears to line up. Can you try to discover which process is causing the high load?

akaDJon · 2017-11-17T03:01:53Z

I tried stop service telegraf and get data load average in script

while true; do uptime >> uptime.log; sleep 10; done

and load average dont up. now try again

akaDJon · 2017-11-17T07:23:48Z

I do "systemctl stop telegraf" and for 4 hours load average dont up even upper 0.9

uptime.zip

as you can see clearly that the problem is in the telegraf or in using it together with influxdb

danielnelson · 2017-11-17T08:45:15Z

I suspect it may be caused by a process that Telegraf is starting, or possibly a subprocess of InfluxDB, because when we monitored the Telegraf process there was no corresponding spike in cpu usage. Can you try to find the name of the process that is producing the cpu usage during one of the spikes? It might be easiest to just run top during the time period with the increase.

akaDJon · 2017-11-17T08:50:41Z

visually on htop program i see 3-4 influxdb and telegraf. I try do print screen htop in spike period

akaDJon · 2017-11-17T09:45:26Z

on 2 minutes load average from 1 to 5

danielnelson · 2017-11-17T10:17:06Z

It looks to me that the load is not caused by cpu, telegraf is still only taking 7% as a peek reading, but I do notice that the influxdb process is often in state D. It might be interesting to take a look at the data collected by the processes input, which is enabled by default in telegraf.

akaDJon · 2017-11-17T10:18:46Z

What I need do?

danielnelson · 2017-11-17T10:22:52Z

If you have it, let's take a look at the graph of the the processes measurement with all fields.

akaDJon · 2017-11-17T10:26:20Z

that?

danielnelson · 2017-11-17T10:30:33Z

Yeah, can you remove total and sleeping so we can see the blocked processes?

akaDJon · 2017-11-17T10:32:37Z

danielnelson · 2017-11-17T20:25:04Z

Can you try looking at these two sample queries with the diskio input.

akaDJon · 2017-11-18T03:35:14Z

on 6-00 system did autobackup.

akaDJon · 2017-11-18T03:48:42Z

i have all default statistics and nothing not show problems exclude load average. if you contact this me on vk.com or facebook.com I may give you access to web grafana

danielnelson · 2017-11-18T03:52:36Z

Who would have thought it could be so hard to track down the source of the high system load averages... It would be interesting to separate out the servers that Telegraf and InfluxDB are running on, is this something you would be able to try?

My best guess is it has something to do with the InfluxDB cache writing.

akaDJon · 2017-11-18T03:54:22Z

sorry, i have only one server. if you give connect to you influxdb I may send statistics to you

akaDJon · 2017-11-18T04:01:57Z

my system Ubuntu 16.04.3 LTS [xenial] 4.4.0-98-generic on KVM VPS
all programs updated.

danielnelson · 2017-11-18T04:22:15Z

I think this is probably just normal behavior of InfluxDB, either writing its cache or perhaps compacting the data files. I would have liked to be able to find the exact cause of the load, but this turned out to be tricky, and I don't think it is something that we can solve in Telegraf anyway.

You could try disabling some of the inputs, and see if this will result in an adjusted period between load increases.

akaDJon · 2017-11-18T04:27:56Z

the more statistics the telegraph collects then more the load average in peak moment. if I turn off the collection of statistics then the load will be lost. but this is not normal when you collect a minimum of statistics the server is highly loaded. and very strange pereodic 1h 45m

akaDJon · 2017-11-18T04:40:31Z

need I open ticket in influxdb forum? may be their specialists help me? or this favorably for buy cloud influxdb service?

danielnelson · 2017-11-20T22:06:27Z

Thanks for opening the issue on InfluxDB, I'm sure they will be able to help you in more detail. If you have questions about InfluxCloud you can also contact sales@influxdata.com.

danielnelson · 2017-12-08T02:26:32Z

I don't think so, it seemed too long between occurrences to be GC related and there doesn't seem to be much change on the memory usage.

danielnelson · 2017-12-08T02:39:47Z

Should be able to see the last gc time by starting Telegraf with --pprof-addr :6060 and browsing to http://localhost:6060/debug/pprof. Under heap you can see the LastGC

james-lawrence · 2017-12-08T02:57:16Z

memory isn't always returned to the OS after a GC, as the assumption is the program will use it again. but the machines in the htop images don't have enough ram to account for a minute + duration of GC.

GC also doesn't trigger unless memory is consumed so it could be a slow build up.

another thing i find interesting that leads me towards other sources of load, htop is showing almost no CPU utilization. what about locks, or timers?

akaDJon · 2018-01-08T08:43:08Z

this?

akaDJon · 2018-01-08T08:44:00Z

or full log?
heap.txt

akaDJon · 2018-01-08T09:52:03Z

after hour
heap2.txt

ekeih · 2018-01-18T23:04:22Z

After upgrading to 1.5.1 yesterday I have the same issue with all my virtual machines and the hardware machine itself. The interval is also 1h 45m but the start time is a little different on every machine.
I will try to collect some more information from my machines as soon as possible.
I attached a screenshot. hetzner02 is the hardware machine and git and data are two virtual machines running on hetzner02.

aldem · 2018-01-19T01:06:06Z

Same issue:

This is idle physical system (16GB RAM, Intel i5-6400T, Ubuntu 16.04.3), running only telegraf (1.5.0).

Similar effect I observe on another system (a bit more powerful), which is serving as proxmox node (though with very small activity).

I have tried to find out what is going on during those peaks - but found nothing. No disk, cpu or any other activity, nothing that could cause such load (on second system it was above 4(!) - while normal is around 0.1-0.2). Context switches, interrupts, fork rate - no changes, only this ephemeral load. Used memory changes are within 500KB, no anomalies.

Increasing collection interval from 10 to 60 seconds significantly reducing load, but it is still there (same interval).

With collectd running on same system (instead of telegraf) nothing like this happens.

danielnelson · 2018-01-19T01:13:53Z

Would be useful to know if it matters what plugins are enabled, or if the load occurs with any plugin so long as there is enough traffic. I think the best way to check would be to enable only a single plugin and see if the issue still occurs, if it does, enable another single plugin and retest.

akaDJon · 2018-01-19T06:14:25Z

Would be useful to know if it matters what plugins are enabled, or if the load occurs with any plugin so long as there is enough traffic. I think the best way to check would be to enable only a single plugin and see if the issue still occurs, if it does, enable another single plugin and retest.

I checked it already, Load average have reducing load, but same interval. If collect very little data, then load average almost invisible

aldem · 2018-01-19T13:44:40Z

Would be useful to know if it matters what plugins are enabled, or if the load occurs with any plugin so long as there is enough traffic.

Well, with single plugin enabled (system, obviously), the situation is even "worse":

Now I have constant load of 1. I do not believe that querying for system load every 10 seconds could produce such load...

PS: After analyzing strace output, I am starting to suspect that all this behavior is not related to telegraf only - as it only uses futex/epoll/pselect/read/write call, and not so often. Most likely, this is related to how Linux computes load average based on process states, and several sleeping threads (depending on state and method) may cause such strange behavior (especially when user-space is involved - a case of futex).

apooniajjn · 2018-02-06T20:15:07Z

I am seeing the same behavior on my host where telegraf agent is being installed and it's happening every 7 hours. CPU load increases and triggers an alert (which I am also using zabbix to monitor this host) .. Hosts where I have installed telegraf agents are showing the same behavior...Updating collection_jitter = "3s" didn't solve this issue as well.

danielnelson · 2018-02-06T20:24:11Z

@apooniajjn When this issue occurs there does not seem to be a cpu increase, only load average. Please ask over at the InfluxData Community site and I'll help you there.

apooniajjn · 2018-02-06T20:37:45Z

@danielnelson thanks .. yeah my bad I meant system load ... let me reach out to you there

danielnelson · 2018-02-06T20:43:39Z

@apooniajjn If it matches this issue closely other than the period, then you can just use this issue. At this time it is unknown what might be causing the problem.

apooniajjn · 2018-02-06T22:25:38Z

@danielnelson yeah it matches closely to this issue except the period ...

ekeih · 2018-02-06T23:07:15Z

I just want to point out that 7hours = 4 * 1h45m 😉

danielnelson · 2018-02-06T23:22:10Z

https://youtu.be/J9Y9GsPtbmQ

8h2a · 2018-02-22T02:00:32Z

I have the same issue but with collectd (and rrdcached as backend) (on Debian Stretch) instead of Telegraf:

When searching the internet for load every 105 minutes you can find more instances of this problem that are unrelated to Telegraf.

gentstr · 2018-02-23T17:33:52Z

Here's a good article talking about how this is calculated in the linux kernel and why it happens.
https://blog.avast.com/investigation-of-regular-high-load-on-unused-machines-every-7-hours

Zbychad · 2018-02-26T11:31:32Z

Very useful article, thanks for sharing. Based on that, we've changed "collection_jitter" from 0 to 5s. Here's the result:

danielnelson · 2018-03-05T18:11:19Z

@gentstr I think that pretty much explains it, thanks for the link. Though I do wonder why in our case the interference occurs so frequently, and not every 14 hours since most users probably have a 10s interval.

I'm going to close this issue since there isn't an action to take on our part, anyone who wants to reduce this artifact can use collection_jitter.

dynek · 2019-10-23T16:35:11Z

Was experimenting same behaviour and really liked the article explaining the artefact.
Ended up using collection and flush jitter and it went beyond what I was expecting. It is even lowering the CPU frequencies (using powersave governor):

Note that this machine (Intel NUC) is running a couple virtual machines each with telegraf installed.

alpiua · 2021-10-01T17:57:42Z

Faced the same issue with high LA every 6h 54 minutes.
All systems except one were cured with setting jitter in telegraf config.
The only one host left had the problem with the other software (vector pipeline building too had an issue with the file buffers). Fixed that as well.
Thanks for the article above.

Will add this to the collection:
https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

danielnelson added the discussion Topics for discussion label Nov 13, 2017

akaDJon mentioned this issue Nov 20, 2017

high load average every 1h 45m influxdata/influxdb#9136

Closed

danielnelson added bug unexpected problem or unintended behavior and removed discussion Topics for discussion labels Jan 2, 2018

danielnelson closed this as completed Mar 5, 2018

Daniel15 mentioned this issue Jan 21, 2019

Netdata causing load average increase every ~20 minutes netdata/netdata#5234

Closed

danielnelson mentioned this issue Feb 13, 2019

Telegraf causes server system load to rise #5423

Closed

matthieumarrast mentioned this issue Mar 27, 2023

cpu load issues metwork-framework/mfsysmon#130

Closed

high load average every 1h 45m #3465

high load average every 1h 45m #3465

Comments

akaDJon commented Nov 13, 2017 • edited

danielnelson commented Nov 13, 2017

akaDJon commented Nov 14, 2017

SM616 commented Nov 14, 2017

danielnelson commented Nov 15, 2017

akaDJon commented Nov 16, 2017

akaDJon commented Nov 16, 2017

danielnelson commented Nov 17, 2017

akaDJon commented Nov 17, 2017 • edited

akaDJon commented Nov 17, 2017 • edited

danielnelson commented Nov 17, 2017

akaDJon commented Nov 17, 2017

akaDJon commented Nov 17, 2017 • edited

danielnelson commented Nov 17, 2017

akaDJon commented Nov 17, 2017

danielnelson commented Nov 17, 2017

akaDJon commented Nov 17, 2017

danielnelson commented Nov 17, 2017

akaDJon commented Nov 17, 2017 • edited

danielnelson commented Nov 17, 2017

akaDJon commented Nov 18, 2017 • edited

akaDJon commented Nov 18, 2017

danielnelson commented Nov 18, 2017

akaDJon commented Nov 18, 2017

akaDJon commented Nov 18, 2017

danielnelson commented Nov 18, 2017

akaDJon commented Nov 18, 2017

akaDJon commented Nov 18, 2017

danielnelson commented Nov 20, 2017

danielnelson commented Dec 8, 2017

danielnelson commented Dec 8, 2017

james-lawrence commented Dec 8, 2017 • edited

akaDJon commented Jan 8, 2018

akaDJon commented Jan 8, 2018

akaDJon commented Jan 8, 2018

ekeih commented Jan 18, 2018 • edited

aldem commented Jan 19, 2018 • edited

danielnelson commented Jan 19, 2018

akaDJon commented Jan 19, 2018

aldem commented Jan 19, 2018

apooniajjn commented Feb 6, 2018 • edited

danielnelson commented Feb 6, 2018

apooniajjn commented Feb 6, 2018

danielnelson commented Feb 6, 2018

apooniajjn commented Feb 6, 2018

ekeih commented Feb 6, 2018

danielnelson commented Feb 6, 2018

8h2a commented Feb 22, 2018 • edited

gentstr commented Feb 23, 2018

Zbychad commented Feb 26, 2018

danielnelson commented Mar 5, 2018

dynek commented Oct 23, 2019 • edited

alpiua commented Oct 1, 2021 • edited

akaDJon commented Nov 13, 2017 •

edited

akaDJon commented Nov 17, 2017 •

edited

akaDJon commented Nov 17, 2017 •

edited

akaDJon commented Nov 17, 2017 •

edited

akaDJon commented Nov 17, 2017 •

edited

akaDJon commented Nov 18, 2017 •

edited

james-lawrence commented Dec 8, 2017 •

edited

ekeih commented Jan 18, 2018 •

edited

aldem commented Jan 19, 2018 •

edited

apooniajjn commented Feb 6, 2018 •

edited

8h2a commented Feb 22, 2018 •

edited

dynek commented Oct 23, 2019 •

edited

alpiua commented Oct 1, 2021 •

edited