New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polling cycles may not always complete as expected #4460
Comments
what is the output of strace on one of the zombie process try running strace -p (poller.php process) and show some of the output Also while the zombie poller.php is running please show the output of this mysql statment
|
I've attached 2 files with the requested information. MySQL is configured with no upper limit, and the filesystem in wich mysql files are creaed still has 5,6GB free space. Relevant configution: /dev/mapper/vg_root-lv_root 12G 5,6G 5,6G 51% / And before anyone loses more time figuring out what may be the problem, i think i should include more background information. I reported this as a 1.2.19 problem. But really problem started when I was running 1.2.18 (cacti) + 1.2.16 (spine). 2 weeks ago after I spent many hours creating a new error graph template with % of discard/errors, and finally understood (or kind of) CDEF, RPN etc. I went in a homonegnization frenzy and run a lot of both, sync graph to template, reapply suggested names and all possible cli/(poller_reindex_hosts.php, rebuild_poller_cache, repair_database, repair_templates...). Some time (1 day maybe) after that I noticed that my weathermaps always had 0 value on LINKs (weirdly (for me) cpu counters related to NODE and not LINKs were alright, TARGET of LINK seemed right and pointing to existing rrd files which existed as local_id on database), that was the moment I decided to move to 1.2.19 in the hope everything will fix itself. Now I have all plugins (only use weathermap) disabled. This database had been upgraded all the way from 0.8.something, maybe I restarted from scratch at some point in 1.something, not sure, but anyway it has seen a lot of upgrades and all kind of host and graph template imports. Another possible meaningfull piece of data. When I reach the poller_output is full state. poller output is fixed at this value: MariaDB [cacti]> select count() from poller_output; MariaDB [cacti]> show table status like 'poller_output%'; MariaDB [cacti]> show create table poller_output; Thank you very much! |
I've raised max_heap_size to 140M did a reboot and more things are happening in the log (during boost write-to-disk process). |
Moved mysql files to another partition, fixed the space problem. Also raised max_heap_table size to (first 140M which gave a max row size of 361195 rows on poller_output) 500M which should be enougth to accomodate the million row set as maximum Records in Settings->Performance. I swear max_heap_table_size was 1G some weeks ago, but oh well, I also updated the systems, maybe the mysql config file was overwritten at some point. Attrached refreshed versions of query and strace: |
Execution of cacti/cli/poller_output_empty.php also gets hanged. Seems to enter an infinite while loop. |
Yea I also see the same as well when you try to run poller output empty it
just hangs
I'm having the exact same problem you are
It seems the output table doesn't clear on my case any device that times
out clogs that table and poller.php never completes
…On Wed., Nov. 10, 2021, 07:48 Nicolás Victorero, ***@***.***> wrote:
Execution of cacti/cli/poller_output_empty.php also gets hanged. Seems to
enter an infinite while loop.
Setting inside the loop a print "nico $rrdtool_pipe
rrds_processed=$rrds_processed\n"; results in:
OV01LNXVZ4:/opt/cacti # cli/poller_output_empty.php
nico Resource id #43 <#43>
rrds_processed=0
nico Resource id #43 <#43>
rrds_processed=0
[...]
Maybe this is related with the problems i'm having, seems like the
function process_poller_output in lib/poller.php is having problems. At the
end nothing seems to be deleted from poller_output table.
Trying to see where it's getting stuck.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4460 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGEXTEKKKSE6AD6UNJZAZDULJS3HANCNFSM5HSWIDQQ>
.
|
Seems like this conditions are very rarely met: The rrd_update_array contains:
And the $items for that rrd file path:
and for that r un rrd_update_array size was 1266 so I think both conditions were met. But i've put another comment inside the loop, where the DELETE sentence should be generated, and it's but, k neet to be 10000 also...mmm Still trying to figure things out. output for a 3 seconds log from poller_output.php redirected to a file print_r ing $rrd_update_array and $item is 55MB long. |
By chance are you seeing these messeges in your OS logs
I also see when I run poller.php --force -d it eventually gets to a seg fault and the php script server crashes try turning off cron/cactid and run the poller manually the same see if you get the same thing |
Oh BTW which OS and php version are you running ? I tried using the previous version of poller. PHP and same issue is happening I'm seeing segfaults for pcre2 on rhel 8 when poller.php exits |
SUSE Linux Enterprise Server 15 SP2 Running sudo -uwwwrun php -q /opt/cacti/cli/poller.php --force --debug ends after 299 seconds PHP Fatal error: Maximum execution time of 299 seconds exceeded in /opt/cacti/lib/poller.php on line 540 I see nothing like your errors in /var/log/messages dmesg output or /opt/cacti/log/cacti.log |
Hold off on closing for now I am also seeing the same thing you are I'm on
rhel not suse but I am running php 7.4
…On Fri., Nov. 12, 2021, 02:30 Nicolás Victorero, ***@***.***> wrote:
SUSE Linux Enterprise Server 15 SP2
RRDtool 1.7.0
Ver 15.1 Distrib 10.4.17-MariaDB
PHP 7.4.6 (cli) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
Running sudo -uwwwrun php -q /opt/cacti/cli/poller.php --force --debug
ends after 299 seconds
PHP Fatal error: Maximum execution time of 299 seconds exceeded in
/opt/cacti/lib/poller.php on line 540
I see nothing like your error in /var/log/messages dmesg output or
/opt/cacti/log/cacti.log
I've decided to start my cacti database from scratch, thinking if there is
a easy way to recover or export/import at least devices with snmp
configurations and user authorization (authentication is done against an
external LDAP).
In my system it's not only that poller process are not deleting
poller_output table, the values I'm getting from many
graphs seems to be broken, like registering the same information from in
and out counters.
Something went wrong, at some point of my many changes of the last weeks,
os upgrades, and graph, device and template cleanup/reapply etc. And I'm
not sure when or why things broke.
Will have a lot of work to do remapping graph ids to weathermap configs,
not to say the historical data, but oh well...
I'll wait a few days and close this issue, I really don't think it is
really 1.2.19 related anyway.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4460 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGEXTHRXV5EFI4X5S4NAWLULS7AFANCNFSM5HSWIDQQ>
.
|
Looking at the attachment you out of the select statement from poller_output you have the same thing in seeing entries from 3 polling cycles and only the U values are sticking around |
ok this is what I did and it seems to be working very good I used the lib/poller.php from 1.2.16 ( I had it downloaded already I am sure 1.2.18 will work too) |
I took /lib/poller.php from 1.2.18 and it also fixes my issue
So something is up with 1.2.19 lib/poller
…On Fri., Nov. 12, 2021, 07:02 Sean Mancini, ***@***.***> wrote:
Hold off on closing for now I am also seeing the same thing you are I'm
on rhel not suse but I am running php 7.4
On Fri., Nov. 12, 2021, 02:30 Nicolás Victorero, ***@***.***>
wrote:
> SUSE Linux Enterprise Server 15 SP2
> RRDtool 1.7.0
> Ver 15.1 Distrib 10.4.17-MariaDB
> PHP 7.4.6 (cli) ( NTS )
> Copyright (c) The PHP Group
> Zend Engine v3.4.0, Copyright (c) Zend Technologies
>
> Running sudo -uwwwrun php -q /opt/cacti/cli/poller.php --force --debug
> ends after 299 seconds
>
> PHP Fatal error: Maximum execution time of 299 seconds exceeded in
> /opt/cacti/lib/poller.php on line 540
>
> I see nothing like your error in /var/log/messages dmesg output or
> /opt/cacti/log/cacti.log
> I've decided to start my cacti database from scratch, thinking if there
> is a easy way to recover or export/import at least devices with snmp
> configurations and user authorization (authentication is done against an
> external LDAP).
> In my system it's not only that poller process are not deleting
> poller_output table, the values I'm getting from many
> graphs seems to be broken, like registering the same information from in
> and out counters.
> Something went wrong, at some point of my many changes of the last weeks,
> os upgrades, and graph, device and template cleanup/reapply etc. And I'm
> not sure when or why things broke.
> Will have a lot of work to do remapping graph ids to weathermap configs,
> not to say the historical data, but oh well...
> I'll wait a few days and close this issue, I really don't think it is
> really 1.2.19 related anyway.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#4460 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ADGEXTHRXV5EFI4X5S4NAWLULS7AFANCNFSM5HSWIDQQ>
> .
>
|
OK, I looked at that and didn't see anything obvious but I'll take another view later. Remind me if I've not posted by Sunday :) |
The only side effect so far is that poller_output doesnt empty out on devices that timeout so the entries creep up slowly poller_output_empty.php still hangs and does nothing at all |
I guess I see the same issue: after updating to 1.2.19 from 1.2.15, poller consumed all memory until oom-killer got active. I have increased the memory of that VM from 4 GB (which was more than enough with 1.2.15) to 16 GB. But now I see the maximum execution time message. Before, it took approx. 90 seconds to finish. I noticed the mysqld being also very busy, although "show processlist" never show anything. Running on CentOS 7 using the EPEL 7 cacti rpm with standard EL7 php and mariadb: I have just downgraded back to 1.2.15 (yes, I know, downgrade...) and it's all back to normal. |
I am also experiencing this issue with 1.2.19. I had no problems with approximately 6000 devices with ~24,000 data sources but as I imported more some time around reaching 7000 devices the Running
After which it hangs for an excessively long time until the maximum execution time runs out (for me, 10 minutes) Immediately running it again gives a 5-10 second pause at the same place before you get the rest of the output and the poller exits normally (logs taken from a previous run so the times are a little inconsistent):
It seems to crop up somewhat at random. Sometimes it hangs on the first poller run, sometimes I get 5-10 runs in before they start to run away. Based on other open issues I think this might be related to #4450 in that they may be describing the same problem. Reading the posts above it was recommended to run s_trace on one of the processes and here is a section of the output. Obviously the output itself is very large so this is truncated but I can provide more if it would be useful
I don't know what additional output / information I can give to help diagnose this problem but I will do my best to provide anything requested. |
As an experiment can you try using lib/poller. PHP from 1.2.18
In my testing that has resolved the issue
Seems there is an issue with the poller_output clean up process
Also check your poller_output table for stale entries I bet you will have
a bunch
…On Fri., Nov. 12, 2021, 16:38 Chris Rowat, ***@***.***> wrote:
I am also experiencing this issue with 1.2.19. I had no problems with
approximately 6000 devices with ~24,000 data sources but as I imported more
some time around reaching 7000 devices the poller.php processes started
to stack up on my main poller (I was running with 2 remote pollers) and the
CPU load ran away until services started failing. I tried adding 2 more
remote pollers (since their polling times were quite long) but that did not
help. I run spine 1.2.19 on all instances. I have tried adding memory and
CPUs to the main poller without any relief.
Running poller.php --debug --force on the main instance produces the
following results:
# docker exec cacti_poller_1 /usr/local/bin/php /opt/cacti/cacti/poller.php --force --debug
2021-11-12 21:31:38 - POLLER: Poller[1] PID[6726] NOTE: Poller Int: '60', Cron Int: '60', Time Since Last: '2496.54', Max Runtime '58', Poller Runs: '1'
2021-11-12 21:31:38 - POLLER: Poller[1] PID[6726] WARNING: Cron is out of sync with the Poller Interval! The Poller Interval is '60' seconds, with a maximum of a '60' second Cron, but 2,496.5 seconds have passed since the last poll!
2021-11-12 21:31:38 - POLLER: Poller[1] PID[6726] WARNING: Poller Output Table not Empty. Issues: 2282, DS[28355, 45122, 7779, 45335, 45335, 45138, 7932, 7962, 7403, 7832, 45503, 41431, 45391, 45174, 45503, 12510, 44763, 28345, 45201, 5835], Additional Issues Remain. Only showing first 20
2021-11-12 21:31:39 - POLLER: Poller[1] PID[6726] DEBUG: About to Spawn a Remote Process [CMD: /opt/cacti/spine, ARGS: --poller=1 --first=0 --last=1 --mibs]
Waiting on 1 of 1 pollers.
2021-11-12 21:31:40 - POLLER: Poller[1] PID[6726] Parsed MULTI output field '1min:0.10' [map 1min->load_1min]
2021-11-12 21:31:40 - POLLER: Poller[1] PID[6726] Parsed MULTI output field '5min:0.22' [map 5min->load_5min]
2021-11-12 21:31:40 - POLLER: Poller[1] PID[6726] Parsed MULTI output field '10min:1.10' [map 10min->load_15min]
After which it hangs for an excessively long time until the maximum
execution time runs out (for me, 10 minutes)
Immediately running it again gives a 5-10 second pause at the same place
before you get the rest of the output and the poller exits normally (logs
taken from a previous run so the times are a little inconsistent):
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q '/opt/cacti/cacti/poller_maintenance.php']
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q '/opt/cacti/cacti/poller_automation.php' -M]
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q '/opt/cacti/cacti/poller_spikekill.php']
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q /opt/cacti/cacti/poller_reports.php]
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q /opt/cacti/cacti/poller_boost.php]
It seems to crop up somewhat at random. Sometimes it hangs on the first
poller run, sometimes I get 5-10 runs in before they start to run away.
Based on other open issues I think this might be related to #4450
<#4450> in that they may be
describing the same problem.
I don't know what additional output / information I can give to help
diagnose this problem but I will do my best to provide anything requested.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4460 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADGEXTFSAWBNIC5POOBOXN3ULWCM3ANCNFSM5HSWIDQQ>
.
|
The EAGAIN message, smells to me of spine. @bmfmancini try using the latest poller.php, but the 1.2.16 version of spine? The real question is what is causing the timeout/resource issue. It definitely seems race related. |
I am using spine 1.2.19 |
Is this releated to #205? |
It may be @skyjou but alas, we had no feedback on that issue. This does seem to be a race condition of sorts with a large number of devices. So far, I haven't been able to reproduce it myself on systems with ~1000 devices, but the conditions to make it happen may not be occurring there. |
I pulled the plunge and started a new cacti install so I cannot provide more feedback (I saved a DB dump).
|
That particular block of code hasn't been changed in three years but the surrounding code has in commit 43baa25 where I think the logic is going wrong or is becoming too recursive thus eating up CPU/cycles. I've given @bmfmancini a patch to try out, before I release it to the wild but your comments helped me see something that I missed when I looked over that code last week. |
The above commit should contain changes that should help things complete. Please let us know if there are any further issues. |
I pulled the committed files and when my poller ran the next time, I received the following in my cacti.log file. In the lines around 517, "issue_list" was used. I removed the "s" and the poller worked as expected during the next poll.
|
I have been testing with the above commit as well. The runaway poller problem has not occurred in the past 8 hours so that part feels like it is fixed. I did run into the
It's not always the exact same number of entries left over but it usually contains exactly the same list of interfaces. Perhaps it just didn't run well enough to get those before. I am not 100% sure they are related but they are unexpected (to me at least). |
I should have corrected the typo now. |
Hello,
So far it's running well again. Will report back news. After 8 days everything keeps working ok. No more problems. |
Closing as this one is duplicate. |
I have a setup with one main poller and 2 additional pollers. With boost enabled.
Upgraded recently from 1.2.18 to 1.2.19, I was running spine 1.2.16, becouse more recent version (1.2.17 or 18) was giving errors, now I'm running both cacti and spine 1.2.19. Apparently spine ends without problems but process "php /opt/cacti/poller.php" is unable to finish in main poller, and keeps there eating memmory and cpu. Ends up being killed after 299 seconds. Weirdly enougth 3 of that process are running at the same time, wich is strage having they are killed after 299 seconds, and cron is set to run every 300 seconds.
It happends every polling interval. spine ends up on all pollers without apparent problems, but cmd.php keeps running on main. Most of the actual polling is run on one of the remote pollers, main poller have no device associated (now it has one for testing).
For a typical polling this is the final outout I see on the log file of each poller
poller 01:
2021/11/08 14:30:41 - SYSTEM STATS: Time:38.2217 Method:spine Processes:1 Threads:30 Hosts:170 HostsPerProcess:170 DataSources:20848 RRDsProcessed:0
poller 02:
2021/11/08 14:30:07 - SYSTEM STATS: Time:5.1605 Method:spine Processes:1 Threads:30 Hosts:27 HostsPerProcess:27 DataSources:468 RRDsProcessed:0
main poller:
2021/11/08 14:30:03 - SPINE: Poller[1] PID[5695] PT[140715085393664] Time: 0.5676 s, Threads: 30, Devices: 1
2021/11/08 14:30:03 - SPINE: Poller[1] PID[5731] PT[139950535016192] Time: 0.5673 s, Threads: 30, Devices: 2
after a time i'll see a timeout line, but weirdly more like 10 minutes after start (not 5, but log says timeout after 299 seconds):
2021/11/08 14:39:55 - ERROR PHP ERROR: Maximum execution time of 299 seconds exceeded in file: /opt/cacti/lib/poller.php on line: 462
2021/11/08 14:39:55 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
When this apparently zombie poller.php are running mysql cpu usage goes up (and stays there as for the time a process is killed up to new 3 process will be up).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3521 mysql 20 0 2827852 837396 24028 S 103,6 20,83 56:01.36 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
5659 wwwrun 20 0 353616 224988 17760 R 52,32 5,596 2:54.58 php /opt/cacti/poller.php
5815 wwwrun 20 0 329912 201416 17748 S 42,72 5,009 0:23.14 php /opt/cacti/poller.php
I stopped cron calls to poller.php on all 3 servers, changed poller to cmd.php and tried to run: poller --debug --force but I see no logging lines. Tried this also with debug level set to maximum (DEVEL) on the webui settings, but made no difference.
sudo -uwwwrun php /opt/cacti/poller.php --debug --force
2021/11/08 12:05:08 - POLLER: Poller[1] PID[3135] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '72.33', Max Runtime '298', Poller Runs: '1'
2021/11/08 12:05:09 - POLLER: Poller[1] PID[3135] WARNING: Poller Output Table not Empty. Issues: 0, DS[64, 74, 23793, 24258, 864, 20776, 24410, 22623, 8301, 8302, 23817, 23817, 23818, 23818, 23819, 23819, 20776, 8301, 20775, 24411]
2021/11/08 12:05:10 - POLLER: Poller[1] PID[3135] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/spine/bin/spine, ARGS: -C '/etc/cacti/spine.conf' --poller=1 --first=0 --last=0 --mibs]
Waiting on 1 of 1 pollers.
PHP Fatal error: Maximum execution time of 299 seconds exceeded in /opt/cacti/lib/database.php on line 287
I don't have a clue what is it doing until timeout. Don't see information on log or console.
I'm running SLES15SP2 (main and 1 poller) and Debian 11 (another poller).
The text was updated successfully, but these errors were encountered: