Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polling cycles may not always complete as expected #4460

Closed
nicolatron opened this issue Nov 8, 2021 · 31 comments
Closed

Polling cycles may not always complete as expected #4460

nicolatron opened this issue Nov 8, 2021 · 31 comments
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team duplicate Duplicate of another issue poller Data Collection related issue resolved A fixed issue unverified Some days we don't have a clue
Milestone

Comments

@nicolatron
Copy link

nicolatron commented Nov 8, 2021

I have a setup with one main poller and 2 additional pollers. With boost enabled.

Upgraded recently from 1.2.18 to 1.2.19, I was running spine 1.2.16, becouse more recent version (1.2.17 or 18) was giving errors, now I'm running both cacti and spine 1.2.19. Apparently spine ends without problems but process "php /opt/cacti/poller.php" is unable to finish in main poller, and keeps there eating memmory and cpu. Ends up being killed after 299 seconds. Weirdly enougth 3 of that process are running at the same time, wich is strage having they are killed after 299 seconds, and cron is set to run every 300 seconds.

It happends every polling interval. spine ends up on all pollers without apparent problems, but cmd.php keeps running on main. Most of the actual polling is run on one of the remote pollers, main poller have no device associated (now it has one for testing).

For a typical polling this is the final outout I see on the log file of each poller
poller 01:
2021/11/08 14:30:41 - SYSTEM STATS: Time:38.2217 Method:spine Processes:1 Threads:30 Hosts:170 HostsPerProcess:170 DataSources:20848 RRDsProcessed:0
poller 02:
2021/11/08 14:30:07 - SYSTEM STATS: Time:5.1605 Method:spine Processes:1 Threads:30 Hosts:27 HostsPerProcess:27 DataSources:468 RRDsProcessed:0
main poller:
2021/11/08 14:30:03 - SPINE: Poller[1] PID[5695] PT[140715085393664] Time: 0.5676 s, Threads: 30, Devices: 1
2021/11/08 14:30:03 - SPINE: Poller[1] PID[5731] PT[139950535016192] Time: 0.5673 s, Threads: 30, Devices: 2
after a time i'll see a timeout line, but weirdly more like 10 minutes after start (not 5, but log says timeout after 299 seconds):
2021/11/08 14:39:55 - ERROR PHP ERROR: Maximum execution time of 299 seconds exceeded in file: /opt/cacti/lib/poller.php on line: 462
2021/11/08 14:39:55 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())

When this apparently zombie poller.php are running mysql cpu usage goes up (and stays there as for the time a process is killed up to new 3 process will be up).

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3521 mysql 20 0 2827852 837396 24028 S 103,6 20,83 56:01.36 /usr/sbin/mysqld --defaults-file=/etc/my.cnf --user=mysql
5659 wwwrun 20 0 353616 224988 17760 R 52,32 5,596 2:54.58 php /opt/cacti/poller.php
5815 wwwrun 20 0 329912 201416 17748 S 42,72 5,009 0:23.14 php /opt/cacti/poller.php

I stopped cron calls to poller.php on all 3 servers, changed poller to cmd.php and tried to run: poller --debug --force but I see no logging lines. Tried this also with debug level set to maximum (DEVEL) on the webui settings, but made no difference.

sudo -uwwwrun php /opt/cacti/poller.php --debug --force
2021/11/08 12:05:08 - POLLER: Poller[1] PID[3135] NOTE: Poller Int: '300', Cron Int: '300', Time Since Last: '72.33', Max Runtime '298', Poller Runs: '1'
2021/11/08 12:05:09 - POLLER: Poller[1] PID[3135] WARNING: Poller Output Table not Empty. Issues: 0, DS[64, 74, 23793, 24258, 864, 20776, 24410, 22623, 8301, 8302, 23817, 23817, 23818, 23818, 23819, 23819, 20776, 8301, 20775, 24411]
2021/11/08 12:05:10 - POLLER: Poller[1] PID[3135] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/spine/bin/spine, ARGS: -C '/etc/cacti/spine.conf' --poller=1 --first=0 --last=0 --mibs]
Waiting on 1 of 1 pollers.
PHP Fatal error: Maximum execution time of 299 seconds exceeded in /opt/cacti/lib/database.php on line 287

I don't have a clue what is it doing until timeout. Don't see information on log or console.

I'm running SLES15SP2 (main and 1 poller) and Debian 11 (another poller).

@nicolatron nicolatron added bug Undesired behaviour unverified Some days we don't have a clue labels Nov 8, 2021
@bmfmancini
Copy link
Member

bmfmancini commented Nov 8, 2021

what is the output of strace on one of the zombie process

try running strace -p (poller.php process) and show some of the output

Also while the zombie poller.php is running please show the output of this mysql statment

select * from poller_output where output = "U";

@nicolatron
Copy link
Author

nicolatron commented Nov 9, 2021

I've attached 2 files with the requested information.
I've noted now more problems. After like 4-5 polling runs I have this errors:
2021/11/09 08:10:03 - SYSTEM WARNING: Primary Admin account notifications disabled! Unable to send administrative Email.
2021/11/09 08:10:07 - SPINE: Poller[1] PID[10032] PT[140558032713472] ERROR: SQL Failed! Error:'1114', Message:'The table 'poller_output' is full', SQL Fragment:'INSERT INTO poller_output (local_data_id, rrd_name, time, output) VALUES (24075, 'ping', FROM_UNIXTIME(1636441805), '1.73'),(22963, '5min_cpu', FROM_UNIXTIME(1636441805), '6') ON DUPLICATE KEY UPDATE output=VALUES(output)'
2021/11/09 08:10:58 - ERROR PHP ERROR: Maximum execution time of 299 seconds exceeded in file: /opt/cacti/lib/poller.php on line: 462
2021/11/09 08:10:58 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())

MySQL is configured with no upper limit, and the filesystem in wich mysql files are creaed still has 5,6GB free space. Relevant configution:
collation_server=utf8mb4_unicode_ci
character_set_server=utf8mb4
max_connections=250
max_allowed_packet=16777216
max_heap_table_size=70M
tmp_table_size=70M
join_buffer_size=135M
innodb_file_per_table=ON
innodb_file_format=Barracuda
innodb_large_prefix=1
innodb_buffer_pool_size=1024M
innodb_doublewrite=OFF
innodb_lock_wait_timeout=50
innodb_flush_method=O_DIRECT
innodb_flush_log_at_timeout=3
innodb_read_io_threads=8
innodb_write_io_threads=4
innodb_buffer_pool_instances=9
innodb_io_capacity=200
innodb_io_capacity_max=2000
innodb_use_atomic_writes=ON
innodb_data_file_path = ibdata1:10M:autoextend
skip-name-resolve

/dev/mapper/vg_root-lv_root 12G 5,6G 5,6G 51% /

And before anyone loses more time figuring out what may be the problem, i think i should include more background information. I reported this as a 1.2.19 problem. But really problem started when I was running 1.2.18 (cacti) + 1.2.16 (spine). 2 weeks ago after I spent many hours creating a new error graph template with % of discard/errors, and finally understood (or kind of) CDEF, RPN etc. I went in a homonegnization frenzy and run a lot of both, sync graph to template, reapply suggested names and all possible cli/(poller_reindex_hosts.php, rebuild_poller_cache, repair_database, repair_templates...). Some time (1 day maybe) after that I noticed that my weathermaps always had 0 value on LINKs (weirdly (for me) cpu counters related to NODE and not LINKs were alright, TARGET of LINK seemed right and pointing to existing rrd files which existed as local_id on database), that was the moment I decided to move to 1.2.19 in the hope everything will fix itself. Now I have all plugins (only use weathermap) disabled. This database had been upgraded all the way from 0.8.something, maybe I restarted from scratch at some point in 1.something, not sure, but anyway it has seen a lot of upgrades and all kind of host and graph template imports.

Another possible meaningfull piece of data. When I reach the poller_output is full state. poller output is fixed at this value:

MariaDB [cacti]> select count() from poller_output;
+----------+
| count(
) |
+----------+
| 180810 |
+----------+
1 row in set (0,000 sec)

MariaDB [cacti]> show table status like 'poller_output%';
+------------------------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+--------------------+---------+------------------+-----------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment | Max_index_length | Temporary |
+------------------------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+--------------------+---------+------------------+-----------+
| poller_output | MEMORY | 10 | Fixed | 180810 | 286 | 52078400 | 51705368 | 9312191 | 0 | NULL | 2021-11-08 16:45:02 | NULL | NULL | utf8mb4_unicode_ci | NULL | row_format=DYNAMIC | | 0 | N |
| poller_output_boost | InnoDB | 10 | Dynamic | 4189381 | 76 | 318767104 | 0 | 0 | 7340032 | NULL | 2021-11-08 15:05:02 | 2021-11-09 08:25:17 | NULL | utf8mb4_unicode_ci | NULL | row_format=DYNAMIC | | 0 | N |
| poller_output_boost_local_data_ids | MEMORY | 10 | Fixed | 0 | 9 | 0 | 13762557 | 0 | 0 | NULL | 2021-11-08 16:48:22 | NULL | NULL | utf8mb4_unicode_ci | NULL | | | 0 | N |
| poller_output_boost_processes | InnoDB | 10 | Dynamic | 2 | 8192 | 16384 | 0 | 0 | 0 | 3 | 2020-02-13 13:09:52 | NULL | NULL | utf8mb4_unicode_ci | NULL | row_format=DYNAMIC | | 0 | N |
| poller_output_realtime | InnoDB | 10 | Dynamic | 0 | 0 | 16384 | 0 | 32768 | 0 | NULL | 2021-09-20 12:16:30 | NULL | NULL | utf8mb4_unicode_ci | NULL | | | 0 | N |
+------------------------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+------------+--------------------+----------+--------------------+---------+------------------+-----------+
5 rows in set (0,001 sec)

MariaDB [cacti]> show create table poller_output;
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| poller_output | CREATE TABLE poller_output (
local_data_id int(10) unsigned NOT NULL DEFAULT 0,
rrd_name varchar(19) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
time timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
output varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
PRIMARY KEY (local_data_id,rrd_name,time) USING BTREE
) ENGINE=MEMORY DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC |
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0,000 sec)

Thank you very much!

select_result.txt
strace_output.txt

@nicolatron
Copy link
Author

I've raised max_heap_size to 140M did a reboot and more things are happening in the log (during boost write-to-disk process).
2021/11/09 09:11:05 - CMDPHP ERROR: A DB Row Failed!, Error: Error writing file '/tmp/MYEGVfxQ' (Errcode: 28 "No space left on device")
Asked for 30G more disk for the VM.

@nicolatron
Copy link
Author

nicolatron commented Nov 9, 2021

Moved mysql files to another partition, fixed the space problem. Also raised max_heap_table size to (first 140M which gave a max row size of 361195 rows on poller_output) 500M which should be enougth to accomodate the million row set as maximum Records in Settings->Performance. I swear max_heap_table_size was 1G some weeks ago, but oh well, I also updated the systems, maybe the mysql config file was overwritten at some point.
Also cleaned a lot of polling errors, like datasource/graphs with wrong values, or polling for non-existant OIDs.
Situation is still like explained on first post, poller keeps running on main poller until timeout.

Attrached refreshed versions of query and strace:
select_result2.txt
strace2.txt

@nicolatron
Copy link
Author

Execution of cacti/cli/poller_output_empty.php also gets hanged. Seems to enter an infinite while loop.
Setting inside the loop a print "nico $rrdtool_pipe rrds_processed=$rrds_processed\n"; results in:
OV01LNXVZ4:/opt/cacti # cli/poller_output_empty.php
nico Resource id #43 rrds_processed=0
nico Resource id #43 rrds_processed=0
[...]
Maybe this is related with the problems i'm having, seems like the function process_poller_output in lib/poller.php is having problems. At the end nothing seems to be deleted from poller_output table.
Trying to see where it's getting stuck.

@bmfmancini
Copy link
Member

bmfmancini commented Nov 10, 2021 via email

@nicolatron
Copy link
Author

nicolatron commented Nov 10, 2021

Seems like this conditions are very rarely met:
if (isset($rrd_update_array[$rrd_path]['times'][$unix_time])) {
if ($item['rrd_num'] <= cacti_sizeof($rrd_update_array[$rrd_path]['times'][$unix_time])) {

The rrd_update_array contains:

(
    [/opt/cacti/rra/core-vss_cisco_memfree_32.rrd] => Array
        (
            [local_data_id] => 32
            [times] => Array
                (
                    [1636550705] => Array
                        (
                            [cisco_memfree] => 755039240
                        )

                )

        )

And the $items for that rrd file path:
nico end rrd_update_array

Array
(
    [output] => 755039240
    [time] => 2021-11-10 14:25:05
    [unix_time] => 1636550705
    [local_data_id] => 32
    [data_template_id] => 53
    [rrd_path] => /opt/cacti/rra/core-vss_cisco_memfree_32.rrd
    [rrd_name] => cisco_memfree
    [rrd_num] => 1
)

and for that r un rrd_update_array size was 1266 so I think both conditions were met. But i've put another comment inside the loop, where the DELETE sentence should be generated, and it's but, k neet to be 10000 also...mmm

Still trying to figure things out. output for a 3 seconds log from poller_output.php redirected to a file print_r ing $rrd_update_array and $item is 55MB long.

@bmfmancini
Copy link
Member

By chance are you seeing these messeges in your OS logs

[1124946.932296] php[4171170]: segfault at 7fff25bd3fe0 ip 00007fa5fe087d64 sp 00007fff25bd3fd0 error 6 in libpcre2-8.so.0.7.1[7fa5fe043000+83000]
[1124946.935261] Code: 10 00 00 48 83 0c 24 00 4c 39 dc 75 ef 48 81 ec 88 02 00 00 4c 8b a4 24 c0 52 00 00 48 89 4c 24 38 48 89 c8 49 89 ff 49 89 f6 <4c> 89 4c 24 10 48 01 f0 49 89 d2 64 48 8b 1c 25 28 00 00 00 48 89
[1125310.011036] php[4174275]: segfault at 7ffea47a7ff8 ip 00007fbb3572dd56 sp 00007ffea47a7fc0 error 6 in libpcre2-8.so.0.7.1[7fbb356e9000+83000]
[1125310.013465] Code: 55 53 4c 8d 9c 24 00 b0 ff ff 48 81 ec 00 10 00 00 48 83 0c 24 00 4c 39 dc 75 ef 48 81 ec 88 02 00 00 4c 8b a4 24 c0 52 00 00 <48> 89 4c 24 38 48 89 c8 49 89 ff 49 89 f6 4c 89 4c 24 10 48 01 f0

I also see when I run poller.php --force -d it eventually gets to a seg fault and the php script server crashes

try turning off cron/cactid and run the poller manually the same see if you get the same thing

@bmfmancini
Copy link
Member

bmfmancini commented Nov 12, 2021

Oh BTW which OS and php version are you running ?

I tried using the previous version of poller. PHP and same issue is happening

I'm seeing segfaults for pcre2 on rhel 8 when poller.php exits

@nicolatron
Copy link
Author

nicolatron commented Nov 12, 2021

SUSE Linux Enterprise Server 15 SP2
RRDtool 1.7.0
Ver 15.1 Distrib 10.4.17-MariaDB
PHP 7.4.6 (cli) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies

Running sudo -uwwwrun php -q /opt/cacti/cli/poller.php --force --debug ends after 299 seconds

PHP Fatal error: Maximum execution time of 299 seconds exceeded in /opt/cacti/lib/poller.php on line 540

I see nothing like your errors in /var/log/messages dmesg output or /opt/cacti/log/cacti.log
I've decided to start my cacti database from scratch, thinking if there is a easy way to recover or export/import at least devices with snmp configurations and user authorization (authentication is done against an external LDAP).
In my system it's not only that poller process are not deleting poller_output table, the values I'm getting from many
graphs seems to be broken, like registering the same information from in and out counters.
Something went wrong, at some point of my many changes of the last weeks, os upgrades, and graph, device and template cleanup/reapply etc. And I'm not sure when or why things broke.
Will have a lot of work to do remapping graph ids to weathermap configs, not to say the historical data, but oh well...
I'll wait a few days and close this issue, I really don't think it is really 1.2.19 related anyway.

@bmfmancini
Copy link
Member

bmfmancini commented Nov 12, 2021 via email

@bmfmancini
Copy link
Member

Looking at the attachment you out of the select statement from poller_output you have the same thing in seeing entries from 3 polling cycles and only the U values are sticking around

@bmfmancini
Copy link
Member

ok this is what I did and it seems to be working very good

I used the lib/poller.php from 1.2.16 ( I had it downloaded already I am sure 1.2.18 will work too)
polling is working just fine now

@bmfmancini
Copy link
Member

bmfmancini commented Nov 12, 2021 via email

@netniV
Copy link
Member

netniV commented Nov 12, 2021

OK, I looked at that and didn't see anything obvious but I'll take another view later. Remind me if I've not posted by Sunday :)

@bmfmancini
Copy link
Member

The only side effect so far is that poller_output doesnt empty out on devices that timeout so the entries creep up slowly
there was a fix for this not so long ago

poller_output_empty.php still hangs and does nothing at all

@gvde
Copy link

gvde commented Nov 12, 2021

I guess I see the same issue: after updating to 1.2.19 from 1.2.15, poller consumed all memory until oom-killer got active. I have increased the memory of that VM from 4 GB (which was more than enough with 1.2.15) to 16 GB. But now I see the maximum execution time message. Before, it took approx. 90 seconds to finish.

I noticed the mysqld being also very busy, although "show processlist" never show anything.

Running on CentOS 7 using the EPEL 7 cacti rpm with standard EL7 php and mariadb:
php-5.4.16-48.el7.x86_64
mariadb-5.5.68-1.el7.x86_64

I have just downgraded back to 1.2.15 (yes, I know, downgrade...) and it's all back to normal.

@ctrowat
Copy link
Contributor

ctrowat commented Nov 12, 2021

I am also experiencing this issue with 1.2.19. I had no problems with approximately 6000 devices with ~24,000 data sources but as I imported more some time around reaching 7000 devices the poller.php processes started to stack up on my main poller (I was running with 2 remote pollers) and the CPU load ran away until services started failing. I tried adding 2 more remote pollers (since their polling times were quite long) but that did not help. I run spine 1.2.19 on all instances. I have tried adding memory and CPUs to the main poller without any relief.

Running poller.php --debug --force on the main instance produces the following results:

# docker exec cacti_poller_1 /usr/local/bin/php /opt/cacti/cacti/poller.php --force --debug
2021-11-12 21:31:38 - POLLER: Poller[1] PID[6726] NOTE: Poller Int: '60', Cron Int: '60', Time Since Last: '2496.54', Max Runtime '58', Poller Runs: '1'
2021-11-12 21:31:38 - POLLER: Poller[1] PID[6726] WARNING: Cron is out of sync with the Poller Interval!  The Poller Interval is '60' seconds, with a maximum of a '60' second Cron, but 2,496.5 seconds have passed since the last poll!
2021-11-12 21:31:38 - POLLER: Poller[1] PID[6726] WARNING: Poller Output Table not Empty.  Issues: 2282, DS[28355, 45122, 7779, 45335, 45335, 45138, 7932, 7962, 7403, 7832, 45503, 41431, 45391, 45174, 45503, 12510, 44763, 28345, 45201, 5835], Additional Issues Remain.  Only showing first 20
2021-11-12 21:31:39 - POLLER: Poller[1] PID[6726] DEBUG: About to Spawn a Remote Process [CMD: /opt/cacti/spine, ARGS:  --poller=1 --first=0 --last=1 --mibs]
Waiting on 1 of 1 pollers.
2021-11-12 21:31:40 - POLLER: Poller[1] PID[6726] Parsed MULTI output field '1min:0.10' [map 1min->load_1min]
2021-11-12 21:31:40 - POLLER: Poller[1] PID[6726] Parsed MULTI output field '5min:0.22' [map 5min->load_5min]
2021-11-12 21:31:40 - POLLER: Poller[1] PID[6726] Parsed MULTI output field '10min:1.10' [map 10min->load_15min]

After which it hangs for an excessively long time until the maximum execution time runs out (for me, 10 minutes)

Immediately running it again gives a 5-10 second pause at the same place before you get the rest of the output and the poller exits normally (logs taken from a previous run so the times are a little inconsistent):


2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q '/opt/cacti/cacti/poller_maintenance.php']
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q '/opt/cacti/cacti/poller_automation.php' -M]
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q '/opt/cacti/cacti/poller_spikekill.php']
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q /opt/cacti/cacti/poller_reports.php]
2021-11-12 20:21:49 - POLLER: Poller[Main Poller] PID[4489] DEBUG: About to Spawn a Remote Process [CMD: /usr/local/bin/php, ARGS: -q /opt/cacti/cacti/poller_boost.php]

It seems to crop up somewhat at random. Sometimes it hangs on the first poller run, sometimes I get 5-10 runs in before they start to run away.

Based on other open issues I think this might be related to #4450 in that they may be describing the same problem.

Reading the posts above it was recommended to run s_trace on one of the processes and here is a section of the output. Obviously the output itself is very large so this is truncated but I can provide more if it would be useful

read(3, 0x556e6aef5df3, 5)              = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 86400000) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\27\3\3\37\343", 5)            = 5
read(3, "\0\223\346\210\203\373\233\323\227\302\337\255?\210;\320\362\321\5`$t\304\217c\3\255>\327\363&_"..., 8163) = 8163
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
write(3, "\27\3\3\1\274+E\240\336e\306\376dO\301ow=\260\311\25>9\256\223dp\2559\372\340X"..., 449) = 449
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(3, 0x556e6aef5df3, 5)              = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 86400000) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\27\3\3\0\34", 5)              = 5
read(3, "\261\20F\237\276l\207E%2\334bd2g\352!\343N\340o\1-\332\304\206\21i", 28) = 28
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
write(3, "\27\3\3\0?M\323\261\250\352\265\25`\212r\276\7\274\230O\31\303,\216,<\200\34\226\r8,"..., 68) = 68
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(3, "\27\3\3\1\325", 5)             = 5
read(3, "\305\377,\270I\226\36\327\33>\374\263\324\27\214\253\23\363\363\232\357,\251wb\372,\25\7\222.\34"..., 469) = 469
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
write(3, "\27\3\3\0^d\323\357\205\324\340\213\332\v>%\215\312s\361\6x\3129{\3140\355\21\37GE"..., 99) = 99
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(3, 0x556e6aef5df3, 5)              = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 86400000) = 1 ([{fd=3, revents=POLLIN}])
read(3, "\27\3\3\0\34", 5)              = 5
read(3, "\235\33@\226\23\354E\215|\357(\324\10\334}\354\32`rH\271 \251\371'\257\335a", 28) = 28
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
write(3, "\27\3\3\0D\367GB[\344\310\314\346Ze\225\354B\3124\337\370\34\277\275|&6w\31\376\260"..., 73) = 73
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(3, "\27\3\3\0_", 5)                = 5
read(3, "\374\10\332ss\252\377\331\247v~\377\25\35_\330b\332k\333\22\360)\247K^\355U\23s-\305"..., 95) = 95
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
write(3, "\27\3\3\0D\16\2!\374\211\351$i`\230\2\353\1\370\221^\347\266\1\247\275\10E\23\301\226\363"..., 73) = 73
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(3, "\27\3\3\0_", 5)                = 5
read(3, "@\337\200\240\343`\363\244Q\325\215\335\333\"Eos\333<\23\26D\24\303\201\217\240\252\320\255\262\n"..., 95) = 95
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
write(3, "\27\3\3\1j\3(`K,6\340N\317\377fP\311^l\353\357\255'\321^\347\360\0\251\216#"..., 367) = 367
fcntl(3, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(3, F_SETFL, O_RDWR)               = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
read(3, 0x556e6aef5df3, 5)              = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 86399999♥strace: Process 31090 detached
 <detached ...>

I don't know what additional output / information I can give to help diagnose this problem but I will do my best to provide anything requested.

@bmfmancini
Copy link
Member

bmfmancini commented Nov 12, 2021 via email

@netniV
Copy link
Member

netniV commented Nov 13, 2021

The EAGAIN message, smells to me of spine. @bmfmancini try using the latest poller.php, but the 1.2.16 version of spine? The real question is what is causing the timeout/resource issue. It definitely seems race related.

@bmfmancini
Copy link
Member

I am using spine 1.2.19
And using lib/poller from 1.2.18

@skyjou
Copy link
Contributor

skyjou commented Nov 13, 2021

Is this releated to #205?

@netniV
Copy link
Member

netniV commented Nov 14, 2021

It may be @skyjou but alas, we had no feedback on that issue. This does seem to be a race condition of sorts with a large number of devices. So far, I haven't been able to reproduce it myself on systems with ~1000 devices, but the conditions to make it happen may not be occurring there.

@nicolatron
Copy link
Author

nicolatron commented Nov 15, 2021

I pulled the plunge and started a new cacti install so I cannot provide more feedback (I saved a DB dump).
I'm not sure all the people reporting here have the same root issue, specially for me (I don't trust my very old and ill treated DB data).
I only had 195 devices and around 30K data sources (that's a lot of DS per device).
For what I saw the problem was apparent on lib/poller_output.php on function function process_poller_output.
The poller_output table was processed, and I didn't lose data, but rows where not deleted from the table, so the poller_output was never empty and process never ends.
I set a debug on line 547, and seem to never entered that if block:

if ($k % 10000 == 0) { db_execute('DELETE FROM poller_output WHERE local_data_id IN (' . implode(',', $data_ids) . ')'); [...]
Also set a debug before that if block outputing $k and sizeof($data_ids) and both scaled to 40K then back to 1, with no DELETE query being executed (as it seems it didn't enter that if block).

@netniV
Copy link
Member

netniV commented Nov 15, 2021

That particular block of code hasn't been changed in three years but the surrounding code has in commit 43baa25 where I think the logic is going wrong or is becoming too recursive thus eating up CPU/cycles. I've given @bmfmancini a patch to try out, before I release it to the wild but your comments helped me see something that I missed when I looked over that code last week.

@netniV netniV changed the title [1.2.19] php <cacti_base>cacti/poller.php hangs on main poller until exec timeout Polling does not complete as expected on larger systems Nov 18, 2021
@netniV
Copy link
Member

netniV commented Nov 18, 2021

The above commit should contain changes that should help things complete. Please let us know if there are any further issues.

@netniV netniV added confirmed Bug is confirm by dev team poller Data Collection related issue unverified Some days we don't have a clue and removed unverified Some days we don't have a clue labels Nov 18, 2021
@netniV netniV added this to the v1.2.20 milestone Nov 18, 2021
@trayhons
Copy link

The above commit should contain changes that should help things complete. Please let us know if there are any further issues.

I pulled the committed files and when my poller ran the next time, I received the following in my cacti.log file. In the lines around 517, "issue_list" was used. I removed the "s" and the poller worked as expected during the next poll.

2021/11/19 11:55:02 - ERROR PHP NOTICE: Undefined variable: issues_list in file: /usr/share/cacti/poller.php on line: 517 2021/11/19 11:55:02 - CMDPHP PHP ERROR NOTICE Backtrace: (/poller.php[517]:CactiErrorHandler()) 2021/11/19 11:55:02 - ERROR PHP WARNING: implode(): Invalid arguments passed in file: /usr/share/cacti/poller.php on line: 517 2021/11/19 11:55:02 - CMDPHP PHP ERROR WARNING Backtrace: (/poller.php[517]:implode(), CactiErrorHandler())

@ctrowat
Copy link
Contributor

ctrowat commented Nov 19, 2021

I have been testing with the above commit as well. The runaway poller problem has not occurred in the past 8 hours so that part feels like it is fixed. I did run into the issues_list vs issue_list problem listed above as well and just manually fixed that. Now on my main instance I am getting some poller output table not empty errors like this:

2021-11-19 23:02:02 - POLLER: Poller[Main Poller] PID[23582] WARNING: Poller Output Table not Empty. Issues: 18, DS[ ... ]

It's not always the exact same number of entries left over but it usually contains exactly the same list of interfaces.

Perhaps it just didn't run well enough to get those before. I am not 100% sure they are related but they are unexpected (to me at least).

netniV added a commit that referenced this issue Nov 20, 2021
@netniV
Copy link
Member

netniV commented Nov 20, 2021

I should have corrected the typo now.

@nicolatron
Copy link
Author

nicolatron commented Nov 24, 2021

Hello,
I started a new cacti installation with an almost fresh database. Only imported host table and user related ones (user_auth user_auth_cache user_auth_group user_auth_group_members user_auth_group_perms user_auth_group_realm user_auth_perms user_auth_realm user_domains user_domains_ldap) to the new database.
I'm adding graphs as needed, so far only 180 devices and around 2200 datasources.
Yesterday it started to fail again. Several poller process running at the same time, and this time data was lost (graphs with NaN values).
I was using 1.2.x version of like 14 days ago.
Didn't check anything, just killed all php process, and upgraded my cacti install with that on 1.2.x branch.

git clone -b 1.2.x http://github.com/cacti/cacti.git cacti-develop
/bin/cp -rfp cacti-develop/* /opt/cacti
chown -R apache:apache /opt/cacti

So far it's running well again. Will report back news.
Thanks, best regards

After 8 days everything keeps working ok. No more problems.

@TheWitness
Copy link
Member

Closing as this one is duplicate.

@TheWitness TheWitness added the duplicate Duplicate of another issue label Dec 10, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Mar 11, 2022
@netniV netniV changed the title Polling does not complete as expected on larger systems Polling cycles may not always complete as expected Apr 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team duplicate Duplicate of another issue poller Data Collection related issue resolved A fixed issue unverified Some days we don't have a clue
Projects
None yet
Development

No branches or pull requests

8 participants