Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continued timeout of registered processes #4271

Closed
TheWitness opened this issue May 19, 2021 · 17 comments
Closed

Continued timeout of registered processes #4271

TheWitness opened this issue May 19, 2021 · 17 comments
Labels
bug Undesired behaviour resolved A fixed issue
Milestone

Comments

@TheWitness
Copy link
Member

Describe the bug

For some unknown reason, processes are still continuing to be killed by the register_process function though they have not timed out.

image

To Reproduce

  1. Reboot a large number of servers, switches and routers enough where the re-index will take several minutes for all devices.

  2. When the next poller run starts not the re-index log entries as in the image above

  3. Note at that at the next poller start, you see the poller command script being restarted.

Expected behavior

The poller should respect the timeout.

@TheWitness TheWitness added the bug Undesired behaviour label May 19, 2021
@TheWitness
Copy link
Member Author

Timeout is 300 seconds, yet this system with a 30 second collect frequency continually generate this error.

image

@bmfmancini
Copy link
Member

bmfmancini commented May 19, 2021 via email

@TheWitness
Copy link
Member Author

Going to move the timeout entirely into SQL as there seems to be continual problems with PHP and the web server.

TheWitness added a commit that referenced this issue May 19, 2021
By moving the calculation fully under SQL, this should solve the problem.
@TheWitness
Copy link
Member Author

Please test the commit that just went in.

@bmfmancini
Copy link
Member

bmfmancini commented May 19, 2021 via email

@TheWitness TheWitness added the resolved A fixed issue label May 19, 2021
@bmfmancini
Copy link
Member

OK put it in our lab will monitor and let you know

@bmfmancini
Copy link
Member

@TheWitness still seeing the same behaviour

2021-05-19 15:01:23 - POLLER: Poller[Main Poller] PID[14130] ERROR: Process being killed due to timeout! (commands, master, 1, 11769)
2021-05-19 14:56:21 - POLLER: Poller[Main Poller] PID[11769] ERROR: Process being killed due to timeout! (commands, master, 1, 2580)
2021-05-19 14:50:20 - POLLER: Poller[Main Poller] PID[2580] ERROR: Process being killed due to timeout! (commands, master, 1, 2069)
2021-05-19 14:44:23 - POLLER: Poller[Main Poller] PID[2069] ERROR: Process being killed due to timeout! (commands, master, 1, 25419)
2021-05-19 14:38:23 - POLLER: Poller[Main Poller] PID[25419] ERROR: Process being killed due to timeout! (commands, master, 1, 23135)
2021-05-19 14:33:22 - POLLER: Poller[Main Poller] PID[23135] ERROR: Process being killed due to timeout! (commands, master, 1, 20928)
2021-05-19 14:28:20 - POLLER: Poller[Main Poller] PID[20928] ERROR: Process being killed due to timeout! (commands, master, 1, 11871)
2021-05-19 14:22:21 - POLLER: Poller[Main Poller] PID[11871] ERROR: Process being killed due to timeout! (commands, master, 1, 9686)
2021-05-19 14:17:20 - POLLER: Poller[Main Poller] PID[9686] ERROR: Process being killed due to timeout! (commands, master, 1, 493)
2021-05-19 14:11:23 - POLLER: Poller[Main Poller] PID[493] ERROR: Process being killed due to timeout! (commands, master, 1, 23789)
2021-05-19 14:05:21 - POLLER: Poller[Main Poller] PID[23789] ERROR: Process being killed due to timeout! (commands, master, 1, 14866)
2021-05-19 13:59:25 - POLLER: Poller[Main Poller] PID[14866] ERROR: Process being killed due to timeout! (commands, master, 1, 5825)
2021-05-19 13:53:19 - POLLER: Poller[Main Poller] PID[5825] ERROR: Process being killed due to timeout! (commands, master, 1, 29003)
2021-05-19 13:47:21 - POLLER: Poller[Main Poller] PID[29003] ERROR: Process being killed due to timeout! (commands, master, 1, 2888)
2021-05-19 13:42:20 - POLLER: Poller[Main Poller] PID[2888] ERROR: Process being killed due to timeout! (commands, master, 1, 25962)
2021-05-19 13:36:21 - POLLER: Poller[Main Poller] PID[25962] ERROR: Process being killed due to timeout! (commands, master, 1, 16682)
2021-05-19 13:30:24 - POLLER: Poller[Main Poller] PID[16682] ERROR: Process being killed due to timeout! (commands, master, 1, 7456)

Made sure I grabbed the latest commits

diff poller.php poller.php.0519
1931c1931
<       $r = db_fetch_row_prepared('SELECT *, IF(UNIX_TIMESTAMP(started) + timeout < UNIX_TIMESTAMP(), 1, 0) AS timeout_exceeded
---
>       $r = db_fetch_row_prepared('SELECT *
1942c1942
<       } elseif ($r['timeout_exceeded']) {
---
>       } elseif (strtotime($r['started']) + $r['timeout'] < time()) {

netniV added a commit that referenced this issue May 20, 2021
@netniV
Copy link
Member

netniV commented May 20, 2021

I've made a change to be a bit more verbose on the debug, plus to ensure we definitely insert a start time and do not rely on MySQL to set it.

@bmfmancini
Copy link
Member

Cool I will test it in the am

@bmfmancini
Copy link
Member

OK so testing this out the extra verbose code really helps narrow this down
I see two variations of errors one for intropage and one for poller maint due to current_timestamp on both

2021-05-21 10:52:24 - CMDPHP SQL Backtrace: (/poller_maintenance.php[92]:register_process_start(), /lib/poller.php[1936]:db_fetch_row_prepared(), /lib/database.php[473]:db_execute_prepared())
2021-05-21 10:52:24 - CMDPHP ERROR: A DB Row Failed!, Error: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'current_timestamp FROM processes WHERE tasktype = 'maintenance' AND taskname = '' at line 1
2021-05-21 10:52:24 - CMDPHP SQL Backtrace: (/poller_spikekill.php[81]:register_process_start(), /lib/poller.php[1936]:db_fetch_row_prepared(), /lib/database.php[473]:db_execute_prepared())
2021-05-21 10:52:24 - CMDPHP ERROR: A DB Row Failed!, Error: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'current_timestamp FROM processes WHERE tasktype = 'spikekill' AND taskname = 'ma' at line 1
2021-05-21 10:52:24 - CMDPHP SQL Backtrace: (/poller_reports.php[136]:register_process_start(), /lib/poller.php[1936]:db_fetch_row_prepared(), /lib/database.php[473]:db_execute_prepared())
2021-05-21 10:52:24 - CMDPHP ERROR: A DB Row Failed!, Error: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'current_timestamp FROM processes WHERE tasktype = 'reports' AND taskname = 'mast' at line 1
2021-05-21 10:54:30 - CMDPHP SQL Backtrace: (/plugins/intropage/poller_intropage.php[98]:register_process_start(), /lib/poller.php[1936]:db_fetch_row_prepared(), /lib/database.php[473]:db_execute_prepared())
2021-05-21 10:54:30 - CMDPHP ERROR: A DB Row Failed!, Error: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'current_timestamp FROM processes WHERE tasktype = 'intropage' AND taskname = 'ma' at line 1

@netniV
Copy link
Member

netniV commented May 21, 2021

I think the error here is that CURRENT_TIMESTAMP is a reserved word probably and isn't escaped.

netniV added a commit that referenced this issue May 21, 2021
@netniV
Copy link
Member

netniV commented May 21, 2021

Try that commit and see if it resolves things for you

@bmfmancini
Copy link
Member

All good now

@netniV
Copy link
Member

netniV commented May 23, 2021

Are you still seeing events being timeout ?

@bmfmancini
Copy link
Member

bmfmancini commented May 23, 2021 via email

@bmfmancini
Copy link
Member

A bit late with the update but yea not a single timeout message since the update !

@netniV netniV closed this as completed May 26, 2021
@netniV netniV added this to the 1.2.18 milestone May 26, 2021
@TheWitness
Copy link
Member Author

This issue was not actually resolved. I've made an update this morning.

@TheWitness TheWitness reopened this Aug 28, 2021
@TheWitness TheWitness modified the milestones: 1.2.18, 1.2.19 Aug 28, 2021
TheWitness added a commit that referenced this issue Aug 28, 2021
The logic that was placed into the service check was flawed resulting in incorrect timestamps in the database.

I've yet to check the timeout handling.
@netniV netniV closed this as completed Sep 5, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Dec 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

3 participants