Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite recursion is possible during a database failure #5490

Closed
jsgh opened this issue Sep 17, 2023 · 6 comments
Closed

Infinite recursion is possible during a database failure #5490

jsgh opened this issue Sep 17, 2023 · 6 comments
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team resolved A fixed issue
Milestone

Comments

@jsgh
Copy link
Contributor

jsgh commented Sep 17, 2023

Using the EPEL cacti-1.2.23-1.el8.noarch.rpm (though I can't find an Issue for it here nor see any recent changes that would prevent it in the develop branch) we've come across an issue...

Under some circumstances we find that the poller_automation.php processes go into a tight loop, their memory usage skyrockets allocating all available RAM, then all available swap, until they bring the whole machine to its knees.

After some investigation it appears this happens if the connection to the mysql server is lost: whether it crashes or is stopped. In fact you can easily repro this by su to the appropriate user, running "php /usr/share/cacti/poller_automation.php -M --force", waiting for a bit (10-20 seconds is usually enough), then doing a "systemctl stop mariadb".

stracing the poller_automation processes shows that they attempt to write their next SQL query (usually one of the "SELECT id, snmp_version ..." ones), get an EPIPE because the remote end has closed, close the actual socket file descriptor, then go into a loop on mmap for 2MB chunks (which is the behaviour of malloc when leaking in a tight loop) until death (machine or process).

That doesn't really tell us much, so sprinkling a few "echo" statements around the area shows what's really going on:

03:39:59.473838 write(1, "db_execute_prepared SELECT id, snmp_version, status, deleted\n\t\t\t\tFROM host\n\t\t\t\tWHERE hostname IN (?,?)\n", 103) = 103 <0.000022>
03:39:59.473959 sendto(4, "g\0\0\0\3SELECT id, snmp_version, status, deleted FROM host WHERE hostname IN ('192.168.1.233','192.168.1.233')", 107, MSG_DONTWAIT, NULL, 0) = -1 EPIPE (Broken pipe) <0.000031>
03:39:59.474090 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=150646, si_uid=48} ---
03:39:59.474209 getpid()                = 150646 <0.000019>
03:39:59.474277 close(4)                = 0 <0.000037>
03:39:59.474373 write(1, "err 8 PDOStatement::execute(): send of 107 bytes failed with errno=32 Broken pipe\n", 82) = 82 <0.000019>
03:39:59.474468 write(1, "cacti_log 5 ERROR: A DB Row Failed!, Error: 8, SQL: 'SELECT id, snmp_version, status, deleted FROM host WHERE hostname IN (?,?)'\n", 129) = 129 <0.000019>
03:39:59.474538 write(1, "read_config_option log_verbosity\n", 33) = 33 <0.000028>
03:39:59.474649 write(1, "read_config_option log_verbosity = 2\n", 37) = 37 <0.000063>
03:39:59.474779 write(1, "cacti_log  ERROR: A DB Row Failed!, Error: PDOStatement::execute(): send of 107 bytes failed with errno=32 Broken pipe\n", 119) = 119 <0.000019>
03:39:59.474849 write(1, "read_config_option client_timezone_support\n", 43) = 43 <0.000017>
03:39:59.474927 write(1, "db_execute_prepared SELECT value FROM settings WHERE name = ?\n", 62) = 62 <0.000017>
03:39:59.475038 write(1, "cacti_log 5 ERROR: A DB Row Failed!, Error: 2006, SQL: 'SELECT value FROM settings WHERE name = ?'\n", 99) = 99 <0.000019>
03:39:59.475105 write(1, "read_config_option log_verbosity\n", 33) = 33 <0.000030>
03:39:59.475203 write(1, "read_config_option log_verbosity = 2\n", 37) = 37 <0.000054>
03:39:59.475315 write(1, "cacti_log  ERROR: A DB Row Failed!, Error: MySQL server has gone away\n", 70) = 70 <0.000018>
03:39:59.475379 write(1, "read_config_option client_timezone_support\n", 43) = 43 <0.000022>
03:39:59.475455 write(1, "db_execute_prepared SELECT value FROM settings WHERE name = ?\n", 62) = 62 <0.000017>
03:39:59.475556 write(1, "cacti_log 5 ERROR: A DB Row Failed!, Error: 2006, SQL: 'SELECT value FROM settings WHERE name = ?'\n", 99) = 99 <0.000020>
03:39:59.475669 write(1, "read_config_option log_verbosity\n", 33) = 33 <0.000025>
03:39:59.475747 write(1, "read_config_option log_verbosity = 2\n", 37) = 37 <0.000063>
03:39:59.475866 write(1, "cacti_log  ERROR: A DB Row Failed!, Error: MySQL server has gone away\n", 70) = 70 <0.000017>
03:39:59.475930 write(1, "read_config_option client_timezone_support\n", 43) = 43 <0.000021>
03:39:59.476004 write(1, "db_execute_prepared SELECT value FROM settings WHERE name = ?\n", 62) = 62 <0.000018>
03:39:59.476106 write(1, "cacti_log 5 ERROR: A DB Row Failed!, Error: 2006, SQL: 'SELECT value FROM settings WHERE name = ?'\n", 99) = 99 <0.000018>
03:39:59.476171 write(1, "read_config_option log_verbosity\n", 33) = 33 <0.000023>
03:39:59.476240 write(1, "read_config_option log_verbosity = 2\n", 37) = 37 <0.000063>
03:39:59.476358 write(1, "cacti_log  ERROR: A DB Row Failed!, Error: MySQL server has gone away\n", 70) = 70 <0.000017>
03:39:59.476422 write(1, "read_config_option client_timezone_support\n", 43) = 43 <0.000021>
03:39:59.476495 write(1, "db_execute_prepared SELECT value FROM settings WHERE name = ?\n", 62) = 62 <0.000017>
03:39:59.476625 write(1, "cacti_log 5 ERROR: A DB Row Failed!, Error: 2006, SQL: 'SELECT value FROM settings WHERE name = ?'\n", 99) = 99 <0.000023>
03:39:59.476705 write(1, "read_config_option log_verbosity\n", 33) = 33 <0.000023>
03:39:59.476776 write(1, "read_config_option log_verbosity = 2\n", 37) = 37 <0.000056>
03:39:59.476888 write(1, "cacti_log  ERROR: A DB Row Failed!, Error: MySQL server has gone away\n", 70) = 70 <0.000017>
03:39:59.476951 write(1, "read_config_option client_timezone_support\n", 43) = 43 <0.000017>
03:39:59.477024 write(1, "db_execute_prepared SELECT value FROM settings WHERE name = ?\n", 62) = 62 <0.000017>
03:39:59.477128 write(1, "cacti_log 5 ERROR: A DB Row Failed!, Error: 2006, SQL: 'SELECT value FROM settings WHERE name = ?'\n", 99) = 99 <0.000018>

So, within db_execute_prepared() we try to execute() the SQL statement which fails (and internally the datbase handle which knows it's unrecoverable closes its file descriptor). That failure is detected and cacti tries to log it - in fact for this particular error two calls to cacti_log() are made, the first reports the statement, the second reports the error. The behaviour of cacti_log() is modified by certain configuration options so it tries to read those, but config options are stored in the database so for some of those at least it makes a call back into the database layer to fetch them. The handle has already been closed so the failure is immediately reported back to cacti, which tries to log that failure... we are now in an infinite recursion which chews up RAM for the PHP stackframes and local variables until none is left. If there's no additional I/O to do (as for the original version without my additional "echo" statements and with things like DEBUG_SQL_CMD turned off) that actually happens pretty fast so the machine becomes responsive soon after.

So:

  1. It is unfortunate that PDO notices the fatal error but cacti keeps trying to use the handle. Then again I looked and couldn't find any obvious way to tell whether an error is due to the specific SQL statement or more permanent. I guess after an error cacti should try an "SELECT 1;" just to see if the handle is completely unusable.
  2. It is also unfortunate that reporting DB errors attempts to access the database, at all. I have worked around this locally by changing read_config_option() to pass $log=false in its call to db_fetch_row_prepared() (which then calls db_execute_prepared()), which bypasses the entire block of logging code and appears to do the job well. (You might argue that read_config_option() calls outside of cacti_log() should continue to log: that's possible but requires cacti_log() to punch the $log=false parameter through several intermediate layers, which felt both uglier and more fragile.)
  3. Trying to cache those config options isn't on its own sufficient: note from the above strace that reading log_verbosity doesn't cause a problem because it has been cached (probably by prime_common_config_settings()), but adding client_timezone_support to the list in that function isn't enough because options are only cached if they exist in the database: in our case there was no client_timezone_support row in the settings table and that doesn't feel like something that should be relied upon to avoid a serious crash. And there are other settings too if you get past that one. And it's again fragile, in the case other settings get added to cacti_log() in the future.
  4. With my local change in place, poller_automation.php logs the error, then pootles round for a bit longer, making a few other queries (which fail), until it eventually gets back to its outer loop where at the beginning ("// Check for cancel") it hits the empty() case rather than an explicit cancel because with the database gone it can't even check for its own existence any more. It nevertheless attempts to remove itself from the database (which fails) before it exits. That is at least a much better option than crashing the box, but I can't help feeling that a graceful exit-on-error would look a bit more... direct than that.
@jsgh jsgh added bug Undesired behaviour unverified Some days we don't have a clue labels Sep 17, 2023
@TheWitness
Copy link
Member

What MySQL/MariaDB version version are you using on the backend?

@jsgh
Copy link
Contributor Author

jsgh commented Sep 22, 2023

Well, it really doesn't matter, as it should be pretty obvious from the diagnosis above: who cares what the version of software is of a server that is unreachable or not actually running.

However, since you asked, I repro'd against MariaDB-server-10.5.15-1.el8.x86_64 (probably the CentOS version thereof.)

@TheWitness
Copy link
Member

Yea, I understand, but there have been some recent changes as well. This is a known issue for some time. Can you generate a pull request?

jsgh added a commit to jsgh/cacti that referenced this issue Sep 24, 2023
@jsgh
Copy link
Contributor Author

jsgh commented Sep 24, 2023

PR raised.

This is a known issue for some time.

It is? Because I did look and couldn't find any issue that sounded like it...

@TheWitness
Copy link
Member

Thanks!

jsgh added a commit to jsgh/cacti that referenced this issue Sep 26, 2023
TheWitness pushed a commit that referenced this issue Sep 27, 2023
@TheWitness TheWitness added resolved A fixed issue confirmed Bug is confirm by dev team and removed unverified Some days we don't have a clue labels Sep 30, 2023
@TheWitness TheWitness added this to the 1.2.26 milestone Sep 30, 2023
TheWitness added a commit that referenced this issue Sep 30, 2023
Database failure causes infinite recursion in cacti_log() followed by machine lockup
TheWitness added a commit that referenced this issue Sep 30, 2023
Fixing this in 1.2.26 too.
@TheWitness
Copy link
Member

Closing this now that I'm satisfied. Thanks fro the help!

@netniV netniV changed the title Database failure causes infinite recursion in cacti_log() followed by machine lockup Infinite recursion is possible during a database failure Dec 16, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Mar 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Undesired behaviour confirmed Bug is confirm by dev team resolved A fixed issue
Projects
None yet
Development

No branches or pull requests

2 participants