Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spine Memory map errors show up randomly #154

Closed
bmfmancini opened this issue Apr 8, 2020 · 21 comments
Closed

Spine Memory map errors show up randomly #154

bmfmancini opened this issue Apr 8, 2020 · 21 comments
Labels
unverfied Not verified by a developer

Comments

@bmfmancini
Copy link
Member

bmfmancini commented Apr 8, 2020

Hello Everyone

I have spine 1.2.9 on our prod system and I notice that sometimes the following errors come up

======= Memory map: ========
/lib64/libc.so.6(clone+0x6d)[0x7f71426eb02d]
/lib64/libpthread.so.0(+0x7dd5)[0x7f7142ec7dd5]
/usr/local/spine/bin/spine[0x40f29c]
/usr/local/spine/bin/spine[0x40cf2d]
/lib64/libmysqlclient.so.18(mysql_close+0x1a)[0x7f7143782a4a]
/lib64/libmysqlclient.so.18(+0x2da14)[0x7f7143782a14]
/lib64/libmysqlclient.so.18(+0x30606)[0x7f7143785606]
/lib64/libmysqlclient.so.18(+0x2aab5)[0x7f714377fab5]
/lib64/libmysqlclient.so.18(+0x5a2f7)[0x7f71437af2f7]
/lib64/libc.so.6(+0x81609)[0x7f714266e609]
======= Backtrace: =========
*** Error in `/usr/local/spine/bin/spine': double free or corruption (out): 0x00007f712400bc70 ***
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
7ffe16cb7000-7ffe16cb9000 r-xp 00000000 00:00 0 [vdso]
7ffe16b80000-7ffe16c81000 rw-p 00000000 00:00 0 [stack]
7fad3dd04000-7fad3dd05000 rw-p 00000000 00:00 0

Followed by spine having a hard time polling then eventually the errors stop with no intervention and polling continues as normal

We recently moved from 1.2.7 to 1.2.9 and in the last 3 days this is the 2nd time this has happened
No memory issues happening on the host and you will also see this on the remote collectors as well

Please let me know if this is a spine level known issue or if this could be something else

Thanks !

@bmfmancini
Copy link
Member Author

bmfmancini commented Apr 8, 2020

Ok....Ive taced back what was done before this happend

There was a modifcation done to a threshold.... then shortly after the above backtrace happend
shortly after that

2020/04/08 14:59:51 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 14:59:51 - ERROR PHP ERROR: Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/lib/database.php on line: 497

Followed by

2020/04/08 15:04:13 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 15:04:13 - ERROR PHP ERROR: Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/lib/database.php on line: 259
2020/04/08 15:01:58 - CMDPHP ERRORS DETECTED - DISABLING PLUGIN 'thold'
2020/04/08 15:01:58 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 15:01:58 - ERROR PHP ERROR in Plugin 'thold': Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/plugins/thold/thold_functions.php on line: 2040
2020/04/08 15:00:53 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 15:00:53 - ERROR PHP ERROR: Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/lib/database.php on line: 259

I did report in Thold that a couple of weeks ago when editing a THOLD things went nuts I since upgraded to 1.4 of thold and that didnt happen perhaps there is something lurking either in their or a spine condition ....

@bmfmancini
Copy link
Member Author

THold report

Cacti/plugin_thold#408

@netniV
Copy link
Member

netniV commented Apr 10, 2020

To me, and I could be wrong here, the double free is occurring within the MySQL client library itself. Have you updated MySQL recently?

@bmfmancini
Copy link
Member Author

No changes to Mysql
the version I have installed is
Ver 15.1 Distrib 10.1.41-MariaDB, for Linux (x86_64) using readline 5.1

These are the library versions that are installed

rpm -qa | grep -i mariadb
MariaDB-shared-10.1.41-1.el7.centos.x86_64
MariaDB-common-10.1.41-1.el7.centos.x86_64
MariaDB-client-10.1.41-1.el7.centos.x86_64
MariaDB-server-10.1.41-1.el7.centos.x86_64
MariaDB-devel-10.1.41-1.el7.centos.x86_64

I cant easily reproduce this the only thing I caught was that last time Thold went bonkers and shutdown and that happened before as well but im not sure if its the cause or a side effect of spine crashing or having an issue causing the timeouts for thold

@bmfmancini
Copy link
Member Author

Oh btw I found this is my messge log

[604396.693209] spine[18286]: segfault at 0 ip (null) sp 00007f277f7d8608 error 14 in spine[400000+1c000]

@bmfmancini
Copy link
Member Author

I caught this today

2020/04/15 10:59:17 - SPINE: Poller[Main Poller] PID[23607] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)
2020/04/15 10:59:17 - SPINE: Poller[Main Poller] PID[23555] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)
2020/04/15 10:59:17 - SPINE: Poller[Main Poller] PID[23581] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)
2020/04/15 04:00:17 - SPINE: Poller[Main Poller] PID[32692] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)

It happens during replication of the remote pollers

@TheWitness
Copy link
Member

Capture a core file and then load it with gdb and do a backtrace for us. We need to know where it's failing.

@TheWitness TheWitness added the unverfied Not verified by a developer label Apr 25, 2020
@TheWitness
Copy link
Member

Is this case still occurring @bmfmancini ?

@bmfmancini
Copy link
Member Author

@TheWitness Just saw this same type of behavior on 1.2.12

======= Memory map: ========
/lib64/libc.so.6(clone+0x6d)[0x7f6926b3c02d]
/lib64/libpthread.so.0(+0x7dd5)[0x7f6927084dd5]
/usr/local/spine/bin/spine[0x40f34c]
/usr/local/spine/bin/spine[0x40cfad]
/lib64/libmysqlclient.so.18(mysql_close+0x1a)[0x7f6927e45a4a]
/lib64/libmysqlclient.so.18(+0x2da14)[0x7f6927e45a14]
/lib64/libmysqlclient.so.18(+0x30606)[0x7f6927e48606]
/lib64/libmysqlclient.so.18(+0x2aab5)[0x7f6927e42ab5]
/lib64/libmysqlclient.so.18(+0x5a2f7)[0x7f6927e722f7]
/lib64/libc.so.6(+0x81609)[0x7f6926abf609]
======= Backtrace: =========
*** Error in `/usr/local/spine/bin/spine': double free or corruption (out): 0x00007f68f401e9a0 ***
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
7ffe9d1a0000-7ffe9d1a2000 r-xp 00000000 00:00 0 [vdso]
7ffe9d043000-7ffe9d144000 rw-p 00000000 00:00 0 [stack]
7f2544cd3000-7f2544cd4000 rw-p 00000000 00:00 0

not seeing any actual impact during the collection process it
I seem to be able to pinpoint it to a blip in polling times possibly a networking event it was overnight so I am pretty sure it was
polling time exceeded on 1 poll cycle then went down to normal ranges all night/day

So I am thinking these errors must be some verbose dummy errors when spine times out ????
and since thankfully this scenario doesn't happen often that's why I only see these once in a blue moon without any impact overall

@bmfmancini bmfmancini changed the title Spine 1.2.9 Memory map errors show up randomly Spine Memory map errors show up randomly Jul 20, 2020
@TheWitness
Copy link
Member

You are likely running out of connections.

SHOW GLOBAL VARIABLES LIKE 'max_connections';
SHOW GLOBAL STATUS LIKE 'max_used_connections';

@bmfmancini
Copy link
Member Author

Connections seem fine

MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE 'max_connections';
+-----------------+-------+
| Variable_name | Value |
+-----------------+-------+
| max_connections | 1000 |
+-----------------+-------+
1 row in set (0.00 sec)

MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'max_used_connections';
+----------------------+-------+
| Variable_name | Value |
+----------------------+-------+
| Max_used_connections | 727 |
+----------------------+-------+
1 row in set (0.00 sec)

@TheWitness
Copy link
Member

Could be shutdown handler. Look for poller runtime very close or above the poller interval in the cacti.log for that time range. Within 2 seconds.

@bmfmancini
Copy link
Member Author

And bingo that it the few times this has happened when I search the log its when the poller timed out
so its just spine taking a dump when it times out the memory map message can throw you off though

@netniV
Copy link
Member

netniV commented Jul 31, 2020

So it sounds like Spine needs a better handling of the loss of connections by cascading that up

@TheWitness
Copy link
Member

Just did a commit. You should test with it. Check the CHANGELOG. We've had other customers reporting connections issues lately.

@TheWitness
Copy link
Member

Bad file descriptor? Hmm.

@TheWitness
Copy link
Member

Try this in /etc/security/limits.d

[root@vmhost3 limits.d]# cat 21-cacti.conf
*          soft    nofile    65535

@TheWitness
Copy link
Member

Update to the latest 1.2.x branch and report back @bmfmancini.

@bmfmancini
Copy link
Member Author

Will do give me a few days

@TheWitness
Copy link
Member

Sean, please update to the latest 1.2.x. I found an issue in the re-index logic that was causing segfaults. I think this explains the somewhat random nature of the crashes.

@TheWitness
Copy link
Member

Related to #193

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unverfied Not verified by a developer
Projects
None yet
Development

No branches or pull requests

3 participants