Spine Memory map errors show up randomly #154

bmfmancini · 2020-04-08T19:15:53Z

Hello Everyone

I have spine 1.2.9 on our prod system and I notice that sometimes the following errors come up

======= Memory map: ========
/lib64/libc.so.6(clone+0x6d)[0x7f71426eb02d]
/lib64/libpthread.so.0(+0x7dd5)[0x7f7142ec7dd5]
/usr/local/spine/bin/spine[0x40f29c]
/usr/local/spine/bin/spine[0x40cf2d]
/lib64/libmysqlclient.so.18(mysql_close+0x1a)[0x7f7143782a4a]
/lib64/libmysqlclient.so.18(+0x2da14)[0x7f7143782a14]
/lib64/libmysqlclient.so.18(+0x30606)[0x7f7143785606]
/lib64/libmysqlclient.so.18(+0x2aab5)[0x7f714377fab5]
/lib64/libmysqlclient.so.18(+0x5a2f7)[0x7f71437af2f7]
/lib64/libc.so.6(+0x81609)[0x7f714266e609]
======= Backtrace: =========
*** Error in `/usr/local/spine/bin/spine': double free or corruption (out): 0x00007f712400bc70 ***
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
7ffe16cb7000-7ffe16cb9000 r-xp 00000000 00:00 0 [vdso]
7ffe16b80000-7ffe16c81000 rw-p 00000000 00:00 0 [stack]
7fad3dd04000-7fad3dd05000 rw-p 00000000 00:00 0

Followed by spine having a hard time polling then eventually the errors stop with no intervention and polling continues as normal

We recently moved from 1.2.7 to 1.2.9 and in the last 3 days this is the 2nd time this has happened
No memory issues happening on the host and you will also see this on the remote collectors as well

Please let me know if this is a spine level known issue or if this could be something else

Thanks !

bmfmancini · 2020-04-08T19:49:18Z

Ok....Ive taced back what was done before this happend

There was a modifcation done to a threshold.... then shortly after the above backtrace happend
shortly after that

2020/04/08 14:59:51 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 14:59:51 - ERROR PHP ERROR: Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/lib/database.php on line: 497

Followed by

2020/04/08 15:04:13 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 15:04:13 - ERROR PHP ERROR: Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/lib/database.php on line: 259
2020/04/08 15:01:58 - CMDPHP ERRORS DETECTED - DISABLING PLUGIN 'thold'
2020/04/08 15:01:58 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 15:01:58 - ERROR PHP ERROR in Plugin 'thold': Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/plugins/thold/thold_functions.php on line: 2040
2020/04/08 15:00:53 - CMDPHP PHP ERROR Backtrace: (CactiShutdownHandler())
2020/04/08 15:00:53 - ERROR PHP ERROR: Maximum execution time of 59 seconds exceeded in file: /var/www/html/cacti/lib/database.php on line: 259

I did report in Thold that a couple of weeks ago when editing a THOLD things went nuts I since upgraded to 1.4 of thold and that didnt happen perhaps there is something lurking either in their or a spine condition ....

bmfmancini · 2020-04-08T19:57:12Z

THold report

Cacti/plugin_thold#408

netniV · 2020-04-10T10:44:32Z

To me, and I could be wrong here, the double free is occurring within the MySQL client library itself. Have you updated MySQL recently?

bmfmancini · 2020-04-10T12:28:15Z

No changes to Mysql
the version I have installed is
Ver 15.1 Distrib 10.1.41-MariaDB, for Linux (x86_64) using readline 5.1

These are the library versions that are installed

rpm -qa | grep -i mariadb
MariaDB-shared-10.1.41-1.el7.centos.x86_64
MariaDB-common-10.1.41-1.el7.centos.x86_64
MariaDB-client-10.1.41-1.el7.centos.x86_64
MariaDB-server-10.1.41-1.el7.centos.x86_64
MariaDB-devel-10.1.41-1.el7.centos.x86_64

I cant easily reproduce this the only thing I caught was that last time Thold went bonkers and shutdown and that happened before as well but im not sure if its the cause or a side effect of spine crashing or having an issue causing the timeouts for thold

bmfmancini · 2020-04-10T12:28:54Z

Oh btw I found this is my messge log

[604396.693209] spine[18286]: segfault at 0 ip (null) sp 00007f277f7d8608 error 14 in spine[400000+1c000]

bmfmancini · 2020-04-15T15:04:21Z

I caught this today

2020/04/15 10:59:17 - SPINE: Poller[Main Poller] PID[23607] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)
2020/04/15 10:59:17 - SPINE: Poller[Main Poller] PID[23555] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)
2020/04/15 10:59:17 - SPINE: Poller[Main Poller] PID[23581] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)
2020/04/15 04:00:17 - SPINE: Poller[Main Poller] PID[32692] FATAL: Spine Encountered a Segmentation Fault [9, Bad file descriptor] (Spine thread)

It happens during replication of the remote pollers

TheWitness · 2020-04-25T18:10:56Z

Capture a core file and then load it with gdb and do a backtrace for us. We need to know where it's failing.

TheWitness · 2020-07-02T13:48:59Z

Is this case still occurring @bmfmancini ?

bmfmancini · 2020-07-20T20:41:46Z

@TheWitness Just saw this same type of behavior on 1.2.12

======= Memory map: ========
/lib64/libc.so.6(clone+0x6d)[0x7f6926b3c02d]
/lib64/libpthread.so.0(+0x7dd5)[0x7f6927084dd5]
/usr/local/spine/bin/spine[0x40f34c]
/usr/local/spine/bin/spine[0x40cfad]
/lib64/libmysqlclient.so.18(mysql_close+0x1a)[0x7f6927e45a4a]
/lib64/libmysqlclient.so.18(+0x2da14)[0x7f6927e45a14]
/lib64/libmysqlclient.so.18(+0x30606)[0x7f6927e48606]
/lib64/libmysqlclient.so.18(+0x2aab5)[0x7f6927e42ab5]
/lib64/libmysqlclient.so.18(+0x5a2f7)[0x7f6927e722f7]
/lib64/libc.so.6(+0x81609)[0x7f6926abf609]
======= Backtrace: =========
*** Error in `/usr/local/spine/bin/spine': double free or corruption (out): 0x00007f68f401e9a0 ***
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
7ffe9d1a0000-7ffe9d1a2000 r-xp 00000000 00:00 0 [vdso]
7ffe9d043000-7ffe9d144000 rw-p 00000000 00:00 0 [stack]
7f2544cd3000-7f2544cd4000 rw-p 00000000 00:00 0

not seeing any actual impact during the collection process it
I seem to be able to pinpoint it to a blip in polling times possibly a networking event it was overnight so I am pretty sure it was
polling time exceeded on 1 poll cycle then went down to normal ranges all night/day

So I am thinking these errors must be some verbose dummy errors when spine times out ????
and since thankfully this scenario doesn't happen often that's why I only see these once in a blue moon without any impact overall

TheWitness · 2020-07-21T12:18:13Z

You are likely running out of connections.

SHOW GLOBAL VARIABLES LIKE 'max_connections';
SHOW GLOBAL STATUS LIKE 'max_used_connections';

bmfmancini · 2020-07-22T16:53:26Z

Connections seem fine

TheWitness · 2020-07-29T06:40:49Z

Could be shutdown handler. Look for poller runtime very close or above the poller interval in the cacti.log for that time range. Within 2 seconds.

bmfmancini · 2020-07-31T14:24:17Z

And bingo that it the few times this has happened when I search the log its when the poller timed out
so its just spine taking a dump when it times out the memory map message can throw you off though

netniV · 2020-07-31T16:10:16Z

So it sounds like Spine needs a better handling of the loss of connections by cascading that up

TheWitness · 2020-11-20T23:41:25Z

Just did a commit. You should test with it. Check the CHANGELOG. We've had other customers reporting connections issues lately.

TheWitness · 2020-11-21T00:15:23Z

Bad file descriptor? Hmm.

TheWitness · 2020-11-21T00:17:46Z

Try this in /etc/security/limits.d

[root@vmhost3 limits.d]# cat 21-cacti.conf
*          soft    nofile    65535

TheWitness · 2021-02-07T14:29:20Z

Update to the latest 1.2.x branch and report back @bmfmancini.

bmfmancini · 2021-02-07T14:58:13Z

Will do give me a few days

TheWitness · 2021-02-22T13:05:24Z

Sean, please update to the latest 1.2.x. I found an issue in the re-index logic that was causing segfaults. I think this explains the somewhat random nature of the crashes.

TheWitness · 2021-02-22T13:05:41Z

Related to #193

TheWitness added the unverfied Not verified by a developer label Apr 25, 2020

bmfmancini changed the title ~~Spine 1.2.9 Memory map errors show up randomly~~ Spine Memory map errors show up randomly Jul 20, 2020

TheWitness closed this as completed Feb 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spine Memory map errors show up randomly #154

Spine Memory map errors show up randomly #154

bmfmancini commented Apr 8, 2020 •

edited

bmfmancini commented Apr 8, 2020 •

edited

bmfmancini commented Apr 8, 2020

netniV commented Apr 10, 2020 •

edited

bmfmancini commented Apr 10, 2020

bmfmancini commented Apr 10, 2020

bmfmancini commented Apr 15, 2020

TheWitness commented Apr 25, 2020

TheWitness commented Jul 2, 2020

bmfmancini commented Jul 20, 2020

TheWitness commented Jul 21, 2020

bmfmancini commented Jul 22, 2020

TheWitness commented Jul 29, 2020

bmfmancini commented Jul 31, 2020

netniV commented Jul 31, 2020

TheWitness commented Nov 20, 2020

TheWitness commented Nov 21, 2020

TheWitness commented Nov 21, 2020

TheWitness commented Feb 7, 2021

bmfmancini commented Feb 7, 2021

TheWitness commented Feb 22, 2021

TheWitness commented Feb 22, 2021

Spine Memory map errors show up randomly #154

Spine Memory map errors show up randomly #154

Comments

bmfmancini commented Apr 8, 2020 • edited

bmfmancini commented Apr 8, 2020 • edited

bmfmancini commented Apr 8, 2020

netniV commented Apr 10, 2020 • edited

bmfmancini commented Apr 10, 2020

bmfmancini commented Apr 10, 2020

bmfmancini commented Apr 15, 2020

TheWitness commented Apr 25, 2020

TheWitness commented Jul 2, 2020

bmfmancini commented Jul 20, 2020

TheWitness commented Jul 21, 2020

bmfmancini commented Jul 22, 2020

TheWitness commented Jul 29, 2020

bmfmancini commented Jul 31, 2020

netniV commented Jul 31, 2020

TheWitness commented Nov 20, 2020

TheWitness commented Nov 21, 2020

TheWitness commented Nov 21, 2020

TheWitness commented Feb 7, 2021

bmfmancini commented Feb 7, 2021

TheWitness commented Feb 22, 2021

TheWitness commented Feb 22, 2021

bmfmancini commented Apr 8, 2020 •

edited

bmfmancini commented Apr 8, 2020 •

edited

netniV commented Apr 10, 2020 •

edited