Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 Segmantation Fault #6520

Closed
firatalkis opened this issue Aug 6, 2018 · 18 comments
Closed

Icinga2 Segmantation Fault #6520

firatalkis opened this issue Aug 6, 2018 · 18 comments
Labels
core/crash Shouldn't happen, requires attention duplicate This issue or pull request already exists

Comments

@firatalkis
Copy link

We are using Icinga2 version r2.9.1-1 and runs on VM (Rethat 7.5 - Maipo).In our arhitecture we have 1 master and 9 slave servers. The Icinga2 service ,which is installed on the one of our slave server, crashes frequently. When we check the messages.log, we can see this pattern : SIGSEGV. We followed the gdp steps like docs said and get the below results. If anyone has same issue, plz share your comments.

icinga2.zip

icinga2.log at attachment,

GDB Output

[root@hostname cores]# gdb /usr/lib64/icinga2/sbin/icinga2 core.icinga2.36862
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright © 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later -gnu.org/licenses/gpl.html-
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
For bug reporting instructions, please see:
-gnu.org/software/gdb/bugs/-…
Reading symbols from /usr/lib64/icinga2/sbin/icinga2…Reading symbols from /usr/lib64/icinga2/sbin/icinga2…(no debugging symbols found)…done.
(no debugging symbols found)…done.
[New LWP 93781]
[New LWP 93799]
[New LWP 91795]
[New LWP 94443]
[New LWP 93751]
[New LWP 94342]
[New LWP 93754]
[New LWP 93757]
[New LWP 94346]
[New LWP 93752]
[New LWP 36999]
[New LWP 93770]
[New LWP 93800]
[New LWP 93763]
[New LWP 39430]
[New LWP 93798]
[New LWP 105815]
[New LWP 36862]
[New LWP 105818]
[New LWP 93766]
[New LWP 105814]
[New LWP 93765]
[New LWP 61141]
[New LWP 124709]
[New LWP 93724]
[New LWP 124608]
[New LWP 93756]
[New LWP 93750]
[New LWP 94306]
[New LWP 105817]
[New LWP 94254]
[New LWP 33679]
[New LWP 93755]
[New LWP 93764]
[New LWP 93753]
[New LWP 93801]
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib64/libthread_db.so.1”.
Core was generated by `/usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/er’.
Program terminated with signal 11, Segmentation fault.
#0 0x00002ba91e938d58 in std::basic_string-char, std::char_traits-char-, std::allocator-char- -::basic_string(std::string const&) () from /lib64/libstdc++.so.6
Missing separate debuginfos, use: debuginfo-install icinga2-bin-2.9.1-1.el7.icinga.x86_64
(gdb) bt
#0 0x00002ba91e938d58 in std::basic_string-char, std::char_traits-char-, std::allocator-char- -::basic_string(std::string const&) () from /lib64/libstdc++.so.6
#1 0x0000000000960225 in icinga::Comment::RemoveComment(icinga::String const&, boost::intrusive_ptr-icinga::MessageOrigin- const&) ()
#2 0x00000000008a0cf6 in icinga::Checkable::RemoveCommentsByType(int) ()
#3 0x0000000000a10364 in icinga::Checkable::ProcessCheckResult(boost::intrusive_ptr-icinga::CheckResult- const&, boost::intrusive_ptr-icinga::MessageOrigin- const&)
()
#4 0x0000000000a1f3d1 in icinga::ClusterEvents::CheckResultAPIHandler(boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&) ()
#5 0x000000000078f69f in std::_Function_handler-icinga::Value (boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&), icinga::Value (*)(boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&)-::_M_invoke(std::_Any_data const&, boost::intrusive_ptr-icinga::MessageOrigin- const&, boost::intrusive_ptr-icinga::Dictionary- const&) ()
#6 0x00000000009b4923 in icinga::JsonRpcConnection::MessageHandler(icinga::String const&) ()
#7 0x00000000009b54ab in icinga::JsonRpcConnection::MessageHandlerWrapper(icinga::String const&) ()
#8 0x000000000071f469 in icinga::WorkQueue::RunTaskFunction(std::function-void ()- const&) ()
#9 0x000000000073f0f7 in icinga::WorkQueue::WorkerThreadProc() ()
#10 0x00002ba91d18d27a in thread_proxy () from /lib64/libboost_thread-mt.so.1.53.0
#11 0x00002ba91f0a0dd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00002ba91f3b3b3d in clone () from /lib64/libc.so.6
(gdb)

@Crunsher
Copy link
Contributor

Crunsher commented Aug 6, 2018

There should be a crashlog with the other logs, could you provide that one as well please?

@Crunsher Crunsher added needs feedback We'll only proceed once we hear from you again core/crash Shouldn't happen, requires attention labels Aug 6, 2018
@firatalkis
Copy link
Author

crash log did not occur in crash/ directory. what can I do to create a crash log? I couldn't find anything in the documentation about that

@Crunsher
Copy link
Contributor

Crunsher commented Aug 6, 2018

Interesting, a crash log should always be written. If it's not that's at least a hint ^_^

@dnsmichi
Copy link
Contributor

dnsmichi commented Aug 6, 2018

For some reason, the check result processed here puts the checkable into a state of Recovery. This triggers the removal of the Acknowledgement.

For some reason, there are not comments associated to this acknowledgement. This would lead into a broken cluster where one node has a broken API package and not all the comments loaded.

Still, it shouldn't crash just by that.

@firatalkis
Copy link
Author

firatalkis commented Aug 14, 2018

the problem continues. the latest gdb below. you have any other suggestions?

GDB OutPut

[root@cluster1 cores]# gdb /usr/lib64/icinga2/sbin/icinga2 core.icinga2.26932
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/lib64/icinga2/sbin/icinga2...Reading symbols from /usr/lib64/icinga2/sbin/icinga2...(no debugging symbols found)...done.
(no debugging symbols found)...done.
[New LWP 26950]
[New LWP 51291]
[New LWP 51265]
[New LWP 51922]
[New LWP 51292]
[New LWP 26951]
[New LWP 51280]
[New LWP 51290]
[New LWP 51287]
[New LWP 51288]
[New LWP 51299]
[New LWP 51297]
[New LWP 51279]
[New LWP 107174]
[New LWP 51274]
[New LWP 51834]
[New LWP 107145]
[New LWP 107195]
[New LWP 26949]
[New LWP 51833]
[New LWP 51298]
[New LWP 57062]
[New LWP 51276]
[New LWP 51277]
[New LWP 51281]
[New LWP 27345]
[New LWP 51275]
[New LWP 51296]
[New LWP 51278]
[New LWP 74163]
[New LWP 51818]
[New LWP 51289]
[New LWP 51850]
[New LWP 26932]
[New LWP 51256]
[New LWP 107190]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon -e /var/log/icinga2/er'.
Program terminated with signal 11, Segmentation fault.
#0 0x00002b514cce7d58 in std::basic_string<char, std::char_traits, std::allocator >::basic_string(std::string const&) () from /lib64/libstdc++.so.6
Missing separate debuginfos, use: debuginfo-install icinga2-bin-2.9.1-1.el7.icinga.x86_64
(gdb) bt
#0 0x00002b514cce7d58 in std::basic_string<char, std::char_traits, std::allocator >::basic_string(std::string const&) () from /lib64/libstdc++.so.6
#1 0x0000000000960225 in icinga::Comment::RemoveComment(icinga::String const&, boost::intrusive_ptricinga::MessageOrigin const&) ()
#2 0x00000000008a0cf6 in icinga::Checkable::RemoveCommentsByType(int) ()
#3 0x0000000000a10364 in icinga::Checkable::ProcessCheckResult(boost::intrusive_ptricinga::CheckResult const&, boost::intrusive_ptricinga::MessageOrigin const&) ()
#4 0x0000000000a16d0b in icinga::PluginCheckTask::ProcessFinishedHandler(boost::intrusive_ptricinga::Checkable const&, boost::intrusive_ptricinga::CheckResult const&, icinga::Value const&, icinga::ProcessResult const&) ()
#5 0x0000000000806cfa in icinga::ThreadPool::WorkerThread::ThreadProc(icinga::ThreadPool::Queue&) ()
#6 0x00002b514b53c27a in thread_proxy () from /lib64/libboost_thread-mt.so.1.53.0
#7 0x00002b514d44fdd5 in start_thread () from /lib64/libpthread.so.0
#8 0x00002b514d762b3d in clone () from /lib64/libc.so.6
(gdb)

/var/log/messages

Aug 14 15:24:33 cluster1 kernel: [1914456.090348] icinga2[26950]: segfault at 48 ip 00002b514cce7d58 sp 00002b5152af5430 error 4 in libstdc++.so.6.0.19[2b514cc29000+e9000]
Aug 14 15:24:34 cluster1 systemd[1]: icinga2.service: main process exited, code=killed, status=11/SEGV
Aug 14 15:24:34 cluster1 systemd[1]: Unit icinga2.service entered failed state.
Aug 14 15:24:34 cluster1 systemd[1]: icinga2.service failed.

@Crunsher
Copy link
Contributor

Looking at the code I'm uncertain how RemoveComment could fail in such a spectacular way. We are going to need a way to reproduce this.

@ghost
Copy link

ghost commented Aug 28, 2018

Hello, we are getting a lot of segfaults since we upgraded from 2.8 to 2.9.1. All servers are CentOS 7.5, and all are having the same issues. Client nodes just randomly die. No crash logs or anything useful that we could find.

[7468808.154021] icinga2[21086]: segfault at 7ff9a44b2dc0 ip 00007ff9a120362c sp 00007ffc65ac92f0 error 4 in libc-2.17.so[7ff9a1183000+1c3000]

kernel 3.10.0-862.3.2.el7.x86_64

`
============== GENERAL INFORMATION ==============

Application version: r2.9.1-1
Installation root: /usr
Sysconf directory: /etc
Run directory: /run
Local state directory: /var
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

Enabled features:
api mainlog
Disabled features:
checker command compatlog debuglog elasticsearch gelf graphite influxdb livestatus notification opentsdb perfdata statusdata syslog

########################
checker is disabled, no checks can be run from this instance
########################

########################
debuglog is disabled, please activate it and rerun icinga2
########################

============== OBJECT INFORMATION ==============

Checking object file from /var/cache/icinga2/icinga2.debug
Found the 248 objects:
Type : Count
ApiListener : 1
ApiUser : 1
CheckCommand : 238
Endpoint : 2
FileLogger : 1
IcingaApplication : 1
Zone : 4

The objects origins are:

/etc/icinga2/conf.d/api-users.conf
/etc/icinga2/conf.d/commands.conf
/etc/icinga2/features-enabled/api.conf
/etc/icinga2/features-enabled/mainlog.conf
/etc/icinga2/zones.conf
/usr/share/icinga2/include/command-icinga.conf
/usr/share/icinga2/include/command-nscp-local.conf
/usr/share/icinga2/include/command-plugins-manubulon.conf
/usr/share/icinga2/include/command-plugins.conf
/usr/share/icinga2/include/plugins-contrib.d/databases.conf
/usr/share/icinga2/include/plugins-contrib.d/hardware.conf
/usr/share/icinga2/include/plugins-contrib.d/icingacli.conf
/usr/share/icinga2/include/plugins-contrib.d/ipmi.conf
/usr/share/icinga2/include/plugins-contrib.d/logmanagement.conf
/usr/share/icinga2/include/plugins-contrib.d/metrics.conf
/usr/share/icinga2/include/plugins-contrib.d/network-components.conf
/usr/share/icinga2/include/plugins-contrib.d/network-services.conf
/usr/share/icinga2/include/plugins-contrib.d/operating-system.conf
/usr/share/icinga2/include/plugins-contrib.d/raid-controller.conf
/usr/share/icinga2/include/plugins-contrib.d/smart-attributes.conf
/usr/share/icinga2/include/plugins-contrib.d/storage.conf
/usr/share/icinga2/include/plugins-contrib.d/virtualization.conf
/usr/share/icinga2/include/plugins-contrib.d/vmware.conf
/usr/share/icinga2/include/plugins-contrib.d/web.conf

============== LOGS AND CRASH REPORTS ==============

Getting the last 20 lines of 1 FileLogger objects.
Logger main-log at path: /var/log/icinga2/icinga2.log
[begin: '/var/log/icinga2/icinga2.log' line: 0]
[end: '/var/log/icinga2/icinga2.log' line: 0]

########################
/var/log/icinga2/icinga2.log either does not exist or is empty
########################

No crash logs found in /var/log/icinga2/crash/
`

@Crunsher
Copy link
Contributor

@fedepires Is there really no log at all? Since the reporter could not find any. And were you able to discern some kind of pattern for the crashes?

@ghost
Copy link

ghost commented Aug 30, 2018

Nothing in the logs, we checked several times. No crash logs, and icinga2.log looks as usual and then just stops. There's no apparent pattern in the crashes, all nodes are mostly the same (same OS, same kernel, similar hardware and resources).

@N-o-X
Copy link
Contributor

N-o-X commented Sep 3, 2018

@firatalkis would it be possible to have a look at your configuration? Icinga should not spam Cannot create object .. already exists. that often. There might be something wrong in your cluster setup.

Also, is there a reason for creating multiple comments every second, especially on the agent?

@dnsmichi
Copy link
Contributor

dnsmichi commented Sep 6, 2018

A full core dump of the crash would help in both cases.

@firatalkis
Copy link
Author

@N-o-X You can find conf files in the attachment.

when we add acknowledgement or comment ,icinga2 servers randomly fails with SIGSEGV.

confs.zip

@dnsmichi
Copy link
Contributor

That won't work, as it is known that more than 2 endpoints in a zone create a loop with routing. In your case this would explain all these log entries and crashes later on.

object Zone "checker" {
  endpoints = [ "hosntame3", "hosntame4", "hosntame5", "hosntame6","hosntame2", "hosntame7", "hosntame1", "hosntame8", "hosntame9" ]
  parent = "master"
}

@dnsmichi dnsmichi added the duplicate This issue or pull request already exists label Sep 17, 2018
@firatalkis
Copy link
Author

hi @dnsmichi,

is it possible that, 2 end points handle the workload of 7600 servser and 14500 service checks?
System controls(cpu, memory, storage) are check interval 2m.

@dnsmichi
Copy link
Contributor

When you throw enough resources onto it, sure. We don't know the specs unfortunately.

@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label Sep 28, 2018
@dnsmichi
Copy link
Contributor

Closing since it is a known problem with #3533

@firatalkis
Copy link
Author

thanks @dnsmichi our problem has been solved.
when we setting Icinga2 two endpoint in a zone file the icinga2 service did not crash and icinga2 working more stable

@ghost
Copy link

ghost commented Oct 17, 2018

For the record, we are still seeing this on 2.9.2 and we don't have multiple endpoints on a zone anywhere. We will test 2.10.0 if this still happens soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/crash Shouldn't happen, requires attention duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

4 participants