Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON-RPC Crashes with 2.11 #7532

Closed
4 of 8 tasks
lazyfrosch opened this issue Sep 24, 2019 · 144 comments · Fixed by #7837
Closed
4 of 8 tasks

JSON-RPC Crashes with 2.11 #7532

lazyfrosch opened this issue Sep 24, 2019 · 144 comments · Fixed by #7837
Assignees
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention core/crash Shouldn't happen, requires attention ref/NC
Milestone

Comments

@lazyfrosch
Copy link
Contributor

lazyfrosch commented Sep 24, 2019

Task List

Mitigations

#7532 (comment)

Analysis

Related issues

#7569
#7687
#7624
#7470

ref/NC/636691
ref/NC/644339
ref/NC/644553
ref/NC/647127
ref/NC/652035
ref/NC/652071
ref/NC/652087

Original Report

The setup is a dual master system which was upgraded to 2.11 around noon yesterday.

In the late evening crashes started to appear and are now consistent. The system ran on 2.11-rc1 before.

The user started upgrading agents to 2.11, this may be related.

ref/NC/636691

Latest crash

  Application version: r2.11.0-1

System information:
  Platform: Ubuntu
  Platform version: 18.04.3 LTS (Bionic Beaver)
  Kernel: Linux
  Kernel version: 4.15.0-1050-aws
  Architecture: x86_64

Build information:
  Compiler: GNU 8.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

        (0) libc.so.6: gsignal (+0xc7) [0x7f9a99cb4e97]
        (1) libc.so.6: abort (+0x141) [0x7f9a99cb6801]
        (2) libc.so.6: <unknown function> (+0x3039a) [0x7f9a99ca639a]
        (3) libc.so.6: <unknown function> (+0x30412) [0x7f9a99ca6412]
        (4) icinga2: <unknown function> (+0x3656c3) [0x55c11ec5d6c3]
        (5) icinga2: icinga::NotificationComponent::NotificationTimerHandler() (+0x1116) [0x55c11ec77a06]
        (6) icinga2: <unknown function> (+0x6da759) [0x55c11efd2759]
        (7) icinga2: icinga::Timer::Call() (+0x2d) [0x55c11eff151d]
        (8) icinga2: <unknown function> (+0x6f0acd) [0x55c11efe8acd]
        (9) icinga2: boost::asio::detail::executor_op<boost::asio::detail::work_dispatcher<icinga::ThreadPool::Post<std::function<void ()> >(std::function<void ()>, icinga::SchedulerPolicy)::{lambda()#1}>, std::allocator<void>, boost::asio::detail::scheduler_operation>::do_complete(void*, std::allocator<void>*, boost::system::error_code const&, unsigned long) (+0xc5) [0x55c11ef9e5d5]
        (10) icinga2: <unknown function> (+0x75807b) [0x55c11f05007b]
        (11) icinga2: <unknown function> (+0x6a2555) [0x55c11ef9a555]
        (12) icinga2: boost_asio_detail_posix_thread_function (+0xf) [0x55c11f047f3f]
        (13) libpthread.so.0: <unknown function> (+0x76db) [0x7f9a98a756db]
        (14) libc.so.6: clone (+0x3f) [0x7f9a99d9788f]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
The program is not being run.

Alternative crashes

        (0) libc.so.6: gsignal (+0xc7) [0x7f32355c0e97]
        (1) libc.so.6: abort (+0x141) [0x7f32355c2801]
        (2) libc.so.6: <unknown function> (+0x89897) [0x7f323560b897]
        (3) libc.so.6: <unknown function> (+0x9090a) [0x7f323561290a]
        (4) libc.so.6: cfree (+0x4dc) [0x7f3235619e2c]
        (5) icinga2: <unknown function> (+0x6d90d1) [0x56119dd120d1]
        (6) icinga2: icinga::JsonRpcConnection::SendMessageInternal(boost::intrusive_ptr<icinga::Dictionary> const&) (+0x4f) [0x56119dc4870f]
        (7) icinga2: <unknown function> (+0x5f6eac) [0x56119dc2feac]
        (8) icinga2: boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) (+0x75) [0x56119dc8ce85]
        (9) icinga2: <unknown function> (+0x75807b) [0x56119dd9107b]
        (10) icinga2: icinga::IoEngine::RunEventLoop() (+0x5e) [0x56119dd85b1e]
        (11) libstdc++.so.6: <unknown function> (+0xbd66f) [0x7f323349b66f]
        (12) libpthread.so.0: <unknown function> (+0x76db) [0x7f32343816db]
        (13) libc.so.6: clone (+0x3f) [0x7f32356a388f]
@lazyfrosch
Copy link
Contributor Author

I was not able to run Icinga 2 on GCC or attach to it:

/build/gdb-JPMZNV/gdb-8.1/gdb/dictionary.c:690: internal-error: void insert_symbol_hashed(dictionary*, symbol*): Assertion `SYMBOL_LANGUAGE (sym) == DICT_LANGUAGE (dict)->la_language' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

@lazyfrosch
Copy link
Contributor Author

After disabling the notification feature and some minutes running:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000563ab115b5de in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_local_data (
    this=<optimized out>, this=<optimized out>) at /usr/include/c++/8/ext/new_allocator.h:81
81      /usr/include/c++/8/ext/new_allocator.h: No such file or directory.
[Current thread is 1 (Thread 0x7f72580d4700 (LWP 20323))]
(gdb) bt
#0  0x0000563ab115b5de in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_local_data (
    this=<optimized out>, this=<optimized out>) at /usr/include/c++/8/ext/new_allocator.h:81
#1  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string (__str=...,
    this=<optimized out>, this=<optimized out>, __str=...) at /usr/include/c++/8/bits/basic_string.h:542
#2  icinga::String::String (other=..., this=<optimized out>, this=<optimized out>, other=...) at ./lib/base/string.cpp:37
#3  __gnu_cxx::new_allocator<icinga::String>::construct<icinga::String, icinga::String> (this=0x7f7220590c88,
    __p=0x7473756c43273a27) at /usr/include/c++/8/ext/new_allocator.h:136
#4  std::allocator_traits<std::allocator<icinga::String> >::construct<icinga::String, icinga::String> (__a=...,
    __p=0x7473756c43273a27) at /usr/include/c++/8/bits/alloc_traits.h:475
#5  std::vector<icinga::String, std::allocator<icinga::String> >::emplace_back<icinga::String> (this=<optimized out>)
    at /usr/include/c++/8/bits/vector.tcc:103
#6  0x0000563ab108d70f in icinga::JsonRpcConnection::SendMessageInternal(boost::intrusive_ptr<icinga::Dictionary> const&) ()
    at ./lib/remote/jsonrpcconnection.cpp:186
#7  0x0000563ab1074eac in boost::asio::detail::completion_handler<icinga::JsonRpcConnection::SendMessage(boost::intrusive_ptr<icinga::Dictionary> const&)::{lambda()#1}>::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) [clone .lto_priv.4649] () at ./lib/remote/jsonrpcconnection.cpp:173
#8  0x0000563ab10d1e85 in boost::asio::detail::strand_service::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long) () at /usr/include/icinga-boost/boost/asio/detail/scheduler_operation.hpp:40
/build/gdb-JPMZNV/gdb-8.1/gdb/dictionary.c:690: internal-error: void insert_symbol_hashed(dictionary*, symbol*): Assertion `SYMBOL_LANGUAGE (sym) == DICT_LANGUAGE (dict)->la_language' failed.
A problem internal to GDB has been detected,
further debugging may prove unreliable.

@bunghi
Copy link

bunghi commented Sep 24, 2019

I have the exactly same issue. Today i've upgraded from 2.10 to 2.11 and the service started to crash. I also noticed that used memory is way higher than before.
I'm thinking to downgrade..

@lazyfrosch
Copy link
Contributor Author

@bunghi please share logs and details, "exactly" the same won't help us...

@bunghi
Copy link

bunghi commented Sep 24, 2019

The Icinga2 environment looks like this:

  • web server running icingaweb2
  • dual node master
  • 13 zones (one endpoint for all but one with 2 endpoints)

This morning we upgraded icinga2 on all 15 servers (masters + zones endpoints). Since then the both endpoints in the zone with 2 crashed (there we have a lot of hosts).

Crash reports:

$ cat report.1569317444.555398
Caught unhandled exception.
Current time: 2019-09-24 11:30:44 +0200

  Application version: r2.11.0-1

System information:
  Platform: Debian GNU/Linux
  Platform version: 9 (stretch)
  Kernel: Linux
  Kernel version: 4.9.0-11-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 6.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

Error: Function call 'opendir' for file '/var/lib/icinga2/api/zones-stage//global' failed with error code 2, 'No such file or directory'

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***
Failed to launch GDB: No such file or directory
$ cat report.1569312889.121968
Caught unhandled exception.
Current time: 2019-09-24 10:14:49 +0200

  Application version: r2.11.0-1

System information:
  Platform: Debian GNU/Linux
  Platform version: 9 (stretch)
  Kernel: Linux
  Kernel version: 4.9.0-11-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 6.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

Error: [json.exception.parse_error.101] parse error at line 1, column 28: syntax error while parsing value - unexpected '{'; expected end of input


        (0) icinga2: icinga::JsonDecode(icinga::String const&) (+0x779) [0x55e373ddfac9]
        (1) icinga2: icinga::Process::DoEvents() (+0x295) [0x55e373e33bf5]
        (2) icinga2: icinga::Process::IOThreadProc(int) (+0x3cd) [0x55e373e3529d]
        (3) libstdc++.so.6: <unknown function> (+0xb9e6f) [0x7fa157ae2e6f]
        (4) libpthread.so.0: <unknown function> (+0x74a4) [0x7fa1589194a4]
        (5) libc.so.6: clone (+0x3f) [0x7fa157257d0f]



***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***
Failed to launch GDB: No such file or directory

Since upgraded this morning it crashed 5 times with Out of Memory kernel error.
kern_node1.log
kern_node2.log

Memory usage after upgrade:

image

@bunghi
Copy link

bunghi commented Sep 25, 2019

I don't know if I should open a different issue for performance degradation after upgrade. Both RAM and CPU usage increased:

image

image

@dnsmichi
Copy link
Contributor

Memory and CPU usage are expected to raise with the introduction of user land threads with Boost Coroutines. This is different to this issue.

Also, the Json error between main and spawn helper process is a new issue, please move this into a dedicated issue, as requested in https://github.com/Icinga/icinga2/issues/7531#issuecomment-534547311
The error with opendir also is different, new issue please.

@Al2Klimov
Copy link
Member

@dnsmichi It's not because of the dependencies, otherwise this config would crash:

for (i in range(259)) {
	object Host i {
		check_command = "dummy"
		check_interval = 5s
	}

	object Dependency i use (i) {
		parent_host_name = i
		child_host_name = (i + 1) % 259
		disable_checks = true
	}
}

@dnsmichi dnsmichi added area/distributed Distributed monitoring (master, satellites, clients) core/crash Shouldn't happen, requires attention labels Oct 2, 2019
@dnsmichi
Copy link
Contributor

dnsmichi commented Oct 8, 2019

It could be related to JsonEncode seen in various other places as well, maybe related to how memory is allocated and later free'd for encoding dictionaries.

@dnsmichi dnsmichi self-assigned this Oct 8, 2019
@dnsmichi dnsmichi added this to the 2.12.0 milestone Oct 8, 2019
@dnsmichi dnsmichi added the blocker Blocks a release or needs immediate attention label Oct 17, 2019
@bunghi
Copy link

bunghi commented Oct 24, 2019

Hi,

Today it crashed again, after a while.. maybe output helps:

$ cat report.1571908186.877839
  Application version: r2.11.1-1

System information:
  Platform: Debian GNU/Linux
  Platform version: 9 (stretch)
  Kernel: Linux
  Kernel version: 4.9.0-11-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 6.3.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

        (0) libc.so.6: gsignal (+0xcf) [0x7f1a2f41dfff]
        (1) libc.so.6: abort (+0x16a) [0x7f1a2f41f42a]
        (2) libc.so.6: <unknown function> (+0x70c00) [0x7f1a2f45bc00]
        (3) libc.so.6: <unknown function> (+0x76fc6) [0x7f1a2f461fc6]
        (4) libc.so.6: <unknown function> (+0x7780e) [0x7f1a2f46280e]
        (5) icinga2: icinga::ObjectImpl<icinga::CheckResult>::~ObjectImpl() (+0x7f) [0x558be62d3e1f]
        (6) icinga2: <unknown function> (+0x55b123) [0x558be6363123]
        (7) icinga2: icinga::Checkable::ProcessCheckResult(boost::intrusive_ptr<icinga::CheckResult> const&, boost::intrusive_ptr<icinga::MessageOrigin> const&) (+0x17a2) [0x558be62f9502]
        (8) icinga2: icinga::ClusterEvents::CheckResultAPIHandler(boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&) (+0xa6c) [0x558be62fbb7c]
        (9) icinga2: std::_Function_handler<icinga::Value (boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&), icinga::Value (*)(boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&)>::_M_invoke(std::_Any_data const&, boost::intrusive_ptr<icinga::MessageOrigin> const&, boost::intrusive_ptr<icinga::Dictionary> const&) (+0x23) [0x558be6205713]
        (10) icinga2: icinga::JsonRpcConnection::MessageHandler(icinga::String const&) (+0x531) [0x558be6151451]
        (11) icinga2: icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0x283) [0x558be615ad03]
        (12) icinga2: <unknown function> (+0x43a18d) [0x558be624218d]
        (13) libboost_context.so.1.67.0: make_fcontext (+0x2f) [0x7f1a31dad72f]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

Failed to launch GDB: No such file or directory

@dnsmichi
Copy link
Contributor

@lippserd @Al2Klimov my suspicion is that this is related to the JSON library with encode/decode, likewise object serialization and a possible leak in there. I haven't run Valgrind yet, but this would be the next thing to try.

@Baboon92
Copy link

ref/NC/644339

@mwaldmueller
Copy link
Contributor

ref/NC/644553

@MarkNReynolds
Copy link

I am experiencing exactly the same crash output as @bunghi posted above. I have a 2 node master cluster and around 200 agent instances. Prior to the upgrade to 2.11 the masters were stable, now both nodes crash several times a day.

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
Stacktrace:

        (0) libc.so.6: gsignal (+0x37) [0x7f3a87108337]
        (1) libc.so.6: abort (+0x148) [0x7f3a87109a28]
        (2) libc.so.6: <unknown function> (+0x78e87) [0x7f3a8714ae87]
        (3) libc.so.6: <unknown function> (+0x7f7c4) [0x7f3a871517c4]
        (4) libc.so.6: <unknown function> (+0x82f00) [0x7f3a87154f00]
        (5) libc.so.6: __libc_malloc (+0x4c) [0x7f3a87157adc]
        (6) libstdc++.so.6: operator new(unsigned long) (+0x1d) [0x7f3a87c32ecd]
        (7) /usr/lib64/icinga2/sbin/icinga2() [0x66093f]
        (8) /usr/lib64/icinga2/sbin/icinga2() [0x6b9c12]
        (9) icinga2: icinga::JsonDecode(icinga::String const&) (+0x4ad) [0x917b2d]
        (10) icinga2: icinga::ApiListener::ReplayLog(boost::intrusive_ptr<icinga::JsonRpcConnection> const&) (+0xa4d) [0xb4497d]
        (11) icinga2: icinga::ApiListener::SyncClient(boost::intrusive_ptr<icinga::JsonRpcConnection> const&, boost::intrusive_ptr<icinga::Endpoint> const&, bool) (+0x61f) [0xb4699f]
        (12) /usr/lib64/icinga2/sbin/icinga2() [0xb47a7a]
        (13) libboost_context.so.1.69.0: make_fcontext (+0x2f) [0x7f3a89bf718f]

***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***

Failed to launch GDB: No such file or directory

I've installed gdb on the masters to see if it provides any useful details.

Mark

@carraroj
Copy link

ref/NC/647127

@dnsmichi dnsmichi assigned Al2Klimov and htriem and unassigned dnsmichi Nov 11, 2019
@Napsty
Copy link
Contributor

Napsty commented Nov 15, 2019

Yesterday (Nov 14th 2019) we experienced the same crash as @bunghi mentioned.
Dual master setup here, 4 zones.

Both masters run 2.11.2-1.xenial.

Caught unhandled exception.
Current time: 2019-11-14 15:30:05 +0100

  Application version: r2.11.2-1

System information:
  Platform: Ubuntu
  Platform version: 16.04.6 LTS (Xenial Xerus)
  Kernel: Linux
  Kernel version: 4.4.0-101-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 5.4.0
  Build host: runner-LTrJQZ9N-project-298-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

Error: [json.exception.parse_error.101] parse error at line 1, column 27: syntax error while parsing value - unexpected '{'; expected end of input


        (0) icinga2: icinga::JsonDecode(icinga::String const&) (+0xcb2) [0x5fca02]
        (1) icinga2: icinga::Process::DoEvents() (+0x27c) [0x63b85c]
        (2) icinga2: icinga::Process::IOThreadProc(int) (+0x3b7) [0x6400a7]
        (3) libstdc++.so.6: <unknown function> (+0xb8c80) [0x7fb42e48ec80]
        (4) libpthread.so.0: <unknown function> (+0x76ba) [0x7fb42f0456ba]
        (5) libc.so.6: clone (+0x6d) [0x7fb43035041d]



***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***
Failed to launch GDB: No such file or directory

@Al2Klimov
Copy link
Member

I'll take care of that.

@Al2Klimov
Copy link
Member

Testing e930efd and 699047e – just to be sure. I'll give you an update on thursday.

@Al2Klimov
Copy link
Member

e930efd has purred like a cat for two days. Green light for v2.12rc1.

@davekempe
Copy link

I have a similar crash with a large number of endpoints that have been removed from the config but are still attempting to connect. There is about 400 endpoints hammering away and icinga2 can only stay up for around 20hours until I get this error:

icinga2: /usr/include/icinga-boost/boost/smart_ptr/intrusive_ptr.hpp:199: T* boost::intrusive_ptr<T>::operator->() const [with T = icinga::Endpoint]: Assertion `px != 0' failed.
Caught SIGABRT.
Current time: 2020-03-02 10:15:08 +0000

Will the latest snapshot discussed here fix this? Or should I lodge a separate bug report.
The de-configured endpoints will be re-enabled shortly, so I expect the problem to go away, but its clear that many endpoints being rejected shouldn't be grounds for a crash.

This is a sample of the debuglog:

[2020-03-01 22:28:48 +0000] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'fp-mlb-au.example.com'.
[2020-03-01 22:28:48 +0000] notice/ClusterEvents: Discarding 'next check changed' message from 'fp-mlb-au.example.com': Invalid endpoint origin (client not allowed)

These messages happen over and over - the debug log gets very large quickly. There are about 10000 hosts behind the collective endpoints by the way, not sure that makes a difference, but some of those endpoints would have a ton of updates not sent.

@Al2Klimov
Copy link
Member

Al2Klimov commented Mar 3, 2020

@davekempe Please try v2.11.3 once it has been released (later today).

@N-o-X N-o-X modified the milestones: 2.12.0, 2.11.3 Mar 3, 2020
@Al2Klimov Al2Klimov unpinned this issue Mar 3, 2020
@Al2Klimov
Copy link
Member

Please could all of you test v2.11.3 and tell whether it has fixed your particular problem?

@davekempe
Copy link

Hey sorry I was going to get back bug the big was closed. Happy to report the issue is fixed. I was able to simulate the problem reliably as it happened every 24 hours in our environment if we removed the endpoints via automation. After the update it has been fine with no crashes.

@hardoverflow
Copy link

Unfortunately, we still have the problem.

[2020-03-04 09:43:14 +0100] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'host.abc.com'
Error: Connection reset by peer


        (0) icinga2: icinga::JsonRpc::SendRawMessage(std::shared_ptr<icinga::AsioTlsStream> const&, icinga::String const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::exe
        (1) icinga2: icinga::JsonRpcConnection::WriteOutgoingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) (+0x231) [0xb64351]
        (2) /usr/lib64/icinga2/sbin/icinga2() [0xb6477a]
        (3) /usr/lib64/icinga2/sbin/icinga2() [0xb64c58]
        (4) libboost_context.so.1.69.0: make_fcontext (+0x2f) [0x7f81b595b18f]



[2020-03-04 09:43:14 +0100] warning/JsonRpcConnection: API client disconnected for identity 'host.abc.com'

@Al2Klimov
Copy link
Member

@hardoverflow ... and it still crashes?

@mcktr
Copy link
Member

mcktr commented Mar 6, 2020

@Al2Klimov @N-o-X Was the master branch (snapshot packages) affected of this bug? If so the master branch is currently not fixed, since the fixing changes are directly merged into the support/2.11 branch.

The master branch should be tested prior a 2.12 release to ensure the bug is fixed there too.

@N-o-X
Copy link
Contributor

N-o-X commented Mar 9, 2020

@mcktr every change has also been merged into the master branch: #7841 #7837 #7737

@Al2Klimov
Copy link
Member

... and tested successfully.

@hardoverflow
Copy link

@Al2Klimov Once the cluster is running, it is also running. The error occurs sporadically when deploying. After a while, the ConfigMaster crashed. The second master is still running.

● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2020-03-17 10:14:15 CET; 3min 29s ago
  Process: 34392 ExecReload=/usr/lib/icinga2/safe-reload /etc/sysconfig/icinga2 (code=exited, status=0/SUCCESS)
  Process: 30207 ExecStart=/usr/sbin/icinga2 daemon --close-stdio -e ${ICINGA2_ERROR_LOG} (code=exited, status=139)
 Main PID: 30207 (code=exited, status=139)

Mar 17 07:11:03 master-a.fqdn.de safe-reload[18089]: Validating config files: Done
Mar 17 07:11:03 master-a.fqdn.de safe-reload[18089]: Reloading Icinga 2: Done
Mar 17 07:12:00 master-a.fqdn.de systemd[1]: Reloaded Icinga host/service/network monitoring system.
Mar 17 10:06:47 master-a.fqdn.de systemd[1]: Reloading Icinga host/service/network monitoring system.
Mar 17 10:07:12 master-a.fqdn.de safe-reload[34392]: Validating config files: Done
Mar 17 10:07:12 master-a.fqdn.de safe-reload[34392]: Reloading Icinga 2: Done
Mar 17 10:08:16 master-a.fqdn.de systemd[1]: Reloaded Icinga host/service/network monitoring system.
Mar 17 10:14:15 master-a.fqdn.de systemd[1]: icinga2.service: main process exited, code=exited, status=139/n/a
Mar 17 10:14:15 master-a.fqdn.de systemd[1]: Unit icinga2.service entered failed state.
Mar 17 10:14:15 master-a.fqdn.de systemd[1]: icinga2.service failed.

We also observe the following network bandwidth for masters and satellites.
grafik

@Al2Klimov
Copy link
Member

Please share the output of icinga2 --version ran on the crashed node.

@hardoverflow
Copy link

icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.3-1)

Copyright (c) 2012-2020 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1062.4.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
[root@master-a.fqdn.de]#

@Al2Klimov
Copy link
Member

Damn. It's actually v2.11.3.

@lippserd @N-o-X @htriem We've got a (hopefully not so) big problem.

@lippserd
Copy link
Member

@hardoverflow Could you please upload core dumps here

@hardoverflow
Copy link

@lippserd Done. Can u confirm?

@Al2Klimov
Copy link
Member

Al2Klimov commented Mar 18, 2020

Confirmed.

All of you: If you give us core dumps, please gzip them – and if you request them, request to gzip them. Not neccessarily all of us have enterprise downlinks due to COVID19.

N-o-X pushed a commit that referenced this issue May 8, 2020
This includes the following fixes:

nlohmann/json#1436

> For a deeply-nested JSON object, the recursive implementation of json_value::destroy function causes stack overflow.

nlohmann/json#1708
nlohmann/json#1722

Stack size

nlohmann/json#1693 (comment)

Integer Overflow

nlohmann/json#1447

UTF8, json dump out of bounds

nlohmann/json#1445

Possibly influences #7532
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) blocker Blocks a release or needs immediate attention core/crash Shouldn't happen, requires attention ref/NC
Projects
None yet
Development

Successfully merging a pull request may close this issue.