-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Icinga 2.11.x crashing every 30 minutes in JsonRpcConnection::HandleAndWriteHeartbeats() #7569
Comments
I'll try to get gdb and the debug symbols on there so that I can give you more information |
nm. Debug symbol packages don't seem to be available in the icinga2 packages repo |
Hmm. This is curious This is pretty much the last few lines of the debug log [2019-10-09 13:45:47 +0000] warning/JsonRpcConnection: Client for endpoint 'icinga-satellite.servers.ca.internal.networkradius.com' has requested heartbeat message but hasn't responded in time. Closing connection. |
This seems to be new code that was written for 2.11 Could it be causing the crash? |
It looks like network problems might be causing it to crash? Just crashed again, debug log output
|
Interesting find. Can you provoke this e.g. with firewall rules? The debug packages for Ubuntu are located in the |
I'll see if I can provoke it using firewall rules. I tried installing the debug packages and I just get this error message
|
I tried blocking all access to port 5665 from the master for a good half an hour today and I still couldn't get it to crash. Any ideas on how to install the debug packages? |
Hm, the problem with the debug packages sounds a little like apt priorities or pinning. Do you have that in place for the icinga package repository? |
Figured it out, I have a weird priority setting on that server that I can't track down but I managed to get it installed by specifying the version. I also installed gdb New crash file is attached, but it doesn't seem to have much more information. Do I need to run icinga2 not from systemd to get the crash reporting to work properly? As well, I noticed that the version reported in the crash report is 2.11.0-1 and the version reported by dpkg is 2.11.0-2, could this be causing a problem? |
I had to incorporate additional package changes when releasing 2.11 (human error, I forgot them), so all the Debian packages now say |
Ah, I thought that was the case. I made sure that the latest version of icinga2 was running. It gets restarted on a pretty regular basis, so it's definitely the latest version |
Can you remove the needs-feedback tag? Or specify what further feedback is needed? |
Hi team, I've experienced the same issue since dual master setup was upgraded from 2.10 to 2.11.2-1 (almost a week ago). My first instance crashes, and the second one goes down in a few minutes. My colleague said he noticed that the second instance once was running for quite a long time. The day before yesterday both instances were running smoothly for about 6 hours but then everything went worse, and now icinga2 crashes every half an hour. Can't submit reproducing steps as they're unclear for me. Attaching some information on my environment.
Icinga2 version:
Icinga2 features:
Config validation:
Stacktrace of all crashes looks pretty the same:
|
Another one for the list @lippserd |
Hi team, Any updates on this? Do you maybe need any details/feedback/etc? We're really looking forward to this fix. Many thanks. |
🚀 I did it! 🚀
139 - 128 = 11 i.e. SEGV |
@dnsmichi I'd suggest to make a v2.12-rc1 w/o this issue being fixed. |
Umm, yes, it's SIGSEGV, but that doesn't mean it's not a bug... |
@Al2Klimov Since you've mentioned that notifications are now enabled, a possible indicator would be the new NotificationResult as cluster message merged in 2.11. This involves JSON encode/decode for nested dictionaries. Please look into this direction. |
For me it's not every 30m, but every day: Terraformresource "openstack_compute_instance_v2" "aklimov-636691-master" {
count = 2
name = "aklimov-636691-master-${count.index}"
image_name = "Debian 9"
flavor_name = "s1.xxlarge"
network {
name = "${var.tenant_network}"
fixed_ip_v4 = "10.77.28.${count.index + 3}"
}
security_groups = [ "${openstack_compute_secgroup_v2.iliketrains.name}" ]
key_pair = "${var.openstack_keypair}"
}
resource "openstack_compute_instance_v2" "aklimov-636691-agent-1" {
count = 251
name = "aklimov-636691-agent-${count.index}"
image_name = "Debian 9"
flavor_name = "s1.2310-custom-1-0.5-13"
network {
name = "${var.tenant_network}"
fixed_ip_v4 = "10.77.28.${count.index + 5}"
}
security_groups = [ "${openstack_compute_secgroup_v2.iliketrains.name}" ]
key_pair = "${var.openstack_keypair}"
}
resource "openstack_compute_instance_v2" "aklimov-636691-agent-2" {
count = 49
name = "aklimov-636691-agent-${count.index + 251}"
image_name = "Debian 9"
flavor_name = "s1.2310-custom-1-0.5-13"
network {
name = "${var.tenant_network}"
fixed_ip_v4 = "10.77.29.${count.index}"
}
security_groups = [ "${openstack_compute_secgroup_v2.iliketrains.name}" ]
key_pair = "${var.openstack_keypair}"
} Ansible---
- hosts: all
become: yes
become_method: sudo
tasks:
- apt_repository:
filename: buster
repo: |
deb http://deb.debian.org/debian buster main contrib non-free
deb-src http://deb.debian.org/debian buster main contrib non-free
deb http://security.debian.org/ buster/updates main contrib non-free
deb-src http://security.debian.org/ buster/updates main contrib non-free
deb http://deb.debian.org/debian buster-updates main contrib non-free
deb-src http://deb.debian.org/debian buster-updates main contrib non-free
- apt:
upgrade: dist
register: upgrade
- when: upgrade.changed
reboot: {}
- hosts: icingas
become: yes
become_method: sudo
tasks:
- apt_key:
url: https://packages.icinga.com/icinga.key
- apt_repository:
filename: icinga
repo: >
deb http://packages.icinga.com/{{ ansible_lsb.id |lower }}
icinga-{{ ansible_lsb.codename }} main
- loop:
- icinga2-bin
- monitoring-plugins
apt:
name: '{{ item }}'
- hosts: aklimov-636691-master-0
become: yes
become_method: sudo
tasks:
- shell: >
icinga2 node setup
--zone master
--listen 0.0.0.0,5665
--cn {{ inventory_hostname }}
--master
--disable-confd;
rm -f /var/cache/icinga2/icinga2.vars
args:
creates: /var/lib/icinga2/certs/ca.crt
notify: Restart Icinga 2
- shell: icinga2 daemon -C
args:
creates: /var/cache/icinga2/icinga2.vars
- with_inventory_hostnames:
- 'icingas:!{{ inventory_hostname }}'
shell: >
icinga2 pki ticket --cn {{ item }}
>/var/cache/icinga2/{{ item }}.ticket
args:
creates: '/var/cache/icinga2/{{ item }}.ticket'
- with_inventory_hostnames:
- 'icingas:!{{ inventory_hostname }}'
fetch:
dest: .tempfiles
src: '/var/cache/icinga2/{{ item }}.ticket'
- name: Fetch Icinga 2 master cert
fetch:
dest: .tempfiles
src: '/var/lib/icinga2/certs/{{ inventory_hostname }}.crt'
handlers:
- name: Restart Icinga 2
service:
name: icinga2
state: restarted
- hosts: 'icingas:!aklimov-636691-master-0'
become: yes
become_method: sudo
tasks:
- copy:
dest: /var/cache/icinga2/trusted.crt
owner: nagios
group: nagios
mode: '0644'
src: .tempfiles/aklimov-636691-master-0/var/lib/icinga2/certs/aklimov-636691-master-0.crt
- copy:
dest: /var/cache/icinga2/my.ticket
owner: nagios
group: nagios
mode: '0600'
src: '.tempfiles/aklimov-636691-master-0/var/cache/icinga2/{{ inventory_hostname }}.ticket'
- shell: >
icinga2 node setup
--zone {{ inventory_hostname }}
--endpoint aklimov-636691-master-0,{{ hostvars['aklimov-636691-master-0'].ansible_all_ipv4_addresses[0] }},5665
--parent_host {{ hostvars['aklimov-636691-master-0'].ansible_all_ipv4_addresses[0] }},5665
--parent_zone master
--listen 0.0.0.0,5665
--ticket `cat /var/cache/icinga2/my.ticket`
--trustedcert /var/cache/icinga2/trusted.crt
--cn {{ inventory_hostname }}
--accept-config
--accept-commands
--disable-confd
args:
creates: /var/lib/icinga2/certs/ca.crt
notify: Restart Icinga 2
handlers:
- name: Restart Icinga 2
service:
name: icinga2
state: restarted
- hosts: icingas
become: yes
become_method: sudo
tasks:
- file:
path: /etc/icinga2/zones.conf.d
owner: root
group: root
mode: '0755'
state: directory
- loop:
- aklimov-636691-master-0
- aklimov-636691-master-1
copy:
dest: '/etc/icinga2/zones.conf.d/{{ item }}.conf'
owner: root
group: root
mode: '0644'
content: |
object Endpoint "{{ item }}" {
host = "{{ hostvars[item].ansible_all_ipv4_addresses[0] }}"
}
notify: Restart Icinga 2
- copy:
dest: /etc/icinga2/zones.conf.d/master.conf
owner: root
group: root
mode: '0644'
content: |
object Zone "master" {
endpoints = [ "aklimov-636691-master-0", "aklimov-636691-master-1" ]
}
notify: Restart Icinga 2
- copy:
dest: /etc/icinga2/zones.conf.d/global.conf
owner: root
group: root
mode: '0644'
content: |
object Zone "global" {
global = true
}
notify: Restart Icinga 2
handlers:
- name: Restart Icinga 2
service:
name: icinga2
state: restarted
- hosts: 'aklimov-636691-master-0:aklimov-636691-master-1'
become: yes
become_method: sudo
tasks:
- with_inventory_hostnames: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
copy:
dest: '/etc/icinga2/zones.conf.d/{{ item }}.conf'
owner: root
group: root
mode: '0644'
content: |
object Endpoint "{{ item }}" {
host = "{{ hostvars[item].ansible_all_ipv4_addresses[0] }}"
}
object Zone "{{ item }}" {
parent = "master"
endpoints = [ "{{ item }}" ]
}
notify: Restart Icinga 2
handlers:
- name: Restart Icinga 2
service:
name: icinga2
state: restarted
- hosts: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
become: yes
become_method: sudo
tasks:
- copy:
dest: '/etc/icinga2/zones.conf.d/{{ inventory_hostname }}.conf'
owner: root
group: root
mode: '0644'
content: |
object Endpoint "{{ inventory_hostname }}" {
}
object Zone "{{ inventory_hostname }}" {
parent = "master"
endpoints = [ "{{ inventory_hostname }}" ]
}
notify: Restart Icinga 2
handlers:
- name: Restart Icinga 2
service:
name: icinga2
state: restarted
- hosts: icingas
become: yes
become_method: sudo
tasks:
- copy:
dest: /etc/icinga2/zones.conf
content: 'include "zones.conf.d/*.conf"'
notify: Restart Icinga 2
handlers:
- name: Restart Icinga 2
service:
name: icinga2
state: restarted
- hosts: aklimov-636691-master-0
become: yes
become_method: sudo
tasks:
- loop:
- global
- master
file:
path: '/etc/icinga2/zones.d/{{ item }}'
owner: root
group: root
mode: '0755'
state: directory
- with_inventory_hostnames: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
file:
path: '/etc/icinga2/zones.d/{{ item }}'
owner: root
group: root
mode: '0755'
state: directory
- copy:
dest: '/etc/icinga2/zones.d/global/global.conf'
owner: root
group: root
mode: '0644'
content: |
object User "navalny" {
}
object NotificationCommand "stabilnost" {
command = [ "/bin/true" ]
}
for (i in range(200)) {
apply Service i {
check_command = "dummy"
command_endpoint = host.name
check_interval = 5m
max_check_attempts = 1
var that = this
vars.dummy_state = function() use(that) {
return if (that.last_check_result && that.last_check_result.state) { 0 } else { 2 }
}
assign where true
}
}
apply Notification "stabilnost" to Service {
command = "stabilnost"
users = [ "navalny" ]
assign where true
}
notify: Restart Icinga 2
- loop:
- aklimov-636691-master-0
- aklimov-636691-master-1
copy:
dest: '/etc/icinga2/zones.d/master/{{ item }}.conf'
owner: root
group: root
mode: '0644'
content: |
object Host "{{ item }}" {
check_command = "passive"
enable_active_checks = false
}
notify: Restart Icinga 2
- with_inventory_hostnames: 'icingas:!aklimov-636691-master-0:!aklimov-636691-master-1'
copy:
dest: '/etc/icinga2/zones.d/{{ item }}/{{ item }}.conf'
owner: root
group: root
mode: '0644'
content: |
object Host "{{ item }}" {
check_command = "passive"
enable_active_checks = false
}
notify: Restart Icinga 2
handlers:
- name: Restart Icinga 2
service:
name: icinga2
state: restarted |
For me it ended up being kind of random. Sometimes the server would stay up for a few hours at a time, sometimes it would go down in 20 minutes or so. |
@mattrose In case you are able to measure that, how many notifications are sent in this period of time? Are there any failing notification scripts involved? Besides, a different question - do you have many agents which still request a certificate and are steadily reconnecting? |
On the first question, I can't answer that question immediately. Right now I have systemd just set to restart the icinga daemon automatically when it exits on SEGV. On the second, I shouldn't have any agents that don't have a certificate, as the certificate exchanged is done through salt (http://www.saltstack.com) Is there any way to tell through icinga logs if this is happening? |
Hi, Thanks for your information so far. We do think that this is related to #7532. I'll close this one as duplicate. If anything further comes up, please do not hesitate to participate in #7532. And @mattrose to answer your question: Yes, the Icinga logs indicate certificate requests and failing connections, e.g. loads of Reconnecting to endpoint ... and Certificate validation failed for endpoint ... Best, |
Describe the bug
icinga 2.11.0-1 seems to be crashing every 30 minutes. Exiting with return code 134.
For example:
root@icinga-master.servers.fr:/var/log/icinga2/crash#
[1]+ Exit 134 nohup icinga2 daemon > /var/log/icinga-matt.log (wd: ~)
(wd now: /var/log/icinga2/crash)
To Reproduce
Unsure of reproduction steps. I upgraded from 2.10 to 2.11 on ubuntu, using the icinga2 package repository, and it started crashing every 30 minutes or so.
Expected behavior
Not to crash
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 feature list
):icinga2 daemon -C
):With some private information replaced with 'xxx'
zones.conf
file (oricinga2 object list --type Endpoint
andicinga2 object list --type Zone
) from all affected nodes.We run multiple icinga2 instances but much of the information in that output is considered private
Additional context
root@icinga-master.servers.fr:/var/log/icinga2/crash# cat report.1570628747.543626
Application version: r2.11.0-1
System information:
Platform: Ubuntu
Platform version: 18.04.1 LTS (Bionic Beaver)
Kernel: Linux
Kernel version: 4.15.0-43-generic
Architecture: x86_64
Build information:
Compiler: GNU 8.3.0
Build host: runner-LTrJQZ9N-project-298-concurrent-0
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var
Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
Stacktrace:
Failed to launch GDB: No such file or directory
The text was updated successfully, but these errors were encountered: