Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry_interval has no effect when soft-critical is triggered by a passive check result #6795

Closed
ekeih opened this issue Nov 22, 2018 · 12 comments · Fixed by #6825
Closed

retry_interval has no effect when soft-critical is triggered by a passive check result #6795

ekeih opened this issue Nov 22, 2018 · 12 comments · Fixed by #6825
Labels
area/checks Check execution and results
Milestone

Comments

@ekeih
Copy link
Contributor

ekeih commented Nov 22, 2018

Expected Behavior

  1. Service is OK and checked with check_interval
  2. Service is in soft-state and checked with retry_interval

Current Behavior

  1. Service is OK and checked with check_interval
  2. Service is in soft-state and checked with check_interval. Visible in the last check and next check sections of Icingaweb2.

Steps to Reproduce

I used the current standalone vagrant box to reproduce this with a minimal configuration: https://github.com/Icinga/icinga-vagrant/tree/master/standalone

object Host "icinga2.vagrant.demo.icinga.com" {
  import "generic-host"
  address = "127.0.0.1"
}

apply Service "load" {
  import "generic-service"

  check_command = "load"
  check_interval = 5m
  retry_interval = 1m

  assign where host.name == NodeName
}
Object 'icinga2.vagrant.demo.icinga.com!load' of type 'Service':
  % declared in '/etc/icinga2/conf.d/services.conf', lines 1:0-1:19
  * __name = "icinga2.vagrant.demo.icinga.com!load"
  * action_url = ""
  * check_command = "load"
    % = modified in '/etc/icinga2/conf.d/services.conf', lines 4:3-4:24
  * check_interval = 300
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 11:3-11:21
    % = modified in '/etc/icinga2/conf.d/services.conf', lines 5:3-5:21
  * check_period = ""
  * check_timeout = null
  * command_endpoint = ""
  * display_name = "load"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "icinga2.vagrant.demo.icinga.com"
    % = modified in '/etc/icinga2/conf.d/services.conf', lines 1:0-1:19
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 5
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 10:3-10:24
  * name = "load"
    % = modified in '/etc/icinga2/conf.d/services.conf', lines 1:0-1:19
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/conf.d/services.conf', lines 1:0-1:19
  * retry_interval = 60
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 12:3-12:22
    % = modified in '/etc/icinga2/conf.d/services.conf', lines 6:3-6:21
  * source_location
    * first_column = 0
    * first_line = 1
    * last_column = 19
    * last_line = 1
    * path = "/etc/icinga2/conf.d/services.conf"
  * templates = [ "load", "generic-service" ]
    % = modified in '/etc/icinga2/conf.d/services.conf', lines 1:0-1:19
    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 9:1-9:34
  * type = "Service"
  * vars = null
  * volatile = false
  * zone = ""

As you can see above the Icinga2 object contains the correct retry_value = 60 but according to Icingaweb2 the check is still executed every five minutes in soft-state.

Your Environment

The current icinga-vagrant standalone box with Icinga2 2.10.2.

@dnsmichi
Copy link
Contributor

Proof as text or screenshot?

@ekeih
Copy link
Contributor Author

ekeih commented Nov 22, 2018

With the configuration I posted in the inital issue I get the following output.

(Update: I forced the critical state by using Process check result in Icingaweb2.)

screen shot 2018-11-22 at 17 35 34

@ekeih
Copy link
Contributor Author

ekeih commented Nov 23, 2018

Can anyone confirm or disconfirm this issue?

I tried to proof my own bugreport wrong but unfortunately I am still able to reproduce this.

@ekeih
Copy link
Contributor Author

ekeih commented Nov 28, 2018

(Update: I forced the critical state by using Process check result in Icingaweb2.)

The bug only happens when we submit the check result passively (IcingaWeb2/API). When the soft state is caused by an active check the retry_interval works as expected.
I was able to reproduce this with 2.10.2, 2.9.2, 2.9.0, 2.8.4 and 2.8.0, I did not test it with earlier versions.

@ekeih ekeih changed the title retry_interval has no effect retry_interval has no effect when soft-critical is triggered by a passive check result Nov 28, 2018
@dnsmichi
Copy link
Contributor

Is this a problem with the IDO database backend and presented in Icinga Web 2, or does the REST API provide the same results for the next_check timestamp?

@ekeih
Copy link
Contributor Author

ekeih commented Nov 29, 2018

The REST API also provides the same result.
I used the following script with git bisect to reproduce the issue. (Sorry for the mess... it was a lot of trial and error ;)) It works inside of the centos7-dev vagrant box.
The script causes a soft-critical via the API and then it checks the difference between next_check and last_check which is always around 600 in this example (it should be 3).

# /root/api-users.conf
/**
 * The ApiUser objects are used for authentication against the API.
 */
object ApiUser "root" {
  password = "71ced997017304e3"
  // client_cn = ""

  permissions = [ "*" ]
}

# /root/hosts.conf
object Host NodeName {
  import "generic-host"
  address = "127.0.0.1"
}

# /root/serivces.conf
apply Service "ping4" {
  import "generic-service"

  check_command = "ping4"
  retry_interval = 3
  check_interval = 600

  assign where host.address
}
#!/usr/bin/env bash

rm -rf /usr/local/icinga2
mkdir -p /usr/local/icinga2/var/run/icinga2

cd /root/icinga2
mkdir -p debug
cd debug
cmake -DCMAKE_BUILD_TYPE=Debug -DICINGA2_UNITY_BUILD=OFF -DCMAKE_INSTALL_PREFIX=/usr/local/icinga2 .. || exit 125
make -j4 || exit 125
make -j4 install || exit 125
cd ..
cp /root/hosts.conf /usr/local/icinga2/etc/icinga2/conf.d/hosts.conf
cp /root/services.conf /usr/local/icinga2/etc/icinga2/conf.d/services.conf

chown -R icinga:icinga /usr/local/icinga2
icinga2 api setup
cp /root/api-users.conf /usr/local/icinga2/etc/icinga2/conf.d/api-users.conf
(cd /usr/local/icinga2 && /usr/local/icinga2/sbin/icinga2 daemon) >/dev/null &

sleep 5

# reset to hard ok
curl -k -s -u "root:71ced997017304e3" -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/actions/process-check-result?service=icinga2-centos7-dev.vagrant.demo.icinga.com!ping4' \
-d '{ "exit_status": 0, "plugin_output": "PING OK - Packet loss = 0%", "performance_data": [ "rta=5000.000000ms;3000.000000;5000.000000;0.000000", "pl=100%;80;100;0" ], "check_source": "example.localdomain", "pretty": true }'

# force soft critical
curl -k -s -u "root:71ced997017304e3" -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/actions/process-check-result?service=icinga2-centos7-dev.vagrant.demo.icinga.com!ping4' \
-d '{ "exit_status": 2, "plugin_output": "PING CRITICAL - Packet loss = 100%", "performance_data": [ "rta=5000.000000ms;3000.000000;5000.000000;0.000000", "pl=100%;80;100;0" ], "check_source": "example.localdomain", "pretty": true }'

# get service state
next_check=$(curl -k -s -u 'root:71ced997017304e3' https://localhost:5665/v1/objects/services | jq -r '.results[0].attrs.next_check - .results[0].attrs.last_check' | cut -d '.' -f 1)

pidof icinga2 | xargs -n 1 kill
rm -f /usr/local/icinga2/var/cache/icinga2/icinga2.vars
rm -f /usr/local/icinga2/var/lib/icinga2/icinga2.state

if [[ "${next_check}" -lt 100 ]]
then
  exit 0
else
  exit 1
fi

@dnsmichi
Copy link
Contributor

Just a quick note from not-too-deep reading - the next_check variable in your script takes the difference between last_check and next_check. What if last_check isn't updated accordingly .. then it fails as well.

@ekeih
Copy link
Contributor Author

ekeih commented Nov 29, 2018

I think I manually verified that last_check and next_check were updated in the # force soft critical step. But to be honest I am not 100% sure because I tried a lot of different things during the debugging.
I can try to check this again tomorrow.

@dnsmichi
Copy link
Contributor

dnsmichi commented Dec 5, 2018

Just a thought - if you feed in passive check results via the API, the next expected check happens via the check_interval offset being the freshness check interval. In contrast to that, an actively scheduled check will proceed with the soft state logic. This would explain the behaviour you're seeing.

https://github.com/Icinga/icinga2/blob/master/lib/icinga/checkable-check.cpp#L344

@dnsmichi dnsmichi added the area/checks Check execution and results label Dec 5, 2018
@ekeih
Copy link
Contributor Author

ekeih commented Dec 5, 2018

Yes, I think that is exactly what is happening. Is this the expected behaviour?

@dnsmichi
Copy link
Contributor

dnsmichi commented Dec 5, 2018

Yep, since passive checks are not actively scheduled. The only measurable time here is the next expected check result timestamp, which always is the check_interval by design. In 1.x there was a separate setting for that.

@ekeih
Copy link
Contributor Author

ekeih commented Dec 5, 2018

Okay, I always assumed that a passive check result would also affect the next active check.
I updated the documentation.

dnsmichi pushed a commit that referenced this issue Feb 11, 2019
@Al2Klimov Al2Klimov added this to the 2.10.3 milestone Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Check execution and results
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants