Service state appears to switch directly to hard without passing through soft #9835

aval13 · 2023-07-17T13:34:21Z

Describe the bug

Hello,

I'm not 100% sure the following is a bug, but after reading the documentation and observing the behaviour I don't understand why it happens.
If the below is normal behaviour, please set this issue as a question and help me understand why it happens.

In short, the documentation says:
"When detecting a problem with a host/service, Icinga re-checks the object a number of times (based on the max_check_attempts and retry_interval settings) before sending notifications."
"The number of times a host/service is re-checked before changing into a hard state. Defaults to 3."
We have set a service with max_check_attempts to 4 and retry_interval to 30s.
What we expect:

service is HARD OK
check returns CRITICAL
service goes to SOFT CRITICAL
4 retries at 30s intervals
last retry sets the service on HARD CRITICAL
notification gets sent
What happens:
service is HARD OK
check returns CRITICAL
service goes to HARD CRITICAL
3 retries at 30s intervals
after the 3rd retry the notification gets sent

As an observation, on Hosts, the transition via SOFT to hard is properly working.
Why on services it reports directly as hard state on the first check that fails, that is the question.

Also, considering the behaviour, it seems to me that max_check_attempts also includes the first check that failed, not just the retries.
So max_check_attempts = 4 means 1 fail + 3 retries.

To Reproduce

We manage our hosts, service templates and applies via Icinga Director.
Please let us know if it would be useful to provide the configuration jsons for Director, or extract the data via the Icinga API.

Expected behavior

See the bug description

Screenshots

Screenshots attached of:

IcingaWeb2 screenshot of the events
mysql select of the events
redis key extracted for the service (regarding the hard_state, soft_state and the previous_ prefixed versions):
before the first critical check
{ "check_attempt": 1, "check_commandline": "'/usr/lib/nagios/plugins/check_procs' '-C' 'apache2' '-c' '1:1' '-p' '1' '-u' 'root' '-w' '1:1'", "check_source": "[REDACTED]", "check_timeout": 60000, "environment_id": "3f6acd65f0d7d3677481c1eeb047bc71ef57b0ec", "execution_time": 10, "hard_state": 0, "host_id": "75306fea55a71675ad96ba85056ab9b9f68ac501", "id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "in_downtime": false, "is_acknowledged": 0, "is_active": true, "is_flapping": false, "is_handled": false, "is_problem": false, "is_reachable": true, "last_state_change": 1689584909460, "last_update": 1689589534101, "latency": 0, "next_check": 1689589712223, "next_update": 1689589892223, "normalized_performance_data": "procs=1", "output": "PROCS OK: 1 process with command name 'apache2', PPID = 1, UID = 0 (root) ", "performance_data": "procs=1;1:1;1:1;0;", "previous_hard_state": 0, "previous_soft_state": 2, "scheduling_source": "[REDACTED]", "service_id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "severity": 0, "soft_state": 0, "state_type": 1 }
and after the first check that returned critical
{ "check_attempt": 1, "check_commandline": "'/usr/lib/nagios/plugins/check_procs' '-C' 'apache2' '-c' '1:1' '-p' '1' '-u' 'root' '-w' '1:1'", "check_source": "[REDACTED]", "check_timeout": 60000, "environment_id": "3f6acd65f0d7d3677481c1eeb047bc71ef57b0ec", "execution_time": 10, "hard_state": 2, "host_id": "75306fea55a71675ad96ba85056ab9b9f68ac501", "id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "in_downtime": false, "is_acknowledged": 0, "is_active": true, "is_flapping": false, "is_handled": false, "is_problem": true, "is_reachable": true, "last_state_change": 1689589894101, "last_update": 1689589894101, "latency": 0, "next_check": 1689589922222, "next_update": 1689589952222, "normalized_performance_data": "procs=0", "output": "PROCS CRITICAL: 0 processes with command name 'apache2', PPID = 1, UID = 0 (root) ", "performance_data": "procs=0;1:1;1:1;0;", "previous_hard_state": 0, "previous_soft_state": 0, "scheduling_source": "[REDACTED]", "service_id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "severity": 2176, "soft_state": 2, "state_type": 0 }

Your Environment

Include as many relevant details about the environment you experienced the problem in

Version used (icinga2 --version): 2.13.7
Operating System and version: Ubuntu 20.04
Enabled features (icinga2 feature list): api checker icingadb mainlog notification (on masters)
Icinga Web 2 version and modules (System - About): 2.11.4 - icingadb 1.0.2 - cube 1.3.0 - director 1.10.2 - incubator 0.20.0 - reporting 0.10.0 - x509 1.1.2
Config validation (icinga2 daemon -C):
``

icinga2 daemon -C

[2023-07-17 11:33:51 +0000] information/cli: Icinga application loader (version: r2.13.7-1)
[2023-07-17 11:33:51 +0000] information/cli: Loading configuration file(s).
[2023-07-17 11:33:51 +0000] information/ConfigItem: Committing config item(s).
[2023-07-17 11:33:51 +0000] information/ApiListener: My API identity: [REDACTED]
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 3 HostGroups.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 78 Hosts.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 FileLogger.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 943 Notifications.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 IcingaDB.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 76 Zones.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 76 Endpoints.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 4 ApiUsers.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 265 CheckCommands.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 2 UserGroups.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 TimePeriod.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 2 Users.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 866 Services.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 2 NotificationCommands.
[2023-07-17 11:33:51 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2023-07-17 11:33:51 +0000] information/cli: Finished validating the configuration file(s).
``

Additional context

Thank you.

The text was updated successfully, but these errors were encountered:

log1-c · 2023-07-18T06:53:19Z

On first glance this sounds like the volatile setting is enabled.
https://icinga.com/docs/icinga-2/latest/doc/08-advanced-topics/#volatile-services-and-hosts

Please check if you service or templates have that setting set to true

aval13 · 2023-07-18T09:38:52Z

You're absolutely right, in a dark corner of the deployment setup there was indeed a volatile set to true which I missed.
This explains everything, now it works exactly as expected.

As a last note on the max check attempts, can anybody confirm that what it says in the documentation:
"max_check_attempts ... The number of times a host/service is re-checked before changing into a hard state. Defaults to 3."
actually means max_check_attempts = first failed check + (max_check_attempts-1) retries?
For instance max_check_attempts = 4 means 1 fail + 3 retries.
This is what we see happening and the documentation says "is re-checked" which to me means retries.

Thank you.

log1-c · 2023-07-18T12:57:57Z

Glad it helped. I suggest closing this issue then ;)

As for the check attempts: Yes, the first check, that detects the problem already counts to the max_check_attempts value.

aval13 · 2023-07-18T14:06:23Z

Issue closed as the reported behaviour was a configuration error and the extra question was answered.
Thank you :)

aval13 closed this as completed Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service state appears to switch directly to hard without passing through soft #9835

Service state appears to switch directly to hard without passing through soft #9835

aval13 commented Jul 17, 2023

log1-c commented Jul 18, 2023

aval13 commented Jul 18, 2023 •

edited

log1-c commented Jul 18, 2023

aval13 commented Jul 18, 2023

Service state appears to switch directly to hard without passing through soft #9835

Service state appears to switch directly to hard without passing through soft #9835

Comments

aval13 commented Jul 17, 2023

Describe the bug

To Reproduce

Expected behavior

Screenshots

Your Environment

icinga2 daemon -C

Additional context

log1-c commented Jul 18, 2023

aval13 commented Jul 18, 2023 • edited

log1-c commented Jul 18, 2023

aval13 commented Jul 18, 2023

aval13 commented Jul 18, 2023 •

edited