Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service state appears to switch directly to hard without passing through soft #9835

Closed
aval13 opened this issue Jul 17, 2023 · 4 comments
Closed

Comments

@aval13
Copy link

aval13 commented Jul 17, 2023

Describe the bug

Hello,

I'm not 100% sure the following is a bug, but after reading the documentation and observing the behaviour I don't understand why it happens.
If the below is normal behaviour, please set this issue as a question and help me understand why it happens.

In short, the documentation says:
"When detecting a problem with a host/service, Icinga re-checks the object a number of times (based on the max_check_attempts and retry_interval settings) before sending notifications."
"The number of times a host/service is re-checked before changing into a hard state. Defaults to 3."
We have set a service with max_check_attempts to 4 and retry_interval to 30s.
What we expect:

  • service is HARD OK
  • check returns CRITICAL
  • service goes to SOFT CRITICAL
  • 4 retries at 30s intervals
  • last retry sets the service on HARD CRITICAL
  • notification gets sent
    What happens:
  • service is HARD OK
  • check returns CRITICAL
  • service goes to HARD CRITICAL
  • 3 retries at 30s intervals
  • after the 3rd retry the notification gets sent

As an observation, on Hosts, the transition via SOFT to hard is properly working.
Why on services it reports directly as hard state on the first check that fails, that is the question.

Also, considering the behaviour, it seems to me that max_check_attempts also includes the first check that failed, not just the retries.
So max_check_attempts = 4 means 1 fail + 3 retries.

To Reproduce

We manage our hosts, service templates and applies via Icinga Director.
Please let us know if it would be useful to provide the configuration jsons for Director, or extract the data via the Icinga API.

Expected behavior

See the bug description

Screenshots

Screenshots attached of:

  • IcingaWeb2 screenshot of the events
    image
  • mysql select of the events
    image
  • redis key extracted for the service (regarding the hard_state, soft_state and the previous_ prefixed versions):
    before the first critical check
    { "check_attempt": 1, "check_commandline": "'/usr/lib/nagios/plugins/check_procs' '-C' 'apache2' '-c' '1:1' '-p' '1' '-u' 'root' '-w' '1:1'", "check_source": "[REDACTED]", "check_timeout": 60000, "environment_id": "3f6acd65f0d7d3677481c1eeb047bc71ef57b0ec", "execution_time": 10, "hard_state": 0, "host_id": "75306fea55a71675ad96ba85056ab9b9f68ac501", "id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "in_downtime": false, "is_acknowledged": 0, "is_active": true, "is_flapping": false, "is_handled": false, "is_problem": false, "is_reachable": true, "last_state_change": 1689584909460, "last_update": 1689589534101, "latency": 0, "next_check": 1689589712223, "next_update": 1689589892223, "normalized_performance_data": "procs=1", "output": "PROCS OK: 1 process with command name 'apache2', PPID = 1, UID = 0 (root) ", "performance_data": "procs=1;1:1;1:1;0;", "previous_hard_state": 0, "previous_soft_state": 2, "scheduling_source": "[REDACTED]", "service_id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "severity": 0, "soft_state": 0, "state_type": 1 }
    and after the first check that returned critical
    { "check_attempt": 1, "check_commandline": "'/usr/lib/nagios/plugins/check_procs' '-C' 'apache2' '-c' '1:1' '-p' '1' '-u' 'root' '-w' '1:1'", "check_source": "[REDACTED]", "check_timeout": 60000, "environment_id": "3f6acd65f0d7d3677481c1eeb047bc71ef57b0ec", "execution_time": 10, "hard_state": 2, "host_id": "75306fea55a71675ad96ba85056ab9b9f68ac501", "id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "in_downtime": false, "is_acknowledged": 0, "is_active": true, "is_flapping": false, "is_handled": false, "is_problem": true, "is_reachable": true, "last_state_change": 1689589894101, "last_update": 1689589894101, "latency": 0, "next_check": 1689589922222, "next_update": 1689589952222, "normalized_performance_data": "procs=0", "output": "PROCS CRITICAL: 0 processes with command name 'apache2', PPID = 1, UID = 0 (root) ", "performance_data": "procs=0;1:1;1:1;0;", "previous_hard_state": 0, "previous_soft_state": 0, "scheduling_source": "[REDACTED]", "service_id": "a9d11b26e6060c8b54a3795122e37aa9d26a05d4", "severity": 2176, "soft_state": 2, "state_type": 0 }

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): 2.13.7
  • Operating System and version: Ubuntu 20.04
  • Enabled features (icinga2 feature list): api checker icingadb mainlog notification (on masters)
  • Icinga Web 2 version and modules (System - About): 2.11.4 - icingadb 1.0.2 - cube 1.3.0 - director 1.10.2 - incubator 0.20.0 - reporting 0.10.0 - x509 1.1.2
  • Config validation (icinga2 daemon -C):
    ``

icinga2 daemon -C

[2023-07-17 11:33:51 +0000] information/cli: Icinga application loader (version: r2.13.7-1)
[2023-07-17 11:33:51 +0000] information/cli: Loading configuration file(s).
[2023-07-17 11:33:51 +0000] information/ConfigItem: Committing config item(s).
[2023-07-17 11:33:51 +0000] information/ApiListener: My API identity: [REDACTED]
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 IcingaApplication.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 3 HostGroups.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 78 Hosts.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 FileLogger.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 943 Notifications.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 IcingaDB.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 76 Zones.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 76 Endpoints.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 4 ApiUsers.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 ApiListener.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 NotificationComponent.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 265 CheckCommands.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 2 UserGroups.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 1 TimePeriod.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 2 Users.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 866 Services.
[2023-07-17 11:33:51 +0000] information/ConfigItem: Instantiated 2 NotificationCommands.
[2023-07-17 11:33:51 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2023-07-17 11:33:51 +0000] information/cli: Finished validating the configuration file(s).
``

Additional context

Thank you.

@log1-c
Copy link
Contributor

log1-c commented Jul 18, 2023

On first glance this sounds like the volatile setting is enabled.
https://icinga.com/docs/icinga-2/latest/doc/08-advanced-topics/#volatile-services-and-hosts

Please check if you service or templates have that setting set to true

@aval13
Copy link
Author

aval13 commented Jul 18, 2023

<insert huge facepalm emoticon here>

You're absolutely right, in a dark corner of the deployment setup there was indeed a volatile set to true which I missed.
This explains everything, now it works exactly as expected.

As a last note on the max check attempts, can anybody confirm that what it says in the documentation:
"max_check_attempts ... The number of times a host/service is re-checked before changing into a hard state. Defaults to 3."
actually means max_check_attempts = first failed check + (max_check_attempts-1) retries?
For instance max_check_attempts = 4 means 1 fail + 3 retries.
This is what we see happening and the documentation says "is re-checked" which to me means retries.

Thank you.

@log1-c
Copy link
Contributor

log1-c commented Jul 18, 2023

Glad it helped. I suggest closing this issue then ;)

As for the check attempts: Yes, the first check, that detects the problem already counts to the max_check_attempts value.

@aval13
Copy link
Author

aval13 commented Jul 18, 2023

Issue closed as the reported behaviour was a configuration error and the extra question was answered.
Thank you :)

@aval13 aval13 closed this as completed Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants