Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All recoveries are HARD #575

Closed
ludmilmm opened this issue Aug 30, 2018 · 7 comments
Closed

All recoveries are HARD #575

ludmilmm opened this issue Aug 30, 2018 · 7 comments
Assignees
Labels
Milestone

Comments

@ludmilmm
Copy link

The issue is described on the Nagios XI support forum here:

https://support.nagios.com/forum/viewtopic.php?t=50067#261007

According to the Nagios Core official documentation:

When a service or host recovers from a soft error. This is considered a soft recovery.

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/statetypes.html

however this is not happening in Nagios Core 4.4.2.

Example (Core 4.4.2):
[1535649657] SERVICE ALERT: CentOS6-NRPE;Users;CRITICAL;SOFT;1;USERS CRITICAL - 2 users currently logged in
[1535649657] GLOBAL SERVICE EVENT HANDLER: CentOS6-NRPE;Users;CRITICAL;SOFT;1;xi_service_event_handler
[1535649717] SERVICE ALERT: CentOS6-NRPE;Users;CRITICAL;SOFT;2;USERS CRITICAL - 2 users currently logged in
[1535649717] GLOBAL SERVICE EVENT HANDLER: CentOS6-NRPE;Users;CRITICAL;SOFT;2;xi_service_event_handler
[1535649776] SERVICE ALERT: CentOS6-NRPE;Users;OK;HARD;1;USERS OK - 1 users currently logged in [1535649776] GLOBAL SERVICE EVENT HANDLER: CentOS6-NRPE;Users;OK;HARD;1;xi_service_event_handler

For comparison (Core 4.2.4):
[1535650778] SERVICE ALERT: localhost;Current Users;CRITICAL;SOFT;1;USERS CRITICAL - 2 users currently logged in
[1535650778] GLOBAL SERVICE EVENT HANDLER: localhost;Current Users;CRITICAL;SOFT;1;xi_service_event_handler
[1535650841] SERVICE ALERT: localhost;Current Users;CRITICAL;SOFT;2;USERS CRITICAL - 2 users currently logged in
[1535650841] GLOBAL SERVICE EVENT HANDLER: localhost;Current Users;CRITICAL;SOFT;2;xi_service_event_handler
[1535650902] SERVICE ALERT: localhost;Current Users;OK;SOFT;3;USERS OK - 1 users currently logged in
[1535650902] GLOBAL SERVICE EVENT HANDLER: localhost;Current Users;OK;SOFT;3;xi_service_event_handler

jomann09 added a commit that referenced this issue Nov 5, 2018
Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.
@jomann09 jomann09 added this to the 4.4.3 milestone Nov 5, 2018
@jomann09 jomann09 self-assigned this Nov 5, 2018
@jomann09 jomann09 added the Bug label Nov 5, 2018
@jomann09
Copy link
Contributor

jomann09 commented Nov 5, 2018

Made this change in the maint branch for 4.4.3.

@jomann09 jomann09 closed this as completed Nov 5, 2018
@dvoryanchikov
Copy link

Hi!
after 766d0d9 we have some services never getting to hard state

1542056400] CURRENT HOST STATE: server;UP;HARD;1;TCP OK - 0.050 second response time on port 22
[1542056400] CURRENT SERVICE STATE: server;Other check;OK;HARD;1;OK test
[1542056400] CURRENT SERVICE STATE: server;Check;OK;HARD;1;Check OK
[1542057399] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542057463] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542057528] SERVICE ALERT: server;Check;CRITICAL;SOFT;3;CRITICAL Test
[1542057595] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542057657] SERVICE ALERT: server;Check;OK;SOFT;1;Check OK
[1542058621] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542058689] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542058754] SERVICE ALERT: server;Check;CRITICAL;SOFT;3;CRITICAL Test
[1542058819] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542058885] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542058950] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059015] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059080] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059142] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059207] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test

max_check_attempts 4
check_interval 5
retry_interval 1

nothing other special about this service

@jomann09
Copy link
Contributor

jomann09 commented Nov 13, 2018

Did you apply this patch to 4.4.2 or did you use the current maint branch when you re-build Core?
I only tested this fix as applied to the current 4.4.2 version. I just did this as a test and my service went into a hard state:

[1542117154] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;1;test
[1542117159] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;2;test
[1542117163] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;3;test123
[1542117167] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;4;tset
[1542117172] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;OK;SOFT;1;segseg
[1542117176] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;1;seseg
[1542117180] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;2;esgseg
[1542117183] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;3;segseges
[1542117187] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;4;segseges
[1542117190] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;HARD;5;segseg

I can try it again using only active checks too.
Also, if max_check_attempts is set to 4, shouldn't the service go into HARD critical during the 4th soft state check?

@dvoryanchikov
Copy link

Hi,
sorry, I've compiled maint again and was unable to reproduce this behavior,
maybe I missed something...
I'll write back If find out why I got service stacked in soft state..

@jomann09
Copy link
Contributor

Great please let me know if it happens again. I will also be testing it again just to be sure.

@dvoryanchikov
Copy link

Hi,
yes, it happens again!

How to reproduce:

  1. compile latest maint branch (with 766d0d9)
  2. make test service became CRITICAL
  3. make a "soft recovery" (make service OK after 2/4 checks)
[1542140109] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542140228] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542140311] SERVICE ALERT: server;Check;OK;SOFT;1;Check OK
  1. wait until midnight (for state retention, log rotation, etc)
[1542130135] Auto-save of retention data completed successfully.
[1542133735] Auto-save of retention data completed successfully.
[1542137335] Auto-save of retention data completed successfully.
[1542140934] Auto-save of retention data completed successfully.

[1542142800] LOG ROTATION: DAILY
[1542142800] LOG VERSION: 2.0
  1. make service critical and get a bug with it stalled in soft state
[1542142800] CURRENT SERVICE STATE: server;Check;OK;HARD;1;Check OK
[1542143131] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542143252] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542143374] SERVICE ALERT: server;Check;CRITICAL;SOFT;3;CRITICAL Test
[1542143496] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143617] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143739] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143861] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143982] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
  1. Web interface cgis also shows strange
Current Attempt: 4/4  (SOFT state)

which is not good, and no one gets notified also.

Service definition:

define service {
        host_name       server
        service_description     Check
        check_period    24x7
        check_command   check_test
        contacts        contact-a
        notification_period     24x7
        initial_state   o
        check_interval  5
        retry_interval  1
        max_check_attempts      4
        is_volatile     0
        parallelize_check       1
        active_checks_enabled   1
        passive_checks_enabled  1
        obsess  1
        event_handler_enabled   1
        low_flap_threshold      0
        high_flap_threshold     0
        flap_detection_enabled  1
        flap_detection_options  a
        freshness_threshold     0
        check_freshness 0
        notification_options    r,w,c,s
        notifications_enabled   1
        notification_interval   60
        first_notification_delay        0
        stalking_options        w,u,c
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
        }

Tested with active checks, not forced.

I've analyzed logs and noted that repeated "CRITICAL;SOFT;X" alerts (where X is max_check_attempts ) happens next day after soft recoveries.

It never happens before our core 4.4.2 was upgraded to maint branch.

I've reverted 766d0d9 and will test again, if it happens again, I'll revert last commits one by one and test...

@dvoryanchikov
Copy link

Seems related to #576

jomann09 added a commit that referenced this issue Dec 31, 2018
…Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Feb 24, 2023
Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Feb 24, 2023
…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Feb 28, 2023
Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Feb 28, 2023
…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Mar 1, 2023
Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Mar 1, 2023
…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Mar 1, 2023
Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.
msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Mar 1, 2023
…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants