All recoveries are HARD #575

ludmilmm · 2018-08-30T18:12:42Z

The issue is described on the Nagios XI support forum here:

https://support.nagios.com/forum/viewtopic.php?t=50067#261007

According to the Nagios Core official documentation:

When a service or host recovers from a soft error. This is considered a soft recovery.

https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/statetypes.html

however this is not happening in Nagios Core 4.4.2.

Example (Core 4.4.2):
[1535649657] SERVICE ALERT: CentOS6-NRPE;Users;CRITICAL;SOFT;1;USERS CRITICAL - 2 users currently logged in
[1535649657] GLOBAL SERVICE EVENT HANDLER: CentOS6-NRPE;Users;CRITICAL;SOFT;1;xi_service_event_handler
[1535649717] SERVICE ALERT: CentOS6-NRPE;Users;CRITICAL;SOFT;2;USERS CRITICAL - 2 users currently logged in
[1535649717] GLOBAL SERVICE EVENT HANDLER: CentOS6-NRPE;Users;CRITICAL;SOFT;2;xi_service_event_handler
[1535649776] SERVICE ALERT: CentOS6-NRPE;Users;OK;HARD;1;USERS OK - 1 users currently logged in [1535649776] GLOBAL SERVICE EVENT HANDLER: CentOS6-NRPE;Users;OK;HARD;1;xi_service_event_handler

For comparison (Core 4.2.4):
[1535650778] SERVICE ALERT: localhost;Current Users;CRITICAL;SOFT;1;USERS CRITICAL - 2 users currently logged in
[1535650778] GLOBAL SERVICE EVENT HANDLER: localhost;Current Users;CRITICAL;SOFT;1;xi_service_event_handler
[1535650841] SERVICE ALERT: localhost;Current Users;CRITICAL;SOFT;2;USERS CRITICAL - 2 users currently logged in
[1535650841] GLOBAL SERVICE EVENT HANDLER: localhost;Current Users;CRITICAL;SOFT;2;xi_service_event_handler
[1535650902] SERVICE ALERT: localhost;Current Users;OK;SOFT;3;USERS OK - 1 users currently logged in
[1535650902] GLOBAL SERVICE EVENT HANDLER: localhost;Current Users;OK;SOFT;3;xi_service_event_handler

The text was updated successfully, but these errors were encountered:

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

jomann09 · 2018-11-05T19:08:05Z

Made this change in the maint branch for 4.4.3.

dvoryanchikov · 2018-11-13T08:27:19Z

Hi!
after 766d0d9 we have some services never getting to hard state

1542056400] CURRENT HOST STATE: server;UP;HARD;1;TCP OK - 0.050 second response time on port 22
[1542056400] CURRENT SERVICE STATE: server;Other check;OK;HARD;1;OK test
[1542056400] CURRENT SERVICE STATE: server;Check;OK;HARD;1;Check OK
[1542057399] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542057463] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542057528] SERVICE ALERT: server;Check;CRITICAL;SOFT;3;CRITICAL Test
[1542057595] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542057657] SERVICE ALERT: server;Check;OK;SOFT;1;Check OK
[1542058621] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542058689] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542058754] SERVICE ALERT: server;Check;CRITICAL;SOFT;3;CRITICAL Test
[1542058819] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542058885] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542058950] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059015] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059080] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059142] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542059207] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test

max_check_attempts 4
check_interval 5
retry_interval 1

nothing other special about this service

jomann09 · 2018-11-13T14:14:53Z

Did you apply this patch to 4.4.2 or did you use the current maint branch when you re-build Core?
I only tested this fix as applied to the current 4.4.2 version. I just did this as a test and my service went into a hard state:

[1542117154] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;1;test
[1542117159] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;2;test
[1542117163] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;3;test123
[1542117167] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;4;tset
[1542117172] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;OK;SOFT;1;segseg
[1542117176] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;1;seseg
[1542117180] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;2;esgseg
[1542117183] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;3;segseges
[1542117187] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;SOFT;4;segseges
[1542117190] SERVICE ALERT: 192.168.5.41;Port 1 Bandwidth 155555;CRITICAL;HARD;5;segseg

I can try it again using only active checks too.
Also, if max_check_attempts is set to 4, shouldn't the service go into HARD critical during the 4th soft state check?

dvoryanchikov · 2018-11-13T16:33:44Z

Hi,
sorry, I've compiled maint again and was unable to reproduce this behavior,
maybe I missed something...
I'll write back If find out why I got service stacked in soft state..

jomann09 · 2018-11-13T16:37:14Z

Great please let me know if it happens again. I will also be testing it again just to be sure.

dvoryanchikov · 2018-11-13T23:57:30Z

Hi,
yes, it happens again!

How to reproduce:

compile latest maint branch (with 766d0d9)
make test service became CRITICAL
make a "soft recovery" (make service OK after 2/4 checks)

[1542140109] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542140228] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542140311] SERVICE ALERT: server;Check;OK;SOFT;1;Check OK

wait until midnight (for state retention, log rotation, etc)

[1542130135] Auto-save of retention data completed successfully.
[1542133735] Auto-save of retention data completed successfully.
[1542137335] Auto-save of retention data completed successfully.
[1542140934] Auto-save of retention data completed successfully.

[1542142800] LOG ROTATION: DAILY
[1542142800] LOG VERSION: 2.0

make service critical and get a bug with it stalled in soft state

[1542142800] CURRENT SERVICE STATE: server;Check;OK;HARD;1;Check OK
[1542143131] SERVICE ALERT: server;Check;CRITICAL;SOFT;1;CRITICAL Test
[1542143252] SERVICE ALERT: server;Check;CRITICAL;SOFT;2;CRITICAL Test
[1542143374] SERVICE ALERT: server;Check;CRITICAL;SOFT;3;CRITICAL Test
[1542143496] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143617] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143739] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143861] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test
[1542143982] SERVICE ALERT: server;Check;CRITICAL;SOFT;4;CRITICAL Test

Web interface cgis also shows strange

Current Attempt: 4/4  (SOFT state)

which is not good, and no one gets notified also.

Service definition:

define service {
        host_name       server
        service_description     Check
        check_period    24x7
        check_command   check_test
        contacts        contact-a
        notification_period     24x7
        initial_state   o
        check_interval  5
        retry_interval  1
        max_check_attempts      4
        is_volatile     0
        parallelize_check       1
        active_checks_enabled   1
        passive_checks_enabled  1
        obsess  1
        event_handler_enabled   1
        low_flap_threshold      0
        high_flap_threshold     0
        flap_detection_enabled  1
        flap_detection_options  a
        freshness_threshold     0
        check_freshness 0
        notification_options    r,w,c,s
        notifications_enabled   1
        notification_interval   60
        first_notification_delay        0
        stalking_options        w,u,c
        process_perf_data       1
        retain_status_information       1
        retain_nonstatus_information    1
        }

Tested with active checks, not forced.

I've analyzed logs and noted that repeated "CRITICAL;SOFT;X" alerts (where X is max_check_attempts ) happens next day after soft recoveries.

It never happens before our core 4.4.2 was upgraded to maint branch.

I've reverted 766d0d9 and will test again, if it happens again, I'll revert last commits one by one and test...

dvoryanchikov · 2018-11-14T00:11:56Z

Seems related to #576

…Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

…ther versions of Core. In order to do this we are moving some of the resetting logic for service OK states so that the notification for soft recovery goes out before setting it to a HARD OK state.

hydrapolic mentioned this issue Sep 3, 2018

net-analyzer/nagios: bump to 4.4.2 gentoo/gentoo#9770

Closed

jomann09 added a commit that referenced this issue Nov 5, 2018

Fixed soft state recoveries #575

766d0d9

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

jomann09 added this to the 4.4.3 milestone Nov 5, 2018

jomann09 self-assigned this Nov 5, 2018

jomann09 added the Bug label Nov 5, 2018

jomann09 closed this as completed Nov 5, 2018

sawolf mentioned this issue Apr 23, 2020

HARD OK state on recovery #757

Closed

msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Feb 24, 2023

Fixed soft state recoveries NagiosEnterprises#575

7b1168a

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Feb 28, 2023

Fixed soft state recoveries NagiosEnterprises#575

b08311b

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Mar 1, 2023

Fixed soft state recoveries NagiosEnterprises#575

591a837

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

msdiamanti pushed a commit to gwos/nagioscore that referenced this issue Mar 1, 2023

Fixed soft state recoveries NagiosEnterprises#575

796ef9f

Soft OK states were not being triggered when a soft non-OK state turned back into an OK state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All recoveries are HARD #575

All recoveries are HARD #575

ludmilmm commented Aug 30, 2018

jomann09 commented Nov 5, 2018

dvoryanchikov commented Nov 13, 2018

jomann09 commented Nov 13, 2018 •

edited

dvoryanchikov commented Nov 13, 2018

jomann09 commented Nov 13, 2018

dvoryanchikov commented Nov 13, 2018

dvoryanchikov commented Nov 14, 2018

All recoveries are HARD #575

All recoveries are HARD #575

Comments

ludmilmm commented Aug 30, 2018

jomann09 commented Nov 5, 2018

dvoryanchikov commented Nov 13, 2018

jomann09 commented Nov 13, 2018 • edited

dvoryanchikov commented Nov 13, 2018

jomann09 commented Nov 13, 2018

dvoryanchikov commented Nov 13, 2018

dvoryanchikov commented Nov 14, 2018

jomann09 commented Nov 13, 2018 •

edited