Prioritise output by criticality #70

peternewman · 2021-06-22T11:32:49Z

I setup check_smart on a system which was already sick and had the following statuses:
sda - Okay
sdb - Warning - unrecoverable errors
sdc - Critical - due to die soon

It would be nice if the output listed sdc, sdb, sda so you know what to prioritise.

I had a quick look, and I think something like adding to a hash of arrays based on the local level, then joining them back up would do the trick, but didn't get a chance to implement it at the time.

Napsty · 2021-06-22T11:37:52Z

Hi @peternewman
Could you please share the current usage and output? I assume you're using -g?

peternewman · 2021-06-23T09:42:14Z

Yes I am. It'll be a little while until I can do so, and I've replaced the failed drive, but I should be able to get the historic output from Nagios.

It was correctly reporting the overall status of the check as Critical, and listing the associated faults and level with each drive, but it was listing as sdb, sdc, sda (i.e. broken first, but just by letter/discovery order, not respective criticality).

peternewman · 2021-07-07T04:55:13Z

Usage was:
check_smart.pl -g '/dev/sd[a-z]' -i auto --selftest --ssd-lifetime

Output was:
CRITICAL: [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1) --- [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047) --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean

Reformatted into bullets to make my point clearer, it was like this:

CRITICAL:
[/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
[/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
[/dev/sda] - Device is clean
[/dev/sdd] - Device is clean

What I wanted was:

CRITICAL:
[/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
[/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
[/dev/sda] - Device is clean
[/dev/sdd] - Device is clean

i.e. sorted prioritised by the return status you'd get by checking each drive individually.

Napsty · 2021-07-07T05:36:12Z

The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives /dev/sdb and /dev/sdc are showing up before the ok drives /dev/sda and /dev/sdd.
However check_smart does not know which critical drive (sdb or sdc) is more important or which state is more important.

You could work around this by using -w and setting higher thresholds for some defective sectors. E.g. -w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable. This should then set drive sdb into OK state.

By the way: Although handy for quick checks, I'm not using -g parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.

peternewman · 2021-07-07T06:11:05Z

The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives /dev/sdb and /dev/sdc are showing up before the ok drives /dev/sda and /dev/sdd.
However check_smart does not know which critical drive (sdb or sdc) is more important or which state is more important.

I think it does. I'm pretty certain when I checked them individually that sdb was WARNING and sdc was CRITICAL, it's just it doesn't currently use the subtlety of that info, just the binary good/bad state.

You could work around this by using -w and setting higher thresholds for some defective sectors. E.g. -w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable. This should then set drive sdb into OK state.

As above, I'm more interested in the general WARNING/CRITICAL ordering.

By the way: Although handy for quick checks, I'm not using -g parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.

Thanks for the heads up, I figured I'd start by getting a check in place across all the machines with software RAID and hence no proper disk monitoring and go from there. I'm always rather nervous with manual config like that, as it becomes rather easy to miss a drive if a machine has more disks than expected. Fortunately for me, the data is pretty transient, so I'm really just interested in knowing the drive has, or is about to fail, rebuilding things and carrying on.

Napsty · 2021-07-07T06:13:39Z

As above, I'm more interested in the general WARNING/CRITICAL ordering.

Yes, this should definitely happen.

So if you do a manual check of sdb right now, is it CRITICAL or WARNING?

peternewman · 2021-07-08T05:02:10Z

As above, I'm more interested in the general WARNING/CRITICAL ordering.

Yes, this should definitely happen.

That's certainly what I'd like, I don't see any code to do so currently (I'm not sure if you're saying you think it should, or agreeing it's a feature to implement):

check_smart/check_smart.pl

Lines 694 to 696 in 956f236

    
            if ($opt_g) { 
        
           $status_string = $label.join(', ', @error_messages); 
        
            }

And e.g.:

check_smart/check_smart.pl

Lines 664 to 665 in 956f236

    
           push(@error_messages, 'Disk temperature is higher than maximum'); 
        
           escalate_status('CRITICAL');

versus

check_smart/check_smart.pl

Lines 677 to 678 in 956f236

    
           push(@error_messages, 'Disk start_stop is higher than maximum'); 
        
           escalate_status('WARNING');

So if you do a manual check of sdb right now, is it CRITICAL or WARNING?

Yeah that works as expected:

/usr/local/bin/check_smart.pl -d /dev/sdb -i auto --selftest --ssd-lifetime; echo $?
WARNING: Drive  <REDACTED> S/N <REDACTED>:  Reported_Uncorrect is non-zero (5)|Raw_Read_Error_Rate=166912319 Spin_Up_Time=0 Start_Stop_Count=16 Reallocated_Sector_Ct=0 Seek_Error_Rate=346936906 Power_On_Hours=63372 Spin_Retry_Count=0 Power_Cycle_Count=16 End-to-End_Error=0 Reported_Uncorrect=5 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=36 Temperature_Celsius=36 Hardware_ECC_Recovered=166912319 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
1

Napsty · 2021-07-09T09:46:38Z

@peternewman can you please try with the 6.11 branch?
https://github.com/Napsty/check_smart/tree/6.11
How does that behave when you have both CRITICAL and WARNING drives in the same system?

peternewman · 2021-07-09T10:36:28Z

How does that behave when you have both CRITICAL and WARNING drives in the same system?

Thanks @Napsty . I've swapped my failed drive now unfortunately, so would need to fake it by making an existing warning a critical.

I do see one big issue though:

check_smart/check_smart.pl

Lines 703 to 704 in 7eecae6

    
           $status_string = join(', ', @error_messages); 
        
           $status_string = join(', ', @warning_messages);

You'll only ever get warning messages out, as you're not concatenating the two joins together, just setting $status_string twice!

Napsty · 2021-07-09T10:44:27Z

Thx for pointing that out. Should be fixed now with commit d3a85e9

peternewman · 2021-07-21T11:15:11Z

This still doesn't fix it in global mode unfortunately @Napsty . Note how /dev/sdc where I fudged being under the threshold to generate my test critical is listed after /dev/sdb which only has warnings:

CRITICAL: [/dev/sdb] - [/dev/sdb] - Reallocated_Sector_Ct is non-zero (25), Reported_Uncorrect is non-zero (139), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55) --- [/dev/sdc] - Reported_Uncorrect is test critical[/dev/sdc] -  --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean|
2

It does in single device mode though (N.B. I've changed to a different drive here and a different threshold), i.e. errors are now correctly listed before criticals on a per drive basis:

CRITICAL: Drive  <redacted> S/N <redacted>:  Reported_Uncorrect is test criticalReallocated_Sector_Ct is non-zero (25), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55)|Raw_Read_Error_Rate=65320951 Spin_Up_Time=0 Start_Stop_Count=17 Reallocated_Sector_Ct=25 Seek_Error_Rate=347559879 Power_On_Hours=63690 Spin_Retry_Count=0 Power_Cycle_Count=17 End-to-End_Error=0 Reported_Uncorrect=139 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=38 Temperature_Celsius=38 Hardware_ECC_Recovered=65320951 Reallocated_Event_Count=25 Current_Pending_Sector=55 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
2

I think in your current model you've got @drives_status_not_okay and @drives_status_okay, you either need to switch to a hash based model, or split @drives_status_not_okay into warning and critical.

See also related #71 to improve the current formatting in global mode.

Napsty · 2021-09-16T12:53:29Z

Hi @peternewman . Can you try it with the newest check_smart.pl from the 6.11 branch please:
https://github.com/Napsty/check_smart/blob/6.11/check_smart.pl

Napsty · 2021-09-16T12:59:43Z

Commit 5dbacc7 now also adds an internal "notice" status for attributes appearing as "less than threshold".

Before the commit, attributes would show up in their lookup order, even when different thresholds are given:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500), Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47246 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

After the commit, the "Reallocated_Sector_Ct" is moved to the end of the output:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500" 
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2), Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47247 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

5dbacc7 also adds splits the "not_okay" drives into "critical" and "warning" drives (as suggested by you). Then critical (first) and warning (second) drives are merged together into the "not_okay" drives. This should assure, that critical drives appear first in the output.

Napsty · 2021-10-04T11:09:05Z

Fixed in #72

Napsty added the improvement label Jun 26, 2021

Napsty added this to the 6.11.0 milestone Jul 9, 2021

Napsty self-assigned this Jul 9, 2021

Napsty mentioned this issue Oct 4, 2021

6.11 #72

Merged

Napsty closed this as completed Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prioritise output by criticality #70

Prioritise output by criticality #70

peternewman commented Jun 22, 2021 •

edited

Loading

Napsty commented Jun 22, 2021

peternewman commented Jun 23, 2021

peternewman commented Jul 7, 2021

Napsty commented Jul 7, 2021

peternewman commented Jul 7, 2021

Napsty commented Jul 7, 2021

peternewman commented Jul 8, 2021

Napsty commented Jul 9, 2021

peternewman commented Jul 9, 2021

Napsty commented Jul 9, 2021

peternewman commented Jul 21, 2021

Napsty commented Sep 16, 2021

Napsty commented Sep 16, 2021

Napsty commented Oct 4, 2021

Prioritise output by criticality #70

Prioritise output by criticality #70

Comments

peternewman commented Jun 22, 2021 • edited Loading

Napsty commented Jun 22, 2021

peternewman commented Jun 23, 2021

peternewman commented Jul 7, 2021

Napsty commented Jul 7, 2021

peternewman commented Jul 7, 2021

Napsty commented Jul 7, 2021

peternewman commented Jul 8, 2021

Napsty commented Jul 9, 2021

peternewman commented Jul 9, 2021

Napsty commented Jul 9, 2021

peternewman commented Jul 21, 2021

Napsty commented Sep 16, 2021

Napsty commented Sep 16, 2021

Napsty commented Oct 4, 2021

peternewman commented Jun 22, 2021 •

edited

Loading