Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prioritise output by criticality #70

Closed
peternewman opened this issue Jun 22, 2021 · 14 comments
Closed

Prioritise output by criticality #70

peternewman opened this issue Jun 22, 2021 · 14 comments
Assignees
Milestone

Comments

@peternewman
Copy link
Contributor

peternewman commented Jun 22, 2021

I setup check_smart on a system which was already sick and had the following statuses:
sda - Okay
sdb - Warning - unrecoverable errors
sdc - Critical - due to die soon

It would be nice if the output listed sdc, sdb, sda so you know what to prioritise.

I had a quick look, and I think something like adding to a hash of arrays based on the local level, then joining them back up would do the trick, but didn't get a chance to implement it at the time.

@Napsty
Copy link
Owner

Napsty commented Jun 22, 2021

Hi @peternewman
Could you please share the current usage and output? I assume you're using -g?

@peternewman
Copy link
Contributor Author

Yes I am. It'll be a little while until I can do so, and I've replaced the failed drive, but I should be able to get the historic output from Nagios.

It was correctly reporting the overall status of the check as Critical, and listing the associated faults and level with each drive, but it was listing as sdb, sdc, sda (i.e. broken first, but just by letter/discovery order, not respective criticality).

@peternewman
Copy link
Contributor Author

Usage was:
check_smart.pl -g '/dev/sd[a-z]' -i auto --selftest --ssd-lifetime

Output was:
CRITICAL: [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1) --- [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047) --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean

Reformatted into bullets to make my point clearer, it was like this:

  1. CRITICAL:
  2. [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
  3. [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
  4. [/dev/sda] - Device is clean
  5. [/dev/sdd] - Device is clean

What I wanted was:

  1. CRITICAL:
  2. [/dev/sdc] - Health status: FAILED!, Attribute Raw_Read_Error_Rate failed at In_the_past, Attribute Reallocated_Sector_Ct failed at FAILING_NOW, Reallocated_Sector_Ct is non-zero (2047), Reported_Uncorrect is non-zero (133), Attribute Reallocated_Event_Count failed at FAILING_NOW, Reallocated_Event_Count is non-zero (2047)
  3. [/dev/sdb] - Reported_Uncorrect is non-zero (1), Current_Pending_Sector is non-zero (1), Offline_Uncorrectable is non-zero (1)
  4. [/dev/sda] - Device is clean
  5. [/dev/sdd] - Device is clean

i.e. sorted prioritised by the return status you'd get by checking each drive individually.

@Napsty
Copy link
Owner

Napsty commented Jul 7, 2021

The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives /dev/sdb and /dev/sdc are showing up before the ok drives /dev/sda and /dev/sdd.
However check_smart does not know which critical drive (sdb or sdc) is more important or which state is more important.

You could work around this by using -w and setting higher thresholds for some defective sectors. E.g. -w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable. This should then set drive sdb into OK state.

By the way: Although handy for quick checks, I'm not using -g parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.

@peternewman
Copy link
Contributor Author

The "problem" here is that the priority sorting already happens. CRITICAL drives are shown before OK drives. You can see this, as the critical drives /dev/sdb and /dev/sdc are showing up before the ok drives /dev/sda and /dev/sdd.
However check_smart does not know which critical drive (sdb or sdc) is more important or which state is more important.

I think it does. I'm pretty certain when I checked them individually that sdb was WARNING and sdc was CRITICAL, it's just it doesn't currently use the subtlety of that info, just the binary good/bad state.

You could work around this by using -w and setting higher thresholds for some defective sectors. E.g. -w Reported_Uncorrect=2,Current_Pending_Sector=2,Offline_Uncorrectable. This should then set drive sdb into OK state.

As above, I'm more interested in the general WARNING/CRITICAL ordering.

By the way: Although handy for quick checks, I'm not using -g parameter in real world monitoring, as I want each drive separately monitored and want to keep historical data of the SMART values, see https://www.claudiokuenzler.com/blog/1077/when-is-solid-state-drive-ssd-dead-analysis-crucial-mx500-1tb for example.

Thanks for the heads up, I figured I'd start by getting a check in place across all the machines with software RAID and hence no proper disk monitoring and go from there. I'm always rather nervous with manual config like that, as it becomes rather easy to miss a drive if a machine has more disks than expected. Fortunately for me, the data is pretty transient, so I'm really just interested in knowing the drive has, or is about to fail, rebuilding things and carrying on.

@Napsty
Copy link
Owner

Napsty commented Jul 7, 2021

As above, I'm more interested in the general WARNING/CRITICAL ordering.

Yes, this should definitely happen.

So if you do a manual check of sdb right now, is it CRITICAL or WARNING?

@peternewman
Copy link
Contributor Author

As above, I'm more interested in the general WARNING/CRITICAL ordering.

Yes, this should definitely happen.

That's certainly what I'd like, I don't see any code to do so currently (I'm not sure if you're saying you think it should, or agreeing it's a feature to implement):

check_smart/check_smart.pl

Lines 694 to 696 in 956f236

if ($opt_g) {
$status_string = $label.join(', ', @error_messages);
}

And e.g.:

check_smart/check_smart.pl

Lines 664 to 665 in 956f236

push(@error_messages, 'Disk temperature is higher than maximum');
escalate_status('CRITICAL');

versus

check_smart/check_smart.pl

Lines 677 to 678 in 956f236

push(@error_messages, 'Disk start_stop is higher than maximum');
escalate_status('WARNING');

So if you do a manual check of sdb right now, is it CRITICAL or WARNING?

Yeah that works as expected:

/usr/local/bin/check_smart.pl -d /dev/sdb -i auto --selftest --ssd-lifetime; echo $?
WARNING: Drive  <REDACTED> S/N <REDACTED>:  Reported_Uncorrect is non-zero (5)|Raw_Read_Error_Rate=166912319 Spin_Up_Time=0 Start_Stop_Count=16 Reallocated_Sector_Ct=0 Seek_Error_Rate=346936906 Power_On_Hours=63372 Spin_Retry_Count=0 Power_Cycle_Count=16 End-to-End_Error=0 Reported_Uncorrect=5 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=36 Temperature_Celsius=36 Hardware_ECC_Recovered=166912319 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
1

@Napsty Napsty added this to the 6.11.0 milestone Jul 9, 2021
@Napsty
Copy link
Owner

Napsty commented Jul 9, 2021

@peternewman can you please try with the 6.11 branch?
https://github.com/Napsty/check_smart/tree/6.11
How does that behave when you have both CRITICAL and WARNING drives in the same system?

@peternewman
Copy link
Contributor Author

How does that behave when you have both CRITICAL and WARNING drives in the same system?

Thanks @Napsty . I've swapped my failed drive now unfortunately, so would need to fake it by making an existing warning a critical.

I do see one big issue though:

check_smart/check_smart.pl

Lines 703 to 704 in 7eecae6

$status_string = join(', ', @error_messages);
$status_string = join(', ', @warning_messages);

You'll only ever get warning messages out, as you're not concatenating the two joins together, just setting $status_string twice!

@Napsty
Copy link
Owner

Napsty commented Jul 9, 2021

Thx for pointing that out. Should be fixed now with commit d3a85e9

@Napsty Napsty self-assigned this Jul 9, 2021
@peternewman
Copy link
Contributor Author

This still doesn't fix it in global mode unfortunately @Napsty . Note how /dev/sdc where I fudged being under the threshold to generate my test critical is listed after /dev/sdb which only has warnings:

CRITICAL: [/dev/sdb] - [/dev/sdb] - Reallocated_Sector_Ct is non-zero (25), Reported_Uncorrect is non-zero (139), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55) --- [/dev/sdc] - Reported_Uncorrect is test critical[/dev/sdc] -  --- [/dev/sda] - Device is clean --- [/dev/sdd] - Device is clean|
2

It does in single device mode though (N.B. I've changed to a different drive here and a different threshold), i.e. errors are now correctly listed before criticals on a per drive basis:

CRITICAL: Drive  <redacted> S/N <redacted>:  Reported_Uncorrect is test criticalReallocated_Sector_Ct is non-zero (25), Reallocated_Event_Count is non-zero (25), Current_Pending_Sector is non-zero (55)|Raw_Read_Error_Rate=65320951 Spin_Up_Time=0 Start_Stop_Count=17 Reallocated_Sector_Ct=25 Seek_Error_Rate=347559879 Power_On_Hours=63690 Spin_Retry_Count=0 Power_Cycle_Count=17 End-to-End_Error=0 Reported_Uncorrect=139 Command_Timeout=4295041085 High_Fly_Writes=0 Airflow_Temperature_Cel=38 Temperature_Celsius=38 Hardware_ECC_Recovered=65320951 Reallocated_Event_Count=25 Current_Pending_Sector=55 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0
2

I think in your current model you've got @drives_status_not_okay and @drives_status_okay, you either need to switch to a hash based model, or split @drives_status_not_okay into warning and critical.

See also related #71 to improve the current formatting in global mode.

@Napsty
Copy link
Owner

Napsty commented Sep 16, 2021

Hi @peternewman . Can you try it with the newest check_smart.pl from the 6.11 branch please:
https://github.com/Napsty/check_smart/blob/6.11/check_smart.pl

@Napsty
Copy link
Owner

Napsty commented Sep 16, 2021

Commit 5dbacc7 now also adds an internal "notice" status for attributes appearing as "less than threshold".

Before the commit, attributes would show up in their lookup order, even when different thresholds are given:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500), Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47246 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

After the commit, the "Reallocated_Sector_Ct" is moved to the end of the output:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500" 
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2), Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47247 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

5dbacc7 also adds splits the "not_okay" drives into "critical" and "warning" drives (as suggested by you). Then critical (first) and warning (second) drives are merged together into the "not_okay" drives. This should assure, that critical drives appear first in the output.

@Napsty Napsty mentioned this issue Oct 4, 2021
Merged
@Napsty
Copy link
Owner

Napsty commented Oct 4, 2021

Fixed in #72

@Napsty Napsty closed this as completed Oct 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants