Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percent_Lifetime_Remain threshold unset with -w #92

Closed
ymartin-ovh opened this issue Sep 14, 2023 · 19 comments
Closed

Percent_Lifetime_Remain threshold unset with -w #92

ymartin-ovh opened this issue Sep 14, 2023 · 19 comments
Assignees
Labels

Comments

@ymartin-ovh
Copy link
Contributor

Hello

It seems there is an issue on -w option handling. When I give a threshold for a particular smartctl item (not lifetime), the Percent_Lifetime_Remain threshold is not set to 90%:

warning => ./check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250 -l
ok => ./check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250,Percent_Lifetime_Remain=90 -l
ok => ./check_smart -i auto -g '/dev/sda' -l

Before working on a patch, can you tell me if this behaviour is normal or not.

Regards

@Napsty
Copy link
Owner

Napsty commented Sep 14, 2023

Can you show the current Percent_Lifetime_Remain value?

@ymartin-ovh
Copy link
Contributor Author

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
...

@Napsty
Copy link
Owner

Napsty commented Sep 15, 2023

Agree, simply adding a -l parameter should be enough to check for the Percent_Lifetime_Remain attribute. Need to check why this didn't work.

@Napsty Napsty self-assigned this Sep 15, 2023
@Napsty Napsty added the bug label Sep 15, 2023
@Napsty
Copy link
Owner

Napsty commented Sep 15, 2023

@ymartin-ovh
Copy link
Contributor Author

Hello

Your patch fix warning threshold when it's not given but introduce a new bug (as your set inconditionally the value):

ok (threshold set to 90%)
./check_smart.pl -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250 -l
./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}'

ko =>
./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85
OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) --- [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|

@ymartin-ovh
Copy link
Contributor Author

before, I have 85% =>
/usr/lib/nagios/ovh/check_smart --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85
OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 85) --- [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 85)|

@Napsty
Copy link
Owner

Napsty commented Sep 18, 2023

Can you please run with --debug as it's easier for me to find out what happens in the background, thx. You can combine with --hide-sn to hide sensitive serial numbers.

@ymartin-ovh
Copy link
Contributor Author

./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85 --debug --hide-sn
Found /dev/sdb
Found /dev/sda
###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sdb 
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/sdb

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Model Family:     Micron 5100 Pro / 52x0 / 5300 SSDs
 Device Model:     Micron_5300_MTFDDAK480TDS
 Serial Number:    22263A2BB86F
 LU WWN Device Id: 5 00a075 13a2bb86f
 Firmware Version: D3MU001
 User Capacity:    480,103,981,056 bytes [480 GB]
 Sector Sizes:     512 bytes logical, 4096 bytes physical
 Rotation Rate:    Solid State Device
 Form Factor:      2.5 inches
 TRIM Command:     Available, deterministic, zeroed
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ACS-4 (minor revision not indicated)
 SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
 Local Time is:    Mon Sep 18 11:59:31 2023 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 


(debug) parsing line:
Device Model:     Micron_5300_MTFDDAK480TDS


(debug) found model:  Micron_5300_MTFDDAK480TDS

(debug) parsing line:
Serial Number:    22263A2BB86F


(debug) Hiding serial number

(debug) found serial number <HIDDEN>

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/sdb

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/sdb

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
   5 Reallocated_Sector_Ct   0x0032   100   100   001    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       6549
  12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       27
 170 Reserved_Block_Pct      0x0033   100   100   010    Pre-fail  Always       -       0
 171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
 172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
 173 Avg_Block-Erase_Count   0x0032   098   098   000    Old_age   Always       -       129
 174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       26
 183 SATA_Int_Downshift_Ct   0x0032   100   100   000    Old_age   Always       -       0
 184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       155
 194 Temperature_Celsius     0x0022   066   057   000    Old_age   Always       -       34 (Min/Max 16/43)
 195 Hardware_ECC_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2
 206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
 246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       166575160859
 247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       5213407814
 248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       373948235
 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   000    Pre-fail  Always       -       2161
 210 RAIN_Success_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 211 Integ_Scan_Complete_Cnt 0x0032   100   100   000    Old_age   Always       -       63
 212 Integ_Scan_Folding_Cnt  0x0032   100   100   000    Old_age   Always       -       1
 


(debug) Raw Check List ATA: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 0)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Power_On_Hours not in raw check list (raw value: 6549)

(debug) Power_Cycle_Count not in raw check list (raw value: 27)

(debug) Reserved_Block_Pct not in raw check list (raw value: 0)

(debug) Program_Fail_Count not in raw check list (raw value: 0)

(debug) Erase_Fail_Count not in raw check list (raw value: 0)

(debug) Avg_Block-Erase_Count not in raw check list (raw value: 129)

(debug) Unexpect_Power_Loss_Ct not in raw check list (raw value: 26)

(debug) SATA_Int_Downshift_Ct not in raw check list (raw value: 0)

(debug) End-to-End_Error not in raw check list (raw value: 0)

(debug) Reported_Uncorrect is OK (0)

(debug) Command_Timeout not in raw check list (raw value: 155)

(debug) Temperature_Celsius not in raw check list (raw value: 34)

(debug) Hardware_ECC_Recovered not in raw check list (raw value: 0)

(debug) Reallocated_Event_Count is OK (0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90

(debug) Write_Error_Rate not in raw check list (raw value: 0)

(debug) Total_LBAs_Written not in raw check list (raw value: 166575160859)

(debug) Host_Program_Page_Count not in raw check list (raw value: 5213407814)

(debug) Bckgnd_Program_Page_Cnt not in raw check list (raw value: 373948235)

(debug) Unused_Rsvd_Blk_Cnt_Tot not in raw check list (raw value: 2161)

(debug) RAIN_Success_Recovered not in raw check list (raw value: 0)

(debug) Integ_Scan_Complete_Cnt not in raw check list (raw value: 63)

(debug) Integ_Scan_Folding_Cnt not in raw check list (raw value: 1)

(debug) gathered perfdata:


###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################


###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sda 
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/sda

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF INFORMATION SECTION ===
 Model Family:     Micron 5100 Pro / 52x0 / 5300 SSDs
 Device Model:     Micron_5300_MTFDDAK480TDS
 Serial Number:    22263A2BB83E
 LU WWN Device Id: 5 00a075 13a2bb83e
 Firmware Version: D3MU001
 User Capacity:    480,103,981,056 bytes [480 GB]
 Sector Sizes:     512 bytes logical, 4096 bytes physical
 Rotation Rate:    Solid State Device
 Form Factor:      2.5 inches
 TRIM Command:     Available, deterministic, zeroed
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ACS-4 (minor revision not indicated)
 SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
 Local Time is:    Mon Sep 18 11:59:31 2023 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 


(debug) parsing line:
Device Model:     Micron_5300_MTFDDAK480TDS


(debug) found model:  Micron_5300_MTFDDAK480TDS

(debug) parsing line:
Serial Number:    22263A2BB83E


(debug) Hiding serial number

(debug) found serial number <HIDDEN>

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/sda

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################


(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/sda

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
 
 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
   5 Reallocated_Sector_Ct   0x0032   100   100   001    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       6549
  12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       27
 170 Reserved_Block_Pct      0x0033   100   100   010    Pre-fail  Always       -       0
 171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
 172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
 173 Avg_Block-Erase_Count   0x0032   098   098   000    Old_age   Always       -       129
 174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       26
 183 SATA_Int_Downshift_Ct   0x0032   100   100   000    Old_age   Always       -       0
 184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       155
 194 Temperature_Celsius     0x0022   065   057   000    Old_age   Always       -       35 (Min/Max 16/43)
 195 Hardware_ECC_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2
 206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
 246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       166523290925
 247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       5211799331
 248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       377450209
 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   000    Pre-fail  Always       -       2161
 210 RAIN_Success_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 211 Integ_Scan_Complete_Cnt 0x0032   100   100   000    Old_age   Always       -       63
 212 Integ_Scan_Folding_Cnt  0x0032   100   100   000    Old_age   Always       -       0
 


(debug) Raw Check List ATA: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 0)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Power_On_Hours not in raw check list (raw value: 6549)

(debug) Power_Cycle_Count not in raw check list (raw value: 27)

(debug) Reserved_Block_Pct not in raw check list (raw value: 0)

(debug) Program_Fail_Count not in raw check list (raw value: 0)

(debug) Erase_Fail_Count not in raw check list (raw value: 0)

(debug) Avg_Block-Erase_Count not in raw check list (raw value: 129)

(debug) Unexpect_Power_Loss_Ct not in raw check list (raw value: 26)

(debug) SATA_Int_Downshift_Ct not in raw check list (raw value: 0)

(debug) End-to-End_Error not in raw check list (raw value: 0)

(debug) Reported_Uncorrect is OK (0)

(debug) Command_Timeout not in raw check list (raw value: 155)

(debug) Temperature_Celsius not in raw check list (raw value: 35)

(debug) Hardware_ECC_Recovered not in raw check list (raw value: 0)

(debug) Reallocated_Event_Count is OK (0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90

(debug) Write_Error_Rate not in raw check list (raw value: 0)

(debug) Total_LBAs_Written not in raw check list (raw value: 166523290925)

(debug) Host_Program_Page_Count not in raw check list (raw value: 5211799331)

(debug) Bckgnd_Program_Page_Cnt not in raw check list (raw value: 377450209)

(debug) Unused_Rsvd_Blk_Cnt_Tot not in raw check list (raw value: 2161)

(debug) RAIN_Success_Recovered not in raw check list (raw value: 0)

(debug) Integ_Scan_Complete_Cnt not in raw check list (raw value: 63)

(debug) Integ_Scan_Folding_Cnt not in raw check list (raw value: 0)

(debug) gathered perfdata:


###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################


(debug) final status/output: OK
(debug) drives  ok: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)
(debug) drives nok: 
(debug)   msg_list: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)^[/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)

OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) --- [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|

@Napsty
Copy link
Owner

Napsty commented Sep 18, 2023

To me it looks like the correct behaviour. Both your drives sda and sdb have a Percent_Lifetime_Remain value of 2:

 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2
 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2

The attribute list can be seen in the debug output.

So to test the warning threshold, you must set it equal to or lower than 2:

./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=2 --debug --hide-sn

Please try that and comment here again with your findings.

PS: I just noticed that --hide-sn didn't properly work. But that's another issue to look at ;-)

@ymartin-ovh
Copy link
Contributor Author

ymartin-ovh commented Sep 18, 2023

No there is an issue in your patch:

./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85
OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) --- [/dev/sda] -

I put 85 and the output mention 90 => but less than threshold 90

Also, in smart, lifetime value is inverted between raw value and the real meaning of lifetime remaining percentage. This is explained in drive datasheet and also in check_smart perl code.

@Napsty
Copy link
Owner

Napsty commented Sep 19, 2023

I put 85 and the output mention 90 => but less than threshold 90

Ah yes, now I see it.

@Napsty
Copy link
Owner

Napsty commented Sep 19, 2023

Let me try to comprehend the issue correctly.

When you want to use the Percent_Lifetime_Remain check, using -l then the check will work and alert automatically when the value reaches 90. If the value is below 90, the plugin will output the value but below warning level:

$ ./check_smart.pl -d /dev/sda -i auto --debug -l
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
[...]
(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90
[...]
OK: Drive  Samsung SSD 850 EVO 500GB S/N XXX: no SMART errors detected.  Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|Reallocated_Sector_Ct=0 Power_On_Hours=26002 Power_Cycle_Count=934 Wear_Leveling_Count=35 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=32 ECC_Error_Rate=0 CRC_Error_Count=0 Percent_Lifetime_Remain=2 POR_Recovery_Count=12 Total_LBAs_Written=41523523747

But when you want to overwrite the Percent_Lifetime_Remain threshold (let's say 50), then your own threshold is overwritten again with 90:

$ ./check_smart.pl -d /dev/sda -i auto --debug -l -w "Percent_Lifetime_Remain=50"
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
[...]
(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90
[...]
OK: Drive  Samsung SSD 850 EVO 500GB S/N XXX: no SMART errors detected.  Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|Reallocated_Sector_Ct=0 Power_On_Hours=26002 Power_Cycle_Count=934 Wear_Leveling_Count=35 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=32 ECC_Error_Rate=0 CRC_Error_Count=0 Percent_Lifetime_Remain=2 POR_Recovery_Count=12 Total_LBAs_Written=41523523747

Is that the problem this issue is about? Or did I misunderstand something?

Note: I faked the SMARTCTL output on this drive, as the Samsung SSDs don't have a Percent_Lifetime_Remain attribute.

@ymartin-ovh
Copy link
Contributor Author

Initially my isssue is when -w is used with another threshold definition like Reallocated_Sector_Ct, Percent_Lifetime_Remain=90 is not pushed in the warn_list (see:
https://github.com/Napsty/check_smart/blob/master/check_smart.pl#L231)

@Napsty
Copy link
Owner

Napsty commented Sep 20, 2023

when -w is used with another threshold definition like Reallocated_Sector_Ct, Percent_Lifetime_Remain=90 is not pushed in the warn_list

Yep, but this should now work.

$ ./check_smart.pl -d /dev/sda -i auto --debug -l -w "Uncorrectable_Error_Cnt=10,Reallocated_Sector_Ct=10"
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
Reallocated_Sector_Ct=10
Uncorrectable_Error_Cnt=10
[...]

Can you confirm with the latest version? -> https://raw.githubusercontent.com/Napsty/check_smart/issue-92/check_smart.pl

ymartin-ovh added a commit to ymartin-ovh/check_smart that referenced this issue Sep 20, 2023
…en (issue Napsty#92)

  Address issue when threshold is not set in the following case:
  check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250 -l

  Instead of 90, the threshold is 0 when -w is given without
  Percent_Lifetime_Remain threshold.
@ymartin-ovh
Copy link
Contributor Author

ymartin-ovh commented Sep 20, 2023

when -w is used with another threshold definition like Reallocated_Sector_Ct, Percent_Lifetime_Remain=90 is not pushed in the warn_list

Yep, but this should now work.

$ ./check_smart.pl -d /dev/sda -i auto --debug -l -w "Uncorrectable_Error_Cnt=10,Reallocated_Sector_Ct=10"
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
Reallocated_Sector_Ct=10
Uncorrectable_Error_Cnt=10
[...]

Can you confirm with the latest version? -> https://raw.githubusercontent.com/Napsty/check_smart/issue-92/check_smart.pl

No your patch overwrite the user given value because of the push at the warn_list tail. The default value should be in the head of the list to do this properly. Eventually, I provide a fix in #93.

Regards

@Napsty
Copy link
Owner

Napsty commented Sep 20, 2023

Thx for the PR. Please set your if condition in line 231:
https://github.com/Napsty/check_smart/blob/master/check_smart.pl#L231

This way the Percent_Lifetime_Remain threshold is only set once and added to the warn_list array from the beginning.

@ymartin-ovh
Copy link
Contributor Author

ymartin-ovh commented Sep 20, 2023

The if condition l231 is not needed anymore as it is implemented l240 in #93

@Napsty
Copy link
Owner

Napsty commented Sep 20, 2023

Just tested it locally, lgtm

  1. Using -l : Sets Percent_Lifetime_Remain=90 into warn_list ✔️
  2. Using a different threshold using -l -w "Percent_Lifetime_Remain=70,CRC_Error_Count=10" works ✔️
  3. Using another attribute threshold -l -w "CRC_Error_Count=10" uses the default threshold of 90 again for Percent_Lifetime_Remain ✔️

Napsty pushed a commit that referenced this issue Sep 20, 2023
…en (issue #92) (#93)

Address issue when threshold is not set in the following case:
  check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250 -l

  Instead of 90, the threshold is 0 when -w is given without
  Percent_Lifetime_Remain threshold.
@Napsty
Copy link
Owner

Napsty commented Sep 20, 2023

Fixed with #93

@Napsty Napsty closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants