Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temperature reported twice in metrics for some drives #41

Closed
MichiK opened this issue Sep 30, 2019 · 4 comments

Comments

@MichiK
Copy link
Contributor

commented Sep 30, 2019

I have some drives (Seagate Nytro XF1230) that report Temperature_Celsius twice: with type Old_age and with type Pre-fail. Both may be relevant performance data but the current temperature, which is the most interesting information for metrics and fancy dashboards is in the first one.

This is what these drives return in smartctl:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
194 Temperature_Celsius     0x0002   071   058   000    Old_age   Always       -       29 (Min/Max 17/42)
...
231 Temperature_Celsius     0x0033   100   100   001    Pre-fail  Always       -       100

A line from check_smart.pl for one of these drives looks like this:

OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Reallocated_Sector_Ct=0 Power_On_Hours=3706 Power_Cycle_Count=302 Program_Fail_Count_Chip=0 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=1432832 Used_Rsvd_Blk_Cnt_Chip=45 Used_Rsvd_Blk_Cnt_Tot=661 Unused_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 End-to-End_Error=0 Reported_Uncorrect=0 Command_Timeout=0 Unknown_SSD_Attribute=0 Airflow_Temperature_Cel=0 Temperature_Celsius=29 Hardware_ECC_Recovered=0 UDMA_CRC_Error_Count=0 Unknown_SSD_Attribute=0 Soft_ECC_Correction=0 Temperature_Celsius=100 Total_LBAs_Written=2973 Total_LBAs_Read=3900 Read_Error_Retry_Rate=992

This has become a problem after I upgraded my monitoring hosts from Debian stretch to buster. The Icinga 2 version 2.6 in stretch fed the first occurrence of the attribute in the performance data to Graphite, whereas the newer version 2.10 from buster seems to use the second. Therefore, all my disks now show a temperature of 99 or 100°C in the database.

As far as I understand it, labels in the performance data should be unique and the order of the label/value pairs in the performance data is irrelevant, so I think Icinga is not at fault here as the behavior in case of non-unique labels is undefined.

Since I deploy check_smart.pl via Ansible on all hosts and have a local copy of it in one of my Ansible roles, I fixed the problem by replacing the Regex on line 396 with something like this (could be better, but I'm not fluent in Perl and Regex, so this was the quick and dirty solution my brain came up with):

next unless $line =~ /^\s*\d+\s(\S+)\s+(\S+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\d+)/;
my ($attribute_name, $type, $when_failed, $raw_value) = ($1, $6, $8, $9);

and then a few lines later I added:

if ($attribute_name eq "Temperature_Celsius" and $type eq "Pre-fail") {
        next;
}           

Maybe I'm not the only one with this or a similar issue and maybe there is a more generic way to do this (e.g. adding the attribute type to the label in the performance data or implementing more flexible exludes?), so I'll just leave this here.

@Napsty

This comment has been minimized.

Copy link
Owner

commented Sep 30, 2019

Hi @der-michik , thanks for reporting!
That's a very interesting case and it might even be a misinterpretation of the attributes coming from smartctl (smartmontools). According to https://en.wikipedia.org/wiki/S.M.A.R.T., attribute id 194 is the most used attribute to show the current temperature, whereas attribute id 231 is Life Left (SSDs) or Temperature. As your drive is an SSD, this is more likely to be "Life Left".
Again from the Wikipedia page:

Indicates the approximate SSD life left, in terms of program/erase cycles or available reserved blocks.[67] A normalized value of 100 represents a new drive, with a threshold value at 10 indicating a need for replacement. A value of 0 may mean that the drive is operating in read-only mode to allow data recovery.[68] Previously (pre-2010) occasionally used for Drive Temperature (more typically reported at 0xC2).

Your SSD drive shows the value 100 which shows a perfectly healthy drive, according to this attribute.

Can you please check the smartctl/smartmontools version on this particular host? We should probably report this upstream.

Update: Seems already fixed in smartmontools, check out: https://github.com/smartmontools/smartmontools/blob/master/smartmontools/drivedb.h#L4082 and smartmontools/smartmontools@160ecb1#diff-5c51af8dba19f3a4f4187af4b46e415f

And the ultimate finding: smartmontools/smartmontools#4

@Napsty Napsty added the third party label Sep 30, 2019
@MichiK

This comment has been minimized.

Copy link
Contributor Author

commented Sep 30, 2019

Ah, interesting, thanks for your research! That explains a lot. I did not think about having a detailed look at smartmontools as upgrading that on the affected systems is not really an option for me anyway whereas patching the script was an easy workaround.

Nevertheless, we maybe should think about a more flexible exclude option. Currently, -e only excludes attributes from failure reporting and only by name. Names are known differ somewhat between drive vendors (even if the information in the attributes is correct) and are not always unique like in my example. IDs would probably be a bit more reliable and exclusion from the performance data as well would be nice to have. Then I could exclude the broken attributes for the affected hosts not in the script on the hosts themselves but in the Icinga configuration instead (and that is build from Ansible using the monitored hosts' facts, so I could even detect it automatically).

Maybe I will have a look at it and do a pull request tomorrow.

@Napsty

This comment has been minimized.

Copy link
Owner

commented Sep 30, 2019

Currently, -e only excludes attributes from failure reporting and only by name. Names are known differ somewhat between drive vendors (even if the information in the attributes is correct) and are not always unique like in my example. IDs would probably be a bit more reliable

That was actually my intended answer here (to use -e attribute_id) :D
I somewhat forgot that the ID could not be excluded. But it's fairly easy to do and add this.
If you want, I'll let you do the code change and PR. If you don't find the time, let me know.

@MichiK MichiK referenced this issue Oct 1, 2019
@MichiK

This comment has been minimized.

Copy link
Contributor Author

commented Oct 1, 2019

As this is originally an already solved upstream issue anyway and I have a nice workaround now that fits my workflow, this can be closed I think.

@MichiK MichiK closed this Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.