New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seagate SMART parsing for smartmontools #108
Comments
Hi @jmhands, Thanks for reporting this to us. We will look into how we can submit a patch for this. |
…ot indicated as pre-fail Adding a different indicator for attributes at/below the threshold that do not set the pre-failure bit. These are meant to indicate other warnings such as old age or non-critical issues, so they needed a different way to indicate this to the user. I also adjusted the hybrid output for attributes 1, 7, 195, and 188 to display the error count since this seems like what people are most interesting in viewing from the raw data. [Seagate/openSeaChest#108] Signed-off-by: Tyler Erickson <tyler.erickson@seagate.com>
…ot indicated as pre-fail Adding a different indicator for attributes at/below the threshold that do not set the pre-failure bit. These are meant to indicate other warnings such as old age or non-critical issues, so they needed a different way to indicate this to the user. I also adjusted the hybrid output for attributes 1, 7, 195, and 188 to display the error count since this seems like what people are most interesting in viewing from the raw data. [#108] Signed-off-by: Tyler Erickson <tyler.erickson@seagate.com>
After some internal review, we should make this update for attributes 1, 7, and 195. Raw 6:4 should be the displayed counter like we have in the analyzed smart attributes output in openSeaChest. I just tagged a couple updates to this issue that I thought were important and related even though they belong in openSeaChest rather than smartmontools. |
Exos X20 20TB also affected, users running
Or add a similar block like the Exos X16 for the Exos X20 in
@vonericsen could you show -v 188 and -v 195 or how more important ones needs to be interpreted in above format? |
@walterav1984, This issue has unfortunately been on hold due to resource constraints lately, but I have left it open as this is still important and should get updated when we get the time again. |
Could you hint what the interpretation of the 188 and 195 value's are for the EXOS X20, since 4 out of 8 new disks show Its maybe a non worrying value as a left over result from 512 > 4K sector changing which disconnected and reconnected the drive itself when it finished. |
@walterav1984, If you do For 188, this will give you a total count (matching hybrid output), number of timeouts > 5 seconds and number of timeouts > 7.5s. The attributes with the pre-fail/warranty flag are those that are most likely to indicate a failure. Other attributes can range from purely informational (like temperature) to tracking a wear out item (like start-stop count) to attributes like these two where on their own, they don't indicate a failure or an issue and may increment for many other reasons, but if a drive is having a problem, there is generally counts in multiple attributes, including a pre-fail attribute that are increasing in a way that would say something is happening. Correlating multiple counts to a single issue/event is very difficult to do and generally takes a lot of failure analysis across a lot of failed drives to figure out the issue. SMART attributes use the "current" or "nominal" value to indicate a percentage of health, usually from 1-100% with 100% being perfectly healthy. Some attributes may use a different scale, but generally this is what they are based on. A threshold indicates the point at which the drive is at a point the manufacturer thinks it can indicate a failure (pre-fail attributes) or is indicating that a non-failure, possibly old-age attribute is wearing out (anything without the pre-fail flag that has a threshold reported). If you recently switched from 512B to 4KB sectors, the drive does begin a lot of background processing which can take a long time (a couple of days). If you read through #117, I was able to provide some information about this and the other users who were involved in that issue also responded with how much time they saw the background activity took to complete. |
Thanks for the comprehensive answer, although I'm a little bit shocked by the so called "background" processing time of disks that might have introduced the "timeout counter" to go up. The disks were directly set to 4k sectors on their first boot, it took a couple of minutes before it mentions succeed with a bit of a "shaddy" warning message which hints that atleast 1 hour to do some background stuff and that it may take longer. But reading your comment it looks like it can be like multiple 30 hours(whole read/write disk cycle)... The machine was rebooted after an hour of the success message that it was set to 4k, but if its very important not to interrupt should it be a better idea to make this 4k sector switching a "online" process on which the terminal command won't quit or won't go to background? Or make an option to view the actual background process? |
I will figure out a better way to rewrite that message. The purpose of it was to inform the user not to worry if the utility appears to "hang" for an hour while the sector size change starts and not to interrupt it during this time. Interrupting during this time that it is busy can make the drive fail to work properly and require rerunning the format or some other vendor unique recovery process. In the standards for this command, it is described as issuing the command, the drive holds the bus in a "busy" state until the sector size change is done. There is no requirement on background processing as it is vendor unique what happens once the drive has returned status. The background processing the drive is doing cannot be forced to the foreground unless you are writing data to the drive. There is no way to view the status of background processes the drive is performing. The intention of this was to allow the user to write their own data rather than waiting for a complete write of the entire drive (possibly in a number of days with today's and the future's HDD capacities) and allow the user to quickly set it up and begin using the drive with their own data. If you were to do a complete overwrite of the entire drive, this would essentially force the background processing the disk is doing into the foreground (this only applies to changing sector size...other background tasks may not be able to be forced to foreground). You can do this with whatever data you want, but it is not necessary to do before using the disk. |
openSeaChest (obviously) correctly parses the SMART for Seagate drives with
-smartAttributes analyzed
Many open-source tools still use smartctl (e.g. smartd, Prometheus / Node Exporter, TrueNAS, UnRAID)
Seagate SMART requires extra parsing on Raw_Read_Error_Rate and Seek_Error_Rate. They report the total bytes read and errors that needs to be parsed, which causes people to think that their drives have failed. I recommend the openSeaChest folks port the parsing over to smartmontools
https://www.smartmontools.org/browser
The text was updated successfully, but these errors were encountered: