Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support transient SMART failures #375

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dcelasun
Copy link

@dcelasun dcelasun commented Sep 24, 2022

As discussed in #374 some SMART errors are transient and should not be treated as permanent.

This commit adds support for a configurable list of ATA SMART attribute IDs, failures of which will be treated as transient. Drive health history is still recorded and notifications are sent, but the device itself is not marked as failed.

Fixes #374.

--

@AnalogJ apologies for the cosmetic changes, they are made automatically by goimports. I can revert them if you'd like.

As discussed in [1] some SMART errors are transient and should not
be treated as permanent.

This commit adds support for a configurable list of ATA SMART attribute
IDs, failures of which will be treated as transient. Drive health history
is still recorded and notifications are sent, but the device itself is
not marked as failed.

Fixes AnalogJ#374.

[1] AnalogJ#374
@AnalogJ
Copy link
Owner

AnalogJ commented Oct 13, 2022

Thanks so much for the PR 🥳

Apologies for taking so long to get to this, I've been a bit distracted by some other projects.
TBH, I think I would have implemented this slightly differently. Rather than "ignoring" the failure at the attribute processing stage, I think i would allow it to mark the attribute entry as "failed" however, I would filter out the failure before setting the DeviceStatus -- which persists across multiple SMART collector runs.

https://github.com/AnalogJ/scrutiny/blob/master/webapp/backend/pkg/web/handler/upload_device_metrics.go#L53-L59

Basically since the value fluctuates (and recovers) constantly, a failure at a single point in time should still be marked as a failure, but shouldn't cause the disk to the permanently marked as "bad"

Hopefully that makes sense?

@dcelasun
Copy link
Author

Hopefully that makes sense?

It does. Let me make some changes :)

@dcelasun
Copy link
Author

dcelasun commented Nov 7, 2022

Apologies for the delay on this. Now that I look at it again:

Basically since the value fluctuates (and recovers) constantly, a failure at a single point in time should still be marked as a failure, but shouldn't cause the disk to the permanently marked as "bad"

Isn't this already the case? The Status field in the Smart` type refers to the device's status, not the attributes, and my PR simply doesn't update that field for transient failures.

@enoch85
Copy link

enoch85 commented Sep 9, 2024

Would be great if you could have another look at this. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Hardware ECC Recovered incorrectly reported as disk failure
3 participants