Support transient SMART failures #375

dcelasun · 2022-09-24T20:26:15Z

As discussed in #374 some SMART errors are transient and should not be treated as permanent.

This commit adds support for a configurable list of ATA SMART attribute IDs, failures of which will be treated as transient. Drive health history is still recorded and notifications are sent, but the device itself is not marked as failed.

Fixes #374.

--

@AnalogJ apologies for the cosmetic changes, they are made automatically by goimports. I can revert them if you'd like.

As discussed in [1] some SMART errors are transient and should not be treated as permanent. This commit adds support for a configurable list of ATA SMART attribute IDs, failures of which will be treated as transient. Drive health history is still recorded and notifications are sent, but the device itself is not marked as failed. Fixes AnalogJ#374. [1] AnalogJ#374

AnalogJ · 2022-10-13T03:40:02Z

Thanks so much for the PR 🥳

Apologies for taking so long to get to this, I've been a bit distracted by some other projects.
TBH, I think I would have implemented this slightly differently. Rather than "ignoring" the failure at the attribute processing stage, I think i would allow it to mark the attribute entry as "failed" however, I would filter out the failure before setting the DeviceStatus -- which persists across multiple SMART collector runs.

https://github.com/AnalogJ/scrutiny/blob/master/webapp/backend/pkg/web/handler/upload_device_metrics.go#L53-L59

Basically since the value fluctuates (and recovers) constantly, a failure at a single point in time should still be marked as a failure, but shouldn't cause the disk to the permanently marked as "bad"

Hopefully that makes sense?

dcelasun · 2022-10-13T08:23:59Z

Hopefully that makes sense?

It does. Let me make some changes :)

dcelasun · 2022-11-07T10:07:55Z

Apologies for the delay on this. Now that I look at it again:

Basically since the value fluctuates (and recovers) constantly, a failure at a single point in time should still be marked as a failure, but shouldn't cause the disk to the permanently marked as "bad"

Isn't this already the case? The Status field in the Smart` type refers to the device's status, not the attributes, and my PR simply doesn't update that field for transient failures.

enoch85 · 2024-09-09T20:09:59Z

Would be great if you could have another look at this. 🙏

dcelasun force-pushed the transient-failures branch from 81920c5 to 2e04c0f Compare September 24, 2022 21:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support transient SMART failures #375

Support transient SMART failures #375

dcelasun commented Sep 24, 2022 •

edited

Loading

AnalogJ commented Oct 13, 2022 •

edited

Loading

dcelasun commented Oct 13, 2022

dcelasun commented Nov 7, 2022

enoch85 commented Sep 9, 2024

Support transient SMART failures #375

Are you sure you want to change the base?

Support transient SMART failures #375

Conversation

dcelasun commented Sep 24, 2022 • edited Loading

AnalogJ commented Oct 13, 2022 • edited Loading

dcelasun commented Oct 13, 2022

dcelasun commented Nov 7, 2022

enoch85 commented Sep 9, 2024

dcelasun commented Sep 24, 2022 •

edited

Loading

AnalogJ commented Oct 13, 2022 •

edited

Loading