New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCE events filtered by the kernel are difficult to debug #111
Comments
@aegl Adding Tony Luck In any case it's a kernel problem, not mcelog itself. |
Common culprit is CONFIG_RAS_CEC=y when that is on most correctable memory errors are filtered by the kernel. |
I guess it would be reasonable to not print the "events logged" message in this case, as suggested by the original entry. |
The thing is, right now I just don't know because the kernel message isn't helpful. For all I know it could be a spurious MCE event on this machine (SKX has known Intel errata). In any case, can CONFIG_RAS_CEC be dynamically disabled? Or said differently, how can I find out if CONFIG_RAS_CEC is active? If I rebuild the kernel is there a good line where I can add kprintfs to learn what is being filtered/ignored? Also if it helps, I'm just using a pre-built Fedora kernel: 5.18.18-100.fc35.x86_64 |
UPDATE -- I found the |
If you could be so kind, is there any downside to the above config change? I.e. continue running |
Hardware Error Reporting Mechanism(HERM) in Red Hat Enterprise Linux 7 also introduces a new user space daemon, rasdaemon, which replaces the tools previously included in the edac-utils package. The rasdaemon catches and handles all Reliability, Availability, and Serviceability (RAS) error events that come from the kernel tracing infrastructure, and logs them. HERM in Red Hat Enterprise Linux 7 also provides the tools to report the errors and is able to detect different types of errors such as burst and sparse errors. Find the below infomation, it might help for you. |
Should be fine. Either can off line corrected error pages, but mcelog has more flexibility and configurability in doing do so. |
If it helps, it seems like the in-kernel RAS_CEC doesn't have a threshold for "too many" correctable errors, so DIMMs that have lots of lots (10 to 100+) of correctable errors per day but no uncorrectable errors means that the DIMMs never get logged for servicing later. Thankfully with |
Upstream kernel recently got a patch to set the RAS_CEC threshold to "2" on Intel systems (AMD said their ECC systems did not require a low threshold). See |
My kernel prints "mce: [Hardware Error]: Machine check events logged" regularly but after a lot of debugging, it seems like the kernel must be filtering the messages before
mcelog
sees them. And for context, I tried unloading the skx_edac driver but that didn't make a difference.It would be great if the kernel would either log all MCE events to
mcelog
(and telll the daemon which events the kernel recommends ignoring) OR change the kernel to not kprintf anything if MCE events are filtered out beforemcelog
sees them.The text was updated successfully, but these errors were encountered: