Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCE events filtered by the kernel are difficult to debug #111

Open
davezarzycki opened this issue Aug 24, 2022 · 10 comments
Open

MCE events filtered by the kernel are difficult to debug #111

davezarzycki opened this issue Aug 24, 2022 · 10 comments
Labels

Comments

@davezarzycki
Copy link

My kernel prints "mce: [Hardware Error]: Machine check events logged" regularly but after a lot of debugging, it seems like the kernel must be filtering the messages before mcelog sees them. And for context, I tried unloading the skx_edac driver but that didn't make a difference.

It would be great if the kernel would either log all MCE events to mcelog (and telll the daemon which events the kernel recommends ignoring) OR change the kernel to not kprintf anything if MCE events are filtered out before mcelog sees them.

@andikleen
Copy link
Owner

@aegl Adding Tony Luck

In any case it's a kernel problem, not mcelog itself.

@aegl
Copy link
Collaborator

aegl commented Aug 24, 2022

Common culprit is CONFIG_RAS_CEC=y when that is on most correctable memory errors are filtered by the kernel.

@andikleen
Copy link
Owner

I guess it would be reasonable to not print the "events logged" message in this case, as suggested by the original entry.

@davezarzycki
Copy link
Author

The thing is, right now I just don't know because the kernel message isn't helpful. For all I know it could be a spurious MCE event on this machine (SKX has known Intel errata). In any case, can CONFIG_RAS_CEC be dynamically disabled? Or said differently, how can I find out if CONFIG_RAS_CEC is active? If I rebuild the kernel is there a good line where I can add kprintfs to learn what is being filtered/ignored? Also if it helps, I'm just using a pre-built Fedora kernel: 5.18.18-100.fc35.x86_64

@davezarzycki
Copy link
Author

UPDATE -- I found the ras=cec_disable boot arg and now the kernel no longer prints "mce: [Hardware Error]: Machine check events logged" during my torture test. Does this sound plausible? Is there any downside of leaving CEC disabled with mcelog running as a daemon?

@davezarzycki
Copy link
Author

If you could be so kind, is there any downside to the above config change? I.e. continue running mcelog in daemon mode with the ras=cec_disable kernel boot arg?

@xiaochunlee
Copy link

If you could be so kind, is there any downside to the above config change? I.e. continue running mcelog in daemon mode with the ras=cec_disable kernel boot arg?
Share with you my tests on Lenovo ThinkSystem, we append "ras=cec_disable" to let mce daemon be notified.
The new kernel version set CEC notifier as the highest notifier, it can impact other notifier, such as /dev/mcelog(mce daemon), edac notifiers, etc. If you want to get mcelog and edac logs, a kernel parameter ”ras=cec_disable” is needed in latest kernel, actually it starts from 5.8-rc1. Refer commit 23ba710a. Start from SLES15.4, it sets config CONFIG_RAS_CEC, so the cec is running on our SLES15.4, but RHEL series doesn’t do it unitl RHEL9.0

Hardware Error Reporting Mechanism(HERM) in Red Hat Enterprise Linux 7 also introduces a new user space daemon, rasdaemon, which replaces the tools previously included in the edac-utils package. The rasdaemon catches and handles all Reliability, Availability, and Serviceability (RAS) error events that come from the kernel tracing infrastructure, and logs them. HERM in Red Hat Enterprise Linux 7 also provides the tools to report the errors and is able to detect different types of errors such as burst and sparse errors.

Find the below infomation, it might help for you.
On RHEL7, mcelog has been deprecated in favor of rasdaemon. The HERM kernel architecture is the new, favored approach for hardware error detection in the kernel community, and requires the rasdaemon. rasdaemon consolidates different approaches of monitoring hardware and reading sensors. rasdaemon can also hand out vendor specific informations to match hardware issues to real hardware; i.e. motherboard labels matching to EDAC entries. If these exist for a certain hardware where an issue is seen, one can see the direct DIMM name instead of a generic information (RAM error in -dimm).
rasdaemon is not yet enabled by default on RHEL7 after installation. More informations on rasdaemon can be found in the RHEL7 Release Notes, looking for Hardware Error Reporting Mechanism.
mcelog has the ability to set memory pages offline when a certain configured error threshold is exceeded, and offers the opportunity to extend its operation using trigger scripts. These features are currently not implemented in rasdaemon. RHEL7 also ships mcelog.
mcelog operates through the /dev/mcelog interface, whereas rasdaemon captures kernel trace events.
In (private) bz1107804 it is requested that rasdaemon should be able to keep info about RAM marked as defective across reboots. Currently there is no code upstream for this.

@andikleen
Copy link
Owner

If you could be so kind, is there any downside to the above config change? I.e. continue running mcelog in daemon mode with the ras=cec_disable kernel boot arg?

Should be fine. Either can off line corrected error pages, but mcelog has more flexibility and configurability in doing do so.

@davezarzycki
Copy link
Author

If it helps, it seems like the in-kernel RAS_CEC doesn't have a threshold for "too many" correctable errors, so DIMMs that have lots of lots (10 to 100+) of correctable errors per day but no uncorrectable errors means that the DIMMs never get logged for servicing later.

Thankfully with ras=cec_disable, then mcelog can offline the pages and edac-util is useful for easily identifying the failing DIMMs. (But dmidecode is a pain for triage due to memory interleaving?)

@aegl
Copy link
Collaborator

aegl commented Oct 26, 2022

Upstream kernel recently got a patch to set the RAS_CEC threshold to "2" on Intel systems (AMD said their ECC systems did not require a low threshold). See
d25c6948a6aa ("RAS/CEC: Reduce offline page threshold for Intel systems")
in v6.1-rc1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants