MCE events filtered by the kernel are difficult to debug #111

davezarzycki · 2022-08-24T14:10:31Z

My kernel prints "mce: [Hardware Error]: Machine check events logged" regularly but after a lot of debugging, it seems like the kernel must be filtering the messages before mcelog sees them. And for context, I tried unloading the skx_edac driver but that didn't make a difference.

It would be great if the kernel would either log all MCE events to mcelog (and telll the daemon which events the kernel recommends ignoring) OR change the kernel to not kprintf anything if MCE events are filtered out before mcelog sees them.

The text was updated successfully, but these errors were encountered:

andikleen · 2022-08-24T18:06:31Z

@aegl Adding Tony Luck

In any case it's a kernel problem, not mcelog itself.

aegl · 2022-08-24T18:11:14Z

Common culprit is CONFIG_RAS_CEC=y when that is on most correctable memory errors are filtered by the kernel.

andikleen · 2022-08-25T00:42:55Z

I guess it would be reasonable to not print the "events logged" message in this case, as suggested by the original entry.

davezarzycki · 2022-08-25T10:38:00Z

The thing is, right now I just don't know because the kernel message isn't helpful. For all I know it could be a spurious MCE event on this machine (SKX has known Intel errata). In any case, can CONFIG_RAS_CEC be dynamically disabled? Or said differently, how can I find out if CONFIG_RAS_CEC is active? If I rebuild the kernel is there a good line where I can add kprintfs to learn what is being filtered/ignored? Also if it helps, I'm just using a pre-built Fedora kernel: 5.18.18-100.fc35.x86_64

davezarzycki · 2022-08-25T11:53:16Z

UPDATE -- I found the ras=cec_disable boot arg and now the kernel no longer prints "mce: [Hardware Error]: Machine check events logged" during my torture test. Does this sound plausible? Is there any downside of leaving CEC disabled with mcelog running as a daemon?

davezarzycki · 2022-09-28T11:29:58Z

If you could be so kind, is there any downside to the above config change? I.e. continue running mcelog in daemon mode with the ras=cec_disable kernel boot arg?

xiaochunlee · 2022-09-28T13:49:48Z

If you could be so kind, is there any downside to the above config change? I.e. continue running mcelog in daemon mode with the ras=cec_disable kernel boot arg?
Share with you my tests on Lenovo ThinkSystem, we append "ras=cec_disable" to let mce daemon be notified.
The new kernel version set CEC notifier as the highest notifier, it can impact other notifier, such as /dev/mcelog(mce daemon), edac notifiers, etc. If you want to get mcelog and edac logs, a kernel parameter ”ras=cec_disable” is needed in latest kernel, actually it starts from 5.8-rc1. Refer commit 23ba710a. Start from SLES15.4, it sets config CONFIG_RAS_CEC, so the cec is running on our SLES15.4, but RHEL series doesn’t do it unitl RHEL9.0

Hardware Error Reporting Mechanism(HERM) in Red Hat Enterprise Linux 7 also introduces a new user space daemon, rasdaemon, which replaces the tools previously included in the edac-utils package. The rasdaemon catches and handles all Reliability, Availability, and Serviceability (RAS) error events that come from the kernel tracing infrastructure, and logs them. HERM in Red Hat Enterprise Linux 7 also provides the tools to report the errors and is able to detect different types of errors such as burst and sparse errors.

Find the below infomation, it might help for you.
On RHEL7, mcelog has been deprecated in favor of rasdaemon. The HERM kernel architecture is the new, favored approach for hardware error detection in the kernel community, and requires the rasdaemon. rasdaemon consolidates different approaches of monitoring hardware and reading sensors. rasdaemon can also hand out vendor specific informations to match hardware issues to real hardware; i.e. motherboard labels matching to EDAC entries. If these exist for a certain hardware where an issue is seen, one can see the direct DIMM name instead of a generic information (RAM error in -dimm).
rasdaemon is not yet enabled by default on RHEL7 after installation. More informations on rasdaemon can be found in the RHEL7 Release Notes, looking for Hardware Error Reporting Mechanism.
mcelog has the ability to set memory pages offline when a certain configured error threshold is exceeded, and offers the opportunity to extend its operation using trigger scripts. These features are currently not implemented in rasdaemon. RHEL7 also ships mcelog.
mcelog operates through the /dev/mcelog interface, whereas rasdaemon captures kernel trace events.
In (private) bz1107804 it is requested that rasdaemon should be able to keep info about RAM marked as defective across reboots. Currently there is no code upstream for this.

andikleen · 2022-09-28T20:52:37Z

If you could be so kind, is there any downside to the above config change? I.e. continue running mcelog in daemon mode with the ras=cec_disable kernel boot arg?

Should be fine. Either can off line corrected error pages, but mcelog has more flexibility and configurability in doing do so.

davezarzycki · 2022-10-26T14:03:22Z

If it helps, it seems like the in-kernel RAS_CEC doesn't have a threshold for "too many" correctable errors, so DIMMs that have lots of lots (10 to 100+) of correctable errors per day but no uncorrectable errors means that the DIMMs never get logged for servicing later.

Thankfully with ras=cec_disable, then mcelog can offline the pages and edac-util is useful for easily identifying the failing DIMMs. (But dmidecode is a pain for triage due to memory interleaving?)

aegl · 2022-10-26T15:06:42Z

Upstream kernel recently got a patch to set the RAS_CEC threshold to "2" on Intel systems (AMD said their ECC systems did not require a low threshold). See
d25c6948a6aa ("RAS/CEC: Reduce offline page threshold for Intel systems")
in v6.1-rc1

andikleen added the kernel label Aug 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCE events filtered by the kernel are difficult to debug #111

MCE events filtered by the kernel are difficult to debug #111

davezarzycki commented Aug 24, 2022

andikleen commented Aug 24, 2022

aegl commented Aug 24, 2022

andikleen commented Aug 25, 2022

davezarzycki commented Aug 25, 2022

davezarzycki commented Aug 25, 2022

davezarzycki commented Sep 28, 2022

xiaochunlee commented Sep 28, 2022

andikleen commented Sep 28, 2022

davezarzycki commented Oct 26, 2022

aegl commented Oct 26, 2022

MCE events filtered by the kernel are difficult to debug #111

MCE events filtered by the kernel are difficult to debug #111

Comments

davezarzycki commented Aug 24, 2022

andikleen commented Aug 24, 2022

aegl commented Aug 24, 2022

andikleen commented Aug 25, 2022

davezarzycki commented Aug 25, 2022

davezarzycki commented Aug 25, 2022

davezarzycki commented Sep 28, 2022

xiaochunlee commented Sep 28, 2022

andikleen commented Sep 28, 2022

davezarzycki commented Oct 26, 2022

aegl commented Oct 26, 2022