Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mcelog --client output nothing when inject error to memory,But the messages can catch the error log. #74

Open
xiaochunlee opened this issue Feb 22, 2019 · 4 comments

Comments

@xiaochunlee
Copy link

Lately, We tested RAS function about memory inject error on the Purley platform of lenovo SR630, The OS is RHEL8,kernel version kernel-4.18.0-67.el8, mcelog version is 159.
The test steps list as below:

  1. Mount the einj module
    linux-1rz0:~ # modprobe einj param_extension=1
    linux-1rz0:~ #

  2. Start the mcelog daemon
    linux-1rz0:~ # mcelog --daemon
    linux-1rz0:~ #

  3. Check whether the einj module loaded successfully
    linux-1rz0:~ # cd /sys/kernel/debug/apei/einj/
    linux-1rz0:/sys/kernel/debug/apei/einj #
    linux-1rz0:/sys/kernel/debug/apei/einj # ls
    available_error_type error_inject error_type flags notrigger param1 param2 param3 param4 vendor vendor_flags
    linux-1rz0:/sys/kernel/debug/apei/einj #

4.Inject uncorrectable error to memory mirror range
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x10 > error_type
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x12345 > param1
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0xfffffffffffff000 > param2
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > error_inject
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > notrigger
linux-1rz0:/sys/kernel/debug/apei/einj #

Below is some informations about the outcome:

[root@rhel8-ose-test rastools]# systemctl status mcelog
● mcelog.service - Machine Check Exception Logging Daemon
Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-01-30 02:22:29 EST; 51min ago
Main PID: 1177 (mcelog)
Tasks: 1 (limit: 26213)
Memory: 856.0K
CGroup: /system.slice/mcelog.service
└─1177 /usr/sbin/mcelog --ignorenodev --daemon --foreground
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

[root@rhel8-ose-test rastools]#tail -f /var/log/dmesg
Jan 30 02:29:26 rhel8-ose-test kernel: mce: Uncorrected hardware memory error in user-access at 6696d1040
Jan 30 02:29:26 rhel8-ose-test kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: Killing einj_mem_uc:8974 due to hardware memory corruption
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: recovery action for dirty LRU page: Recovered
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Hardware event. This is not a software error.
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCE 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPU 21 BANK 1 TSC 8a4ce5d5aa0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: RIP 33:403c4b
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MISC 86 ADDR 6696d1040
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: TIME 1548833366 Wed Jan 30 02:29:26 2019
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCG status:RIPV EIPV MCIP LMCE
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi status:
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Uncorrected error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85

But when we execute the mcelog --client ,there is nothing output. So I research the code about the mcelog and found that it was blocked by mce status bit settings,The partial resource code of the mcelog tool list as below:

125 static int intel_memory_error(struct mce *m, unsigned recordlen)
126 {
127 u32 mca = m->status & 0xffff;
128 if ((mca >> 7) == 1) {
129 unsigned corr_err_cnt = 0;
130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 };
131 int dimm[2] = { -1, -1 };

@xiaochunlee
Copy link
Author

For investigating by UEFI expert, They got a conclusion that the Firmware always handle the error first, So that the OS cant capture that. Such as the handling of PFA threshold and mirror area uncorrected errors. So, I have a question, does the mcelog need to handle that while enabled the firmware first on UEFI? Looking forward to your reply, many thanks!

@xiaochunlee
Copy link
Author

Because if the BIOS enable the Firmware First function, the OS can't get the error count correctly, So that the mcelog can't trigger the threshold events that defined by '/etc/mcelog/mcelog.conf ' as usual. The UEFI engineers only want to know how the OS will do when the UEFI handle error first, or whether the UEFI should enable the Firmware First, if enable it, whether it is reasonable.
Is there anyone give a help? Appreciate it!

@xiaochunlee
Copy link
Author

As we know, The mcelog daemon accounts memory errors, mcelog --client can be used to query the count of corrected and uncorrected errors. The mcelog daemon can also execute triggers when configurable error thresholds are exceeded. This is used to implement a range of automatic predictive failure analysis algorithms: including bad page offlining and automatic cache error handling. User defined actions can be also configured.
If the UEFI enable the FIRMWARE_FIRST function(defined by ACPI spec), UEFI handles errors first, not even notifying OS, it result in the thresholds event which defined in ‘/etc/mcelog/mcelog.conf’ can’t be triggered, So I think it's an abnormal behavior.
Anyway, the mcelog daemon accounts inaccurate memory error while test RAS features after enable FIRMWARE_FIRST in UEFI.So if enable FIRMWARE_FIRST in UEFI, when we run the command ‘mcelog –client’ , there is nothing output.
After researching the code about the mcelog, it was blocked at MCE status bit checking,The partial resource code of the mcelog tool list as below:

125 static int intel_memory_error(struct mce *m, unsigned recordlen)
126 {
127 u32 mca = m->status & 0xffff;
128 if ((mca >> 7) == 1) {
129 unsigned corr_err_cnt = 0;
130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 };
131 int dimm[2] = { -1, -1 };

It seems that after enable the FIRMWARE_FIRST, UEFI handle errors first when errors occurred, But This behavior causes a change in the status of the MCA register, so mcelog can’t get the correct information, Ultimately, the user-defined event could not be triggered.

@ZSN666
Copy link

ZSN666 commented Aug 8, 2022

hi [xiaochunlee]:
I ran into the same problem,How do you solve it now ?mcelog --client shows nothing but /var/log/mcelog records mce event,finally triggers not invoked。@xiaochunlee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants