New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mcelog --client output nothing when inject error to memory,But the messages can catch the error log. #74
Comments
For investigating by UEFI expert, They got a conclusion that the Firmware always handle the error first, So that the OS cant capture that. Such as the handling of PFA threshold and mirror area uncorrected errors. So, I have a question, does the mcelog need to handle that while enabled the firmware first on UEFI? Looking forward to your reply, many thanks! |
Because if the BIOS enable the Firmware First function, the OS can't get the error count correctly, So that the mcelog can't trigger the threshold events that defined by '/etc/mcelog/mcelog.conf ' as usual. The UEFI engineers only want to know how the OS will do when the UEFI handle error first, or whether the UEFI should enable the Firmware First, if enable it, whether it is reasonable. |
As we know, The mcelog daemon accounts memory errors, mcelog --client can be used to query the count of corrected and uncorrected errors. The mcelog daemon can also execute triggers when configurable error thresholds are exceeded. This is used to implement a range of automatic predictive failure analysis algorithms: including bad page offlining and automatic cache error handling. User defined actions can be also configured. 125 static int intel_memory_error(struct mce *m, unsigned recordlen) It seems that after enable the FIRMWARE_FIRST, UEFI handle errors first when errors occurred, But This behavior causes a change in the status of the MCA register, so mcelog can’t get the correct information, Ultimately, the user-defined event could not be triggered. |
hi [xiaochunlee]: |
Lately, We tested RAS function about memory inject error on the Purley platform of lenovo SR630, The OS is RHEL8,kernel version kernel-4.18.0-67.el8, mcelog version is 159.
The test steps list as below:
Mount the einj module
linux-1rz0:~ # modprobe einj param_extension=1
linux-1rz0:~ #
Start the mcelog daemon
linux-1rz0:~ # mcelog --daemon
linux-1rz0:~ #
Check whether the einj module loaded successfully
linux-1rz0:~ # cd /sys/kernel/debug/apei/einj/
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # ls
available_error_type error_inject error_type flags notrigger param1 param2 param3 param4 vendor vendor_flags
linux-1rz0:/sys/kernel/debug/apei/einj #
4.Inject uncorrectable error to memory mirror range
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x10 > error_type
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0x12345 > param1
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 0xfffffffffffff000 > param2
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > error_inject
linux-1rz0:/sys/kernel/debug/apei/einj #
linux-1rz0:/sys/kernel/debug/apei/einj # echo 1 > notrigger
linux-1rz0:/sys/kernel/debug/apei/einj #
Below is some informations about the outcome:
[root@rhel8-ose-test rastools]# systemctl status mcelog
● mcelog.service - Machine Check Exception Logging Daemon
Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-01-30 02:22:29 EST; 51min ago
Main PID: 1177 (mcelog)
Tasks: 1 (limit: 26213)
Memory: 856.0K
CGroup: /system.slice/mcelog.service
└─1177 /usr/sbin/mcelog --ignorenodev --daemon --foreground
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85
[root@rhel8-ose-test rastools]#tail -f /var/log/dmesg
Jan 30 02:29:26 rhel8-ose-test kernel: mce: Uncorrected hardware memory error in user-access at 6696d1040
Jan 30 02:29:26 rhel8-ose-test kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: Killing einj_mem_uc:8974 due to hardware memory corruption
Jan 30 02:29:26 rhel8-ose-test kernel: Memory failure: 0x6696d1: recovery action for dirty LRU page: Recovered
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Hardware event. This is not a software error.
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCE 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPU 21 BANK 1 TSC 8a4ce5d5aa0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: RIP 33:403c4b
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MISC 86 ADDR 6696d1040
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: TIME 1548833366 Wed Jan 30 02:29:26 2019
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCG status:RIPV EIPV MCIP LMCE
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi status:
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Uncorrected error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: Error enabled
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_MISC register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCi_ADDR register valid
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: SRAR
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCA: Data CACHE Level-0 Data-Read Error
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: STATUS bd80000000100134 MCGSTATUS f
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MCGCAP f000c14 APICID 17 SOCKETID 0
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: PPIN 2f5f92f94c7e6989
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: MICROCODE 2000055
Jan 30 02:29:26 rhel8-ose-test mcelog[1177]: CPUID Vendor Intel Family 6 Model 85
But when we execute the mcelog --client ,there is nothing output. So I research the code about the mcelog and found that it was blocked by mce status bit settings,The partial resource code of the mcelog tool list as below:
125 static int intel_memory_error(struct mce *m, unsigned recordlen)
126 {
127 u32 mca = m->status & 0xffff;
128 if ((mca >> 7) == 1) {
129 unsigned corr_err_cnt = 0;
130 int channel[2] = { (mca & 0xf) == 0xf ? -1 : (int)(mca & 0xf), -1 };
131 int dimm[2] = { -1, -1 };
The text was updated successfully, but these errors were encountered: