Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel segfault when running grep -r / #101

Closed
nickgarvey opened this issue Jan 1, 2023 · 4 comments
Closed

Kernel segfault when running grep -r / #101

nickgarvey opened this issue Jan 1, 2023 · 4 comments

Comments

@nickgarvey
Copy link

Here's the output from dmesg

I ran this command to cause it: sudo grep -lre 'asahi-linux' /

[Sat Dec 31 22:07:04 2022] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[Sat Dec 31 22:07:04 2022] Mem abort info:
[Sat Dec 31 22:07:04 2022]   ESR = 0x0000000086000004
[Sat Dec 31 22:07:04 2022]   EC = 0x21: IABT (current EL), IL = 32 bits
[Sat Dec 31 22:07:04 2022]   SET = 0, FnV = 0
[Sat Dec 31 22:07:04 2022]   EA = 0, S1PTW = 0
[Sat Dec 31 22:07:04 2022]   FSC = 0x04: level 0 translation fault
[Sat Dec 31 22:07:04 2022] user pgtable: 16k pages, 48-bit VAs, pgdp=00000008111da130
[Sat Dec 31 22:07:04 2022] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[Sat Dec 31 22:07:04 2022] Internal error: Oops: 0000000086000004 [#2] PREEMPT SMP
[Sat Dec 31 22:07:04 2022] Modules linked in: snd_seq_dummy snd_hrtimer snd_seq snd_seq_device rfcomm rpcsec_gss_krb5 joydev usbhid bnep des_generic libdes md4 nls_iso8859_1 hci_bcm4377 bluetooth tg3 xhci_pci ptp xhci_hcd brcmfmac brcmutil ecdh_generic ecc macsmc_hid apple_piodma appledrm macsmc_power snd_soc_macaudio macsmc_reboot cfg80211 snd_soc_cs42l83_i2c snd_soc_cs42l42 rfkill asahi apple_dcp snd_soc_tas2770 snd_soc_apple_mca drm_dma_helper apple_admac clk_apple_nco apple_soc_cpufreq crypto_user fuse nvmem_spmi_mfd rtc_macsmc gpio_macsmc tps6598x simple_mfd_spmi regmap_spmi pcie_apple pci_host_common phy_apple_atc typec dwc3 udc_core macsmc_rtkit macsmc mfd_core nvmem_apple_efuses spmi_apple_controller pinctrl_apple_gpio i2c_apple apple_dart nvme_apple apple_sart apple_rtkit apple_mailbox
[Sat Dec 31 22:07:04 2022] CPU: 7 PID: 1815 Comm: grep Tainted: G S    D            6.1.0-asahi-2-2-edge-ARCH #2
[Sat Dec 31 22:07:04 2022] Hardware name: Apple Mac mini (M1, 2020) (DT)
[Sat Dec 31 22:07:04 2022] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[Sat Dec 31 22:07:04 2022] pc : 0x0
[Sat Dec 31 22:07:04 2022] lr : psci_debugfs_read+0x20/0x60
[Sat Dec 31 22:07:04 2022] sp : ffff80000a19bbd0
[Sat Dec 31 22:07:04 2022] x29: ffff80000a19bbd0 x28: ffff000011c4eae0 x27: 0000000000400cc0
[Sat Dec 31 22:07:04 2022] x26: 000000007fffc000 x25: ffff000011c4ead0 x24: 0000000000000000
[Sat Dec 31 22:07:04 2022] x23: ffff80000a19bca0 x22: ffff80000a19bcc8 x21: 0000000000000001
[Sat Dec 31 22:07:04 2022] x20: ffff000011c4eaa8 x19: ffff000011c4eaa8 x18: 0000000000000000
[Sat Dec 31 22:07:04 2022] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[Sat Dec 31 22:07:04 2022] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[Sat Dec 31 22:07:04 2022] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
[Sat Dec 31 22:07:04 2022] x8 : 00000000ffffffff x7 : 0000000000400cc0 x6 : 0000000000000001
[Sat Dec 31 22:07:04 2022] x5 : ffff000011c3c000 x4 : 0000000000000000 x3 : 0000000000000000
[Sat Dec 31 22:07:04 2022] x2 : ffff8000088548e0 x1 : ffff800009469000 x0 : 0000000000000000
[Sat Dec 31 22:07:04 2022] Call trace:
[Sat Dec 31 22:07:04 2022]  0x0
[Sat Dec 31 22:07:04 2022]  seq_read_iter+0x164/0x450
[Sat Dec 31 22:07:04 2022]  seq_read+0x84/0xc0
[Sat Dec 31 22:07:04 2022]  full_proxy_read+0x60/0xbc
[Sat Dec 31 22:07:04 2022]  vfs_read+0xc0/0x2a0
[Sat Dec 31 22:07:04 2022]  ksys_read+0x6c/0x100
[Sat Dec 31 22:07:04 2022]  __arm64_sys_read+0x1c/0x30
[Sat Dec 31 22:07:04 2022]  invoke_syscall.constprop.0+0x50/0xf0
[Sat Dec 31 22:07:04 2022]  do_el0_svc+0xbc/0xe0
[Sat Dec 31 22:07:04 2022]  el0_svc+0x30/0x120
[Sat Dec 31 22:07:04 2022]  el0t_64_sync_handler+0xf4/0x120
[Sat Dec 31 22:07:04 2022]  el0t_64_sync+0x18c/0x190
[Sat Dec 31 22:07:04 2022] Code: bad PC value
[Sat Dec 31 22:07:04 2022] ---[ end trace 0000000000000000 ]---

This is easily reproducible, let me know if there is any other information I can provide.

@nickgarvey
Copy link
Author

[ngarvey@macmini ~]$ sudo grep -lre 'asahi-linux' /
/home/ngarvey/.bash_history
/home/ngarvey/.viminfo
/home/ngarvey/.Xauthority
/home/ngarvey/.config/emailidentities
/home/ngarvey/.config/kdeconnect/config
/home/ngarvey/.local/share/akonadi/db_data/aria_log.00000001
/home/ngarvey/.local/share/akonadi/db_data/mysql/proxies_priv.MAD
/home/ngarvey/.local/share/akonadi/db_data/mysql/proxies_priv.MAI
/home/ngarvey/.local/share/akonadi/db_data/mysql/global_priv.MAI
/home/ngarvey/.local/share/akonadi/db_data/mysql/global_priv.MAD
grep: /sys/kernel/mm/hugepages/hugepages-32768kB/demote: Permission denied
grep: /sys/kernel/mm/hugepages/hugepages-1048576kB/demote: Permission denied
grep: /sys/kernel/security/apparmor/revision: Resource temporarily unavailable
grep: /sys/kernel/security/apparmor/.remove: Invalid argument
grep: /sys/kernel/security/apparmor/.replace: Invalid argument
grep: /sys/kernel/security/apparmor/.load: Invalid argument

@marcan
Copy link
Member

marcan commented Jan 5, 2023

Looks like it crashes reading /sys/kernel/debug/psci. That's an upstream bug (this is not a PSCI platform so there is no PSCI support).

@marcan
Copy link
Member

marcan commented Jan 5, 2023

Upstream patch: https://lore.kernel.org/lkml/20230105090834.630238-1-maz@kernel.org/

@marcan
Copy link
Member

marcan commented Jun 15, 2023

Should be fixed these days.

@marcan marcan closed this as completed Jun 15, 2023
svenpeter42 pushed a commit that referenced this issue Apr 17, 2024
[ Upstream commit 601429c ]

Why:
    The PCI error slot reset maybe triggered after inject ue to UMC multi times, this
    caused system hang.
    [  557.371857] amdgpu 0000:af:00.0: amdgpu: GPU reset succeeded, trying to resume
    [  557.373718] [drm] PCIE GART of 512M enabled.
    [  557.373722] [drm] PTB located at 0x0000031FED700000
    [  557.373788] [drm] VRAM is lost due to GPU reset!
    [  557.373789] [drm] PSP is resuming...
    [  557.547012] mlx5_core 0000:55:00.0: mlx5_pci_err_detected Device state = 1 pci_status: 0. Exit, result = 3, need reset
    [  557.547067] [drm] PCI error: detected callback, state(1)!!
    [  557.547069] [drm] No support for XGMI hive yet...
    [  557.548125] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 0. Enter
    [  557.607763] mlx5_core 0000:55:00.0: wait vital counter value 0x16b5b after 1 iterations
    [  557.607777] mlx5_core 0000:55:00.0: mlx5_pci_slot_reset Device state = 1 pci_status: 1. Exit, err = 0, result = 5, recovered
    [  557.610492] [drm] PCI error: slot reset callback!!
    ...
    [  560.689382] amdgpu 0000:3f:00.0: amdgpu: GPU reset(2) succeeded!
    [  560.689546] amdgpu 0000:5a:00.0: amdgpu: GPU reset(2) succeeded!
    [  560.689562] general protection fault, probably for non-canonical address 0x5f080b54534f611f: 0000 [#1] SMP NOPTI
    [  560.701008] CPU: 16 PID: 2361 Comm: kworker/u448:9 Tainted: G           OE     5.15.0-91-generic #101-Ubuntu
    [  560.712057] Hardware name: Microsoft C278A/C278A, BIOS C2789.5.BS.1C11.AG.1 11/08/2023
    [  560.720959] Workqueue: amdgpu-reset-hive amdgpu_ras_do_recovery [amdgpu]
    [  560.728887] RIP: 0010:amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
    [  560.736891] Code: ff 41 89 c6 e9 1b ff ff ff 44 0f b6 45 b0 e9 4f ff ff ff be 01 00 00 00 4c 89 e7 e8 76 c9 8b ff 44 0f b6 45 b0 e9 3c fd ff ff <48> 83 ba 18 02 00 00 00 0f 84 6a f8 ff ff 48 8d 7a 78 be 01 00 00
    [  560.757967] RSP: 0018:ffa0000032e53d80 EFLAGS: 00010202
    [  560.763848] RAX: ffa00000001dfd10 RBX: ffa0000000197090 RCX: ffa0000032e53db0
    [  560.771856] RDX: 5f080b54534f5f07 RSI: 0000000000000000 RDI: ff11000128100010
    [  560.779867] RBP: ffa0000032e53df0 R08: 0000000000000000 R09: ffffffffffe77f08
    [  560.787879] R10: 0000000000ffff0a R11: 0000000000000001 R12: 0000000000000000
    [  560.795889] R13: ffa0000032e53e00 R14: 0000000000000000 R15: 0000000000000000
    [  560.803889] FS:  0000000000000000(0000) GS:ff11007e7e800000(0000) knlGS:0000000000000000
    [  560.812973] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [  560.819422] CR2: 000055a04c118e68 CR3: 0000000007410005 CR4: 0000000000771ee0
    [  560.827433] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [  560.835433] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
    [  560.843444] PKRU: 55555554
    [  560.846480] Call Trace:
    [  560.849225]  <TASK>
    [  560.851580]  ? show_trace_log_lvl+0x1d6/0x2ea
    [  560.856488]  ? show_trace_log_lvl+0x1d6/0x2ea
    [  560.861379]  ? amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
    [  560.867778]  ? show_regs.part.0+0x23/0x29
    [  560.872293]  ? __die_body.cold+0x8/0xd
    [  560.876502]  ? die_addr+0x3e/0x60
    [  560.880238]  ? exc_general_protection+0x1c5/0x410
    [  560.885532]  ? asm_exc_general_protection+0x27/0x30
    [  560.891025]  ? amdgpu_device_gpu_recover.cold+0xbf1/0xcf5 [amdgpu]
    [  560.898323]  amdgpu_ras_do_recovery+0x1b2/0x210 [amdgpu]
    [  560.904520]  process_one_work+0x228/0x3d0
How:
    In RAS recovery, mode-1 reset is issued from RAS fatal error handling and expected
    all the nodes in a hive to be reset. no need to issue another mode-1 during this procedure.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants