Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drm: BUG: unable to handle page fault for address: 17ec6000 #1081

Closed
paulmenzel opened this issue Jul 7, 2020 · 19 comments
Closed

drm: BUG: unable to handle page fault for address: 17ec6000 #1081

paulmenzel opened this issue Jul 7, 2020 · 19 comments
Labels
[FIXED][LINUX] 5.10 This bug was fixed in Linux 5.10 Reported upstream This bug was filed on LLVM’s issue tracker, Phabricator, or the kernel mailing list.

Comments

@paulmenzel
Copy link

On the Asus F2A85-M PRO with

00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Richland [Radeon HD 8470D] [1002:9996]

running Debian Sid/unstable with Linux v5.8-rc4-25-gbfe91da29bfad (with some patches for LLVM/Clang/LLD) built with clang-11 and lld-11 1:11~++20200701093119+ffee8040534-1~exp1 from experimental, starting a graphical session (X.Org or Wayland) fails with a page fault:

[  502.044997] BUG: unable to handle page fault for address: 17ec6000
[  502.045650] #PF: supervisor write access in kernel mode
[  502.046301] #PF: error_code(0x0002) - not-present page
[  502.046956] *pde = 00000000 
[  502.047612] Oops: 0002 [#1] SMP
[  502.048269] CPU: 0 PID: 2125 Comm: Xorg.wrap Not tainted 5.8.0-rc4-00105-g4da71f1ee6263 #141
[  502.048967] Hardware name: System manufacturer System Product Name/F2A85-M PRO, BIOS 6601 11/25/2014
[  502.049686] EIP: __srcu_read_lock+0x11/0x20
[  502.050413] Code: 83 e0 03 50 56 68 72 c6 99 dd 68 46 c6 99 dd e8 3a c8 fe ff 83 c4 10 eb ce 0f 1f 44 00 00 55 89 e5 8b 48 68 8b 40 7c 83 e1 01 <64> ff 04 88 f0 83 44 24 fc 00 89 c8 5d c3 90 0f 1f 44 00 00 55 89
[  502.052027] EAX: 00000000 EBX: f36671b8 ECX: 00000000 EDX: 00000286
[  502.052856] ESI: f3f94eb8 EDI: f3e51c00 EBP: f303dd9c ESP: f303dd9c
[  502.053695] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[  502.054543] CR0: 80050033 CR2: 17ec6000 CR3: 2eea2000 CR4: 000406d0
[  502.055402] Call Trace:
[  502.056275]  drm_minor_acquire+0x6f/0x140 [drm]
[  502.057162]  drm_stub_open+0x2e/0x110 [drm]
[  502.058049]  chrdev_open+0xdd/0x1e0
[  502.058937]  do_dentry_open+0x21d/0x330
[  502.059828]  vfs_open+0x23/0x30
[  502.060718]  path_openat+0x947/0xd60
[  502.061610]  ? unlink_anon_vmas+0x53/0x120
[  502.062504]  do_filp_open+0x6d/0x100
[  502.063404]  ? __alloc_fd+0x73/0x140
[  502.064305]  do_sys_openat2+0x1b3/0x2a0
[  502.065217]  __ia32_sys_openat+0x90/0xb0
[  502.066128]  ? prepare_exit_to_usermode+0xa/0x20
[  502.067046]  do_fast_syscall_32+0x68/0xd0
[  502.067970]  do_SYSENTER_32+0x12/0x20
[  502.068902]  entry_SYSENTER_32+0x9f/0xf2
[  502.069839] EIP: 0xb7ef14f9
[  502.070764] Code: Bad RIP value.
[  502.071689] EAX: ffffffda EBX: ffffff9c ECX: bfa6a2ac EDX: 00008002
[  502.072654] ESI: 00000000 EDI: b7ed1000 EBP: bfa6b2c8 ESP: bfa6a1c0
[  502.073630] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
[  502.074615] Modules linked in: af_packet k10temp r8169 realtek i2c_piix4 snd_hda_codec_realtek snd_hda_codec_generic ohci_pci ohci_hcd ehci_pci snd_hda_codec_hdmi ehci_hcd radeon i2c_algo_bit snd_hda_intel ttm snd_intel_dspcfg snd_hda_codec drm_kms_helper snd_hda_core snd_pcm cfbimgblt cfbcopyarea cfbfillrect snd_timer sysimgblt syscopyarea sysfillrect snd fb_sys_fops xhci_pci xhci_hcd soundcore acpi_cpufreq drm drm_panel_orientation_quirks agpgart ipv6 nf_defrag_ipv6
[  502.077895] CR2: 0000000017ec6000
[  502.079050] ---[ end trace ced4517b63a6db26 ]---
[  502.080214] EIP: __srcu_read_lock+0x11/0x20
[  502.081392] Code: 83 e0 03 50 56 68 72 c6 99 dd 68 46 c6 99 dd e8 3a c8 fe ff 83 c4 10 eb ce 0f 1f 44 00 00 55 89 e5 8b 48 68 8b 40 7c 83 e1 01 <64> ff 04 88 f0 83 44 24 fc 00 89 c8 5d c3 90 0f 1f 44 00 00 55 89
[  502.083891] EAX: 00000000 EBX: f36671b8 ECX: 00000000 EDX: 00000286
[  502.085148] ESI: f3f94eb8 EDI: f3e51c00 EBP: f303dd9c ESP: f303dd9c
[  502.086406] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246
[  502.087675] CR0: 80050033 CR2: 17ec6000 CR3: 2eea2000 CR4: 000406d0

linux-5.8-rc4+-messages.txt

@paulmenzel
Copy link
Author

Should I report this to the DRM folks, or is it a LLVM/Clang issue because it works with GCC just fine?

@nathanchance
Copy link
Member

Should I report this to the DRM folks

It might be worth getting them involved due to the complexity of the system.

is it a LLVM/Clang issue because it works with GCC just fine?

Just because it works fine with GCC doesn't mean it is an LLVM/Clang issue. See #735 for an instance of this with amdgpu.

Other than that, I do not have much else to offer at the moment from staring at the code.

nathanchance added a commit to nathanchance/continuous-integration that referenced this issue Jul 7, 2020
Note, as of next-20200707, arm32 and arm64 boot is broken but should be
fixed in next-20200708 for reasons unrelated to LLVM:

https://lore.kernel.org/dmaengine/5f036d83.1c69fb81.10199.06d0@mx.google.com/
https://lore.kernel.org/dmaengine/159404871194.45151.3076873396834992441.stgit@djiang5-desk3.ch.intel.com/

No presubmit testing is done for that reason but this has been verified
locally.

[skip ci]

Fixes: https://travis-ci.com/github/ClangBuiltLinux/continuous-integration/jobs/358277535
Link: ClangBuiltLinux/linux#1081
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
@nickdesaulniers nickdesaulniers added the [ARCH] x86_64 This bug impacts ARCH=x86_64 label Jul 7, 2020
@nickdesaulniers
Copy link
Member

with some patches for LLVM/Clang/LLD

What does that mean; they may be important to reproduce.

From the trace, it looks like just the 32b registers are being printed? Is this a 32b kernel image, or a 64b kernel image?

@paulmenzel paulmenzel added [ARCH] x86 This bug impacts ARCH=i386 and removed [ARCH] x86_64 This bug impacts ARCH=x86_64 labels Jul 9, 2020
@paulmenzel
Copy link
Author

with some patches for LLVM/Clang/LLD

What does that mean; they may be important to reproduce.

Sorry, the two Linux commits for the two issues below are applied.

  1. x86: support i386 with Clang, invalid output size for constraint '=q' in arch/x86/events/amd/core.c #194
  2. x86/boot: allow a relocatable kernel to be linked with lld, can't create dynamic relocation R_386_32 with LLD #579

From the trace, it looks like just the 32b registers are being printed? Is this a 32b kernel image, or a 64b kernel image?

This is a 32-bit (ARCH=i386) Linux kernel image.

@paulmenzel
Copy link
Author

$ dmesg | ./scripts/decodecode
[ 55.784870] Code: 83 e0 03 50 56 68 ca c6 99 cf 68 9e c6 99 cf e8 3a c8 fe ff 83 c4 10 eb ce 0f 1f 44 00 00 55 89 e5 8b 48 68 8b 40 7c 83 e1 01 <64> ff 04 88 f0 83 44 24 fc 00 89 c8 5d c3 90 0f 1f 44 00 00 55 89
All code
========
   0:	83 e0 03             	and    $0x3,%eax
   3:	50                   	push   %eax
   4:	56                   	push   %esi
   5:	68 ca c6 99 cf       	push   $0xcf99c6ca
   a:	68 9e c6 99 cf       	push   $0xcf99c69e
   f:	e8 3a c8 fe ff       	call   0xfffec84e
  14:	83 c4 10             	add    $0x10,%esp
  17:	eb ce                	jmp    0xffffffe7
  19:	0f 1f 44 00 00       	nopl   0x0(%eax,%eax,1)
  1e:	55                   	push   %ebp
  1f:	89 e5                	mov    %esp,%ebp
  21:	8b 48 68             	mov    0x68(%eax),%ecx
  24:	8b 40 7c             	mov    0x7c(%eax),%eax
  27:	83 e1 01             	and    $0x1,%ecx
  2a:*	64 ff 04 88          	incl   %fs:(%eax,%ecx,4)		<-- trapping instruction
  2e:	f0 83 44 24 fc 00    	lock addl $0x0,-0x4(%esp)
  34:	89 c8                	mov    %ecx,%eax
  36:	5d                   	pop    %ebp
  37:	c3                   	ret    
  38:	90                   	nop
  39:	0f 1f 44 00 00       	nopl   0x0(%eax,%eax,1)
  3e:	55                   	push   %ebp
  3f:	89                   	.byte 0x89

Code starting with the faulting instruction
===========================================
   0:	64 ff 04 88          	incl   %fs:(%eax,%ecx,4)
   4:	f0 83 44 24 fc 00    	lock addl $0x0,-0x4(%esp)
   a:	89 c8                	mov    %ecx,%eax
   c:	5d                   	pop    %ebp
   d:	c3                   	ret    
   e:	90                   	nop
   f:	0f 1f 44 00 00       	nopl   0x0(%eax,%eax,1)
  14:	55                   	push   %ebp
  15:	89                   	.byte 0x89

@kaniini
Copy link

kaniini commented Sep 16, 2020

This is happening for me when building unpatched Linux 5.8.9 from kernel.org with clang as well on aarch64:

[   19.810753] Unable to handle kernel paging request at virtual address ffff802f0735a000
[   19.810755] Mem abort info:
[   19.810756]   ESR = 0x96000005
[   19.810758]   EC = 0x25: DABT (current EL), IL = 32 bits
[   19.810760]   SET = 0, FnV = 0
[   19.810761]   EA = 0, S1PTW = 0
[   19.810762] Data abort info:
[   19.810764]   ISV = 0, ISS = 0x00000005
[   19.810765]   CM = 0, WnR = 0
[   19.810767] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000e3578000
[   19.810768] [ffff802f0735a000] pgd=0000002fdffff003, p4d=0000002fdffff003, pud=0000000000000000
[   19.810773] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[   19.810886] Modules linked in: cdc_ether usbnet r8152 mii nls_utf8 nls_cp437 vfat fat snd_usb_audio aes_ce_blk crypto_simd cryptd snd_usbmidi_lib aes_ce_cipher snd_rawmidi crct10dif_ce ghash_ce snd_seq_device snd_hda_codec_hdmi joydev mousedev gf128mul mc input_leds af_packet sha2_ce efi_pstore evdev snd_hda_intel snd_intel_dspcfg snd_hda_codec sha256_arm64 sha1_ce snd_hda_core efivars sbsa_gwdt snd_hwdep snd_pcm snd_timer snd lm90 soundcore uio_pdrv_genirq uio fan thermal processor dwc3 efivarfs hid_generic usbhid hid amdgpu gpu_sched hwmon ttm drm_kms_helper drm cec fb_sys_fops syscopyarea sysfillrect sysimgblt i2c_algo_bit i2c_core nvme nvme_core ahci_platform libahci_platform libahci libata xhci_plat_hcd xhci_hcd loop usb_storage usbcore
[   19.816393] CPU: 15 PID: 2957 Comm: X Not tainted 5.8.9-1-edge #2-Alpine
[   19.816834] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jul 24 2020
[   19.816988] pstate: 20000005 (nzCv daif -PAN -UAO BTYPE=--)
[   19.817607] pc : __srcu_read_lock+0x38/0x7c
[   19.832584] lr : drm_minor_acquire+0xa8/0x11c [drm]
[   19.838341] sp : ffff800012f9baa0
[   19.838342] x29: ffff800012f9baa0 x28: 0000000000000041 
[   19.838344] x27: ffff002f100a24c0 x26: 0000000000000030 
[   19.838346] x25: ffff002f08879140 x24: ffff002f10a33328 
[   19.838347] x23: 0000000000000000 x22: ffff800008a5b4e8 
[   19.838349] x21: ffff002f0785c3f0 x20: ffff002f10af6000 
[   19.838351] x19: 0000000000000000 x18: 0000000000000000 
[   19.838352] x17: 0000000000000002 x16: ffffffffffffffff 
[   19.838354] x15: 0000000000000028 x14: 0000000000000005 
[   19.838356] x13: ffff002f08879148 x12: 0000000000000000 
[   19.838357] x11: ffff802f0735a000 x10: 0000000000000001 
[   19.838359] x9 : ffff802f0735a000 x8 : ffff002f100a24c0 
[   19.838360] x7 : 0000000000000000 x6 : 000000000000003f 
[   19.838362] x5 : 0000000000000000 x4 : 0000000000000000 
[   19.838364] x3 : 0000000000000001 x2 : ffff002f082e6400 
[   19.838365] x1 : 0000000000000000 x0 : ffff800008a794d0 
[   19.838368] Call trace:
[   19.838371]  __srcu_read_lock+0x38/0x7c
[   19.838382]  drm_minor_acquire+0xa8/0x11c [drm]
[   19.838391]  drm_stub_open+0x34/0x114 [drm]
[   19.838394]  chrdev_open+0x198/0x1f8
[   19.838396]  do_dentry_open+0x268/0x3a0
[   19.838398]  vfs_open+0x28/0x30
[   19.838399]  path_openat+0x888/0xc0c
[   19.838401]  do_filp_open+0x74/0x11c
[   19.838403]  do_sys_openat2+0x7c/0x14c
[   19.838404]  __arm64_sys_openat+0x68/0x8c
[   19.838407]  el0_svc_common+0x98/0x160
[   19.838409]  do_el0_svc+0x70/0x78
[   19.838412]  el0_sync_handler+0xd4/0x248
[   19.838413]  el0_sync+0x140/0x180
[   19.838416] Code: d538d08b 8b130d49 8b090169 5280002a (c85f7d2c) 
[   19.838419] ---[ end trace 32b105d7eb3c05e2 ]---
[   19.838434] note: X[2957] exited with preempt_count 1

I suspect the issue has to do with SRCU.

@nickdesaulniers
Copy link
Member

@nickdesaulniers nickdesaulniers added [BUG] Untriaged Something isn't working Reported upstream This bug was filed on LLVM’s issue tracker, Phabricator, or the kernel mailing list. and removed [ARCH] x86 This bug impacts ARCH=i386 labels Sep 16, 2020
@kaniini
Copy link

kaniini commented Sep 17, 2020

I will have to build a kernel by hand outside of the Alpine kernel packaging but will sprinkle in some printk() this weekend as requested in that thread.

@nickdesaulniers
Copy link
Member

Thanks for the reports. @kaniini please attach disassembly of the bottom most stack frame when posting traces; those go hand in hand and we typically need both to understand reports. They also need to come from precisely the same kernel image; rebuilding may change the object file (I'm not sure of the kernel's status as far as fully reproducible builds is concerned).

Paul had some suggestions:

0.      Did someone call srcu_read_lock() before init_srcu_struct()
        had been called on this srcu_struct structure?

Printing via printk %p with kptr restrict should help us spot if we see the same address between these two, but in the wrong order, perhaps.

1.      Did the init_srcu_struct() for this srcu_struct report an error?
        (Though with current mainline, that memory-allocation failure
        would more likely have page-faulted in init_srcu_struct().)

You can check the dmesg closer for any reports. I noticed that some of these functions have different definitions when CONFIG_DEBUG_LOCK_ALLOC is enabled. You could try enabling that config.

2.      Has the srcu_struct in question already been passed to
        cleanup_srcu_struct()?

So in this case, I'd add printk's in cleanup_srcu_struct of the %p pointer. You might need kptr restrict disabled to not see 0x00's in the dmesg. If you see the same address as one that's been

3.      Has the value of %fs been clobbered?  Though that seems
        unlikely given that it also happens on aarch64.  Plus, the
        smoking gun seems to me to be the zero value of %eax.

I don't think this is the case, %fs had a value in @paulmenzel 's report, and @kaniini 's report is arm64.

4.      If the above three questions fail to provide enlightenment,
        I suggest recording the ->sda value and adding debug checks
        to anything that can unmap memory...  And recording the value
        of ->sda somewhere to check to see if it is being changed (it
        should remain constant from init_srcu_struct()'s return through
        the corresponding call to cleanup_srcu_struct()).

I kind of get the feeling that there may be a dangling reference to a value that's been cleaned up somewhere, too. I wonder if enabling KASAN would help find use after frees here?

@kaniini
Copy link

kaniini commented Sep 19, 2020

arm64 does not have scripts/decodecode available to it, AFAIK. otherwise I would :)

@kaniini
Copy link

kaniini commented Sep 19, 2020

also, %fs was originally zeroed in @paulmenzel's report. the second mention of FS is 0xd8, which means clobbering is possible.

@nathanchance
Copy link
Member

So as it turns out, I think that @nickdesaulniers 's recent SRCU patch actually fixes this issue... https://lore.kernel.org/lkml/20200929192549.501516-1-ndesaulniers@google.com/

I can reproduce these warnings on my Raspberry Pi on next-20201002 by just booting it up now that the DRM stack works fine.
As soon as I add Nick's patch, they go away. Further testing would be nice from @paulmenzel and @kaniini and if that patch fixes it, we should ask Paul to fast track it to mainline with a CC stable tag.

@nathanchance
Copy link
Member

nathanchance commented Oct 6, 2020

Actually, I just decided to reply on the mailing list with that information: https://lore.kernel.org/lkml/20201006065623.GA2418984@ubuntu-m3-large-x86/. Further testing would still be appreciated!

@paulmenzel
Copy link
Author

paulmenzel commented Oct 6, 2020

Thank you for the update. I did a test again with

$ git describe --tags clang-lto/clang-lto # x86, build: allow LTO_CLANG and THINLTO to be selected
v5.9-rc8-174-gf37134efda8fd

with the KBUILD_LDFLAGS += -z notext change from #579 applied on top.

Using the package clang-11 and lld-11 at version 11.0.0~+rc5-1, adding non-versioned symbolic links, an image built with

make bindeb-pkg -j32 ARCH=i386 LLVM=1

works, and the bug is not visible.

$ more /proc/version
Linux version 5.9.0-rc8+ (root@855cb05d002d) (Debian clang version 11.0.0-+rc5-1 , LLD 11.0.0) #205 SMP Tue Oct 6 08:09:19 UTC 2020

But it looks like, Nick’s patch you referenced is not in the branch,

git log --grep srcu -i --author=Nick

so my problem seems to have been something else, and this issue can be closed, and a new one opened for yours on the Raspberry Pi?

@nathanchance
Copy link
Member

Hmmm, good to know that your issue is resolved although I cannot help but feel that the issues are somehow related given the call trace is extremely similar. I do not think we should split the bugs for now.

@kaniini
Copy link

kaniini commented Oct 6, 2020

Applying the SRCU patch does seem to resolve it here in light testing.

@nickdesaulniers
Copy link
Member

But it looks like, Nick’s patch you referenced is not in the branch,

My patch hasn't landed in mainline yet. You'll need to pick it up and apply it manually.

@dileks
Copy link
Collaborator

dileks commented Oct 9, 2020

@paulmckrcu

Paul included the (updated) patch in his <linux-rcu.git#rcu/next> Git tree.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/patch/?id=9044d3d36df248cd15c98ccddb1509c4a8de19cb

@nathanchance
Copy link
Member

I think this should be resolved by https://git.kernel.org/linus/33def8498fdde180023444b08e12b72a9efed41d. Feel free to reopen if not.

@nathanchance nathanchance added [FIXED][LINUX] 5.10 This bug was fixed in Linux 5.10 and removed [BUG] Untriaged Something isn't working labels Apr 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[FIXED][LINUX] 5.10 This bug was fixed in Linux 5.10 Reported upstream This bug was filed on LLVM’s issue tracker, Phabricator, or the kernel mailing list.
Projects
None yet
Development

No branches or pull requests

5 participants