Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU General Protection Faults #735

Closed
yshui opened this issue Oct 8, 2019 · 32 comments
Closed

AMDGPU General Protection Faults #735

yshui opened this issue Oct 8, 2019 · 32 comments
Assignees
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] linux A bug that should be fixed in the mainline kernel. [FIXED][LINUX] 5.4 This bug was fixed in Linux 5.4

Comments

@yshui
Copy link
Member

yshui commented Oct 8, 2019

dmesg

general protection fault: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 786 Comm: systemd-udevd Tainted: P           O      5.4.0-rc2-1-me-clang+ #3
Hardware name: System manufacturer System Product Name/ROG STRIX X470-F GAMING, BIOS 5007 06/17/2019
RIP: 0010:CalculatePrefetchSourceLines+0xe3/0x190 [amdgpu]
Code: f2 0f 58 05 37 8a 16 00 f2 48 0f 2c c0 31 d2 41 f7 f6 e9 a0 00 00 00 f2 0f 58 05 20 8a 16 00 44 89 f0 0f 57 c9 f2 48 0f 2a c8 <66> 0f 29 4d d0 f2 0f 5e c1 f2 0f 5a c0 f3 0f 10 0d f0 8f 16 00 e8
RSP: 0018:ffff943717476db8 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff9436c8331f38 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9436c8331f38
RBP: ffff943717476de8 R08: 0000000000000000 R09: ffff9436c8336d10
R10: 0000000000000001 R11: ffff9436c83379b8 R12: ffff9436c8336cd0
R13: ffff9436c83367f8 R14: 0000000000000001 R15: ffff9436c8336d10
FS:  00007fb1af330840(0000) GS:ffff94371e800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000555e5db20d68 CR3: 0000000817472000 CR4: 00000000003406f0
Call Trace:
 dml20v2_ModeSupportAndSystemConfigurationFull+0x47ec/0x6c90 [amdgpu]
 ? get_page_from_freelist+0x370/0x680
 dml_get_voltage_level+0x1ea/0x200 [amdgpu]
 dcn20_fast_validate_bw+0x290/0x990 [amdgpu]
 ? __kmalloc+0x218/0x2b0
 dcn20_validate_bandwidth_internal+0x8d/0x1f0 [amdgpu]
 dcn20_validate_bandwidth+0x46/0xe0 [amdgpu]
 dc_validate_global_state+0x354/0x380 [amdgpu]
 amdgpu_dm_atomic_check+0xdc1/0xee0 [amdgpu]
 ? __alloc_pages_nodemask+0x16b/0x300
 ? idr_alloc+0x8e/0xc0
 ? __drm_mode_object_add+0x95/0xb0
 drm_atomic_check_only+0x410/0x710
 ? drm_atomic_add_affected_connectors+0xe6/0x100
 drm_atomic_commit+0x18/0x60
 drm_client_modeset_commit_atomic+0x158/0x1c0
 drm_client_modeset_commit_force+0x43/0x170
 drm_fb_helper_set_par+0x79/0xb0
 fbcon_init+0x483/0x620
 visual_init+0xd1/0x130
 do_bind_con_driver+0x340/0x620
 do_take_over_console+0x257/0x2a0
 fbcon_fb_registered+0x128/0x220
 register_framebuffer+0x27d/0x310
 __drm_fb_helper_initial_config_and_unlock+0x365/0x480
 amdgpu_fbdev_init+0xe3/0x100 [amdgpu]
 amdgpu_device_init+0x1935/0x1ca0 [amdgpu]
 ? __alloc_pages_nodemask+0x16b/0x300
 ? kmalloc_order_trace+0x77/0x130
 amdgpu_driver_load_kms+0x7b/0x230 [amdgpu]
 drm_dev_register+0xdb/0x2a0
 amdgpu_pci_probe+0xdd/0x130 [amdgpu]
 pci_device_probe+0x11f/0x1b0
 really_probe+0x283/0x560
 driver_probe_device+0x59/0xf0
 device_driver_attach+0x66/0xa0
 __driver_attach+0xac/0x130
 ? driver_attach+0x20/0x20
 bus_for_each_dev+0x83/0xb0
 bus_add_driver+0x11b/0x200
 ? 0xffffffffc2013000
 driver_register+0x86/0x120
 do_one_initcall+0x106/0x290
 ? __free_one_page+0x67/0x380
 ? _raw_spin_unlock+0x16/0x30
 ? preempt_count_add+0x56/0xa0
 ? __vunmap+0x267/0x2a0
 ? kmem_cache_alloc_trace+0x1ea/0x260
 ? __vunmap+0x267/0x2a0
 do_init_module+0x60/0x230
 load_module+0x3996/0x3ce0
 ? copy_user_generic_string+0x2c/0x40
 __se_sys_init_module+0xfe/0x130
 do_syscall_64+0x8a/0x120
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb1b0b4ed2e
Code: 48 8b 0d 55 01 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 22 01 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd73618d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 0000555e5dabdf40 RCX: 00007fb1b0b4ed2e
RDX: 0000555e5daba460 RSI: 000000000073a0b8 RDI: 0000555e5e2e29f0
RBP: 0000555e5daba460 R08: 0000555e5d97301a R09: 0000000000000004
R10: 0000555e5d973010 R11: 0000000000000246 R12: 0000555e5e2e29f0
R13: 0000555e5dac03b0 R14: 0000000000020000 R15: 0000555e5dabdf40
Modules linked in: vfat fat amdgpu(+) btusb btrtl btbcm btintel bluetooth mxm_wmi ecdh_generic ecc crc16 snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel gpu_sched k10temp snd_intel_nhlt snd_hda_codec snd_usb_audio ttm igb snd_hda_core snd_usbmidi_lib snd_hwdep snd_pcm snd_rawmidi snd_seq_device snd_timer snd soundcore usbkbd usbmouse wmi nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) sg efivarfs ip_tables hid_generic usbhid
---[ end trace bb36ee02fd414cc1 ]---
------------[ cut here ]------------

Disassembly

000000000022c760 <CalculatePrefetchSourceLines>:
  22c760:	e8 00 00 00 00       	call   22c765 <CalculatePrefetchSourceLines+0x5>
  22c765:	55                   	push   rbp
  22c766:	48 89 e5             	mov    rbp,rsp
  22c769:	41 57                	push   r15
  22c76b:	41 56                	push   r14
  22c76d:	41 54                	push   r12
  22c76f:	53                   	push   rbx
  22c770:	48 83 ec 10          	sub    rsp,0x10
  22c774:	4d 89 cf             	mov    r15,r9
  22c777:	41 89 ce             	mov    r14d,ecx
  22c77a:	48 89 fb             	mov    rbx,rdi
  22c77d:	f2 0f 58 c8          	addsd  xmm1,xmm0
  22c781:	f2 0f 58 0d 00 00 00 	addsd  xmm1,QWORD PTR [rip+0x0]        # 22c789 <CalculatePrefetchSourceLines+0x29>
  22c788:	00 
  22c789:	85 d2                	test   edx,edx
  22c78b:	75 18                	jne    22c7a5 <CalculatePrefetchSourceLines+0x45>
  22c78d:	40 0f b6 c6          	movzx  eax,sil
  22c791:	f2 0f 2a d0          	cvtsi2sd xmm2,eax
  22c795:	f2 0f 59 15 00 00 00 	mulsd  xmm2,QWORD PTR [rip+0x0]        # 22c79d <CalculatePrefetchSourceLines+0x3d>
  22c79c:	00 
  22c79d:	f2 0f 59 d0          	mulsd  xmm2,xmm0
  22c7a1:	f2 0f 58 ca          	addsd  xmm1,xmm2
  22c7a5:	4c 8b 65 10          	mov    r12,QWORD PTR [rbp+0x10]
  22c7a9:	f2 0f 59 0d 00 00 00 	mulsd  xmm1,QWORD PTR [rip+0x0]        # 22c7b1 <CalculatePrefetchSourceLines+0x51>
  22c7b0:	00 
  22c7b1:	0f 57 c0             	xorps  xmm0,xmm0
  22c7b4:	f2 0f 5a c1          	cvtsd2ss xmm0,xmm1
  22c7b8:	f3 0f 10 0d 00 00 00 	movss  xmm1,DWORD PTR [rip+0x0]        # 22c7c0 <CalculatePrefetchSourceLines+0x60>
  22c7bf:	00 
  22c7c0:	e8 00 00 00 00       	call   22c7c5 <CalculatePrefetchSourceLines+0x65>
  22c7c5:	f3 0f 5a c0          	cvtss2sd xmm0,xmm0
  22c7c9:	f2 41 0f 11 07       	movsd  QWORD PTR [r15],xmm0
  22c7ce:	80 bb e0 17 00 00 00 	cmp    BYTE PTR [rbx+0x17e0],0x0
  22c7d5:	74 59                	je     22c830 <CalculatePrefetchSourceLines+0xd0>
  22c7d7:	44 89 f0             	mov    eax,r14d
  22c7da:	0f 57 c9             	xorps  xmm1,xmm1
  22c7dd:	f2 48 0f 2a c8       	cvtsi2sd xmm1,rax
  22c7e2:	f2 0f 11 4d d0       	movsd  QWORD PTR [rbp-0x30],xmm1
  22c7e7:	f2 0f 5e c1          	divsd  xmm0,xmm1
  22c7eb:	f2 0f 5a c0          	cvtsd2ss xmm0,xmm0
  22c7ef:	f3 0f 10 0d 00 00 00 	movss  xmm1,DWORD PTR [rip+0x0]        # 22c7f7 <CalculatePrefetchSourceLines+0x97>
  22c7f6:	00 
  22c7f7:	e8 00 00 00 00       	call   22c7fc <CalculatePrefetchSourceLines+0x9c>
  22c7fc:	f3 48 0f 2c c8       	cvttss2si rcx,xmm0
  22c801:	41 89 0c 24          	mov    DWORD PTR [r12],ecx
  22c805:	f2 41 0f 10 07       	movsd  xmm0,QWORD PTR [r15]
  22c80a:	66 0f 2e 05 00 00 00 	ucomisd xmm0,QWORD PTR [rip+0x0]        # 22c812 <CalculatePrefetchSourceLines+0xb2>
  22c811:	00 
  22c812:	77 05                	ja     22c819 <CalculatePrefetchSourceLines+0xb9>
  22c814:	f2 0f 58 45 d0       	addsd  xmm0,QWORD PTR [rbp-0x30]
  22c819:	f2 0f 58 05 00 00 00 	addsd  xmm0,QWORD PTR [rip+0x0]        # 22c821 <CalculatePrefetchSourceLines+0xc1>
  22c820:	00 
  22c821:	f2 48 0f 2c c0       	cvttsd2si rax,xmm0
  22c826:	31 d2                	xor    edx,edx
  22c828:	41 f7 f6             	div    r14d
  22c82b:	e9 a0 00 00 00       	jmp    22c8d0 <CalculatePrefetchSourceLines+0x170>
  22c830:	f2 0f 58 05 00 00 00 	addsd  xmm0,QWORD PTR [rip+0x0]        # 22c838 <CalculatePrefetchSourceLines+0xd8>
  22c837:	00 
  22c838:	44 89 f0             	mov    eax,r14d
  22c83b:	0f 57 c9             	xorps  xmm1,xmm1
  22c83e:	f2 48 0f 2a c8       	cvtsi2sd xmm1,rax
  22c843:	66 0f 29 4d d0       	movapd XMMWORD PTR [rbp-0x30],xmm1      <============= here
  22c848:	f2 0f 5e c1          	divsd  xmm0,xmm1
  22c84c:	f2 0f 5a c0          	cvtsd2ss xmm0,xmm0
  22c850:	f3 0f 10 0d 00 00 00 	movss  xmm1,DWORD PTR [rip+0x0]        # 22c858 <CalculatePrefetchSourceLines+0xf8>
  22c857:	00 
  22c858:	e8 00 00 00 00       	call   22c85d <CalculatePrefetchSourceLines+0xfd>
  22c85d:	f3 0f 5a c0          	cvtss2sd xmm0,xmm0
  22c861:	f2 0f 10 0d 00 00 00 	movsd  xmm1,QWORD PTR [rip+0x0]        # 22c869 <CalculatePrefetchSourceLines+0x109>
  22c868:	00 
  22c869:	f2 0f 58 c1          	addsd  xmm0,xmm1
  22c86d:	f2 48 0f 2c c0       	cvttsd2si rax,xmm0
  22c872:	41 89 04 24          	mov    DWORD PTR [r12],eax
  22c876:	f2 41 0f 10 07       	movsd  xmm0,QWORD PTR [r15]
  22c87b:	66 0f 28 5d d0       	movapd xmm3,XMMWORD PTR [rbp-0x30]
  22c880:	f2 0f 58 d8          	addsd  xmm3,xmm0
  22c884:	f2 0f c2 c8 01       	cmpltsd xmm1,xmm0
  22c889:	66 0f 28 d1          	movapd xmm2,xmm1
  22c88d:	66 0f 55 d3          	andnpd xmm2,xmm3
  22c891:	66 0f 54 c8          	andpd  xmm1,xmm0
  22c895:	66 0f 56 ca          	orpd   xmm1,xmm2
  22c899:	f2 0f 58 0d 00 00 00 	addsd  xmm1,QWORD PTR [rip+0x0]        # 22c8a1 <CalculatePrefetchSourceLines+0x141>
  22c8a0:	00 
  22c8a1:	f2 48 0f 2c c1       	cvttsd2si rax,xmm1
  22c8a6:	31 d2                	xor    edx,edx
  22c8a8:	41 f7 f6             	div    r14d
  22c8ab:	0f 57 c0             	xorps  xmm0,xmm0
  22c8ae:	f2 48 0f 2a c2       	cvtsi2sd xmm0,rdx
  22c8b3:	0f 57 c9             	xorps  xmm1,xmm1
  22c8b6:	f2 0f 5a c8          	cvtsd2ss xmm1,xmm0
  22c8ba:	f3 0f 10 05 00 00 00 	movss  xmm0,DWORD PTR [rip+0x0]        # 22c8c2 <CalculatePrefetchSourceLines+0x162>
  22c8c1:	00 
  22c8c2:	e8 00 00 00 00       	call   22c8c7 <CalculatePrefetchSourceLines+0x167>
  22c8c7:	f3 48 0f 2c d0       	cvttss2si rdx,xmm0
  22c8cc:	41 8b 0c 24          	mov    ecx,DWORD PTR [r12]
  22c8d0:	41 0f af ce          	imul   ecx,r14d
  22c8d4:	01 d1                	add    ecx,edx
  22c8d6:	0f 57 c0             	xorps  xmm0,xmm0
  22c8d9:	f2 48 0f 2a c1       	cvtsi2sd xmm0,rcx
  22c8de:	48 83 c4 10          	add    rsp,0x10
  22c8e2:	5b                   	pop    rbx
  22c8e3:	41 5c                	pop    r12
  22c8e5:	41 5e                	pop    r14
  22c8e7:	41 5f                	pop    r15
  22c8e9:	5d                   	pop    rbp
  22c8ea:	c3                   	ret    
  22c8eb:	0f 1f 44 00 00       	nop    DWORD PTR [rax+rax*1+0x0]

rbp-30 is not aligned to 16 bytes boundary. Hopefully I didn't misunderstand anything?

@yshui yshui changed the title Unaligned movapd in CalculatePrefetchSourceLines Unaligned movapd in CalculatePrefetchSourceLines [amdgpu] Oct 8, 2019
@yshui
Copy link
Member Author

yshui commented Oct 8, 2019

File is drivers/gpu/drm/amd/display/dc/dml/dcn20/display_mode_vba_20v2.c

@nathanchance
Copy link
Member

I believe this is the runtime issue that @nickdesaulniers mentioned in IRC?

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Oct 9, 2019

Yes, @yshui great job debugging. You've come to the same conclusions I did. While it seems that their driver is compiled w/ 16B stack alignment, the rest of the x86_64 kernel is compiled with 8B stack alignment (as a micro optimization). Thus, there are no guarantees that the stack is actually 16B aligned. I added the -msse2 flag to allow the AMDGPU driver to link (otherwise, clang would generate soft-fp calls to functions that would exist in userspace libgcc_s/compiler_rt, but aren't implemented in the kernel). -msse2 makes clang select instructions that have 16B stack alignment requirements (which now link, but result in GPF's at runtime; I didn't/don't have hardware to test on when I wrote the -msse2 patch). GCC at -msse -mno-sse2 <no further SIMD extensions> doesn't error for the use of float or double, but does make use of xmm registers (I guess without alignment requirements, though I'm unfamiliar with the differences between sse and sse2). Clang frequently crashes during instruction selection in the x86_64 backend when mixing float, double, and -mno-sse2. (Honestly, sse but no sse2 is kind of weird).

I don't know why their kernel uses a 16B stack alignment, since it's not possible to guarantee your stack is actually 16B aligned if called from 8B stack aligned translation units (without additional inline assembly to detect and realign such cases, which I would consider a hack upon a hack).

Even if the 16B alignment was removed from their driver, it would still be incorrect to use -msse2 w/ Clang. (because the sse2 instructions require 16B stack alignment). So that should be reverted at some point, though there's value in being able to at least build it and stop the warnings that are coming into the actively developed and large driver.

If we implemented the soft-fp routines in the kernel (there is precedent for lifting them from libgcc's source), we'd also need to remove the calls to kernel_fpu_begin()/kernel_fpu_end() because then we wouldn't be using the xmm registers in the kernel. We'd need multiple implementations since currently CONFIG_AMDGPU is available from non-x86_64 architectures, unless we'd want to disable it for those, though I suspect someone might be using it for arm64.

The soft-fp routines only make sense if the code is not performance critical. @Ajs1984 suspects it has something to do with setting mode lines; I'm not sure if that would be harmed by soft-fp, but it's worth a try. Once soft-fp routines were implemented, we could then remove the -mhard-float -msse that the driver is using for both compilers.

Finally, we could spend more time on x86_64 instruction selection w/ sse but no sse2, but it's a very limited use case that even LLVM's x86_64 backend maintainer had doubts about that approach.

@nickdesaulniers nickdesaulniers changed the title Unaligned movapd in CalculatePrefetchSourceLines [amdgpu] AMDGPU General Protection Faults Oct 9, 2019
@nickdesaulniers nickdesaulniers added [BUG] linux A bug that should be fixed in the mainline kernel. [BUG] llvm A bug that should be fixed in upstream LLVM labels Oct 9, 2019
@yshui
Copy link
Member Author

yshui commented Oct 9, 2019

According to the Makefile, clang should get the -mstack-alignment=16 flag. I think the stack should have been aligned to 16 bytes.

@yshui
Copy link
Member Author

yshui commented Oct 9, 2019

Hmm, the amdgpu module (specifically, the dc) is the only place the stack is 16 bytes aligned. All other parts of the kernel have 8 bytes alignment. How does this work with gcc? Is the stack pointer somehow aligned before calling the amdgpu functions?

Normally you would use -mincoming-stack-boundary, but I don't see that being used anywhere.

@yshui
Copy link
Member Author

yshui commented Oct 9, 2019

Probably need input from AMD people to understand.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Oct 9, 2019

How does this work with gcc?

I assume that either:

  1. sse[1] instructions do not have 16B stack alignment requirements or
  2. gcc at -msse -mno-sse2 does not select such instructions

Is the stack pointer somehow aligned before calling the amdgpu functions?

It is not, hence this GPF w/ Clang. You'd have to audit the translation unit boundaries, and annotate them. Without an automated way of doing so or validating them, it would very easy to break and would be a mental burden for AMDGPU driver developers to remember to do and check. Thus, I consider that brittle.

Normally you would use -mincoming-stack-boundary

Oh, interesting, I did not know about that flag. If clang implements it, I wonder if we can use it for just those translation units? (Normally, I wouldn't adjust the stack alignment, but such is life).

@yshui
Copy link
Member Author

yshui commented Oct 9, 2019

gcc at -msse -mno-sse2 does not select such instructions

If that is true, then passing -mpreferred-stack-boundary=4 to gcc would have been kind of pointless. (I raised question on #radeon@freenode, but I didn't get a reply).

@nickdesaulniers
Copy link
Member

If that is true, then passing -mpreferred-stack-boundary=4 to gcc would have been kind of pointless.

I agree, but maybe there's another reason it's used that I can't think of. I did not go read the commits that added it to see if maybe a good reason was provided, but that's a good place to start.

(I raised question on #radeon@freenode, but I didn't get a reply).

I've asked the AMDGPU maintainer to ask around and set up a call, and I've asked their partner engineer on an internal thread related to CrOS. I've also spoken in person with an AMDGPU LLVM developer about this FWIW.

I haven't heard back from them yet. I give them the benefit of the doubt, but should chase if I don't hear back soon.

@andyhhp
Copy link

andyhhp commented Oct 10, 2019

I'm unfamiliar with the differences between sse and sse2

To a very rough approximation, SSE is support for float[4], and SSE2 added support for double[2], as well as a replacement for MMX.

because the sse2 instructions require 16B stack alignment

Not quite. The aligned movs (MOVAPD in this case) require a 16b aligned memory operand, which in this case it doesn't get because it references a 16b object misaligned on an 8b stack.

An 8 byte alignment generally, with 16 in a callee is an ABI breakage, unless the function knows it need to cope with misaligned callers.

Either

  1. The global switch from 16=>8 alignment caused an ABI breakage by not fixing this driver or declaring the incoming stack alignment, or
  2. The incoming stack alignment is being specified, and the function prologue is not correct.

Code generation to use MOVAPD is correct in this case. It is on a temporary variable spilled to the stack, and because the stack is 16 bytes aligned, this is a safe instruction to schedule.

@nickdesaulniers
Copy link
Member

The aligned movs (MOVAPD in this case) require a 16b aligned memory operand, which in this case it doesn't get because it references a 16b object misaligned on an 8b stack.

Well put (more precise).

An 8 byte alignment generally, with 16 in a callee is an ABI breakage

Agreed.

The global switch from 16=>8 alignment caused an ABI breakage by not fixing this driver or declaring the incoming stack alignment, or

I believe the global switch occurred before the driver was upstreamed? Maybe it existed out of tree prior to the switch, and only for older kernels?

The incoming stack alignment is being specified, and the function prologue is not correct.

@yshui pointed out -mincoming-stack-boundary. Either:

  1. it's not be specified, or
  2. it is but clang doesn't support it (could be silently dropped via cc-option).
  3. it is and clang recognizes the flag but has bad codegen
$ cd linux-next
$ grep -r incoming-stack-boundary
$ echo $?
1

so 1, but also 2: https://godbolt.org/z/N9b67Q

@yshui
Copy link
Member Author

yshui commented Oct 11, 2019

What is the best solution here? It seems quite clear there is a bug in amdgpu at least, since it should have used -mincoming-stack-boundary.

So we need to:

  1. Add -mincoming-stack-boundary to amdgpu's Makefile
  2. Implement equivalent in clang
  3. Add clang equivalent flag to Makefile

?

@andyhhp
Copy link

andyhhp commented Oct 11, 2019

I don't immediately see anything in the driver which would necessitate the use of a 16 byte stack. I'd start by dropping that and seeing what explodes.

Once Clang sees that the stack is only 8b aligned, its no longer correct to schedule MOVAPD, and it will probably use MOVUPD instead which is the unaligned variant.

@andyhhp
Copy link

andyhhp commented Oct 11, 2019

so 1, but also 2: https://godbolt.org/z/N9b67Q

I was thinking more of https://godbolt.org/z/TeiLEr but yes. I don't see suitable prologue code being generated even in GCC.

@yshui
Copy link
Member Author

yshui commented Oct 11, 2019

@yshui
Copy link
Member Author

yshui commented Oct 11, 2019

-mincoming-stack-boundary= takes the same kind of arguments as -mpreferred-stack-boundary=, so for 8 bytes aligned you need to pass 3

@nickdesaulniers
Copy link
Member

What is the best solution here?
Add -mincoming-stack-boundary to amdgpu's Makefile
Implement equivalent in clang
Add clang equivalent flag to Makefile

This feels like papering over an ABI mismatch which is bad behavior we shouldn't bend over backwards to support IMO. While it is a possible route to a solution, I consider it suboptimal.

I'd start by dropping that and seeing what explodes.

Agreed.

its no longer correct to schedule MOVAPD, and it will probably use MOVUPD instead which is the unaligned variant.

Oh, good point, and if it doesn't, that's a bug in Clang.

I was thinking more of

Oh, I was just showing the flag being unrecognized. The code in the link was irrelevant.

I don't see suitable prologue code being generated even in GCC.

Should the flag be -mincoming-stack-boundary=8 (not -mincoming-stack-boundary=3)? I know GCC likes you to specify 2^X for the stack alignment flag. Not sure if this one is similar.

You both are making me so happy with your use of godbolt! 💟

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Oct 11, 2019

diff --git a/drivers/gpu/drm/amd/display/dc/calcs/Makefile b/drivers/gpu/drm/amd/display/dc/calcs/Makefile
index 985633c08a26..3c3778de33b4 100644
--- a/drivers/gpu/drm/amd/display/dc/calcs/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/calcs/Makefile
@@ -24,13 +24,7 @@
 # It calculates Bandwidth and Watermarks values for HW programming
 #
 
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
-	cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
-	cc_stack_align := -mstack-alignment=16
-endif
-
-calcs_ccflags := -mhard-float -msse $(cc_stack_align)
+calcs_ccflags := -mhard-float -msse
 
 ifdef CONFIG_CC_IS_CLANG
 calcs_ccflags += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/Makefile b/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
index ddb8d5649e79..13d1fd2bf8f8 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
@@ -10,13 +10,7 @@ ifdef CONFIG_DRM_AMD_DC_DSC_SUPPORT
 DCN20 += dcn20_dsc.o
 endif
 
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
-	cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
-	cc_stack_align := -mstack-alignment=16
-endif
-
-CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o := -mhard-float -msse $(cc_stack_align)
+CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o := -mhard-float -msse
 
 ifdef CONFIG_CC_IS_CLANG
 CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/Makefile b/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
index ef673bffc241..c4042732c8a0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
@@ -3,13 +3,7 @@
 
 DCN21 = dcn21_hubp.o dcn21_hubbub.o dcn21_resource.o
 
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
-	cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
-	cc_stack_align := -mstack-alignment=16
-endif
-
-CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o := -mhard-float -msse $(cc_stack_align)
+CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o := -mhard-float -msse
 
 ifdef CONFIG_CC_IS_CLANG
 CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile b/drivers/gpu/drm/amd/display/dc/dml/Makefile
index 5b2a65b42403..a604cdbaf317 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile
@@ -24,13 +24,7 @@
 # It provides the general basic services required by other DAL
 # subcomponents.
 
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
-	cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
-	cc_stack_align := -mstack-alignment=16
-endif
-
-dml_ccflags := -mhard-float -msse $(cc_stack_align)
+dml_ccflags := -mhard-float -msse
 
 ifdef CONFIG_CC_IS_CLANG
 dml_ccflags += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dsc/Makefile b/drivers/gpu/drm/amd/display/dc/dsc/Makefile
index b456cd23c6fa..8954ba2c284b 100644
--- a/drivers/gpu/drm/amd/display/dc/dsc/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dsc/Makefile
@@ -1,13 +1,7 @@
 #
 # Makefile for the 'dsc' sub-component of DAL.
 
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
-	cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
-	cc_stack_align := -mstack-alignment=16
-endif
-
-dsc_ccflags := -mhard-float -msse $(cc_stack_align)
+dsc_ccflags := -mhard-float -msse
 
 ifdef CONFIG_CC_IS_CLANG
 dsc_ccflags += -msse2

but I still see movaps 😭 :

$ llvm-objdump -d drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.o | grep movaps
     9b3: 0f 28 05 00 00 00 00         	movaps	(%rip), %xmm0
     fcc: 0f 28 05 00 00 00 00         	movaps	(%rip), %xmm0
$ llvm-objdump -d drivers/gpu/drm/amd/display/dc/calcs/dcn_calcs.o | grep movaps       
     d4c: 0f 28 c4                     	movaps	%xmm4, %xmm0
    1160: 0f 28 d0                     	movaps	%xmm0, %xmm2
...

@andyhhp
Copy link

andyhhp commented Oct 11, 2019

@andyhhp @nickdesaulniers https://godbolt.org/z/i-HQwU https://godbolt.org/z/XbdshX

Ah - much better.

You both are making me so happy with your use of godbolt! heart_decoration

Godbolt is awesome.

but I still see movaps sob :

Those 4 are all fine. The first two will be to objects store in .data etc, whose alignment doesn't depend on the stack at all. They are instruction-pointer relative accesses without outstanding relocations (hence the 0's) because the object files haven't been linked yet.

The second two are register to register moves, so have no alignment restrictions.

You'll want to grep for 'mova' because for the float variants alone, you've got mov{p,s}{s,d} which are {packed,scalar}{single,double}.

@yshui
Copy link
Member Author

yshui commented Oct 11, 2019

I have a kernel built without the stack alignment option, with clang. Seems to work fine so far.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Oct 12, 2019

Those 4 are all fine.

Ah, sorry, I should have included the full output, and used -r to show relocations.

They are instruction-pointer relative accesses without outstanding relocations (hence the 0's) because the object files haven't been linked yet.

You mean with outstanding relocations? If so, that makes sense, and I should have recognized the missing relocation info from the 4B's of zeros.

The second two are register to register moves, so have no alignment restrictions.

Right, but I think there were then stack references later on in the full output, IIRC. (Shall post full output). Maybe those had relocations as well though (I always forget -r to llvm-objdump).

I have a kernel built without the stack alignment option, with clang. Seems to work fine so far.

That's great news @yshui ! Can you test running it with Clang and GCC, and if it seems ok, I'm happy to send the above diff with your tested-by+debugged-by && @andyhhp 's suggested by tags (or whatever you all would like) to LKML.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Oct 12, 2019

(And if that works for both, I wonder if we can then further [re-]add -msse2 unconditionally for both compilers as we had originally. That's what I had submitted, but it was reverted when users w/ GCC built kernels reported "instability," which I'm guessing is the same GPF's as were seeing here due to ABI mismatch of stack alignment).

@nickdesaulniers nickdesaulniers added [ARCH] x86_64 This bug impacts ARCH=x86_64 [PATCH] Exists There is a patch that fixes this issue Inquiry sent An email to the maintainer/author has been sent and removed [BUG] llvm A bug that should be fixed in upstream LLVM labels Oct 12, 2019
@nickdesaulniers nickdesaulniers self-assigned this Oct 12, 2019
@andyhhp
Copy link

andyhhp commented Oct 12, 2019

You mean with outstanding relocations?

Yes I did. (Sorry - it was late here.)

If you want a S-by tag, I'm Andrew Cooper <andrew.cooper3@citrix.com>

@yshui
Copy link
Member Author

yshui commented Oct 12, 2019

@nickdesaulniers The gcc side seems to work fine as well, I didn't test adding -msse2.

On the clang side, I hit a GPU timeout, which I don't know if is related to this or not. I still haven't seen a GPF, which is good.

@nickdesaulniers
Copy link
Member

Great, thanks for the report. I've asked the AMD folks on our internal CrOS thread to test the diff as well. I'll send the patch by EOD, hopefully with 1 more tested by tag on it. 🏁

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Oct 15, 2019

@yshui
Copy link
Member Author

yshui commented Oct 15, 2019

@nickdesaulniers Isn't enabling sse2 for gcc<7.1 also an option?

Edit: just tried, nope. https://godbolt.org/z/xXbjMr

@yshui
Copy link
Member Author

yshui commented Oct 15, 2019

Another question would be: is -msse absolutely required? Can we just remove it for gcc<7.1?

@nickdesaulniers
Copy link
Member

Another question would be: is -msse absolutely required?

Probably not. They use -mhard-float in the AMDGPU Makefile to seemingly prevent -mno-80387 from making GCC emit soft-fp calls (the way Clang did until I added -msse2).

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Oct 16, 2019

@nickdesaulniers nickdesaulniers added [PATCH] Submitted A patch has been submitted for review and removed [PATCH] Exists There is a patch that fixes this issue Inquiry sent An email to the maintainer/author has been sent labels Oct 16, 2019
FireBurn pushed a commit to FireBurn/linux that referenced this issue Oct 27, 2019
The x86 kernel is compiled with an 8B stack alignment via
`-mpreferred-stack-boundary=3` for GCC since 3.6-rc1 via
commit d9b0cde ("x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported")
or `-mstack-alignment=8` for Clang. Parts of the AMDGPU driver are
compiled with 16B stack alignment.

Generally, the stack alignment is part of the ABI. Linking together two
different translation units with differing stack alignment is dangerous,
particularly when the translation unit with the smaller stack alignment
makes calls into the translation unit with the larger stack alignment.
While 8B aligned stacks are sometimes also 16B aligned, they are not
always.

Multiple users have reported General Protection Faults (GPF) when using
the AMDGPU driver compiled with Clang. Clang is placing objects in stack
slots assuming the stack is 16B aligned, and selecting instructions that
require 16B aligned memory operands.

At runtime, syscall handlers with 8B aligned stack call into code that
assumes 16B stack alignment.  When the stack is a multiple of 8B but not
16B, these instructions result in a GPF.

Remove the code that added compatibility between the differing compiler
flags, as it will result in runtime GPFs when built with Clang. Cleanups
for GCC will be sent in later patches in the series.

Link: ClangBuiltLinux/linux#735
Tested-by: Shirish S <shirish.s@amd.com>
Debugged-by: Yuxuan Shui <yshuiv7@gmail.com>
Reported-by: Shirish S <shirish.s@amd.com>
Reported-by: Yuxuan Shui <yshuiv7@gmail.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
ruscur pushed a commit to ruscur/linux that referenced this issue Oct 29, 2019
The x86 kernel is compiled with an 8B stack alignment via
`-mpreferred-stack-boundary=3` for GCC since 3.6-rc1 via
commit d9b0cde ("x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported")
or `-mstack-alignment=8` for Clang. Parts of the AMDGPU driver are
compiled with 16B stack alignment.

Generally, the stack alignment is part of the ABI. Linking together two
different translation units with differing stack alignment is dangerous,
particularly when the translation unit with the smaller stack alignment
makes calls into the translation unit with the larger stack alignment.
While 8B aligned stacks are sometimes also 16B aligned, they are not
always.

Multiple users have reported General Protection Faults (GPF) when using
the AMDGPU driver compiled with Clang. Clang is placing objects in stack
slots assuming the stack is 16B aligned, and selecting instructions that
require 16B aligned memory operands.

At runtime, syscall handlers with 8B aligned stack call into code that
assumes 16B stack alignment.  When the stack is a multiple of 8B but not
16B, these instructions result in a GPF.

Remove the code that added compatibility between the differing compiler
flags, as it will result in runtime GPFs when built with Clang. Cleanups
for GCC will be sent in later patches in the series.

Link: ClangBuiltLinux#735
Tested-by: Shirish S <shirish.s@amd.com>
Debugged-by: Yuxuan Shui <yshuiv7@gmail.com>
Reported-by: Shirish S <shirish.s@amd.com>
Reported-by: Yuxuan Shui <yshuiv7@gmail.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
@nathanchance
Copy link
Member

The patches have been accepted and presumably passed AMD's internal testing because they are on their way for 5.4: https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-5.4-2019-10-30

@nathanchance nathanchance added [PATCH] Accepted A submitted patch has been accepted upstream and removed [PATCH] Submitted A patch has been submitted for review labels Oct 31, 2019
@nathanchance
Copy link
Member

Nick's patches have been merged and released in 5.4-rc6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] linux A bug that should be fixed in the mainline kernel. [FIXED][LINUX] 5.4 This bug was fixed in Linux 5.4
Projects
None yet
Development

No branches or pull requests

4 participants