Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.11.4 not working on R7 370 (GCN 1.0) #8

Closed
magicmyth opened this issue Jun 11, 2017 · 12 comments
Closed

4.11.4 not working on R7 370 (GCN 1.0) #8

magicmyth opened this issue Jun 11, 2017 · 12 comments

Comments

@magicmyth
Copy link

magicmyth commented Jun 11, 2017

Previous kernel versions up to and including 4.11.3+ would all boot into Xorg just fine but since 4.11.4+ sddm keeps crashing and Xorg log shows issues with loading the amdgpu module. Booting the same kernel using the "radeon" kernel module works just fine. All 4.9 series are working fine.

I've attached my Xorg log. Let me know if I can provide any other debugging information.

Xorg-amdgpu.0.txt

For clarity:
GPU: AMD Radeon R7 370 (XFX Xtreme Black Edition OC)
Good kernels:

  • 4.9.X series
  • 4.11.[0 - 3]

Tested broken kernels:

  • 4.11.4+
  • 4.11.5+
  • 4.11.8+
M-Bab added a commit that referenced this issue Jun 12, 2017
To test if it resolves the issue #8
@M-Bab
Copy link
Owner

M-Bab commented Jun 12, 2017

Okay this is a severe issue, it works with my R9 380 though. But the 4.11.4+ kernel was build directly before a couple of commits at the amd-staging kernel were made. So I merged the latest status, built and uploaded again. Can you try again if this fixes the issue?
Otherwise I will forward your report of the regression.

@magicmyth
Copy link
Author

Unfortunately that did not fix it. I had not tested the latest 4.9 previously but just did and 4.9.31+ kernel boots just fine. However, I forgot to mention previously that since 4.11.3+ Vulkan apps hang my GPU. This happens with the 4.9.31+ kernel as well. 4.11.1 worked just fine. I've attached my dmesg log that I got during a hang from Mad Max Vulkan version (which I got via SSH).
dmesg-vulkan-hand.txt

@M-Bab
Copy link
Owner

M-Bab commented Jun 13, 2017

Sorry to hear that but thanks for testing. We should avoid that the issues get too messy: Please state the precise kernel version where the problem started like:
Xorg crashes while loading amdgpu module, R7 370, worked until kernel 4.9.XX & 4.11.XX and stopped working with kernel 4.9.XX & 4.11.XX.

The Vulkan problem sounds like it stopped working with a different kernel version, hence it is a different reggression. We should handle this in a seperate issue. If I got this right, I would ask you to open a new issue for that problem.

I will report these issues upstream later today.

@magicmyth
Copy link
Author

Thanks for getting back so quick.

When I get the chance I'll test out 4.11.2+ and 4.11.1+ again to verify where the Vulkan issue starts and create a new issue. Unfortunately the latest Mesa from the Padoka PPA has broken radeonsi for me in general (radeonsi_dri interface seems to be having issues). I'll wait a bit for that PPA to update and if its still broken or takes a while I roll back to the stable Mesa branch and test everything out.

@M-Bab
Copy link
Owner

M-Bab commented Jun 16, 2017

New kernel, new chance? Can you test again with 4.11.5+?

@magicmyth
Copy link
Author

Unfortunately amdgpu is still crashing on 4.11.5+. However, I have now tested with 4.11.2+ and can confirm not only does it boot just fine but Vulkan as well as OpenGL apps are working just fine. I also tested 4.9.32+ and everything is working good there as well. It may have been a radv issue that caused issues with Vulkan previously as my Mesa drivers have been updated, though I was not expecting the userspace drivers to cause a full GPU crash. I could test the 4.9.31+ version again to confirm if Mesa was the cause if that would be useful and report a separate issue?

I'll update the original report with the new info.

FYI I did notice some errors in dmesg with 4.11.2 that I don't think I've seen before:

amdgpu 0000:04:00.0: GPU fault detected: 147 0x000ec802
[ 1335.750727] amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1335.750729] amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E0C8002
[ 1335.750731] amdgpu 0000:04:00.0: VM fault (0x02, vmid 7) at page 0, read from '' (0x00000000) (200)
[ 1341.270164] amdgpu 0000:04:00.0: GPU fault detected: 147 0x0002c802
[ 1341.270168] amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1341.270170] amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x020C8002
[ 1341.270173] amdgpu 0000:04:00.0: VM fault (0x02, vmid 1) at page 0, read from '' (0x00000000) (200)
[ 1345.295741] amdgpu 0000:04:00.0: GPU fault detected: 147 0x000ec802
[ 1345.295745] amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[ 1345.295747] amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E0C8002
[ 1345.295750] amdgpu 0000:04:00.0: VM fault (0x02, vmid 7) at page 0, read from '' (0x00000000) (200)

Everything was working OK though. I've not seen that warning with 4.9.32+.

@M-Bab
Copy link
Owner

M-Bab commented Jun 18, 2017

Basically there is no new code from amd-staging side in the 4.9 kernel since 4.9.28+. In this very special case it would still be interesting because in 4.9.32 something was changed in drivers/gpu/drm/amd/amdgpu/ci_dpm.c (still I would be surprised).

I expect this to be a radv issue. If you can confirm that, I will only report your Xorg crash since 4.11.4+ upstream.

@magicmyth
Copy link
Author

Sorry for taking so long. I tested 4.9.31+ and its working fine. Seems it was something related to the Mesa/libdrm/radv changes.

I've now tested 4.11.8+ and its still not working for me using the amdgpu driver.

@M-Bab
Copy link
Owner

M-Bab commented Jul 9, 2017

Thanks for your endurance and detailed testing. I reported the 4.11.X+ kernel issue upstream to the AMD graphics developers mailing list. Hopefully they find the reason for the regression.

If you are working on Ubuntu you can also try the new kernel variant with Ubuntu additions. E.g. the vanilla based kernel didn't play well with the apparmor tools.

@M-Bab
Copy link
Owner

M-Bab commented Jul 18, 2017

Got feedback from the AMD developers - they would like to have a dmesg log after boot/crash. If it was just a conflict with the radeon module this should be obsolete now because I don't build the radeon module in the kernels anymore.

@piet8stevens
Copy link

piet8stevens commented Jul 21, 2017

Hi, first of all thank you for this work - I would really like to avoid installing the amdgpupro driver. So, I have tried your 4.9.38 and 4.11.11+ kernels on 16.04 but unfortunately with no success. Now, I am not so familiar with this kind of testing/modifications. If I needed to open a new issue for this, just let me know, but it kind of looks similar to the main issue in this thread.

Visible sign of the issue: Xorg does not start up. I can open a console but not log in via the graphical user interface. Black screen with cursor in upper left hand corner for both 4.9.38 and 4.11.11+
My hardware: radeon HD 7750, amd64 Ubuntu 16.04.

When I start up my system with 4.9.0-040900-generic #201612111631 kernel, I have no graphics issue (and no sound).

On the Xorg.0.log with the 4.11.11+ , I see the following lines with (EE):

1. [     4.774] (EE) AMDGPU(0): amdgpu_device_initialize failed
2. [     4.775] (EE) AMDGPU(G0): amdgpu_device_initialize failed
3. [     4.776] (EE) AMDGPU(1): amdgpu_device_initialize failed
4. [     4.788] (EE) Screen 0 deleted because of no matching config section.
5. [     4.788] (EE) Screen 0 deleted because of no matching config section.
6. [     4.843] (EE) RADEON(0): glamor detected, failed to initialize EGL.
7. [     5.393] (EE) RADEON(0): failed to initialise surface manager
8. [     5.393] (EE) RADEON(0): radeon_setup_kernel_mem failed
9. [     5.393] (EE) 
10. [     5.393] (EE) AddScreen/ScreenInit failed for driver 0
11. [     5.393] (EE) 
12. [     5.393] (EE) 
13. [     5.393] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
14. [     5.393] (EE) 
15. [     5.396] (EE) Server terminated with error (1). Closing log file.

If you need more info, please let me know what. I can do testing if you let me know what to do.

EDIT: after reading some of the comments above, I also generated a 4.11.3+ kernel by going through your older commits and that worked with my 16.04. I also have a 17.04 on this machine and will try that tomorrow.
EDIT1: OK, very interesting - took me a while to figure out how to generate additional grub entries for a dual-boot system. 4.11.11+ works on my Ubuntu 17.04. I am not sure why it would work on 17.04 and not on 16.04.

@M-Bab
Copy link
Owner

M-Bab commented Jul 22, 2017

Looks like I missed an important change that caused all the hassle. There is a new way to switch between radeon and amdgpu and it is not via blacklisting anymore.

The GCN 1.0 and GCN 1.1 have to add "radeon.si_support=0 radeon.cik_support=0 amdgpu.si_support=1 amdgpu.cik_support=1" to their kernel boot line (e.g. via GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub).

Updated the Readme for this additional "feature". I assume this should fix this issue and close it. If there are still problems please open a fresh issue with Xorg and dmesg logs.

@M-Bab M-Bab closed this as completed Jul 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants