Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

steam crashes amdgpu with mesa 20.0.8 #92807

Closed
ashkitten opened this issue Jul 9, 2020 · 10 comments · Fixed by #92977
Closed

steam crashes amdgpu with mesa 20.0.8 #92807

ashkitten opened this issue Jul 9, 2020 · 10 comments · Fixed by #92977

Comments

@ashkitten
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. run steam on nixos-unstable
  2. entire display freezes up, leaving only the mouse able to move

dmesg output:

[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
show_signal_msg: 24 callbacks suppressed
GpuWatchdog[27365]: segfault at 0 ip 00007f271e83623d sp 00007f27038c5760 error 6 in libcef.so[7f271aab0000+69a4000]
Code: 00 79 09 48 8b 7d a0 e8 01 80 c1 02 41 8b 85 00 01 00 00 85 c0 0f 84 ab 00 00 00 49 8b 45 00 4c 89 ef be 01 00 00 00 ff 50 58 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 a1 a5 37 03 01 80 bd 7f ff
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, but soft recovered

Additional context
amdgpu is able to recover and not require a hard reboot on kernel 5.7.7, but not on 5.4.50. this doesn't affect the crash happening in the first place, though.

overriding hardware.opengl.package{,32} with mesa-20.1.3 does fix the issue on my machine

Notify maintainers
@primeos @vcunat

Metadata

  • system: "x86_64-linux"
  • host os: Linux 5.7.7, NixOS, 20.09pre-git (Nightingale)
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.3.6
  • channels(root): ""
  • channels(ash): "nixos-19.09-19.09.809.5000b1478a1"
  • nixpkgs: /nix/store/3v5m83bfhwjy0k2y4yblh01cvqv00igr-nixpkgs

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
 - mesa
# a list of nixos modules affected by the problem
module:
@c0deaddict
Copy link
Member

I also have this problem. Could you share how you have overridden mesa?

@ashkitten
Copy link
Contributor Author

@c0deaddict ashkitten/nixos-config@a1f0312

@kira-bruneau
Copy link
Contributor

kira-bruneau commented Jul 9, 2020

I don't seem to have this problem for some reason:

  • system: "x86_64-linux"
  • host os: Linux 5.7.7, NixOS, 20.09pre233323.dc80d7bc4a2 (Nightingale)
  • multi-user?: yes
  • sandbox: yes
  • version: nix-env (Nix) 2.3.6
  • channels(root): "nixos-20.09pre233323.dc80d7bc4a2"
  • channels(kira): "home-manager, nixos-20.03-20.03.2491.6a00eba02a3, nixos-unstable-20.09pre233323.dc80d7bc4a2, nixpkgs-unstable-20.09pre233849.1d801806827, nixpkgs-20.03-20.03.1812.14dd961b8d5"
  • nixpkgs: /nix/var/nix/profiles/per-user/root/channels/nixos

I'm using a Radeon RX 590 with OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.8

@c0deaddict
Copy link
Member

c0deaddict commented Jul 10, 2020

Thanks @ashkitten! With your fix I can start steam again 💯 Dirt Rally 2 crashes however, but Half Life 2 and Portal do work. Did not have the time to test more.

PS. I have a Radeon RX 5700XT

@jansol
Copy link
Contributor

jansol commented Jul 11, 2020

Hmm not sure if related, but while steam starts, I also get a similar error when trying to launch anything that uses vulkan (so basically any thing that runs on proton). Steam itself works fine, as do any games that use OpenGL for rendering (forcing the OpenGL-based WineD3D makes other games run too, but is not a feasible workaround for performance reasons). With mesa 20.0.8 it corrupts the screen and messes up the X session (cursor moves, no interaction possible) but switching to a different console works and I can kill the X session from there to recover. With mesa 20.1.3 vulkan applications simply crash. Trying to run a vulkan triangle tutorial I get this in stdout:

ac_rtld error: !part->elf
ELF error: (null)
Segmentation fault (core dumped)

and dmesg has these lines:

triangle[4080]: segfault at d4 ip 00007fd5ad3a8d36 sp 00007fff9c4a2f40 error 4 in libvulkan_radeon.so[7fd5ad311000+3e6000]
Code: 4c 8b a3 68 04 00 00 31 f6 31 ff 4d 8b 95 50 1e 00 00 48 8d 8b 40 04 00 00 c7 83 c0 05 00 00 00 b9 00 00 4c 8d 83 70 04 00 00 <41> 0f b6 84 24 d4 00 00 00 41 b9 00 01 00 00 08 83 38 04 00 00 58

EDIT: Radeon RX5700XT here as well

@c0deaddict
Copy link
Member

I reverted mesa back to 20.0.2, with that version at least Dirt Rally 2 works again (not sure if that uses Vulkan).

@jansol
Copy link
Contributor

jansol commented Jul 11, 2020

According to steam Dirt Rally 2.0 requires DirectX 11, which would get translated to Vulkan by DXVK unless you put PROTON_USE_WINED3D=1 %command% in your launch options (with that env variable set it uses the OpenGL-based WineD3D backend). If it works that way (performance aside), it is likely a RADV issue separate from the steam crashes.

@primeos primeos mentioned this issue Jul 12, 2020
10 tasks
@primeos primeos linked a pull request Jul 12, 2020 that will close this issue
10 tasks
@primeos
Copy link
Member

primeos commented Jul 25, 2020

Should be fixed with the next channel update as 0e93ae3 is now in master. Thanks for the bug report.

@primeos primeos closed this as completed Jul 25, 2020
@corngood
Copy link
Contributor

Upgrading to 20.1.3/4 didn't fix this for me. I had to remove ~/.cache/radv_builtin_shaders*. Satisfactory was crashing 100% with ac_rtld error: !part->elf, and now it works...

I have a feeling caching is broken across the board in mesa right now due to disk_cache_get_function_identifier falling back to timestamps, which are 0 in the nix store.

@corngood corngood mentioned this issue Jul 27, 2020
10 tasks
@corngood
Copy link
Contributor

I did some investigating and fortunately it does seem to be limited to radv, because radv isn't currently built with --build-id=sha1.

My proposed fix is in #93946.

primeos pushed a commit that referenced this issue Jul 28, 2020
Without this, the radv cache uuid would fall back to using the
timestamps of the radv and llvm shared libraries, which are fixed in
/nix/store.  This caused cache collisons, which resulted in crashes
(e.g. #92807).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants