Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reset GPUs after a crash #616

Closed
MoneroCrusher opened this issue Nov 19, 2018 · 8 comments
Closed

How to reset GPUs after a crash #616

MoneroCrusher opened this issue Nov 19, 2018 · 8 comments
Labels

Comments

@MoneroCrusher
Copy link

Hi there,

I usually run 12 GPUs in my system and it can happen that a GPU crashes.
I know there is a way to reset the GPU and/or amdgpu kernel module but I don't know how.
Could you please tell me how to put the whole AMD system back in "mint" condition like after a reboot?
Requirement: no reboot
@jlgreathouse @gstoner

Thanks

@jlgreathouse
Copy link
Collaborator

Hi @MoneroCrusher

Have you tried the gpu_recovery mechanism that is part of our GPU drivers now? In particular, if you set the kernel driver parameter amdgpu.gpu_recovery=1, then whenever there is a GPU timeout detected, it will automatically attempt to reset the GPU and bring it back up.

If you do not want to have these resets attempted automatically, you should be able to use the debugfs mechanism to perform a manual GPU reset, even if you have not set the gpu_recovery parameter. To do this, (as root) read /sys/kernel/debug/dri/N/amdgpu_gpu_recover. N in this case is the number of the GPU you wish to reset in the DRI subsystem.

@MoneroCrusher
Copy link
Author

MoneroCrusher commented Jan 1, 2019

Hi @jlgreathouse

Thanks for the potential solution.
I am currently not using ROCm because 11 out of 12 of my RX 550 GPUs don't get detected because of lacking PCIe atomics on the PCIe x1 slots.
So I am using AMDGPU-PRO 16.60 for now. Checking out the sysfs folder you mentioned, there was no file called /sys/kernel/debug/dri/N/amdgpu_gpu_recover, however, there was one called /sys/kernel/debug/dri/N/amdgpu_gpu_reset. Is a certain min kernel required for this?

What do you mean by "reading" it automatically resets the GPU (in the manual way)?
Just cat it or what (that didn't work, it just made my machine hang, (ssh'd in it and I had to physically go reset it)?
Also does your solution work for both core crashed GPUs as well as Mem crashed GPUs? And does it unload the faulty GPU from the amdgpu kernel? Because when I try to quit the program with the crashed GPU it's not possible to do so and I have to hard-reset the machine, not even a reboot works.

@jlgreathouse
Copy link
Collaborator

Hi @MoneroCrusher

I can't really offer any support for amdgpu-pro software here. I don't work with it, and this issue tracker is specifically for ROCm software. That said, in the open source amdgpu (which is also used as a base for amdgpu-pro), the name of that file was changed from amdgpu_gpu_reset in Linux 4.15 to amdgpu_gpu_recover in Linux 4.16. As shown above, the ROCm 2.0 rock-dkms driver uses the latter naming scheme, as a number of changes from post-4.15 have been pulled in for it.

It appears that the capability was initially made visible in Linux 4.6, but I can't speak towards whether it actually works well in kernels that old.

Yes, you should be able to just sudo cat /sys/kernel/debug/dri/N/amdgpu_gpu_reset. Maybe this reset mechanism doesn't work in that version of amdgpu-pro. The mechanism at play here is logically similar to completely resetting the GPU, so I suspect it should work for both types of lockups.

@MoneroCrusher
Copy link
Author

Hi @jlgreathouse

I have tried every kernel from 4.12 all the way to 4.20. I provocated a GPU crash by overclocking memory and also core too much, while undervolting and the running the miner program, the GPU would crash after about a minute and the system and ssh access remain intact. I then tried to cat the file but upon doing that the system would immediately freeze. Not even any kernel panics would be displayed on the screen hooked up to the worker, but the screen would just freeze on the spot.

I also changed the core clock of the lowest SCLK table from 200 something to 800SCLK without ever crashing the GPU, so amdgpu_pm_info would show 800 SCLK. I then read the file again, it read successfully and I called pm info again but the SCLK would still be at 800. I would expect a GPU reset to truly reset it, in this case it didn't?

But I found another way: echo 1 > sys/class/drm/cardN/device/remove and then echo 1 > sys/bus/pci/rescan would bring it back in mint condition and with all mods stripped (OC; UV), just like after a reboot. But unfortunately it doesn't do anything when the GPU is crashed, it just waits there indefinitely and doesn't remove it.

I feel like this is a general Linux/AMD GPU issue and not related to drivers, but maybe I'm wrong. Anyhow, I hope we can find a way to do this because it would add big value to the community because I've seen so many posts and complaints about this on dozens of forums.

I also tried AMDGPU-PRO 17.40 and it didn't make a difference. If necessary I can also test ROCm on one GPU but I doubt it would be any different.

Can you recreate the issue on your side?

@kentrussell
Copy link
Collaborator

There were some bugs in the GPU reset code previously (that have been addressed upstream), but I have tested it with the latest amd-staging-drm-next code and it seems to be working correctly when I caused a VM Fault. Can you give it a shot on 2.2 and see if the issue persists? If so, can you grab a dmesg and copy it in?

The other issue is that the GPU might be stuck due to how you crashed it, since it might not be the traditional "wptr!=rptr" check caused by your undervolting, and could be that the GPU is stuck in a hardware loop, which the GPU reset cannot address, since it's being undervolted. It could be a HW limitation, and thus wouldn't be able to utilize the amdgpu_gpu_recover functionality.

@kentrussell
Copy link
Collaborator

Also, I will be adding a --gpureset flag to the SMI, which will hopefully make it into 2.3, which does the "cat amdgpu_gpu_recover" command. It doesn't work for all GPU hangs, but definitely works for some (from experience)

@kentrussell
Copy link
Collaborator

GPU Reset is available from 2.3.

@Tectract
Copy link

Tectract commented Oct 6, 2023

gpu_reset seems to kill my user session :/ Is there a way to gpu_reset and restore my old user session?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants