-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to reset GPUs after a crash #616
Comments
Have you tried the If you do not want to have these resets attempted automatically, you should be able to use the debugfs mechanism to perform a manual GPU reset, even if you have not set the |
Thanks for the potential solution. What do you mean by "reading" it automatically resets the GPU (in the manual way)? |
I can't really offer any support for amdgpu-pro software here. I don't work with it, and this issue tracker is specifically for ROCm software. That said, in the open source amdgpu (which is also used as a base for amdgpu-pro), the name of that file was changed from It appears that the capability was initially made visible in Linux 4.6, but I can't speak towards whether it actually works well in kernels that old. Yes, you should be able to just |
I have tried every kernel from 4.12 all the way to 4.20. I provocated a GPU crash by overclocking memory and also core too much, while undervolting and the running the miner program, the GPU would crash after about a minute and the system and ssh access remain intact. I then tried to cat the file but upon doing that the system would immediately freeze. Not even any kernel panics would be displayed on the screen hooked up to the worker, but the screen would just freeze on the spot. I also changed the core clock of the lowest SCLK table from 200 something to 800SCLK without ever crashing the GPU, so amdgpu_pm_info would show 800 SCLK. I then read the file again, it read successfully and I called pm info again but the SCLK would still be at 800. I would expect a GPU reset to truly reset it, in this case it didn't? But I found another way: echo 1 > sys/class/drm/cardN/device/remove and then echo 1 > sys/bus/pci/rescan would bring it back in mint condition and with all mods stripped (OC; UV), just like after a reboot. But unfortunately it doesn't do anything when the GPU is crashed, it just waits there indefinitely and doesn't remove it. I feel like this is a general Linux/AMD GPU issue and not related to drivers, but maybe I'm wrong. Anyhow, I hope we can find a way to do this because it would add big value to the community because I've seen so many posts and complaints about this on dozens of forums. I also tried AMDGPU-PRO 17.40 and it didn't make a difference. If necessary I can also test ROCm on one GPU but I doubt it would be any different. Can you recreate the issue on your side? |
There were some bugs in the GPU reset code previously (that have been addressed upstream), but I have tested it with the latest amd-staging-drm-next code and it seems to be working correctly when I caused a VM Fault. Can you give it a shot on 2.2 and see if the issue persists? If so, can you grab a dmesg and copy it in? The other issue is that the GPU might be stuck due to how you crashed it, since it might not be the traditional "wptr!=rptr" check caused by your undervolting, and could be that the GPU is stuck in a hardware loop, which the GPU reset cannot address, since it's being undervolted. It could be a HW limitation, and thus wouldn't be able to utilize the amdgpu_gpu_recover functionality. |
Also, I will be adding a --gpureset flag to the SMI, which will hopefully make it into 2.3, which does the "cat amdgpu_gpu_recover" command. It doesn't work for all GPU hangs, but definitely works for some (from experience) |
GPU Reset is available from 2.3. |
gpu_reset seems to kill my user session :/ Is there a way to gpu_reset and restore my old user session? |
Hi there,
I usually run 12 GPUs in my system and it can happen that a GPU crashes.
I know there is a way to reset the GPU and/or amdgpu kernel module but I don't know how.
Could you please tell me how to put the whole AMD system back in "mint" condition like after a reboot?
Requirement: no reboot
@jlgreathouse @gstoner
Thanks
The text was updated successfully, but these errors were encountered: