New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU does not clock down after test load #621
Comments
The workload mentioned in #432 triggers the same behavior. |
Hi @Bengt , can you provide the log after setting the following: |
Hi @sunway513, running
During the test run, the last LED of the Fury X turns off, indicating clocking down to pstate 6 (1018 MHz vs. 1050 MHz in pstate 7). During the test run of the Vega 64, none of the LEDs turn off, so I suspect the clock management does not hit a thermal or power limit. After the run, the Vega 64 stays in pstate 7 and produces horrible coil whine. Executing Regards, |
Hi @Bengt , can you try to force the perf level to be "auto"? |
With the performance level forced to |
@kentrussell, do you have any idea on this issue? |
Not sure if it could be related to ROCm/ROC-smi#62 , where Vega10 is having issues actually respecting voltage changes, especially after being at DPM7. Out of curiousity, if you just ditch the middle DPM states and try to set the sclk to "0 6 7" or "0 5 6 7", do things still appear to be the same? That way it can still drop to DPM0, but skips the middling steps and only uses the top 2-3 levels as well. Just some spitballing. Hoping that once I finish my current task, I can get on these Vega10 issues |
Hi @kentrussell, as you suggested, I tried restricting the DPM states:
When I run However, yes, running the tensorflow workload with restricted DPM states, the power state does jump down to 0 (1 LED). So we might be on to something here. |
I've been wondering about the dynamic DPM and how it has been changing its workloads. My concern is that if the levels are constantly changing between 8 potential levels, that things can be a little less stable. If keeping the clocks at high isn't an option (you cited coil whine and fans not returning), we can try to tweak things a bit. At least it might give a temporary workaround while we look into the issue more |
Thanks for the offer of tweaking things, @kentrussell. I can however live with the current workaround of restricting the DPM states until a proper fix is released. @iszotic, do you have a different opinion or does this work for your use case, too? There is a similar issue at the ROCm repository: |
This workaround works for clpeak, too. |
@Bengt @kentrussell setting all states also fixes the issue |
@Bengt I'm going to close this issue since this is not tensorflow-rocm specific. |
I ran the test from issue #519 on my Vega 64 (gfx900, Vega 10) under Ubuntu 18.04.3 LTS. Afterwards, the card stays in PState 3 (4 LEDs of GPUTach on and there is some coil whine). The issue is somewhat transient, as the
pp_dpm_sclk
reports PState 0 sometimes. The GPU should does not clock down like with exiting any other load.Procedure to reproduce:
Output:
Shutting down to container does not help:
Starting and ending a 3D-load does not help:
Only a reboot fixes the issue for me:
The text was updated successfully, but these errors were encountered: