Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU does not clock down after test load #621

Closed
Bengt opened this issue Sep 2, 2019 · 14 comments
Closed

GPU does not clock down after test load #621

Bengt opened this issue Sep 2, 2019 · 14 comments
Assignees

Comments

@Bengt
Copy link

Bengt commented Sep 2, 2019

I ran the test from issue #519 on my Vega 64 (gfx900, Vega 10) under Ubuntu 18.04.3 LTS. Afterwards, the card stays in PState 3 (4 LEDs of GPUTach on and there is some coil whine). The issue is somewhat transient, as the pp_dpm_sclk reports PState 0 sometimes. The GPU should does not clock down like with exiting any other load.

Procedure to reproduce:

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx rocm/tensorflow:rocm2.7-tf1.14-dev
python3 -m pip install scikit-image numpy keras efficientnet pytest
wget https://upload.wikimedia.org/wikipedia/commons/f/fe/Giant_Panda_in_Beijing_Zoo_1.JPG
wget https://gist.githubusercontent.com/Bengt/308c7d05dc755f1bfe0aeda9220e4eed/raw//test_efficientnet_gfx803.py
HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done

Output:

0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz *
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz 

Shutting down to container does not help:

$ exit
$ while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done
0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz *
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz 

Starting and ending a 3D-load does not help:

$ glxgears
<Ctrl+C>
$ while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done
0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz *
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz 

Only a reboot fixes the issue for me:

$ sudo reboot
$ while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done
0: 852Mhz *
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz 
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz 
@Bengt
Copy link
Author

Bengt commented Sep 2, 2019

The workload mentioned in #432 triggers the same behavior.

@sunway513
Copy link

Hi @Bengt , can you provide the log after setting the following:
/opt/rocm/bin/rocm-smi --setperf high

@sunway513 sunway513 self-assigned this Sep 3, 2019
@Bengt
Copy link
Author

Bengt commented Sep 3, 2019

Hi @sunway513,

running /opt/rocm/bin/rocm-smi --setperf high turns on all the GPUTach LEDs of all installed GPUs. The log agrees:

0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz * 

During the test run, the last LED of the Fury X turns off, indicating clocking down to pstate 6 (1018 MHz vs. 1050 MHz in pstate 7). During the test run of the Vega 64, none of the LEDs turn off, so I suspect the clock management does not hit a thermal or power limit. After the run, the Vega 64 stays in pstate 7 and produces horrible coil whine. Executing /opt/rocm/bin/rocm-smi --setperf low makes the GPU turn off all but one LED, indicating pstate 0. However, that makes the fan ramp up to 48.63% and never come down again. Even after idling a few minutes at about 31°C. After a reboot, everything is back to normal again.

Regards,
Bengt

@sunway513
Copy link

Hi @Bengt , can you try to force the perf level to be "auto"?
FYI, you can also control the fan speed using rocm-smi, you can refer to the doc below:
https://github.com/RadeonOpenCompute/ROC-smi

@Bengt
Copy link
Author

Bengt commented Sep 3, 2019

With the performance level forced to auto, at times all LED turn on during the workload and the fan ramps up a bit. Afterwards the GPU gets stuck in pstate 3 (four LEDs), like before. Running /opt/rocm/bin/rocm-smi --setperf auto again afterwards does not help with clocking down.

@sunway513
Copy link

@kentrussell, do you have any idea on this issue?

@kentrussell
Copy link

Not sure if it could be related to ROCm/ROC-smi#62 , where Vega10 is having issues actually respecting voltage changes, especially after being at DPM7.

Out of curiousity, if you just ditch the middle DPM states and try to set the sclk to "0 6 7" or "0 5 6 7", do things still appear to be the same? That way it can still drop to DPM0, but skips the middling steps and only uses the top 2-3 levels as well. Just some spitballing. Hoping that once I finish my current task, I can get on these Vega10 issues

@Bengt
Copy link
Author

Bengt commented Sep 3, 2019

Hi @kentrussell,

as you suggested, I tried restricting the DPM states:

$ /opt/rocm/bin/rocm-smi --setsclk 0 6 7
 ========================ROCm System Management Interface========================
GPU[0] 		: Successfully set sclk frequency mask to Level 0 6 7
GPU[1] 		: Successfully set sclk frequency mask to Level 0 6 7
==============================End of ROCm SMI Log ==============================

When I run glxgears and resize the window, I can provoke any of the power states. So restricting them does not really seem to work.

However, yes, running the tensorflow workload with restricted DPM states, the power state does jump down to 0 (1 LED). So we might be on to something here.

@kentrussell
Copy link

I've been wondering about the dynamic DPM and how it has been changing its workloads. My concern is that if the levels are constantly changing between 8 potential levels, that things can be a little less stable. If keeping the clocks at high isn't an option (you cited coil whine and fans not returning), we can try to tweak things a bit. At least it might give a temporary workaround while we look into the issue more

@Bengt
Copy link
Author

Bengt commented Sep 3, 2019

Thanks for the offer of tweaking things, @kentrussell. I can however live with the current workaround of restricting the DPM states until a proper fix is released. @iszotic, do you have a different opinion or does this work for your use case, too?

There is a similar issue at the ROCm repository:

ROCm/ROCm#857

@Bengt
Copy link
Author

Bengt commented Sep 3, 2019

This workaround works for clpeak, too.

@iszotic
Copy link

iszotic commented Sep 4, 2019

@Bengt @kentrussell setting all states also fixes the issue /opt/rocm/bin/rocm-smi --setsclk 0 1 2 3 4 5 6 7, but if you set one--setperf then the issue comes back, it fixes again when all states are set while running a program, I'm fine with this workaround much better than manually setting each profile, and the default fan curve works too

@sunway513
Copy link

@Bengt I'm going to close this issue since this is not tensorflow-rocm specific.

@Bengt
Copy link
Author

Bengt commented Apr 20, 2020

Folding at Home (F@H) can also trigger this behavior. Picture of the aftermath:

IMG_20200420_133729

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants