GPU does not clock down after test load #621

Bengt · 2019-09-02T11:38:53Z

I ran the test from issue #519 on my Vega 64 (gfx900, Vega 10) under Ubuntu 18.04.3 LTS. Afterwards, the card stays in PState 3 (4 LEDs of GPUTach on and there is some coil whine). The issue is somewhat transient, as the pp_dpm_sclk reports PState 0 sometimes. The GPU should does not clock down like with exiting any other load.

Procedure to reproduce:

docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx rocm/tensorflow:rocm2.7-tf1.14-dev
python3 -m pip install scikit-image numpy keras efficientnet pytest
wget https://upload.wikimedia.org/wikipedia/commons/f/fe/Giant_Panda_in_Beijing_Zoo_1.JPG
wget https://gist.githubusercontent.com/Bengt/308c7d05dc755f1bfe0aeda9220e4eed/raw//test_efficientnet_gfx803.py
HIP_VISIBLE_DEVICES=0 python3 -m pytest -s test_efficientnet_gfx803.py
while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done

Output:

0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz *
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz

Shutting down to container does not help:

$ exit
$ while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done
0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz *
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz

Starting and ending a 3D-load does not help:

$ glxgears
<Ctrl+C>
$ while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done
0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz *
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz

Only a reboot fixes the issue for me:

$ sudo reboot
$ while true; do cat /sys/class/drm/card0/device/pp_dpm_sclk; sleep 1; clear; done
0: 852Mhz *
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz 
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz

The text was updated successfully, but these errors were encountered:

Bengt · 2019-09-02T11:58:17Z

The workload mentioned in #432 triggers the same behavior.

sunway513 · 2019-09-03T17:28:09Z

Hi @Bengt , can you provide the log after setting the following:
/opt/rocm/bin/rocm-smi --setperf high

Bengt · 2019-09-03T18:01:27Z

Hi @sunway513,

running /opt/rocm/bin/rocm-smi --setperf high turns on all the GPUTach LEDs of all installed GPUs. The log agrees:

0: 852Mhz 
1: 991Mhz 
2: 1084Mhz 
3: 1138Mhz
4: 1200Mhz 
5: 1401Mhz 
6: 1536Mhz 
7: 1630Mhz *

During the test run, the last LED of the Fury X turns off, indicating clocking down to pstate 6 (1018 MHz vs. 1050 MHz in pstate 7). During the test run of the Vega 64, none of the LEDs turn off, so I suspect the clock management does not hit a thermal or power limit. After the run, the Vega 64 stays in pstate 7 and produces horrible coil whine. Executing /opt/rocm/bin/rocm-smi --setperf low makes the GPU turn off all but one LED, indicating pstate 0. However, that makes the fan ramp up to 48.63% and never come down again. Even after idling a few minutes at about 31°C. After a reboot, everything is back to normal again.

Regards,
Bengt

sunway513 · 2019-09-03T18:27:06Z

Hi @Bengt , can you try to force the perf level to be "auto"?
FYI, you can also control the fan speed using rocm-smi, you can refer to the doc below:
https://github.com/RadeonOpenCompute/ROC-smi

Bengt · 2019-09-03T19:01:22Z

With the performance level forced to auto, at times all LED turn on during the workload and the fan ramps up a bit. Afterwards the GPU gets stuck in pstate 3 (four LEDs), like before. Running /opt/rocm/bin/rocm-smi --setperf auto again afterwards does not help with clocking down.

sunway513 · 2019-09-03T19:05:36Z

@kentrussell, do you have any idea on this issue?

kentrussell · 2019-09-03T19:13:52Z

Not sure if it could be related to ROCm/ROC-smi#62 , where Vega10 is having issues actually respecting voltage changes, especially after being at DPM7.

Out of curiousity, if you just ditch the middle DPM states and try to set the sclk to "0 6 7" or "0 5 6 7", do things still appear to be the same? That way it can still drop to DPM0, but skips the middling steps and only uses the top 2-3 levels as well. Just some spitballing. Hoping that once I finish my current task, I can get on these Vega10 issues

Bengt · 2019-09-03T19:50:11Z

Hi @kentrussell,

as you suggested, I tried restricting the DPM states:

$ /opt/rocm/bin/rocm-smi --setsclk 0 6 7
 ========================ROCm System Management Interface========================
GPU[0] 		: Successfully set sclk frequency mask to Level 0 6 7
GPU[1] 		: Successfully set sclk frequency mask to Level 0 6 7
==============================End of ROCm SMI Log ==============================

When I run glxgears and resize the window, I can provoke any of the power states. So restricting them does not really seem to work.

However, yes, running the tensorflow workload with restricted DPM states, the power state does jump down to 0 (1 LED). So we might be on to something here.

kentrussell · 2019-09-03T20:52:31Z

I've been wondering about the dynamic DPM and how it has been changing its workloads. My concern is that if the levels are constantly changing between 8 potential levels, that things can be a little less stable. If keeping the clocks at high isn't an option (you cited coil whine and fans not returning), we can try to tweak things a bit. At least it might give a temporary workaround while we look into the issue more

Bengt · 2019-09-03T21:15:21Z

Thanks for the offer of tweaking things, @kentrussell. I can however live with the current workaround of restricting the DPM states until a proper fix is released. @iszotic, do you have a different opinion or does this work for your use case, too?

There is a similar issue at the ROCm repository:

ROCm/ROCm#857

Bengt · 2019-09-03T21:22:26Z

This workaround works for clpeak, too.

iszotic · 2019-09-04T14:09:16Z

@Bengt @kentrussell setting all states also fixes the issue /opt/rocm/bin/rocm-smi --setsclk 0 1 2 3 4 5 6 7, but if you set one--setperf then the issue comes back, it fixes again when all states are set while running a program, I'm fine with this workaround much better than manually setting each profile, and the default fan curve works too

sunway513 · 2019-11-19T19:05:12Z

@Bengt I'm going to close this issue since this is not tensorflow-rocm specific.

Bengt · 2020-04-20T11:40:11Z

Folding at Home (F@H) can also trigger this behavior. Picture of the aftermath:

Bengt mentioned this issue Sep 2, 2019

Programs with ROCm drivers don't exit sucessfully ROCm/ROCm#857

Closed

sunway513 self-assigned this Sep 3, 2019

sunway513 closed this as completed Nov 19, 2019

Bengt mentioned this issue Feb 28, 2020

AMD GPUs remain in power state 3 after exiting clinfo. Oblomov/clinfo#43

Closed

Bengt mentioned this issue Apr 20, 2020

card goes crazy under high load and ROCm 3.3 ROCm/ROCm#1081

Closed

This was referenced Jun 15, 2020

[rocm-opencl-runtime] Blender demo file eventually crashes while compiling kernels rocm-arch/rocm-arch#258

Closed

ROCm 3.5.0 leaves Vega 56 stuck in P-state 3 ROCm/ROCm#1149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU does not clock down after test load #621

GPU does not clock down after test load #621

Bengt commented Sep 2, 2019

Bengt commented Sep 2, 2019

sunway513 commented Sep 3, 2019

Bengt commented Sep 3, 2019 •

edited

sunway513 commented Sep 3, 2019

Bengt commented Sep 3, 2019

sunway513 commented Sep 3, 2019

kentrussell commented Sep 3, 2019

Bengt commented Sep 3, 2019

kentrussell commented Sep 3, 2019

Bengt commented Sep 3, 2019 •

edited

Bengt commented Sep 3, 2019

iszotic commented Sep 4, 2019 •

edited

sunway513 commented Nov 19, 2019

Bengt commented Apr 20, 2020

GPU does not clock down after test load #621

GPU does not clock down after test load #621

Comments

Bengt commented Sep 2, 2019

Bengt commented Sep 2, 2019

sunway513 commented Sep 3, 2019

Bengt commented Sep 3, 2019 • edited

sunway513 commented Sep 3, 2019

Bengt commented Sep 3, 2019

sunway513 commented Sep 3, 2019

kentrussell commented Sep 3, 2019

Bengt commented Sep 3, 2019

kentrussell commented Sep 3, 2019

Bengt commented Sep 3, 2019 • edited

Bengt commented Sep 3, 2019

iszotic commented Sep 4, 2019 • edited

sunway513 commented Nov 19, 2019

Bengt commented Apr 20, 2020

Bengt commented Sep 3, 2019 •

edited

Bengt commented Sep 3, 2019 •

edited

iszotic commented Sep 4, 2019 •

edited