Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The GPU usage shows huge % usage #110

Closed
shmalex opened this issue Jun 7, 2021 · 14 comments
Closed

The GPU usage shows huge % usage #110

shmalex opened this issue Jun 7, 2021 · 14 comments

Comments

@shmalex
Copy link

shmalex commented Jun 7, 2021

Hi, found this issue today.

The GPU usage jumps to huge % number (screenshot below).

My System: Ubuntu 18.04.5 LTS
nvtop version 1.2.0

Screenshot from 2021-06-07 17-56-04

Please let me know what additional info would be helpfull.

P.$ Thanks for the tool!

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Jun 8, 2021

It may caused by illegal memory access (#107). What's your driver version (nvidia-smi). Maybe you can try out the latest version (build from source).

@XuehaiPan
Copy link
Contributor

As an alternative, you can try nvitop (written in Python):

sudo apt-get update
sudo apt-get install python3-dev python3-pip
pip3 install --user nvitop  # Or `sudo pip3 install nvitop`
nvitop -m

@shmalex
Copy link
Author

shmalex commented Jun 8, 2021

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:03:00.0  On |                  N/A |
|  0%   44C    P8    21W / 300W |    733MiB / 11018MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2113      G   /usr/lib/xorg/Xorg                 18MiB |
|    0   N/A  N/A      2284      G   /usr/bin/gnome-shell               72MiB |
|    0   N/A  N/A      2811      G   /usr/lib/xorg/Xorg                364MiB |
|    0   N/A  N/A      2947      G   /usr/bin/gnome-shell               64MiB |
|    0   N/A  N/A      3411      G   ...AAAAAAAA== --shared-files       29MiB |
|    0   N/A  N/A      3692      G   ...AAAAAAAAA= --shared-files       60MiB |
|    0   N/A  N/A      4158      G   ...AAAAAAAA== --shared-files       14MiB |
|    0   N/A  N/A     23484      G   whatsie                            14MiB |
|    0   N/A  N/A     26046      G   ...AAAAAAAAA= --shared-files       84MiB |
+-----------------------------------------------------------------------------+

@Syllo
Copy link
Owner

Syllo commented Jun 8, 2021

Hi,

I couldn't find any reason in my code on the why such a huge value would be displayed.
I checked as far back as driver 390.87 from the year 2018 and the function had the same prototype, so it doesn't seem to be the same problem as encountered in #107.

The value is suspicious though, it is the maximum unsigned value UINT_MAX, which makes me think that either there is a computation in the driver that returns -1 and it wraps to the max or that the information is not available and defaults as the maximum to indicate an error.

Could you please check if the patch in the branch fix_gpu_rate does the trick?

To do so:

git clone https://github.com/Syllo/nvtop.git
mkdir -p nvtop/build && cd nvtop/build
git checkout fix_gpu_rate
cmake ..
make

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Jun 8, 2021

I have the same issue when a zombie process is consuming the GPU memory. And nvtop works as expected when I filter out those GPUs. But I think my situation is slightly different from @shmalex 's.
As the image shown below, nvtop extracts all process names correctly.

Screenshot from 2021-06-07 17-56-04

On my machine, there are some zombie processes on the GPU caused by some issues of PyTorch:

Screenshot 1

Screenshot 2

I add the following lines to function gpuinfo_nvidia_get_process_utilization:

fprintf(stderr, "Count=%u\n", samples_count);
for (unsigned i = 0; i < samples_count; ++i) {
  fprintf(stderr, "PID=%u %%SM=%u %%ENC=%u %%DEC=%u TS=%llu\n", samples[i].pid, samples[i].smUtil, samples[i].encUtil, samples[i].decUtil, samples[i].timeStamp);
}
fprintf(stderr, "\n");

And run as nvtop -s 7 2>debug.txt.

I get the following outputs:

Count=100
PID=50554 %SM=37100 %ENC=50583 %DEC=50624 TS=217144956601725
PID=50021 %SM=38305 %ENC=49270 %DEC=50571 TS=158819300671603
PID=50017 %SM=0 %ENC=3313 %DEC=0 TS=49936
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
... # 94 lines same as above

The output from pynvml:

$ pip3 install nvidia-ml-py==11.450.51  # the official NVML Python Bindings (http://pypi.python.org/pypi/nvidia-ml-py/)
$ ipython3
Python 3.9.5 (default, May  3 2021, 15:11:33) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from pynvml import *

In [2]: nvmlInit()

In [3]: handle = nvmlDeviceGetHandleByIndex(7)

In [4]: for sample in nvmlDeviceGetProcessUtilization(handle, 0):
   ...:     print(sample)
   ...:     
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
... # 97 lines same as above

The output of nvtop is different from the output of pynvml on the same GPU.

@XuehaiPan
Copy link
Contributor

For me, this issue disappears when I run nvtop inside a docker container.

docker build --tag nvtop .
docker run --interactive --tty --rm --runtime=nvidia --gpus all --pid=host nvtop

Screenshot 3

@Syllo
Copy link
Owner

Syllo commented Jun 8, 2021

Interesting, so indeed the function nvmlDeviceGetProcessUtilization can be a bit finicky.

@XuehaiPan What is the timestamp that is passed to the function in the variable internal->last_utilization_timestamp. if it is larger or equal than the ones returned (the TS=...), this might be the way to filter the results. Although this parameter tells the driver to return the utilization that should be more recent!

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Jun 8, 2021

What is the timestamp that is passed to the function in the variable internal->last_utilization_timestamp.

The first value is 0. And It gets a random value on each refresh (as internal->last_utilization_timestamp = samples[0].timeStamp on the last update):

Count=100 TS=0
PID=50554 %SM=37100 %ENC=50583 %DEC=50624 TS=217144956601725
PID=50021 %SM=38305 %ENC=49270 %DEC=50571 TS=158819300671603
PID=50017 %SM=0 %ENC=93761 %DEC=0 TS=49936
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
# ...

Count=100 TS=217144956601725
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
# ...
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=82417 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
# ...

# ...

Count=100 TS=1047527424
PID=38576 %SM=47961 %ENC=1047527424 %DEC=0 TS=1047527424
PID=875831345 %SM=909586487 %ENC=926037305 %DEC=840970297 TS=3617010763600568368
PID=540094517 %SM=875573536 %ENC=874524960 %DEC=540029472 TS=4048798948548490784
PID=842145840 %SM=540024880 %ENC=875837238 %DEC=926103344 TS=2320532713153574197
PID=892680496 %SM=840970272 %ENC=909189170 %DEC=858863671 TS=3467824627879784480
PID=824193585 %SM=842342455 %ENC=825570848 %DEC=807415840 TS=3539882221917712416
PID=807415840 %SM=2609 %ENC=1178666720 %DEC=32532 TS=3539882221917712416
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=80353 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=1178665568 %SM=926234675 %ENC=1869116537 %DEC=1914712942 TS=80305
PID=1651076204 %SM=943071286 %ENC=825040944 %DEC=842150944 TS=3978147638086279258
PID=909586487 %SM=540095029 %ENC=842215732 %DEC=540226359 TS=3611941001401742391
PID=858993206 %SM=807417138 %ENC=842217015 %DEC=540024882 TS=2319389199467295520
PID=942743600 %SM=909194549 %ENC=807415840 %DEC=807415840 TS=4120854352152769588
PID=909455904 %SM=909586720 %ENC=807415840 %DEC=540487968 TS=3616728262244775473
PID=807416119 %SM=540024880 %ENC=540024880 %DEC=667953 TS=2319389199166349369
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0
PID=0 %SM=0 %ENC=0 %DEC=0 TS=0

@Syllo
Copy link
Owner

Syllo commented Jun 8, 2021

I'm confused!
Or I don't get the same behavior with my driver version ...
I get this:

Samples 3, last TS 0
PID 18429, GPU% 1, ENC% 0, DEC% 0, TS 1623157629331552
PID 7044, GPU% 12, ENC% 0, DEC% 0, TS 1623157633175692
PID 7180, GPU% 7, ENC% 0, DEC% 0, TS 1623157632005826
Samples 2, last TS 1623157629331552
PID 7044, GPU% 11, ENC% 0, DEC% 0, TS 1623157633175692
PID 7180, GPU% 9, ENC% 0, DEC% 0, TS 1623157632005826
Samples 2, last TS 1623157633175692
PID 7044, GPU% 4, ENC% 0, DEC% 0, TS 1623157637687768
PID 7180, GPU% 16, ENC% 0, DEC% 0, TS 1623157639191912
Samples 2, last TS 1623157637687768
PID 7044, GPU% 1, ENC% 0, DEC% 0, TS 1623157639526261
PID 7180, GPU% 19, ENC% 0, DEC% 0, TS 1623157640194710

Did you insert your code after the second call to nvmlDeviceGetProcessUtilization? Just before

if (samples_count) {
      internal->last_utilization_timestamp = samples[0].timeStamp;
    }

@XuehaiPan
Copy link
Contributor

XuehaiPan commented Jun 8, 2021

Reply to this, yes.

Did you insert your code after the second call to nvmlDeviceGetProcessUtilization? Just before

if (samples_count) {
  internal->last_utilization_timestamp = samples[0].timeStamp;
}

Normally, the sample_count is excatly equals the number of processes on the GPU.
For me, there are 2 zombie processes but the sample_count=100. Both for nvtop and pynvml.

I think it maybe a upstream driver issue caused by PyTorch (pytorch/pytorch#4293) or TeamViewer for @shmalex (I'm not so sure). We can set the utilization rates to zero (not clamp into [0, 100]) when NVML gets illegal values. To be further (optional), we can prompt some errors that request the user to reset the GPU or reboot the machine when exiting nvtop. (We should distinguish between error processes and normal processes which at exiting (zombie process for 1-2 sec))

@XuehaiPan
Copy link
Contributor

I created a gist here: https://gist.github.com/XuehaiPan/bc4834bf40723fe0c994b03d9c0473e4

git clone https://gist.github.com/bc4834bf40723fe0c994b03d9c0473e4.git nvml-exmaple
cd nvml-example

# C
sed -i 's|/usr/local/cuda-10.1|/usr/local/cuda|g' Makefile  # change the CUDA_HOME
make && ./example

# Python
pip3 install nvitop
python3 example.py

The files output-c.txt and output-py.txt in the gist are the output on my machine.

On my machine, I'm sure this issue is not caused by nvtop.

Syllo added a commit that referenced this issue Jun 9, 2021
The function nvmlDeviceGetProcessUtilization might return a number of
samples that exceeds the running number of processes on the GPU.
Furthermore, most of the returned samples are filled with values that,
either do not make sense (e.g. >100% utilization rate) or with a
timestamp in the past of what we asked for.

Fixes #110.
@Syllo
Copy link
Owner

Syllo commented Jun 9, 2021

Thanks for all the info.
So it seems that there is something wrong with either the driver or the function nvmlDeviceGetProcessUtilization.
My solution to avoid the wrong samples after looking at the results that you provided is:

  • Filter out the samples with no existing processes as returned by nvmlDeviceGetComputeRunningProcesses or nvmlDeviceGetGraphicsRunningProcesses.
  • If the process exists, further filter if the utilization rates are >100
  • Last filter out if the sample is older than the last seen timestamp

This should iron out the case where the driver misbehaves for whatever reason.

@Syllo Syllo closed this as completed in 8fa9696 Jun 9, 2021
@shmalex
Copy link
Author

shmalex commented Jun 9, 2021

I can confitm - the issues is not comming up again.
1.2.0 issues reproduced. on 1.2.1 - no issue
@Syllo @XuehaiPan thank you very much.

@Syllo
Copy link
Owner

Syllo commented Jun 10, 2021

All right.
Thanks for the feedback @shmalex.
Take care.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants