-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The GPU usage shows huge % usage #110
Comments
It may caused by illegal memory access (#107). What's your driver version ( |
As an alternative, you can try nvitop (written in Python): sudo apt-get update
sudo apt-get install python3-dev python3-pip
pip3 install --user nvitop # Or `sudo pip3 install nvitop`
nvitop -m |
|
Hi, I couldn't find any reason in my code on the why such a huge value would be displayed. The value is suspicious though, it is the maximum unsigned value UINT_MAX, which makes me think that either there is a computation in the driver that returns -1 and it wraps to the max or that the information is not available and defaults as the maximum to indicate an error. Could you please check if the patch in the branch fix_gpu_rate does the trick? To do so: git clone https://github.com/Syllo/nvtop.git
mkdir -p nvtop/build && cd nvtop/build
git checkout fix_gpu_rate
cmake ..
make |
I have the same issue when a zombie process is consuming the GPU memory. And On my machine, there are some zombie processes on the GPU caused by some issues of PyTorch: I add the following lines to function fprintf(stderr, "Count=%u\n", samples_count);
for (unsigned i = 0; i < samples_count; ++i) {
fprintf(stderr, "PID=%u %%SM=%u %%ENC=%u %%DEC=%u TS=%llu\n", samples[i].pid, samples[i].smUtil, samples[i].encUtil, samples[i].decUtil, samples[i].timeStamp);
}
fprintf(stderr, "\n"); And run as I get the following outputs:
The output from pynvml: $ pip3 install nvidia-ml-py==11.450.51 # the official NVML Python Bindings (http://pypi.python.org/pypi/nvidia-ml-py/)
$ ipython3
Python 3.9.5 (default, May 3 2021, 15:11:33)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.24.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from pynvml import *
In [2]: nvmlInit()
In [3]: handle = nvmlDeviceGetHandleByIndex(7)
In [4]: for sample in nvmlDeviceGetProcessUtilization(handle, 0):
...: print(sample)
...:
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
c_nvmlProcessUtilizationSample_t(pid: 0, timeStamp: 0, smUtil: 0, memUtil: 0, encUtil: 0, decUtil: 0)
... # 97 lines same as above The output of |
Interesting, so indeed the function @XuehaiPan What is the timestamp that is passed to the function in the variable |
The first value is 0. And It gets a random value on each refresh (as
|
I'm confused!
Did you insert your code after the second call to if (samples_count) {
internal->last_utilization_timestamp = samples[0].timeStamp;
} |
Reply to this, yes.
Normally, the I think it maybe a upstream driver issue caused by PyTorch (pytorch/pytorch#4293) or TeamViewer for @shmalex (I'm not so sure). We can set the utilization rates to zero (not clamp into [0, 100]) when NVML gets illegal values. To be further (optional), we can prompt some errors that request the user to reset the GPU or reboot the machine when exiting |
I created a gist here: https://gist.github.com/XuehaiPan/bc4834bf40723fe0c994b03d9c0473e4 git clone https://gist.github.com/bc4834bf40723fe0c994b03d9c0473e4.git nvml-exmaple
cd nvml-example
# C
sed -i 's|/usr/local/cuda-10.1|/usr/local/cuda|g' Makefile # change the CUDA_HOME
make && ./example
# Python
pip3 install nvitop
python3 example.py The files On my machine, I'm sure this issue is not caused by |
The function nvmlDeviceGetProcessUtilization might return a number of samples that exceeds the running number of processes on the GPU. Furthermore, most of the returned samples are filled with values that, either do not make sense (e.g. >100% utilization rate) or with a timestamp in the past of what we asked for. Fixes #110.
Thanks for all the info.
This should iron out the case where the driver misbehaves for whatever reason. |
I can confitm - the issues is not comming up again. |
All right. |
Hi, found this issue today.
The GPU usage jumps to huge % number (screenshot below).
My System:
Ubuntu 18.04.5 LTS
nvtop version 1.2.0
Please let me know what additional info would be helpfull.
P.$ Thanks for the tool!
The text was updated successfully, but these errors were encountered: