-
Notifications
You must be signed in to change notification settings - Fork 2k
How to show processes in container with cmd nvidia-smi? #179
Comments
It's a current limitation of our driver, sorry about that! |
Thanks for you reply. @flx42 |
mark it |
@fredy12 your way works |
This should be implemented. 👍 |
How can we help make this happen? |
Are there plans to fix this soon? BTW, this is a problem with not only the nvidia-smi tool, but also when calling NVML directly. |
Any news about this? Seems to me quite fundamental. |
@flx42 Can we track this issue somewhere? Cause seems that this is closed. |
+1 |
Please reopen this |
re-opening. It's a driver limitation and still exists. There is a mediocre work-around using node-manager but it is unsatisfactory for some work flows. Requirement for closure is for the nvidia-smi to at least show processes running in the same container. The underlying libraries and drivers can't support this under the current architecture, don't expect this to be fixed any time soon. No ETA, sorry! |
Is there any update on this? Are there any workarounds? |
We need this definitively. A typical example is when using Pytorch and running more than one training process. Often OOM will kill the process without any info if we cannot monitor the GPU. Even assumed to use TF with gpu fraction like here # Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) I need within the container the GPU memory usage to decide which policy must be adopted at run time for running processes like training or to parallelize inferences, etc. |
FYI: |
@maaft as soon you have the pid, how you do with |
Oh sorry, I guess I missed the topic on this one a bit. Anyway, if using python is an option for you: import os
import sys
import pynvml as N
MB = 1024 * 1024
def get_usage(device_index, my_pid):
N.nvmlInit()
handle = N.nvmlDeviceGetHandleByIndex(device_index)
usage = [nv_process.usedGpuMemory // MB for nv_process in
N.nvmlDeviceGetComputeRunningProcesses(handle) + N.nvmlDeviceGetGraphicsRunningProcesses(handle) if
nv_process.pid == my_pid]
if len(usage) == 1:
usage = usage[0]
else:
raise KeyError("PID not found")
return usage
if __name__ == "__main__":
print(get_usage(sys.argv[1], sys.argv[2])) Instead of calling |
No cigar. N.nvmlDeviceGetGraphicsRunningProcesses(handle) always returns the empty list for me, no matter what the actual load on the referenced GPU is. |
@te0006 I just included it for completeness. If both lists are empty: Are you sure you started your docker container with: |
Not really sure. Have to check with a colleague who does the administration. Thanks for the hint. |
Can't believe it is still not working after 4 years. Now people are running GPU inference/trainning jobs in every kubernetes cluster, and we really need a way to determine gpu usage inside docker containers. |
@dragonly you can use the workaround stated above in the meantime. |
After consulting with my colleague and doing some testing, I can confirm that the above workaround works: |
options available to you, I am closing this:
Some of the information people want to see (like memory and GPU usage) might be better represented in a monitor (Prometheus) where a graph can provide better visualization of multiple nodes in a cluster. I would love to see a way to monitor processes cluster-wide, and not simply for a particular node. (a bit out of scope for this issue) If there is a specific feature or enhancement to one of the options above, please open a new issue. |
I stumbled across this ticket while looking for a way to get information in the Kubernetes cluster about which pod/container is using which NVIDIA GPU, even if the pod is not running in hostPID mode. I have developed a tool to get this information. In case it will be useful for someone: https://github.com/dmrub/kube-utils/blob/master/kube-nvidia-get-processes.py Note that the tool runs a privileged pod on each node with GPU to map the node PID to the container PID. Here is the sample output:
|
Hello, any updated on this issue ? |
Adding hostPID might not be a good idea for k8s clusters in which the GPU workers are shared among multiple tenants. The users can potentially kill processes that belong to others if they are root in the container and they can see all the pids on the host. Still hope that this can be patched at driver level to add PID namespace perception. |
I made a kernel module that patches the nvidia driver for correct PID namespace handling: https://github.com/gh2o/nvidia-pidns |
This worked well for me for k8s: https://github.com/matpool/mpu |
I have solved this by using https://github.com/matpool/mpu/ |
…b.com/matpool/mpu, has not been checked, find it from NVIDIA/nvidia-docker#179
Uh oh!
There was an error while loading. Please reload this page.
Hi, in container I execute nvidia-smi shows: ( it shown nothing)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
But in physical-host when I execute nvidia-smi show: (it shown Process with pid 28290)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28290 C python 1012MiB |
+-----------------------------------------------------------------------------+
How can I show the running processes in container?
I figure out one way is to run a container with --pid=host, any other graceful ways?
The text was updated successfully, but these errors were encountered: