Skip to content
This repository was archived by the owner on Jan 22, 2024. It is now read-only.

How to show processes in container with cmd nvidia-smi? #179

Closed
fredy12 opened this issue Aug 24, 2016 · 30 comments
Closed

How to show processes in container with cmd nvidia-smi? #179

fredy12 opened this issue Aug 24, 2016 · 30 comments

Comments

@fredy12
Copy link

fredy12 commented Aug 24, 2016

Hi, in container I execute nvidia-smi shows: ( it shown nothing)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

But in physical-host when I execute nvidia-smi show: (it shown Process with pid 28290)
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 28290 C python 1012MiB |
+-----------------------------------------------------------------------------+

How can I show the running processes in container?
I figure out one way is to run a container with --pid=host, any other graceful ways?

@flx42
Copy link
Member

flx42 commented Aug 24, 2016

It's a current limitation of our driver, sorry about that!
As you realized, it is related to PID namespaces, the driver is not aware of the PID namespace and thus nvidia-smi in the container doesn't see any process running.

@fredy12
Copy link
Author

fredy12 commented Aug 25, 2016

Thanks for you reply. @flx42
So we can consider that add PID namespace perception into the driver.

@3XX0 3XX0 closed this as completed Aug 26, 2016
@qiaohaijun
Copy link

mark it

@qiaohaijun
Copy link

@fredy12 your way works

@loretoparisi
Copy link

This should be implemented. 👍

@dharmeshkakadia
Copy link

How can we help make this happen?
Checking the usage of GPU is part of workflow when running large training. Without the support for nvidia-smi inside container, there is no good way to keep an eye on the usage. 😿

@therc
Copy link

therc commented Dec 20, 2018

Are there plans to fix this soon? BTW, this is a problem with not only the nvidia-smi tool, but also when calling NVML directly.

@bhack
Copy link

bhack commented Jan 16, 2019

Any news about this? Seems to me quite fundamental.
nvmlDeviceGetAccountingStats cannot work with getpid() argument for the process itself inside nvidia-docker.

@bhack
Copy link

bhack commented Jan 17, 2019

@flx42 Can we track this issue somewhere? Cause seems that this is closed.

@Breakend
Copy link

+1

@bhack
Copy link

bhack commented Sep 30, 2019

Please reopen this

@nvjmayo
Copy link
Contributor

nvjmayo commented Oct 1, 2019

re-opening. It's a driver limitation and still exists. There is a mediocre work-around using node-manager but it is unsatisfactory for some work flows.

Requirement for closure is for the nvidia-smi to at least show processes running in the same container. The underlying libraries and drivers can't support this under the current architecture, don't expect this to be fixed any time soon. No ETA, sorry!

@maaft
Copy link

maaft commented Mar 11, 2020

Is there any update on this?
How can I get the GPU memory usage of a process inside a docker container?

Are there any workarounds?

@loretoparisi
Copy link

loretoparisi commented Mar 11, 2020

We need this definitively. A typical example is when using Pytorch and running more than one training process. Often OOM will kill the process without any info if we cannot monitor the GPU. Even assumed to use TF with gpu fraction like here

# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

I need within the container the GPU memory usage to decide which policy must be adopted at run time for running processes like training or to parallelize inferences, etc.

@maaft
Copy link

maaft commented Mar 11, 2020

FYI:
As a workaround I start the docker container with the --pid=host flag which works perfectly fine with e.g. python's os.getpid().

@loretoparisi
Copy link

@maaft as soon you have the pid, how you do with nvidia-smi? Thanks

@maaft
Copy link

maaft commented Mar 12, 2020

Oh sorry, I guess I missed the topic on this one a bit.

Anyway, if using python is an option for you:

import os
import sys
import pynvml as N

MB = 1024 * 1024

def get_usage(device_index, my_pid):
    N.nvmlInit()

    handle = N.nvmlDeviceGetHandleByIndex(device_index)

    usage = [nv_process.usedGpuMemory // MB for nv_process in
             N.nvmlDeviceGetComputeRunningProcesses(handle) + N.nvmlDeviceGetGraphicsRunningProcesses(handle) if
             nv_process.pid == my_pid]

    if len(usage) == 1:
        usage = usage[0]
    else:
        raise KeyError("PID not found")

    return usage

if __name__ == "__main__":
   print(get_usage(sys.argv[1], sys.argv[2]))

Instead of calling nvidia-smi from your process you could just call this little script with the device-index of your GPU and the processe's PID as arguments using popen or anything else and reading from stdout.

@te0006
Copy link

te0006 commented Mar 27, 2020

No cigar. N.nvmlDeviceGetGraphicsRunningProcesses(handle) always returns the empty list for me, no matter what the actual load on the referenced GPU is.

@maaft
Copy link

maaft commented Mar 27, 2020

@te0006 I just included it for completeness.
If your PID is not in N.nvmlDeviceGetGraphicsRunningProcesses(handle) it should be in N.nvmlDeviceGetComputeRunningProcesses(handle).

If both lists are empty: Are you sure you started your docker container with:
--pid=host
?

@te0006
Copy link

te0006 commented Mar 27, 2020

Not really sure. Have to check with a colleague who does the administration. Thanks for the hint.

@dragonly
Copy link

dragonly commented Apr 4, 2020

Can't believe it is still not working after 4 years. Now people are running GPU inference/trainning jobs in every kubernetes cluster, and we really need a way to determine gpu usage inside docker containers.

@maaft
Copy link

maaft commented Apr 8, 2020

@dragonly you can use the workaround stated above in the meantime.

@te0006
Copy link

te0006 commented Apr 16, 2020

After consulting with my colleague and doing some testing, I can confirm that the above workaround works:
After he added 'hostPID: true' to the pod specification and restarting the container, nvidia-smi now shows the GPU-using Python processes correctly with pid and GPU memory usage. And querying the GPU usage with maaft's above Python code works as well.

@nvjmayo
Copy link
Contributor

nvjmayo commented Jun 17, 2020

options available to you, I am closing this:

  1. add hostPID: true to pod spec
  2. for docker (rather than Kubernetes) run with --privileged or --pid=host. This is useful if you need to run nvidia-smi manually as an admin for troubleshooting.
  3. set up MIG partitions on a supported card

Some of the information people want to see (like memory and GPU usage) might be better represented in a monitor (Prometheus) where a graph can provide better visualization of multiple nodes in a cluster. I would love to see a way to monitor processes cluster-wide, and not simply for a particular node. (a bit out of scope for this issue)

If there is a specific feature or enhancement to one of the options above, please open a new issue.

@nvjmayo nvjmayo closed this as completed Jun 17, 2020
@dmrub
Copy link

dmrub commented May 13, 2021

I stumbled across this ticket while looking for a way to get information in the Kubernetes cluster about which pod/container is using which NVIDIA GPU, even if the pod is not running in hostPID mode. I have developed a tool to get this information. In case it will be useful for someone:

https://github.com/dmrub/kube-utils/blob/master/kube-nvidia-get-processes.py

Note that the tool runs a privileged pod on each node with GPU to map the node PID to the container PID.

Here is the sample output:

NODE              POD                            NAMESPACE NODE_PID PID   GPU GPU_NAME             PCI_ADDRESS      GPU_MEMORY CMDLINE                                                                                                                             
proj-vm-1-node-02 pytorch-nvidia-794d7bb8d-mb7qp jodo01    27301    24927 0   Tesla V100-PCIE-32GB 00000000:00:07.0 813 MiB    /opt/conda/bin/python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-9a103c15-db05-4427-a850-cc79412cb4e8.json  
proj-vm-1-node-02 pytorch-nvidia-794d7bb8d-mb7qp jodo01    7744     25151 0   Tesla V100-PCIE-32GB 00000000:00:07.0 5705 MiB   /opt/conda/bin/python -m ipykernel_launcher -f /root/.local/share/jupyter/runtime/kernel-4049593b-313b-444f-8188-a33875108b10.json  

@yncxcw
Copy link

yncxcw commented Jul 7, 2021

Hello, any updated on this issue ?

@pandaji
Copy link

pandaji commented Sep 10, 2021

options available to you, I am closing this:

  1. add hostPID: true to pod spec
  2. for docker (rather than Kubernetes) run with --privileged or --pid=host. This is useful if you need to run nvidia-smi manually as an admin for troubleshooting.
  3. set up MIG partitions on a supported card

Some of the information people want to see (like memory and GPU usage) might be better represented in a monitor (Prometheus) where a graph can provide better visualization of multiple nodes in a cluster. I would love to see a way to monitor processes cluster-wide, and not simply for a particular node. (a bit out of scope for this issue)

If there is a specific feature or enhancement to one of the options above, please open a new issue.

Adding hostPID might not be a good idea for k8s clusters in which the GPU workers are shared among multiple tenants.

The users can potentially kill processes that belong to others if they are root in the container and they can see all the pids on the host.

Still hope that this can be patched at driver level to add PID namespace perception.

@gh2o
Copy link

gh2o commented Dec 10, 2021

I made a kernel module that patches the nvidia driver for correct PID namespace handling: https://github.com/gh2o/nvidia-pidns

@siaimes
Copy link

siaimes commented Dec 20, 2021

This worked well for me for k8s: https://github.com/matpool/mpu

@gualixiang
Copy link

I have solved this by using https://github.com/matpool/mpu/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests