MPS Support #419

waynesuzq · 2017-07-06T04:00:17Z

Hi,

When I use "CUDA Multi-Process Service" aka MPS in nvidia-docker environment, I met a couple of issues. So I'm wonder if MPS is supported in nvidia-docker? Please help me, thanks in advance~

Here is problems I have met:

When I run nvidia-cuda-mps-control -d to start mps daemon in Nvidia-docker, I can't see this process from nvidia-smi, however, I can see this process from host machine.
In comparison, when I run the same command, nvidia-cuda-mps-control -d, in Host machine (physical server), I got see this from nvidia-smi. (need run a gpu program first to start MPS server)
I tried to run caffe training with MPS as a example, 2 training process at the same time in Nvidia-docker env. It showed:
F0703 13:39:15.539633 97 common.cpp:165] Check failed: error == cudaSuccess (46 vs. 0) all CUDA-capable devices are busy or unavailable
In comparison, this works ok in host (physical machine).

I'm trying this on P100 GPU, Ubuntu14,

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

Docker version 17.04.0-ce, build 4845c56

I hope this is the right place to ask, thanks again.

The text was updated successfully, but these errors were encountered:

3XX0 · 2017-07-06T04:10:32Z

Short answer, it is not supported for now. However, we are looking at it for the 2.0 timeframe but there are a lot of corner cases that need to be investigated.

I'll update this issue with additional information once we are confident it could work properly.

xpp1021 · 2018-01-20T07:20:35Z

Hi,
Is the 2.0 supports the Cuda9 for Volta MPS now?
@3XX0 ,thanks.

andrewpsp · 2018-01-28T09:40:37Z

This MPS Support seems like it would be a blocker creating the service deployments in orchestration. I'll be following the outcome in anticipation for a pull request use-case for the swarm or Kubernetes functionality.

@

vivisidea · 2018-02-04T08:48:55Z

Any progress? or is there any workaround so I can use CUDA Multi-Process Service in the container?

ksze · 2018-02-09T04:09:10Z

Shouldn't it be the other way around? I.E. The MPS should run on the host so it can allocate process time to multiple containers? Is that an already supported architecture?

3XX0 · 2018-02-09T22:13:53Z

With 2.0 it should work as long as you run the MPS server on the host and use --ipc=host. We're working torward a better integration though, so I'll keep this issue open.

# Launch two containers on the second GPU device
sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 nvidia-cuda-mps-control -d

docker run -ti --rm -e NVIDIA_VISIBLE_DEVICES=1 --runtime=nvidia --ipc=host nvidia/cuda
docker run -ti --rm -e NVIDIA_VISIBLE_DEVICES=1 --runtime=nvidia --ipc=host nvidia/cuda

echo quit | sudo nvidia-cuda-mps-control

bhagatindia · 2018-02-11T01:08:43Z

@3XX0,

Does it mean that we can set and limit CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for each container? Any examples of usage would really help.

Could you please elaborate what you mean by "better integration"?

Thank you

WIZARD-CXY · 2018-03-01T11:06:36Z

mark

hehaifengcn · 2018-04-12T22:39:11Z

@3XX0 How much does "-ipc=host" compromise security? Somebody asked the question on SO but no answer yet: https://stackoverflow.com/questions/38907708/docker-ipc-host-and-security

hehaifengcn · 2018-04-20T06:32:32Z

@3XX0 Any update on when nvidia-docker will officially support MPS?

hehaifengcn · 2018-04-27T04:39:09Z

@3XX0 I did some tests and --ipc=host does appear to work. But is there anything else we should pay attention to run current nvidia-docker 2 under MPS? Would you recommend to use it in production? Would be super helpful if you can provide some guidance here.

flx42 · 2018-10-16T01:10:14Z

I've added a wiki page on how to use MPS with Docker Compose:
https://github.com/NVIDIA/nvidia-docker/wiki/MPS-(EXPERIMENTAL)

You can look at the docker-compose.yml file for implementation details.

azazhu · 2018-10-29T13:43:13Z

Hi, @flx42 , is it possible to provide a compose file which format version is 2.1? As lots of companies still use docker 1.12 in their cluster and they cannot upgrade their docker version to 17.0.6 in short term.

flx42 · 2018-10-29T15:57:53Z

@azazhu are you running RHEL/Atomic's fork of Docker? If you do, you can just remove the runtime: lines and it should work fine. That's the docker package on RHEL/CentOS and probably other derivatives.

If that's not what you are running, you won't be able to make it work since the runtime option requires format 2.3:
https://docs.docker.com/compose/compose-file/compose-versioning/#version-23

azazhu · 2018-10-30T14:35:59Z

Thx, @flx42, Could you check me if my understanding is correct or not:

nvidia-docker can work with volta MPS even if we don't use docker compose file you provided, right?

we just need a) nvidia-docker2; b) recommend to set EXCLUSIVE_PROCESS in host machine; c) start mps daemon(nvidia-cuda-mps-control in host machine; d) set CUDA_MPS_PIPE_DIRECTORY in host machine; e) make sure container can read the path of CUDA_MPS_PIPE_DIRECTORY by using -v; f) start container with "--ipc=host". Are my a,b, ~ e,f right?

another question is: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE should be set in container instead of host machine, right?

flx42 · 2018-10-30T15:11:10Z

Yes, that should work. But you can also containerize the MPS daemon, like in the Docker Compose example.
I need to document the steps with the docker CLI too.

another question is: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE should be set in container instead of host machine, right?

IIRC you can set this value for the MPS daemon, or for all CUDA client apps. I think both work fine.

azazhu · 2018-10-31T02:59:52Z

thx @flx42 , what do you mean by "containerize the MPS daemon"? To launch MPS daemon(nvidia-cuda-mps-control) on both host machine and container?
In my experiment, I only launched nvidia-cuda-mps-control on host machine(i didn't launch it in container) and looks it works fine.

flx42 · 2018-10-31T03:08:23Z

Yes, you can launch it inside a container or on the host. Both ways will work.

azazhu · 2018-10-31T11:59:55Z

hi @flx42 ,

it will be great if you can document the steps with the docker CLI, as I failed to launch docker-compose. I met "ERROR: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network" in my work env. I tried to change the "bip" to avoid the subnet conflict, but still met the same error.
I use the method I mentioned above and it can work, but it looks different from https://github.com/NVIDIA/nvidia-docker/wiki/MPS-(EXPERIMENTAL).
In docker-compose.yml, looks container has sys admin permission. So container can set gpu mode(to EXCLUSIVE_PROCESS) and launch mps demon by itself(pls correct me if my understanding is wrong). While the method I used is that gpu mode is set by host machine and mps demon is launched by host machine, and container doesn't have sys admin permission. Both methods can work, right?

GoodJoey · 2018-11-06T06:38:14Z

@flx42 Does MPS support pascal GPU in nvidia-docker contrainers?

flx42 · 2018-11-06T17:10:46Z

@GoodJoey not with the approach documented above, you would need a Volta GPU.

lxl910915 · 2020-01-12T13:43:01Z

@flx42 In this wiki MPS , does Volta mean Volta Architecture or Volta GPU in sentence 'Only Volta MPS is supported' ?
What's more, does 7.0 mean Compute Capability 7.0 in sentence 'NVIDIA GPU with Architecture >= Volta (7.0)' ? Forward your repley, thanks!

'

renedlog · 2020-02-21T08:00:46Z

Seems like mps is not supported on the newest docker version. especially it's not --runtime=nvidia but --gpus=all now.
Also the missing support for docker-compose is annoying.

This example shows well that the containers have some kind of problem with cuda....

sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 nvidia-cuda-mps-control -d #start deamon
docker run -it --rm -e NVIDIA_VISIBLE_DEVICES=1 --gpus=all --ipc=host tensorflow/tensorflow:2.1.0-gpu-py3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
echo quit | sudo nvidia-cuda-mps-control #shutdown deamon

Would really love to see "usable" support of mps with docker

elepherai · 2021-05-20T03:12:51Z

Any update on this issue?

elepherai · 2021-05-20T03:16:05Z

Seems like mps is not supported on the newest docker version. especially it's not --runtime=nvidia but --gpus=all now.
Also the missing support for docker-compose is annoying.

This example shows well that the containers have some kind of problem with cuda....
sudo CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=1 nvidia-cuda-mps-control -d #start deamon
docker run -it --rm -e NVIDIA_VISIBLE_DEVICES=1 --gpus=all --ipc=host tensorflow/tensorflow:2.1.0-gpu-py3 python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
echo quit | sudo nvidia-cuda-mps-control #shutdown deamon
Would really love to see "usable" support of mps with docker

Hi, have you solved this problem?

juinshell · 2022-09-13T10:17:10Z

@3XX0,

Does it mean that we can set and limit CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for each container? Any examples of usage would really help.

Could you please elaborate what you mean by "better integration"?

Thank you

Hi, have you solved this problem? I want to set different CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for each container, such as 3*30%and1*10% for a specific GPU.

domef · 2023-08-31T16:29:22Z

any update?

elezar · 2023-10-30T09:25:18Z

We are working on a DRA Driver for NVIDA GPUs (https://github.com/NVIDIA/k8s-dra-driver) which will include better MPS support.

If there are use cases not covered by this (e.g. outside of K8s), please create an issue describing the use case against https://github.com/NVIDIA/nvidia-container-toolkit.

3XX0 added the new feature label Jul 7, 2017

3XX0 changed the title ~~Can "cuda-mps-server" work on nvidia-docker?~~ MPS Support Nov 14, 2017

hehaifengcn mentioned this issue May 4, 2018

Is nvidia-docker 2 support for MPS ready for production work load? #728

Closed

andrewjguo mentioned this issue Aug 9, 2018

CUDA_MPS_ACTIVE_THREAD_PERCENTAGE doesn't control the GPU usage #807

Closed

flx42 mentioned this issue Sep 7, 2018

Can I set CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for each container? #822

Closed

tomandjerrygit mentioned this issue Apr 26, 2019

Segmentation fault when run TF Serving with NVIDIA MPS tensorflow/serving#1326

Closed

achimnol mentioned this issue Feb 5, 2020

Optional support for GPU sharing with CUDA MPS lablup/backend.ai#133

Open

NVIDIA deleted a comment from ChaosJu Sep 13, 2020

NVIDIA deleted a comment from zrss Sep 13, 2020

archwolf118 mentioned this issue Aug 29, 2021

mps could not work both in container and host in the same time #1543

Closed

barentine mentioned this issue May 10, 2022

Docker containers for PYME's distributed servers python-microscopy/python-microscopy#1219

Closed

3 tasks

kmu-leeky mentioned this issue May 24, 2022

Kubernetes (Container) 에서 GPU 공유하는 방안 연구 ddps-lab/edge-management-module#3

Closed

nneram mentioned this issue Nov 15, 2022

Disable *nvidia-device-plugin-daemonset* in Helm chart values NVIDIA/gpu-operator#420

Closed

elezar closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPS Support #419

MPS Support #419

waynesuzq commented Jul 6, 2017

3XX0 commented Jul 6, 2017

xpp1021 commented Jan 20, 2018

andrewpsp commented Jan 28, 2018 •

edited

Loading

vivisidea commented Feb 4, 2018

ksze commented Feb 9, 2018

3XX0 commented Feb 9, 2018 •

edited

Loading

bhagatindia commented Feb 11, 2018 •

edited

Loading

WIZARD-CXY commented Mar 1, 2018

hehaifengcn commented Apr 12, 2018

hehaifengcn commented Apr 20, 2018

hehaifengcn commented Apr 27, 2018

flx42 commented Oct 16, 2018

azazhu commented Oct 29, 2018

flx42 commented Oct 29, 2018

azazhu commented Oct 30, 2018 •

edited

Loading

flx42 commented Oct 30, 2018

azazhu commented Oct 31, 2018

flx42 commented Oct 31, 2018

azazhu commented Oct 31, 2018 •

edited

Loading

GoodJoey commented Nov 6, 2018

flx42 commented Nov 6, 2018

lxl910915 commented Jan 12, 2020 •

edited

Loading

renedlog commented Feb 21, 2020 •

edited

Loading

elepherai commented May 20, 2021

elepherai commented May 20, 2021

juinshell commented Sep 13, 2022 •

edited

Loading

domef commented Aug 31, 2023

elezar commented Oct 30, 2023

MPS Support #419

MPS Support #419

Comments

waynesuzq commented Jul 6, 2017

3XX0 commented Jul 6, 2017

xpp1021 commented Jan 20, 2018

andrewpsp commented Jan 28, 2018 • edited Loading

vivisidea commented Feb 4, 2018

ksze commented Feb 9, 2018

3XX0 commented Feb 9, 2018 • edited Loading

bhagatindia commented Feb 11, 2018 • edited Loading

WIZARD-CXY commented Mar 1, 2018

hehaifengcn commented Apr 12, 2018

hehaifengcn commented Apr 20, 2018

hehaifengcn commented Apr 27, 2018

flx42 commented Oct 16, 2018

azazhu commented Oct 29, 2018

flx42 commented Oct 29, 2018

azazhu commented Oct 30, 2018 • edited Loading

flx42 commented Oct 30, 2018

azazhu commented Oct 31, 2018

flx42 commented Oct 31, 2018

azazhu commented Oct 31, 2018 • edited Loading

GoodJoey commented Nov 6, 2018

flx42 commented Nov 6, 2018

lxl910915 commented Jan 12, 2020 • edited Loading

renedlog commented Feb 21, 2020 • edited Loading

elepherai commented May 20, 2021

elepherai commented May 20, 2021

juinshell commented Sep 13, 2022 • edited Loading

domef commented Aug 31, 2023

elezar commented Oct 30, 2023

andrewpsp commented Jan 28, 2018 •

edited

Loading

3XX0 commented Feb 9, 2018 •

edited

Loading

bhagatindia commented Feb 11, 2018 •

edited

Loading

azazhu commented Oct 30, 2018 •

edited

Loading

azazhu commented Oct 31, 2018 •

edited

Loading

lxl910915 commented Jan 12, 2020 •

edited

Loading

renedlog commented Feb 21, 2020 •

edited

Loading

juinshell commented Sep 13, 2022 •

edited

Loading