GPU Docker Plugin #8

ruffsl · 2015-11-17T05:13:45Z

I've been looking at ways to use cuda containers at my workplace, as our lab shares a common Nvidia workstation, and I'd like interact with this server in a more abstract manner so that 1) I can more readily port my robotics work to any nvidia workstation, and 2) minimize the impact of changes effecting others using shared research workstation.

One gap I'm wrestling with is how to incorporate the current NVIDIA Docker wrapper with the rest of the existing docker ecosystem: docker compose, machine, and swarm. The current drop-in replacement for docker run|create CLI is awesome, but it only gets us so far. The moment we need to use any additional tooling for abstracting or scaling up our apps or avoiding the need to interact with the host directly, well its hard to get to that last step.

So I'm thinking this might be a case for making relevant docker plugin, harkening back to a recent post on the Docker blog, Extending Docker with Plugins. That post was perhaps geared more towards networking and storage drivers, but perhaps our issue here could be treated as a custom volume management. I feel the same level of integration of GPU device options may be called for to achieve the desired user experience in cloud development or cluster computing with Nvidia. This'll most likely call for something more demanding than shell scripts to extend the needed interfaces, so I'd like to hear the rest of the community's and Nvidia devs take on this.

flx42 · 2015-11-17T05:14:59Z

Don't worry, we are already working on the plugin :)

ruffsl · 2015-11-17T05:16:33Z

I figured, but you know I couldn't help but ask :P

3XX0 · 2015-11-17T05:16:56Z

It's being worked on as we speak :) I should have a working implementation fairly soon for you to play with.

Kaixhin · 2015-11-18T02:35:34Z

The wrapper script was my biggest concern with this project, but the Docker plugin sounds like the ideal solution. Once this is ready (alongside #7 and hopefully #10) I'll be happy to port over the many DL images built on top of kaixhin/cuda. I'll keep old versions around for legacy purposes but it'll be good to have CUDA on Docker looked after officially.

Leverage the Docker volume plugin mechanism introduced with Docker 1.9 This plugin also exports few REST endpoints to ease remote NVIDIA Docker management This should address issue #8

3XX0 · 2015-12-06T10:06:02Z

I just pushed an initial implementation of the plugin in the v1 branch.
This certainly needs additional work but people can start experimenting with it now.
The plugin has two REST endpoints that one can query to get GPU information:

localhost:3476/gpu/info
localhost:3476/gpu/status

In addition it provides /docker/cli which generates proper Docker arguments given volume names and device numbers ( it will probably be called from within nvidia-docker).

Example of running CUDA runtime with two GPUs 0 and 1:

make runtime
cd plugin && make
sudo sh -c "./bin/nvidia-docker-plugin &"

gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
docker run -ti `gpu 0+1` cuda:runtime

ruffsl · 2015-12-08T16:52:41Z

@3XX0 , I've tried running the plugin using the snippet you posted above, but It looks like I'm having some issues with the ld.so.cache file:

~/git/NVIDIA/nvidia-docker/plugin$ sudo sh -c "./bin/plugin &"
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA management library
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA unified memory module
nvidia-docker-plugin | 2015/12/08 11:29:07 Discovering GPU devices
nvidia-docker-plugin | 2015/12/08 11:29:07 Creating volumes
nvidia-docker-plugin | 2015/12/08 11:29:07 Error: invalid ld.so.cache file

I've reproduced the same error on two different systems:
GPU Docker Plugin Debuging

Let me know of any more specifics or logs you'd need.

3XX0 · 2015-12-08T17:57:42Z

Weird, can you give the output of

strings /etc/ld.so.cache | head -n 2
hexdump -C -n 256  /etc/ld.so.cache
hexdump -C /etc/ld.so.cache | grep -A2 glibc

ruffsl · 2015-12-08T18:24:42Z

@3XX0 , I've amended my gist above with the added output.

3XX0 · 2015-12-08T20:16:33Z

My bad ... Thanks for the report, it should be fixed now

ruffsl · 2015-12-08T21:12:57Z

Success using the new nvidia-docker-plugin executable!

$ sudo sh -c "./bin/nvidia-docker-plugin &"
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA management library
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA unified memory
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Discovering GPU devices
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Creating volumes at /tmp/nvidia-volumes-355599703
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving plugin API at /run/docker/plugins
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving remote API at localhost:3476

$ gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1;vol=bin+cuda; }

$ docker run -ti `gpu 0` cuda:runtime

root@f4a4da5d68b1:/# nvidia-smi 
Tue Dec  8 20:28:20 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:01:00.0      On |                  N/A |
| 22%   38C    P8    17W / 250W |    659MiB / 12287MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So correct me in what I see going on so far:

The user builds the plugin executable via Makefile
That kick starts a docker build step from a golang image to compile the binary to the host's volume.
Then we launch the REST service in the background
Define a local shell function to curl the service the the docker-run args
- For mounting both GPUs 0 and 1, + the bin cuda files (although omitting vol=bin+cuda; doesn't seem to modify the REST returned string)
Then provide the piped output of the curl call to the run CLI.

I like the REST endpoints, it's kind of handy alone just to be able to point a browser at x.x.x.x:3476/gpu/status to check on GPU usage. How would I use this remotely, i.e. when I can't rely on the gpu 0+1 to execute the nested shell command on the remote host (like docker-machine).

flx42 · 2015-12-08T21:21:43Z

The query string separator is wrong, that's why the vol argument is not taken into account. It should be:

gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }

3XX0 · 2015-12-08T21:49:12Z

@ruffsl Correct, to use it remotely you would do something like:

# On the docker-machine host
sudo ./bin/nvidia-docker-plugin -l :3476

# On the docker client
gpu(){
  host="$( docker-machine url $1 | sed 's;tcp://\(.*\):[0-9]\+;http://\1:3476;' );"
  curl -s $host/docker/cli?dev=$2\&vol=bin+cuda;
}

eval "$(docker-machine env <MACHINE>)"
docker run -ti `gpu <MACHINE> 0+1` cuda:runtime

Note if your docker-machine is backed up by a VM, you will need to enable GPU passthrough

Eventually everything should be abstracted withinnvidia-docker so stay tuned.

ruffsl · 2015-12-08T22:02:07Z

@flx42 , what I'm seeing is that

gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1; }
are both returning the same string of args:

--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --volume-driver=nvidia --volume=bin:/usr/local/nvidia/bin --volume=cuda:/usr/local/nvidia

It's not likely that someone would want to use this without mounting the bin+nvidia, but I was thinking it'd behave as a parameter (omit it, and it won't included)?

@3XX0 , wouldn't that assume you'd need to expose port 3476 of the remote machine to the world? I'f I'm recalling correctly, docker-machine runs via daemon binding to a TCP port via key exchange and then some ssh. How would the request reach the remote REST endpoint from the local client's shell session?

flx42 · 2015-12-08T22:03:02Z

Yes, if you don't specify vol, by default it will take all volume. But try vol=bin, it should be different.

ruffsl · 2015-12-08T22:04:34Z

I see, that does work. Is there a way to specify none?

3XX0 · 2015-12-09T20:55:49Z

@ruffsl no we didn't implement none because it doesn't really make sense to ask no volumes.

Regarding the REST API, if you want remote access, you need to expose it.
It has been question to add SSL to handle unsafe environments but in practice you rarely need it.
You can always tunnel it through ssh and in fact, I was thinking about adding a similar feature in the future nvidia-docker.
I'm not really sure how docker-machine works but from my understanding they were using the DOCKER_HOST hence your Docker daemon needs to be exposed as well (I might be wrong though)

ruffsl · 2015-12-09T22:10:23Z

Well, it'd be tedious for people to override this while still leveraging the device detection and mounting the rest of the plugin machinery here has to offer. Niche I know, but this would be useful for those who'd like to use set nvidia devices, but needn't use cuda. Remember people like me still need to bake the drivers into the container for some apps to get things like OpenGL working, I think these volumes may blow aways some files we'd need to preserve during runtime in that scenario.

Yea, regarding remote access, I feel like there might be a better way to get about this. I'm wondering if there would be something better than just port forwarding with the docker-machine-ssh. Let's ask @psftw , maybe he'd know about this topic or know who to ask.

3XX0 · 2015-12-10T03:05:24Z

I don't get it, why would you want no volumes ? GPU devices are unusable without at least one NVIDIA volume and if you really need it then you don't need to use the /docker/cli endpoint, just use --device directly.

Speaking of which, I'm wondering if the current volume separation (aka. cuda, bin...) is worth the trouble. I might change it to a single driver volume except if there is a reason not to do so.

Leverage the Docker volume plugin mechanism introduced with Docker 1.9 This plugin also exports few REST endpoints to ease remote NVIDIA Docker management This should address issue #8

cancan101 · 2016-01-12T19:15:24Z

Is it now possible to use docker-compose along with nvidia-docker? If so, how?

lsb · 2016-03-02T00:03:59Z

I too would be interesting in using docker-compose along with nvidia-docker

3XX0 · 2016-03-02T00:05:06Z

See #39 ;)

3XX0 mentioned this issue Nov 20, 2015

nvidia-docker script does't work on daemon mode? #13

Closed

3XX0 added a commit that referenced this issue Dec 6, 2015

First draft of the NVIDIA Docker plugin

9c679c7

Leverage the Docker volume plugin mechanism introduced with Docker 1.9 This plugin also exports few REST endpoints to ease remote NVIDIA Docker management This should address issue #8

3XX0 added a commit that referenced this issue Dec 18, 2015

First draft of the NVIDIA Docker plugin

6195017

Leverage the Docker volume plugin mechanism introduced with Docker 1.9 This plugin also exports few REST endpoints to ease remote NVIDIA Docker management This should address issue #8

3XX0 added a commit that referenced this issue Jan 5, 2016

First draft of the NVIDIA Docker plugin

b7f470e

Leverage the Docker volume plugin mechanism introduced with Docker 1.9 This plugin also exports few REST endpoints to ease remote NVIDIA Docker management This should address issue #8

3XX0 added a commit that referenced this issue Jan 9, 2016

First draft of the NVIDIA Docker plugin

66f32cc

Leverage the Docker volume plugin mechanism introduced with Docker 1.9 This plugin also exports few REST endpoints to ease remote NVIDIA Docker management This should address issue #8

3XX0 closed this as completed in fda10b2 Jan 9, 2016

3XX0 added the new feature label Jan 9, 2016

windreamer mentioned this issue Apr 29, 2016

Gpu support douban/tfmesos#3

Merged

convneato mentioned this issue Nov 10, 2019

Docker issue: error response from daemon: OCI runtime create failed #1121

Closed

8 tasks

silencekev mentioned this issue Nov 28, 2019

Driver timeout error during initialization while running "docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi" #1133

Closed

This was referenced Mar 12, 2020

docker: Error response from daemon #1217

Closed

Unable to create container #1218

Closed

paniabhisek mentioned this issue Nov 13, 2020

nvidia-container-toolkit is not installing on RHEL 7 #1416

Closed

9 tasks

Mihawk2022 mentioned this issue Jan 22, 2024

WSL2: nvidia-container-cli mount error, libnvidia-ml.so.1: file exists: unknown. NVIDIA/nvidia-container-toolkit#289

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Docker Plugin #8

GPU Docker Plugin #8

ruffsl commented Nov 17, 2015

flx42 commented Nov 17, 2015

ruffsl commented Nov 17, 2015

3XX0 commented Nov 17, 2015

Kaixhin commented Nov 18, 2015

3XX0 commented Dec 6, 2015

ruffsl commented Dec 8, 2015

3XX0 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

3XX0 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

flx42 commented Dec 8, 2015

3XX0 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

flx42 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

3XX0 commented Dec 9, 2015

ruffsl commented Dec 9, 2015

3XX0 commented Dec 10, 2015

cancan101 commented Jan 12, 2016

lsb commented Mar 2, 2016

3XX0 commented Mar 2, 2016

GPU Docker Plugin #8

GPU Docker Plugin #8

Comments

ruffsl commented Nov 17, 2015

flx42 commented Nov 17, 2015

ruffsl commented Nov 17, 2015

3XX0 commented Nov 17, 2015

Kaixhin commented Nov 18, 2015

3XX0 commented Dec 6, 2015

ruffsl commented Dec 8, 2015

3XX0 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

3XX0 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

flx42 commented Dec 8, 2015

3XX0 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

flx42 commented Dec 8, 2015

ruffsl commented Dec 8, 2015

3XX0 commented Dec 9, 2015

ruffsl commented Dec 9, 2015

3XX0 commented Dec 10, 2015

cancan101 commented Jan 12, 2016

lsb commented Mar 2, 2016

3XX0 commented Mar 2, 2016