Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

GPU Docker Plugin #8

Closed
ruffsl opened this issue Nov 17, 2015 · 21 comments
Closed

GPU Docker Plugin #8

ruffsl opened this issue Nov 17, 2015 · 21 comments

Comments

@ruffsl
Copy link
Contributor

ruffsl commented Nov 17, 2015

I've been looking at ways to use cuda containers at my workplace, as our lab shares a common Nvidia workstation, and I'd like interact with this server in a more abstract manner so that 1) I can more readily port my robotics work to any nvidia workstation, and 2) minimize the impact of changes effecting others using shared research workstation.

One gap I'm wrestling with is how to incorporate the current NVIDIA Docker wrapper with the rest of the existing docker ecosystem: docker compose, machine, and swarm. The current drop-in replacement for docker run|create CLI is awesome, but it only gets us so far. The moment we need to use any additional tooling for abstracting or scaling up our apps or avoiding the need to interact with the host directly, well its hard to get to that last step.

So I'm thinking this might be a case for making relevant docker plugin, harkening back to a recent post on the Docker blog, Extending Docker with Plugins. That post was perhaps geared more towards networking and storage drivers, but perhaps our issue here could be treated as a custom volume management. I feel the same level of integration of GPU device options may be called for to achieve the desired user experience in cloud development or cluster computing with Nvidia. This'll most likely call for something more demanding than shell scripts to extend the needed interfaces, so I'd like to hear the rest of the community's and Nvidia devs take on this.

@flx42
Copy link
Member

flx42 commented Nov 17, 2015

Don't worry, we are already working on the plugin :)

@ruffsl
Copy link
Contributor Author

ruffsl commented Nov 17, 2015

I figured, but you know I couldn't help but ask :P

@3XX0
Copy link
Member

3XX0 commented Nov 17, 2015

It's being worked on as we speak :) I should have a working implementation fairly soon for you to play with.

@Kaixhin
Copy link

Kaixhin commented Nov 18, 2015

The wrapper script was my biggest concern with this project, but the Docker plugin sounds like the ideal solution. Once this is ready (alongside #7 and hopefully #10) I'll be happy to port over the many DL images built on top of kaixhin/cuda. I'll keep old versions around for legacy purposes but it'll be good to have CUDA on Docker looked after officially.

3XX0 added a commit that referenced this issue Dec 6, 2015
Leverage the Docker volume plugin mechanism introduced with Docker 1.9
This plugin also exports few REST endpoints to ease remote NVIDIA Docker management
This should address issue #8
@3XX0
Copy link
Member

3XX0 commented Dec 6, 2015

I just pushed an initial implementation of the plugin in the v1 branch.
This certainly needs additional work but people can start experimenting with it now.
The plugin has two REST endpoints that one can query to get GPU information:

  • localhost:3476/gpu/info
  • localhost:3476/gpu/status

In addition it provides /docker/cli which generates proper Docker arguments given volume names and device numbers ( it will probably be called from within nvidia-docker).

Example of running CUDA runtime with two GPUs 0 and 1:

make runtime
cd plugin && make
sudo sh -c "./bin/nvidia-docker-plugin &"

gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
docker run -ti `gpu 0+1` cuda:runtime

@ruffsl
Copy link
Contributor Author

ruffsl commented Dec 8, 2015

@3XX0 , I've tried running the plugin using the snippet you posted above, but It looks like I'm having some issues with the ld.so.cache file:

~/git/NVIDIA/nvidia-docker/plugin$ sudo sh -c "./bin/plugin &"
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA management library
nvidia-docker-plugin | 2015/12/08 11:29:07 Loading NVIDIA unified memory module
nvidia-docker-plugin | 2015/12/08 11:29:07 Discovering GPU devices
nvidia-docker-plugin | 2015/12/08 11:29:07 Creating volumes
nvidia-docker-plugin | 2015/12/08 11:29:07 Error: invalid ld.so.cache file

I've reproduced the same error on two different systems:
GPU Docker Plugin Debuging

Let me know of any more specifics or logs you'd need.

@3XX0
Copy link
Member

3XX0 commented Dec 8, 2015

Weird, can you give the output of

strings /etc/ld.so.cache | head -n 2
hexdump -C -n 256  /etc/ld.so.cache
hexdump -C /etc/ld.so.cache | grep -A2 glibc

@ruffsl
Copy link
Contributor Author

ruffsl commented Dec 8, 2015

@3XX0 , I've amended my gist above with the added output.

@3XX0
Copy link
Member

3XX0 commented Dec 8, 2015

My bad ... Thanks for the report, it should be fixed now

@ruffsl
Copy link
Contributor Author

ruffsl commented Dec 8, 2015

Success using the new nvidia-docker-plugin executable!

$ sudo sh -c "./bin/nvidia-docker-plugin &"
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA management library
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Loading NVIDIA unified memory
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Discovering GPU devices
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Creating volumes at /tmp/nvidia-volumes-355599703
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving plugin API at /run/docker/plugins
./bin/nvidia-docker-plugin | 2015/12/08 15:27:37 Serving remote API at localhost:3476

$ gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1;vol=bin+cuda; }

$ docker run -ti `gpu 0` cuda:runtime

root@f4a4da5d68b1:/# nvidia-smi 
Tue Dec  8 20:28:20 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 0000:01:00.0      On |                  N/A |
| 22%   38C    P8    17W / 250W |    659MiB / 12287MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

So correct me in what I see going on so far:

  • The user builds the plugin executable via Makefile
  • That kick starts a docker build step from a golang image to compile the binary to the host's volume.
  • Then we launch the REST service in the background
  • Define a local shell function to curl the service the the docker-run args
    • For mounting both GPUs 0 and 1, + the bin cuda files (although omitting vol=bin+cuda; doesn't seem to modify the REST returned string)
  • Then provide the piped output of the curl call to the run CLI.

I like the REST endpoints, it's kind of handy alone just to be able to point a browser at x.x.x.x:3476/gpu/status to check on GPU usage. How would I use this remotely, i.e. when I can't rely on the gpu 0+1 to execute the nested shell command on the remote host (like docker-machine).

@flx42
Copy link
Member

flx42 commented Dec 8, 2015

The query string separator is wrong, that's why the vol argument is not taken into account. It should be:

gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }

@3XX0
Copy link
Member

3XX0 commented Dec 8, 2015

@ruffsl Correct, to use it remotely you would do something like:

# On the docker-machine host
sudo ./bin/nvidia-docker-plugin -l :3476

# On the docker client
gpu(){
  host="$( docker-machine url $1 | sed 's;tcp://\(.*\):[0-9]\+;http://\1:3476;' );"
  curl -s $host/docker/cli?dev=$2\&vol=bin+cuda;
}

eval "$(docker-machine env <MACHINE>)"
docker run -ti `gpu <MACHINE> 0+1` cuda:runtime

Note if your docker-machine is backed up by a VM, you will need to enable GPU passthrough

Eventually everything should be abstracted withinnvidia-docker so stay tuned.

@ruffsl
Copy link
Contributor Author

ruffsl commented Dec 8, 2015

@flx42 , what I'm seeing is that

  • gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1\&vol=bin+cuda; }
  • gpu(){ curl -s http://localhost:3476/docker/cli?dev=$1; }
    are both returning the same string of args:
--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --volume-driver=nvidia --volume=bin:/usr/local/nvidia/bin --volume=cuda:/usr/local/nvidia 

It's not likely that someone would want to use this without mounting the bin+nvidia, but I was thinking it'd behave as a parameter (omit it, and it won't included)?

@3XX0 , wouldn't that assume you'd need to expose port 3476 of the remote machine to the world? I'f I'm recalling correctly, docker-machine runs via daemon binding to a TCP port via key exchange and then some ssh. How would the request reach the remote REST endpoint from the local client's shell session?

@flx42
Copy link
Member

flx42 commented Dec 8, 2015

Yes, if you don't specify vol, by default it will take all volume. But try vol=bin, it should be different.

@ruffsl
Copy link
Contributor Author

ruffsl commented Dec 8, 2015

I see, that does work. Is there a way to specify none?

@3XX0
Copy link
Member

3XX0 commented Dec 9, 2015

@ruffsl no we didn't implement none because it doesn't really make sense to ask no volumes.

Regarding the REST API, if you want remote access, you need to expose it.
It has been question to add SSL to handle unsafe environments but in practice you rarely need it.
You can always tunnel it through ssh and in fact, I was thinking about adding a similar feature in the future nvidia-docker.
I'm not really sure how docker-machine works but from my understanding they were using the DOCKER_HOST hence your Docker daemon needs to be exposed as well (I might be wrong though)

@ruffsl
Copy link
Contributor Author

ruffsl commented Dec 9, 2015

Well, it'd be tedious for people to override this while still leveraging the device detection and mounting the rest of the plugin machinery here has to offer. Niche I know, but this would be useful for those who'd like to use set nvidia devices, but needn't use cuda. Remember people like me still need to bake the drivers into the container for some apps to get things like OpenGL working, I think these volumes may blow aways some files we'd need to preserve during runtime in that scenario.

Yea, regarding remote access, I feel like there might be a better way to get about this. I'm wondering if there would be something better than just port forwarding with the docker-machine-ssh. Let's ask @psftw , maybe he'd know about this topic or know who to ask.

@3XX0
Copy link
Member

3XX0 commented Dec 10, 2015

I don't get it, why would you want no volumes ? GPU devices are unusable without at least one NVIDIA volume and if you really need it then you don't need to use the /docker/cli endpoint, just use --device directly.

Speaking of which, I'm wondering if the current volume separation (aka. cuda, bin...) is worth the trouble. I might change it to a single driver volume except if there is a reason not to do so.

3XX0 added a commit that referenced this issue Dec 18, 2015
Leverage the Docker volume plugin mechanism introduced with Docker 1.9
This plugin also exports few REST endpoints to ease remote NVIDIA Docker management
This should address issue #8
3XX0 added a commit that referenced this issue Jan 5, 2016
Leverage the Docker volume plugin mechanism introduced with Docker 1.9
This plugin also exports few REST endpoints to ease remote NVIDIA Docker management
This should address issue #8
3XX0 added a commit that referenced this issue Jan 9, 2016
Leverage the Docker volume plugin mechanism introduced with Docker 1.9
This plugin also exports few REST endpoints to ease remote NVIDIA Docker management
This should address issue #8
@3XX0 3XX0 closed this as completed in fda10b2 Jan 9, 2016
@cancan101
Copy link

Is it now possible to use docker-compose along with nvidia-docker? If so, how?

@lsb
Copy link

lsb commented Mar 2, 2016

I too would be interesting in using docker-compose along with nvidia-docker

@3XX0
Copy link
Member

3XX0 commented Mar 2, 2016

See #39 ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants