Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to leverage the cuda in host machine without install cuda in docker image #120

Closed
8 of 11 tasks
Cherishty opened this issue Jun 11, 2019 · 3 comments
Closed
8 of 11 tasks

Comments

@Cherishty
Copy link

Cherishty commented Jun 11, 2019

1. Issue or feature description

When training tensorflow model inner a k8s pod, how to leverage(mount) the cuda lib on host machine without install cuda in docker image

2. Steps to reproduce the issue

Formerly in k8s 1.9.5, kubelet has a parameter to start with feature-gate=accelerator=true
With this feature, I cound install nvidia-driver, cuda and cudnn on HOST machine, then mount this hostpath via k8s volume.
In this case, the image may only install the python+tensorflow-gpu lib, and config the corresponding LD_LIBRARY_PATH, but NO need to install(or based on) cuda anymore, which is too heavy, and lose the for the flexibility.


However, when it comes to device-plugins, Former mechanism seems NOT work anymore:
For the same yaml of pod definition (with means same docker image(python, tf version), same environment variables, the only deference is request GPU resource changed from alpha.kubernetes.io/nvidia-gpu to nvidia.com/gpu)

Some debug info:

  • nvidia-smi return the right value inner container (which means docker runtime and device-plugin daemonset do work)

  • LD_LIBRARY_PATH is set correctly as

/usr/local/cuda-9.0/lib64:/usr/lib64/nvidia

where are the libs of cuda and nvidia driver (especially libcuda.so.1) located in

When running python tensor_mnist.py(which will train model leverage GPU). the error raises as

E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_UNKNOWN

execute it with sudo will give more detail about the error

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

While libcublas.so.9.0 can be found under /usr/local/cuda-9.0/lib64 (LD_LIBRARY_PATH).

We do need this behavior to simplify the image size and enhance flexibility.
Much of appreciate for any suggestion or clues !!!

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version: 18.09.6
  • Docker command, image and tag used: self build python3.6 and Tensorflow-gpu
  • Kernel version from uname -a: 3.10.0-957.1.3.el7.x86_64
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • NVIDIA container library version from nvidia-container-cli -V: version: 1.0.2
  • NVIDIA container library logs (see troubleshooting)
@Cherishty
Copy link
Author

The error tensorflow raise is similiar as this one, which the solution is install nvidia-modprobe.

However, as you know it is not acceptiable for my case, as I do NOT want to install any redundant package

@RenaudWasTaken
Copy link
Contributor

Hello!

This feature-gate=accelerator=true has been marked deprecated. Please use the device plugin feature gate (marked beta somewhere around 1.9).
All the documentation is based on that and doesn't require you to mount anything.

@Cherishty
Copy link
Author

Cherishty commented Jun 12, 2019

@RenaudWasTaken Much of thanks but sorry to make you confusion.

Yes, I do need to use device-plugin instead of feature-gate.accelerator. However, my scenario is to run tensorflow-gpu in a k8s pods, without based on any cuda image, nor install cuda lib manually in the image

From all the document I reach, they all let me to install cuda or build from a cuda image, is there any workaround to by-pass it?
Like what I did in feature-gate.accelerator, I can satisfy it via mount nvidia and cuda lib. What can I do in device-plugin mode?

BTW, I have upgraded my k8s to 1.14.2, as well as docker 18.09.6, so device-plugin is my only choice to leverage GPU in container

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants