You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When training tensorflow model inner a k8s pod, how to leverage(mount) the cuda lib on host machinewithout install cuda in docker image
2. Steps to reproduce the issue
Formerly in k8s 1.9.5, kubelet has a parameter to start with feature-gate=accelerator=true
With this feature, I cound install nvidia-driver, cuda and cudnn on HOST machine, then mount this hostpath via k8s volume.
In this case, the image may only install the python+tensorflow-gpu lib, and config the corresponding LD_LIBRARY_PATH, but NO need to install(or based on) cuda anymore, which is too heavy, and lose the for the flexibility.
However, when it comes to device-plugins, Former mechanism seems NOT work anymore:
For the same yaml of pod definition (with means same docker image(python, tf version), same environment variables, the only deference is request GPU resource changed from alpha.kubernetes.io/nvidia-gpu to nvidia.com/gpu)
Some debug info:
nvidia-smi return the right value inner container (which means docker runtime and device-plugin daemonset do work)
LD_LIBRARY_PATH is set correctly as
/usr/local/cuda-9.0/lib64:/usr/lib64/nvidia
where are the libs of cuda and nvidia driver (especially libcuda.so.1) located in
When running python tensor_mnist.py(which will train model leverage GPU). the error raises as
E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_UNKNOWN
execute it with sudo will give more detail about the error
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
While libcublas.so.9.0 can be found under /usr/local/cuda-9.0/lib64 (LD_LIBRARY_PATH).
We do need this behavior to simplify the image size and enhance flexibility.
Much of appreciate for any suggestion or clues !!!
3. Information to attach (optional if deemed irrelevant)
Common error checking:
The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The k8s-device-plugin container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
Additional information that might help better understand your environment and reproduce the bug:
Docker version from docker version: 18.09.6
Docker command, image and tag used: self build python3.6 and Tensorflow-gpu
Kernel version from uname -a: 3.10.0-957.1.3.el7.x86_64
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V: version: 1.0.2
This feature-gate=accelerator=true has been marked deprecated. Please use the device plugin feature gate (marked beta somewhere around 1.9).
All the documentation is based on that and doesn't require you to mount anything.
@RenaudWasTaken Much of thanks but sorry to make you confusion.
Yes, I do need to use device-plugin instead of feature-gate.accelerator. However, my scenario is to run tensorflow-gpu in a k8s pods, without based on any cuda image, nor install cuda lib manually in the image
From all the document I reach, they all let me to install cuda or build from a cuda image, is there any workaround to by-pass it?
Like what I did in feature-gate.accelerator, I can satisfy it via mount nvidia and cuda lib. What can I do in device-plugin mode?
BTW, I have upgraded my k8s to 1.14.2, as well as docker 18.09.6, so device-plugin is my only choice to leverage GPU in container
1. Issue or feature description
When training tensorflow model inner a k8s pod, how to leverage(mount) the cuda lib on
host machine
without install cuda in docker image2. Steps to reproduce the issue
Formerly in k8s 1.9.5, kubelet has a parameter to start with
feature-gate=accelerator=true
With this feature, I cound install nvidia-driver, cuda and cudnn on
HOST machine
, then mount this hostpath via k8s volume.In this case, the image may only install the python+tensorflow-gpu lib, and config the corresponding
LD_LIBRARY_PATH
, but NO need to install(or based on) cuda anymore, which is too heavy, and lose the for the flexibility.However, when it comes to
device-plugins
, Former mechanism seemsNOT
work anymore:For the same yaml of pod definition (with means same docker image(python, tf version), same environment variables, the only deference is request GPU resource changed from alpha.kubernetes.io/nvidia-gpu to nvidia.com/gpu)
Some debug info:
nvidia-smi
return the right value inner container (which means docker runtime and device-plugin daemonset do work)LD_LIBRARY_PATH
is set correctly aswhere are the libs of cuda and nvidia driver (especially libcuda.so.1) located in
When running python tensor_mnist.py(which will train model leverage GPU). the error raises as
execute it with
sudo
will give more detail about the errorWhile
libcublas.so.9.0
can be found under /usr/local/cuda-9.0/lib64 (LD_LIBRARY_PATH).We do need this behavior to simplify the image size and enhance flexibility.
Much of appreciate for any suggestion or clues !!!
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
: 18.09.6uname -a
: 3.10.0-957.1.3.el7.x86_64dmesg
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
: version: 1.0.2The text was updated successfully, but these errors were encountered: