How to leverage the cuda in host machine without install cuda in docker image #120

Cherishty · 2019-06-11T20:15:04Z

1. Issue or feature description

When training tensorflow model inner a k8s pod, how to leverage(mount) the cuda lib on host machine without install cuda in docker image

2. Steps to reproduce the issue

Formerly in k8s 1.9.5, kubelet has a parameter to start with feature-gate=accelerator=true
With this feature, I cound install nvidia-driver, cuda and cudnn on HOST machine, then mount this hostpath via k8s volume.
In this case, the image may only install the python+tensorflow-gpu lib, and config the corresponding LD_LIBRARY_PATH, but NO need to install(or based on) cuda anymore, which is too heavy, and lose the for the flexibility.

However, when it comes to device-plugins, Former mechanism seems NOT work anymore:
For the same yaml of pod definition (with means same docker image(python, tf version), same environment variables, the only deference is request GPU resource changed from alpha.kubernetes.io/nvidia-gpu to nvidia.com/gpu)

Some debug info:

nvidia-smi return the right value inner container (which means docker runtime and device-plugin daemonset do work)
LD_LIBRARY_PATH is set correctly as

/usr/local/cuda-9.0/lib64:/usr/lib64/nvidia

where are the libs of cuda and nvidia driver (especially libcuda.so.1) located in

When running python tensor_mnist.py(which will train model leverage GPU). the error raises as

E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_UNKNOWN

execute it with sudo will give more detail about the error

ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

While libcublas.so.9.0 can be found under /usr/local/cuda-9.0/lib64 (LD_LIBRARY_PATH).

We do need this behavior to simplify the image size and enhance flexibility.
Much of appreciate for any suggestion or clues !!!

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The k8s-device-plugin container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version: 18.09.6
Docker command, image and tag used: self build python3.6 and Tensorflow-gpu
Kernel version from uname -a: 3.10.0-957.1.3.el7.x86_64
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
NVIDIA container library version from nvidia-container-cli -V: version: 1.0.2
NVIDIA container library logs (see troubleshooting)

The text was updated successfully, but these errors were encountered:

Cherishty · 2019-06-11T20:21:55Z

The error tensorflow raise is similiar as this one, which the solution is install nvidia-modprobe.

However, as you know it is not acceptiable for my case, as I do NOT want to install any redundant package

RenaudWasTaken · 2019-06-11T22:11:53Z

Hello!

This feature-gate=accelerator=true has been marked deprecated. Please use the device plugin feature gate (marked beta somewhere around 1.9).
All the documentation is based on that and doesn't require you to mount anything.

Cherishty · 2019-06-12T01:41:20Z

@RenaudWasTaken Much of thanks but sorry to make you confusion.

Yes, I do need to use device-plugin instead of feature-gate.accelerator. However, my scenario is to run tensorflow-gpu in a k8s pods, without based on any cuda image, nor install cuda lib manually in the image

From all the document I reach, they all let me to install cuda or build from a cuda image, is there any workaround to by-pass it?
Like what I did in feature-gate.accelerator, I can satisfy it via mount nvidia and cuda lib. What can I do in device-plugin mode?

BTW, I have upgraded my k8s to 1.14.2, as well as docker 18.09.6, so device-plugin is my only choice to leverage GPU in container

RenaudWasTaken closed this as completed Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to leverage the cuda in host machine without install cuda in docker image #120

How to leverage the cuda in host machine without install cuda in docker image #120

Cherishty commented Jun 11, 2019 •

edited

Cherishty commented Jun 11, 2019

RenaudWasTaken commented Jun 11, 2019

Cherishty commented Jun 12, 2019 •

edited

How to leverage the cuda in host machine without install cuda in docker image #120

How to leverage the cuda in host machine without install cuda in docker image #120

Comments

Cherishty commented Jun 11, 2019 • edited

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Cherishty commented Jun 11, 2019

RenaudWasTaken commented Jun 11, 2019

Cherishty commented Jun 12, 2019 • edited

Cherishty commented Jun 11, 2019 •

edited

Cherishty commented Jun 12, 2019 •

edited