K8s-device-plugin does not work with Docker v.19.03 and K8s v1.15.0 #139

ayadmarouane · 2019-09-30T13:12:59Z

1. Issue or feature description

I installed the latest version of Docker which natively supports Nvidia gpus, I installed also the 410 Nvidia drivers, the last version of nvidia-container-toolkit and Cuda10.0.
And Docker is working fine with Nvidia without making any changes to /etc/docker/daemon.json file :

sudo docker run --gpus all --rm nvidia/cuda:10.0-base-ubuntu18.04 nvidia-smi
Mon Sep 30 12:44:17 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+

Then I installed the k8s-device by executing the following command but k8s can not see the graphics card.:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

Logs from the daemonset :
2019/09/30 12:34:07 Loading NVML
2019/09/30 12:34:07 Failed to initialize NVML: could not load NVML library.
2019/09/30 12:34:07 If this is a GPU node, did you set the docker default runtime to nvidia?
2019/09/30 12:34:07 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/30 12:34:07 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

kubectl describe nodes scalio-ub-n01 | grep -B 5 gpu
Resource Requests Limits

cpu 200m (0%) 200m (0%)
memory 275Mi (0%) 75Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 0 0

and i tried to deploy with nvidia.com/gpu: 1 as a request and limit resource but it doesn't work :
0/3 nodes are available: 3 Insufficient nvidia.com/gpu. default-scheduler-2

for information, i have 3 node cluster, with 2 workers. and the daemon-set is deployed on the 2 workers, each worker have a geForce RTX 2080 Ti.

2. Steps to reproduce the issue

Install 1.0.0-beta/nvidia-device-plugin.yml with K8s 1.15.0 and Docker 19.03

3. Information to attach (optional if deemed irrelevant)

Common error checking:

The output of nvidia-smi on your host
Mon Sep 30 12:52:42 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:15:00.0 Off | N/A |
| 32% 28C P8 22W / 260W | 0MiB / 10981MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

The output of nvidia-smi -a on your host
==============NVSMI LOG==============

Timestamp : Mon Sep 30 12:52:51 2019
Driver Version : 410.104
CUDA Version : 10.0

Attached GPUs : 1
GPU 00000000:15:00.0
Product Name : GeForce RTX 2080 Ti
Product Brand : GeForce
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-d52420f5-d9d6-3d06-8c3e-02e8cb7b01d2
Minor Number : 0
VBIOS Version : 90.02.17.00.74
MultiGPU Board : No
Board ID : 0x1500
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x15
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:15:00.0
Sub System Id : 0x12FB196E
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 32 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 10981 MiB
Used : 0 MiB
Free : 10981 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 10 MiB
Free : 246 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 22.36 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

Your docker configuration file (e.g: /etc/docker/daemon.json)
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
The k8s-device-plugin container logs
2019/09/30 12:34:07 Loading NVML
2019/09/30 12:34:07 Failed to initialize NVML: could not load NVML library.
2019/09/30 12:34:07 If this is a GPU node, did you set the docker default runtime to nvidia?
2019/09/30 12:34:07 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/30 12:34:07 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker version 19.03.2, build 6a30dfc
K8s version from kubeadm version && kubectl version
kubeadm version: &version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:37:41Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Docker command, image and tag used
nvidia/cuda:10.0-base-ubuntu18.04
Kernel version from uname -a
Linux scalio-ub-n01 4.16.0-041600-generic #201804012230 SMP Sun Apr 1 22:31:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Any relevant kernel output lines from dmesg
NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
libgldispatch0-nvidia
libnvidia-cfg1-410:amd64 410.104-0ubuntu018.04.1
libnvidia-cfg1-any
libnvidia-common
libnvidia-common-410 410.104-0ubuntu018.04.1
libnvidia-compute-410:amd64 410.104-0ubuntu018.04.1
libnvidia-container-tools 1.0.5-1
libnvidia-container1:amd64 1.0.5-1
libnvidia-decode
libnvidia-decode-410:amd64 410.104-0ubuntu018.04.1
libnvidia-encode
libnvidia-encode-410:amd64 410.104-0ubuntu018.04.1
libnvidia-fbc1
libnvidia-fbc1-410:amd64 410.104-0ubuntu018.04.1
libnvidia-gl
libnvidia-gl-410:amd64 410.104-0ubuntu018.04.1
libnvidia-ifr1
libnvidia-ifr1-410:amd64 410.104-0ubuntu018.04.1
nvidia-304
nvidia-340
nvidia-384
nvidia-390
nvidia-compute-utils-410 410.104-0ubuntu018.04.1
nvidia-container-runtime
nvidia-container-runtime-hook
nvidia-container-toolkit 1.0.5-1
nvidia-dkms-410 410.104-0ubuntu018.04.1
nvidia-dkms-kernel
nvidia-driver-410 410.104-0ubuntu018.04.1
nvidia-driver-binary
nvidia-kernel-common
nvidia-kernel-common-410 410.104-0ubuntu018.04.1
nvidia-kernel-source
nvidia-kernel-source-410 410.104-0ubuntu018.04.1
nvidia-legacy-304xx-vdpau-driver
nvidia-legacy-340xx-vdpau-driver
nvidia-modprobe 410.48-0ubuntu1
nvidia-opencl-icd
nvidia-persistenced
nvidia-prime 0.8.8.2
nvidia-settings 418.56-0ubuntu0gpu18.04.
nvidia-settings-binary
nvidia-smi
nvidia-utils
nvidia-utils-410 410.104-0ubuntu018.04.1
nvidia-vdpau-driver
xserver-xorg-video-nvidia-410 410.104-0ubuntu018.04.1
NVIDIA container library version from nvidia-container-cli -V
version: 1.0.5
build date: 2019-09-06T16:59+00:00
build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
build compiler: x86_64-linux-gnu-gcc-7 7.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
NVIDIA container library logs (see troubleshooting)

Thank you for your help.

The text was updated successfully, but these errors were encountered:

ayadmarouane · 2019-09-30T14:53:18Z

I tried to execute the k8s-device with Docker directly, and it works when i add --gpus argument :

$ sudo docker run --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin

2019/09/30 14:46:38 Loading NVML
2019/09/30 14:46:38 Failed to initialize NVML: could not load NVML library.
2019/09/30 14:46:38 If this is a GPU node, did you set the docker default runtime to nvidia?
2019/09/30 14:46:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/30 14:46:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

$ sudo docker run --gpus all --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin

2019/09/30 14:46:49 Loading NVML
2019/09/30 14:46:49 Fetching devices.
2019/09/30 14:46:49 Starting FS watcher.
2019/09/30 14:46:49 Failed to created FS watcher.

any idea ?

RenaudWasTaken · 2019-10-01T00:47:49Z

Hello!

For kubernetes, you should install the nvidia-docker2 package and then set the runtime to nvidia.
Hope this helps!

rayh · 2019-12-07T13:43:11Z

@RenaudWasTaken since docker 19.03, nvidia-docker2 is deprecated as docker natively supports nvidia GPUs - https://github.com/NVIDIA/nvidia-docker

For now, it seems that we can keep using nvidia-docker2, but is there another approach that would avoid having to install it?

RenaudWasTaken · 2019-12-07T13:53:07Z

For now, there are no other options than to use these packages.
We won't remove them and ensure that this keeps working with Kubernetes until another solution is adopted into upstream kubernetes.

valayDave · 2020-03-21T23:12:33Z

I was trying to FIgure if I could use Cuda 10.2 With GPU clusters on Kubernetes. The dependency chain lead me here :). I was figuring if KOPS could let me deploy CUDA 10.2 supported GPU Instances on AWS with NVIDIA Docker configured.

K8s v1.17 Support Docker 19.03. NVIDIA Container Toolkit Requires Docker 19.03 and supports CUDA 10.2.

KOPS Supports Kubernetes 1.16 which runs on Docker 18.x. I now realize I will just have to wait. Can we reopen this issue as Some other Issue?

gocpplua · 2022-12-06T03:02:56Z

This issue has been resolved：

vim /etc/docker/daemon.json
add： "default-runtime": "nvidia",
systemctl restart docker

Then:

kubectl describe node -A | grep nvidia

nvidia.com/gpu: 4

RenaudWasTaken closed this as completed Oct 1, 2019

valayDave mentioned this issue Mar 21, 2020

Add support for CUDA 10.0 kubernetes/kops#7911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s-device-plugin does not work with Docker v.19.03 and K8s v1.15.0 #139

K8s-device-plugin does not work with Docker v.19.03 and K8s v1.15.0 #139

ayadmarouane commented Sep 30, 2019 •

edited

Loading

ayadmarouane commented Sep 30, 2019

RenaudWasTaken commented Oct 1, 2019

rayh commented Dec 7, 2019 •

edited

Loading

RenaudWasTaken commented Dec 7, 2019

valayDave commented Mar 21, 2020

gocpplua commented Dec 6, 2022

K8s-device-plugin does not work with Docker v.19.03 and K8s v1.15.0 #139

K8s-device-plugin does not work with Docker v.19.03 and K8s v1.15.0 #139

Comments

ayadmarouane commented Sep 30, 2019 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

ayadmarouane commented Sep 30, 2019

RenaudWasTaken commented Oct 1, 2019

rayh commented Dec 7, 2019 • edited Loading

RenaudWasTaken commented Dec 7, 2019

valayDave commented Mar 21, 2020

gocpplua commented Dec 6, 2022

kubectl describe node -A | grep nvidia

ayadmarouane commented Sep 30, 2019 •

edited

Loading

rayh commented Dec 7, 2019 •

edited

Loading