Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s-device-plugin does not work with Docker v.19.03 and K8s v1.15.0 #139

Closed
13 tasks
ayadmarouane opened this issue Sep 30, 2019 · 6 comments
Closed
13 tasks

Comments

@ayadmarouane
Copy link

ayadmarouane commented Sep 30, 2019

1. Issue or feature description

I installed the latest version of Docker which natively supports Nvidia gpus, I installed also the 410 Nvidia drivers, the last version of nvidia-container-toolkit and Cuda10.0.
And Docker is working fine with Nvidia without making any changes to /etc/docker/daemon.json file :

sudo docker run --gpus all --rm nvidia/cuda:10.0-base-ubuntu18.04 nvidia-smi
Mon Sep 30 12:44:17 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+

Then I installed the k8s-device by executing the following command but k8s can not see the graphics card.:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml

Logs from the daemonset :
2019/09/30 12:34:07 Loading NVML
2019/09/30 12:34:07 Failed to initialize NVML: could not load NVML library.
2019/09/30 12:34:07 If this is a GPU node, did you set the docker default runtime to nvidia?
2019/09/30 12:34:07 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/30 12:34:07 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

kubectl describe nodes scalio-ub-n01 | grep -B 5 gpu
Resource Requests Limits


cpu 200m (0%) 200m (0%)
memory 275Mi (0%) 75Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 0 0

and i tried to deploy with nvidia.com/gpu: 1 as a request and limit resource but it doesn't work :
0/3 nodes are available: 3 Insufficient nvidia.com/gpu. default-scheduler-2

for information, i have 3 node cluster, with 2 workers. and the daemon-set is deployed on the 2 workers, each worker have a geForce RTX 2080 Ti.

2. Steps to reproduce the issue

Install 1.0.0-beta/nvidia-device-plugin.yml with K8s 1.15.0 and Docker 19.03

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi on your host
    Mon Sep 30 12:52:42 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
    |-------------------------------+----------------------+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+======================+======================|
    | 0 GeForce RTX 208... Off | 00000000:15:00.0 Off | N/A |
    | 32% 28C P8 22W / 260W | 0MiB / 10981MiB | 0% Default |
    +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

  • The output of nvidia-smi -a on your host
    ==============NVSMI LOG==============

Timestamp : Mon Sep 30 12:52:51 2019
Driver Version : 410.104
CUDA Version : 10.0

Attached GPUs : 1
GPU 00000000:15:00.0
Product Name : GeForce RTX 2080 Ti
Product Brand : GeForce
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-d52420f5-d9d6-3d06-8c3e-02e8cb7b01d2
Minor Number : 0
VBIOS Version : 90.02.17.00.74
MultiGPU Board : No
Board ID : 0x1500
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x15
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:15:00.0
Sub System Id : 0x12FB196E
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 32 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 10981 MiB
Used : 0 MiB
Free : 10981 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 10 MiB
Free : 246 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 22.36 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

  • Your docker configuration file (e.g: /etc/docker/daemon.json)
    {
    "exec-opts": ["native.cgroupdriver=systemd"],
    "log-driver": "json-file",
    "log-opts": {
    "max-size": "100m"
    },
    "storage-driver": "overlay2"
    }

  • The k8s-device-plugin container logs
    2019/09/30 12:34:07 Loading NVML
    2019/09/30 12:34:07 Failed to initialize NVML: could not load NVML library.
    2019/09/30 12:34:07 If this is a GPU node, did you set the docker default runtime to nvidia?
    2019/09/30 12:34:07 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
    2019/09/30 12:34:07 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
    Docker version 19.03.2, build 6a30dfc

  • K8s version from kubeadm version && kubectl version
    kubeadm version: &version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:37:41Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

  • Docker command, image and tag used
    nvidia/cuda:10.0-base-ubuntu18.04

  • Kernel version from uname -a
    Linux scalio-ub-n01 4.16.0-041600-generic #201804012230 SMP Sun Apr 1 22:31:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    libgldispatch0-nvidia
    libnvidia-cfg1-410:amd64 410.104-0ubuntu018.04.1
    libnvidia-cfg1-any
    libnvidia-common
    libnvidia-common-410 410.104-0ubuntu0
    18.04.1
    libnvidia-compute-410:amd64 410.104-0ubuntu018.04.1
    libnvidia-container-tools 1.0.5-1
    libnvidia-container1:amd64 1.0.5-1
    libnvidia-decode
    libnvidia-decode-410:amd64 410.104-0ubuntu0
    18.04.1
    libnvidia-encode
    libnvidia-encode-410:amd64 410.104-0ubuntu018.04.1
    libnvidia-fbc1
    libnvidia-fbc1-410:amd64 410.104-0ubuntu0
    18.04.1
    libnvidia-gl
    libnvidia-gl-410:amd64 410.104-0ubuntu018.04.1
    libnvidia-ifr1
    libnvidia-ifr1-410:amd64 410.104-0ubuntu0
    18.04.1
    nvidia-304
    nvidia-340
    nvidia-384
    nvidia-390
    nvidia-compute-utils-410 410.104-0ubuntu018.04.1
    nvidia-container-runtime
    nvidia-container-runtime-hook
    nvidia-container-toolkit 1.0.5-1
    nvidia-dkms-410 410.104-0ubuntu0
    18.04.1
    nvidia-dkms-kernel
    nvidia-driver-410 410.104-0ubuntu018.04.1
    nvidia-driver-binary
    nvidia-kernel-common
    nvidia-kernel-common-410 410.104-0ubuntu0
    18.04.1
    nvidia-kernel-source
    nvidia-kernel-source-410 410.104-0ubuntu018.04.1
    nvidia-legacy-304xx-vdpau-driver
    nvidia-legacy-340xx-vdpau-driver
    nvidia-modprobe 410.48-0ubuntu1
    nvidia-opencl-icd
    nvidia-persistenced
    nvidia-prime 0.8.8.2
    nvidia-settings 418.56-0ubuntu0
    gpu18.04.
    nvidia-settings-binary
    nvidia-smi
    nvidia-utils
    nvidia-utils-410 410.104-0ubuntu018.04.1
    nvidia-vdpau-driver
    xserver-xorg-video-nvidia-410 410.104-0ubuntu0
    18.04.1

  • NVIDIA container library version from nvidia-container-cli -V
    version: 1.0.5
    build date: 2019-09-06T16:59+00:00
    build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
    build compiler: x86_64-linux-gnu-gcc-7 7.4.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • NVIDIA container library logs (see troubleshooting)

Thank you for your help.

@ayadmarouane
Copy link
Author

I tried to execute the k8s-device with Docker directly, and it works when i add --gpus argument :

$ sudo docker run --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin

2019/09/30 14:46:38 Loading NVML
2019/09/30 14:46:38 Failed to initialize NVML: could not load NVML library.
2019/09/30 14:46:38 If this is a GPU node, did you set the docker default runtime to nvidia?
2019/09/30 14:46:38 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/30 14:46:38 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

$ sudo docker run --gpus all --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin

2019/09/30 14:46:49 Loading NVML
2019/09/30 14:46:49 Fetching devices.
2019/09/30 14:46:49 Starting FS watcher.
2019/09/30 14:46:49 Failed to created FS watcher.

any idea ?

@RenaudWasTaken
Copy link
Contributor

Hello!

For kubernetes, you should install the nvidia-docker2 package and then set the runtime to nvidia.
Hope this helps!

@rayh
Copy link

rayh commented Dec 7, 2019

@RenaudWasTaken since docker 19.03, nvidia-docker2 is deprecated as docker natively supports nvidia GPUs - https://github.com/NVIDIA/nvidia-docker

For now, it seems that we can keep using nvidia-docker2, but is there another approach that would avoid having to install it?

@RenaudWasTaken
Copy link
Contributor

For now, there are no other options than to use these packages.
We won't remove them and ensure that this keeps working with Kubernetes until another solution is adopted into upstream kubernetes.

@valayDave
Copy link

I was trying to FIgure if I could use Cuda 10.2 With GPU clusters on Kubernetes. The dependency chain lead me here :). I was figuring if KOPS could let me deploy CUDA 10.2 supported GPU Instances on AWS with NVIDIA Docker configured.

K8s v1.17 Support Docker 19.03. NVIDIA Container Toolkit Requires Docker 19.03 and supports CUDA 10.2.

KOPS Supports Kubernetes 1.16 which runs on Docker 18.x. I now realize I will just have to wait. Can we reopen this issue as Some other Issue?

@gocpplua
Copy link

gocpplua commented Dec 6, 2022

This issue has been resolved:

  1. vim /etc/docker/daemon.json
    add: "default-runtime": "nvidia",

  2. systemctl restart docker

Then:

kubectl describe node -A | grep nvidia

nvidia.com/gpu: 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants