New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
K8s-device-plugin does not work with Docker v.19.03 and K8s v1.15.0 #139
Comments
I tried to execute the k8s-device with Docker directly, and it works when i add --gpus argument : $ sudo docker run --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin 2019/09/30 14:46:38 Loading NVML $ sudo docker run --gpus all --rm nvidia/k8s-device-plugin:1.0.0-beta nvidia-device-plugin 2019/09/30 14:46:49 Loading NVML any idea ? |
Hello! For kubernetes, you should install the |
@RenaudWasTaken since docker 19.03, nvidia-docker2 is deprecated as docker natively supports nvidia GPUs - https://github.com/NVIDIA/nvidia-docker For now, it seems that we can keep using nvidia-docker2, but is there another approach that would avoid having to install it? |
For now, there are no other options than to use these packages. |
I was trying to FIgure if I could use Cuda 10.2 With GPU clusters on Kubernetes. The dependency chain lead me here :). I was figuring if KOPS could let me deploy CUDA 10.2 supported GPU Instances on AWS with NVIDIA Docker configured. K8s v1.17 Support Docker 19.03. NVIDIA Container Toolkit Requires Docker 19.03 and supports CUDA 10.2. KOPS Supports Kubernetes 1.16 which runs on Docker 18.x. I now realize I will just have to wait. Can we reopen this issue as Some other Issue? |
This issue has been resolved:
Then: kubectl describe node -A | grep nvidianvidia.com/gpu: 4 |
1. Issue or feature description
I installed the latest version of Docker which natively supports Nvidia gpus, I installed also the 410 Nvidia drivers, the last version of nvidia-container-toolkit and Cuda10.0.
And Docker is working fine with Nvidia without making any changes to /etc/docker/daemon.json file :
sudo docker run --gpus all --rm nvidia/cuda:10.0-base-ubuntu18.04 nvidia-smi
Mon Sep 30 12:44:17 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
Then I installed the k8s-device by executing the following command but k8s can not see the graphics card.:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta/nvidia-device-plugin.yml
Logs from the daemonset :
2019/09/30 12:34:07 Loading NVML
2019/09/30 12:34:07 Failed to initialize NVML: could not load NVML library.
2019/09/30 12:34:07 If this is a GPU node, did you set the docker default runtime to
nvidia
?2019/09/30 12:34:07 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/30 12:34:07 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
kubectl describe nodes scalio-ub-n01 | grep -B 5 gpu
Resource Requests Limits
cpu 200m (0%) 200m (0%)
memory 275Mi (0%) 75Mi (0%)
ephemeral-storage 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
and i tried to deploy with nvidia.com/gpu: 1 as a request and limit resource but it doesn't work :
0/3 nodes are available: 3 Insufficient nvidia.com/gpu. default-scheduler-2
for information, i have 3 node cluster, with 2 workers. and the daemon-set is deployed on the 2 workers, each worker have a geForce RTX 2080 Ti.
2. Steps to reproduce the issue
Install 1.0.0-beta/nvidia-device-plugin.yml with K8s 1.15.0 and Docker 19.03
3. Information to attach (optional if deemed irrelevant)
Common error checking:
nvidia-smi
on your hostMon Sep 30 12:52:42 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:15:00.0 Off | N/A |
| 32% 28C P8 22W / 260W | 0MiB / 10981MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
nvidia-smi -a
on your host==============NVSMI LOG==============
Timestamp : Mon Sep 30 12:52:51 2019
Driver Version : 410.104
CUDA Version : 10.0
Attached GPUs : 1
GPU 00000000:15:00.0
Product Name : GeForce RTX 2080 Ti
Product Brand : GeForce
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-d52420f5-d9d6-3d06-8c3e-02e8cb7b01d2
Minor Number : 0
VBIOS Version : 90.02.17.00.74
MultiGPU Board : No
Board ID : 0x1500
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x15
Device : 0x00
Domain : 0x0000
Device Id : 0x1E0710DE
Bus Id : 00000000:15:00.0
Sub System Id : 0x12FB196E
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 32 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 10981 MiB
Used : 0 MiB
Free : 10981 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 10 MiB
Free : 246 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 28 C
GPU Shutdown Temp : 94 C
GPU Slowdown Temp : 91 C
GPU Max Operating Temp : 89 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 22.36 W
Power Limit : 260.00 W
Default Power Limit : 260.00 W
Enforced Power Limit : 260.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 300 MHz
SM : 300 MHz
Memory : 405 MHz
Video : 540 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2160 MHz
SM : 2160 MHz
Memory : 7000 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
Your docker configuration file (e.g:
/etc/docker/daemon.json
){
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
The k8s-device-plugin container logs
2019/09/30 12:34:07 Loading NVML
2019/09/30 12:34:07 Failed to initialize NVML: could not load NVML library.
2019/09/30 12:34:07 If this is a GPU node, did you set the docker default runtime to
nvidia
?2019/09/30 12:34:07 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2019/09/30 12:34:07 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
Docker version from
docker version
Docker version 19.03.2, build 6a30dfc
K8s version from
kubeadm version && kubectl version
kubeadm version: &version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:37:41Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Docker command, image and tag used
nvidia/cuda:10.0-base-ubuntu18.04
Kernel version from
uname -a
Linux scalio-ub-n01 4.16.0-041600-generic #201804012230 SMP Sun Apr 1 22:31:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Any relevant kernel output lines from
dmesg
NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
libgldispatch0-nvidia
libnvidia-cfg1-410:amd64 410.104-0ubuntu0
18.04.118.04.1libnvidia-cfg1-any
libnvidia-common
libnvidia-common-410 410.104-0ubuntu0
libnvidia-compute-410:amd64 410.104-0ubuntu0
18.04.118.04.1libnvidia-container-tools 1.0.5-1
libnvidia-container1:amd64 1.0.5-1
libnvidia-decode
libnvidia-decode-410:amd64 410.104-0ubuntu0
libnvidia-encode
libnvidia-encode-410:amd64 410.104-0ubuntu0
18.04.118.04.1libnvidia-fbc1
libnvidia-fbc1-410:amd64 410.104-0ubuntu0
libnvidia-gl
libnvidia-gl-410:amd64 410.104-0ubuntu0
18.04.118.04.1libnvidia-ifr1
libnvidia-ifr1-410:amd64 410.104-0ubuntu0
nvidia-304
nvidia-340
nvidia-384
nvidia-390
nvidia-compute-utils-410 410.104-0ubuntu0
18.04.118.04.1nvidia-container-runtime
nvidia-container-runtime-hook
nvidia-container-toolkit 1.0.5-1
nvidia-dkms-410 410.104-0ubuntu0
nvidia-dkms-kernel
nvidia-driver-410 410.104-0ubuntu0
18.04.118.04.1nvidia-driver-binary
nvidia-kernel-common
nvidia-kernel-common-410 410.104-0ubuntu0
nvidia-kernel-source
nvidia-kernel-source-410 410.104-0ubuntu0
18.04.1gpu18.04.nvidia-legacy-304xx-vdpau-driver
nvidia-legacy-340xx-vdpau-driver
nvidia-modprobe 410.48-0ubuntu1
nvidia-opencl-icd
nvidia-persistenced
nvidia-prime 0.8.8.2
nvidia-settings 418.56-0ubuntu0
nvidia-settings-binary
nvidia-smi
nvidia-utils
nvidia-utils-410 410.104-0ubuntu0
18.04.118.04.1nvidia-vdpau-driver
xserver-xorg-video-nvidia-410 410.104-0ubuntu0
NVIDIA container library version from
nvidia-container-cli -V
version: 1.0.5
build date: 2019-09-06T16:59+00:00
build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
build compiler: x86_64-linux-gnu-gcc-7 7.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
NVIDIA container library logs (see troubleshooting)
Thank you for your help.
The text was updated successfully, but these errors were encountered: