kubernetes 1.10 Failed to initialize NVML: could not load NVML library. #60

xikunyang · 2018-06-29T09:45:30Z

I have deploy kubernetes 1.10, and want to enable gpu node.

I have install nvidia-docker2, and the /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

And I test docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

Fri Jun 29 09:38:08 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 29%   25C    P8     7W / 250W |      0MiB / 11178MiB |      1%      Default |

Enable gpu support in kubernetes:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

Get the log from pod:

[root@k8s1 gpu]# kubectl logs  nvidia-device-plugin-daemonset-m5f68 -n kube-system
2018/06/29 08:52:25 Loading NVML
2018/06/29 08:52:25 Failed to initialize NVML: could not load NVML library.
2018/06/29 08:52:25 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2018/06/29 08:52:25 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/06/29 08:52:25 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

On gpu node( the gpu node had install docker before I make it to be a kubernetes worker):

[root@gpu055 gpu]# ldconfig -p | grep nvidia-ml
	libnvidia-ml.so.1 (libc6,x86-64) => /lib64/libnvidia-ml.so.1
	libnvidia-ml.so.1 (libc6) => /lib/libnvidia-ml.so.1
	libnvidia-ml.so (libc6,x86-64) => /lib64/libnvidia-ml.so
	libnvidia-ml.so (libc6) => /lib/libnvidia-ml.so

I can run nvidia/k8s-device-plugin:1.10 by docker:

[root@gpu055 gpu]# docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.10
Unable to find image 'nvidia/k8s-device-plugin:1.10' locally
1.10: Pulling from nvidia/k8s-device-plugin
683abbb4ea60: Pull complete
6bbf844c6d97: Pull complete
Digest: sha256:96f234268cdc87d9c57288d0589c1841c6c026b3e656ca8c9174c9923d049798
Status: Downloaded newer image for nvidia/k8s-device-plugin:1.10
2018/06/29 11:12:25 Loading NVML
2018/06/29 11:12:25 Fetching devices.
2018/06/29 11:12:25 Starting FS watcher.
2018/06/29 11:12:25 Starting OS watcher.
2018/06/29 11:12:25 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/29 11:12:25 Registered device plugin with Kubelet

Does I miss something?Any help is appreciate.

The text was updated successfully, but these errors were encountered:

xikunyang · 2018-06-29T13:43:47Z

I have figure out it.
I need to change kubelet to use docker instead of containerd.

RenaudWasTaken · 2018-06-29T14:33:54Z

I was at a loss when reading your logs :)

I was going to ask you what Container runtime you were using but seeing you pasted the docker logs I assumed you had docker underneath :)

Great to see you solved that!

narender-singh · 2018-07-01T17:31:42Z

@xikunyang can you please guide me how you changed kubelet to use docker ?

xikunyang · 2018-07-01T17:41:24Z

@narender-singh

install docker on gpu node
install nvidia-docker
update /etc/docker/daemon.json, add "default-runtime": "nvidia"
restart docker
update kubelet systemd service config

for example:

cat <<EOF | sudo tee /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service
Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet \\
  --config=/var/lib/kubelet/kubelet-config.yaml \\
  --container-runtime=docker \\
  --image-pull-progress-deadline=2m \\
  --kubeconfig=/var/lib/kubelet/kubeconfig \\
  --network-plugin=cni \\
  --register-node=true \\
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

RenaudWasTaken · 2018-07-11T13:37:31Z

Closing as this is resolved.

RenaudWasTaken closed this as completed Jul 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes 1.10 Failed to initialize NVML: could not load NVML library. #60

kubernetes 1.10 Failed to initialize NVML: could not load NVML library. #60

xikunyang commented Jun 29, 2018 •

edited

xikunyang commented Jun 29, 2018

RenaudWasTaken commented Jun 29, 2018

narender-singh commented Jul 1, 2018 •

edited

xikunyang commented Jul 1, 2018

RenaudWasTaken commented Jul 11, 2018

kubernetes 1.10 Failed to initialize NVML: could not load NVML library. #60

kubernetes 1.10 Failed to initialize NVML: could not load NVML library. #60

Comments

xikunyang commented Jun 29, 2018 • edited

xikunyang commented Jun 29, 2018

RenaudWasTaken commented Jun 29, 2018

narender-singh commented Jul 1, 2018 • edited

xikunyang commented Jul 1, 2018

RenaudWasTaken commented Jul 11, 2018

xikunyang commented Jun 29, 2018 •

edited

narender-singh commented Jul 1, 2018 •

edited