Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes 1.10 Failed to initialize NVML: could not load NVML library. #60

Closed
xikunyang opened this issue Jun 29, 2018 · 5 comments
Closed

Comments

@xikunyang
Copy link

xikunyang commented Jun 29, 2018

I have deploy kubernetes 1.10, and want to enable gpu node.

I have install nvidia-docker2, and the /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

And I test docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

Fri Jun 29 09:38:08 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 29%   25C    P8     7W / 250W |      0MiB / 11178MiB |      1%      Default |

Enable gpu support in kubernetes:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.10/nvidia-device-plugin.yml

Get the log from pod:

[root@k8s1 gpu]# kubectl logs  nvidia-device-plugin-daemonset-m5f68 -n kube-system
2018/06/29 08:52:25 Loading NVML
2018/06/29 08:52:25 Failed to initialize NVML: could not load NVML library.
2018/06/29 08:52:25 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2018/06/29 08:52:25 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2018/06/29 08:52:25 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start

On gpu node( the gpu node had install docker before I make it to be a kubernetes worker):

[root@gpu055 gpu]# ldconfig -p | grep nvidia-ml
	libnvidia-ml.so.1 (libc6,x86-64) => /lib64/libnvidia-ml.so.1
	libnvidia-ml.so.1 (libc6) => /lib/libnvidia-ml.so.1
	libnvidia-ml.so (libc6,x86-64) => /lib64/libnvidia-ml.so
	libnvidia-ml.so (libc6) => /lib/libnvidia-ml.so

I can run nvidia/k8s-device-plugin:1.10 by docker:

[root@gpu055 gpu]# docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.10
Unable to find image 'nvidia/k8s-device-plugin:1.10' locally
1.10: Pulling from nvidia/k8s-device-plugin
683abbb4ea60: Pull complete
6bbf844c6d97: Pull complete
Digest: sha256:96f234268cdc87d9c57288d0589c1841c6c026b3e656ca8c9174c9923d049798
Status: Downloaded newer image for nvidia/k8s-device-plugin:1.10
2018/06/29 11:12:25 Loading NVML
2018/06/29 11:12:25 Fetching devices.
2018/06/29 11:12:25 Starting FS watcher.
2018/06/29 11:12:25 Starting OS watcher.
2018/06/29 11:12:25 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
2018/06/29 11:12:25 Registered device plugin with Kubelet

Does I miss something?Any help is appreciate.

@xikunyang
Copy link
Author

I have figure out it.
I need to change kubelet to use docker instead of containerd.

@RenaudWasTaken
Copy link
Contributor

I was at a loss when reading your logs :)

I was going to ask you what Container runtime you were using but seeing you pasted the docker logs I assumed you had docker underneath :)

Great to see you solved that!

@narender-singh
Copy link

narender-singh commented Jul 1, 2018

@xikunyang can you please guide me how you changed kubelet to use docker ?

@xikunyang
Copy link
Author

@narender-singh

  1. install docker on gpu node
  2. install nvidia-docker
  3. update /etc/docker/daemon.json, add "default-runtime": "nvidia"
  4. restart docker
  5. update kubelet systemd service config

for example:

cat <<EOF | sudo tee /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=containerd.service
Requires=containerd.service

[Service]
ExecStart=/usr/local/bin/kubelet \\
  --config=/var/lib/kubelet/kubelet-config.yaml \\
  --container-runtime=docker \\
  --image-pull-progress-deadline=2m \\
  --kubeconfig=/var/lib/kubelet/kubeconfig \\
  --network-plugin=cni \\
  --register-node=true \\
  --v=2
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

@RenaudWasTaken
Copy link
Contributor

Closing as this is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants