K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot #604

luhong123 · 2024-03-18T10:25:13Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04.4 LTS
Kernel Version: Linux 5.15.0-94-generic
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s VERSION:v1.26.0

2. Issue or feature description

Let me operate the steps.

root@zc:~/k8s# nvidia-smi
Tue May  7 15:16:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:41:00.0 Off |                  N/A |
|  0%   25C    P8              9W /  370W |       1MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

root@zc:~/k8s# sudo nvidia-ctk runtime configure --runtime=containerd
INFO[0000] Loading config from /etc/containerd/config.toml
INFO[0000] Wrote updated config to /etc/containerd/config.toml
INFO[0000] It is recommended that containerd daemon be restarted.

root@zc:~/k8s# sudo systemctl restart containerd

root@zc:~/k8s# cat  /etc/containerd/config.toml
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.8"

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = ""
      max_conf_num = 1

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

    [plugins."io.containerd.grpc.v1.cri".registry]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0
root@zc:~/k8s#

But the GPU still cannot be called

root@zc:~/k8s# kubectl  logs   nvidia-device-plugin-daemonset-mrgz4    -n  kube-system
I0507 07:14:58.366091       1 main.go:178] Starting FS watcher.
I0507 07:14:58.366181       1 main.go:185] Starting OS watcher.
I0507 07:14:58.366403       1 main.go:200] Starting Plugins.
I0507 07:14:58.366437       1 main.go:257] Loading configuration.
I0507 07:14:58.366893       1 main.go:265] Updating config with default resource matching patterns.
I0507 07:14:58.367145       1 main.go:276]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0507 07:14:58.367155       1 main.go:279] Retrieving plugins.
W0507 07:14:58.367204       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0507 07:14:58.367241       1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0507 07:14:58.367284       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0507 07:14:58.367291       1 factory.go:112] Incompatible platform detected
E0507 07:14:58.367295       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0507 07:14:58.367299       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0507 07:14:58.367304       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0507 07:14:58.367310       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0507 07:14:58.367316       1 main.go:308] No devices found. Waiting indefinitely.

root@zc:~/k8s# kubectl  logs   gpu-pod
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

Call OK with docker

The text was updated successfully, but these errors were encountered:

luhong123 · 2024-03-18T10:29:25Z

@elezar

elezar · 2024-03-18T11:29:02Z

@luhong123 could you please confirm your device plugin and NVIDIA Container Toolkit versions?

luhong123 · 2024-03-19T06:18:18Z

https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

luhong123 · 2024-03-19T06:19:46Z

NVIDIA-SMI 535.161.07
GPU 0: NVIDIA GeForce RTX 3090
@elezar

andy108369 · 2024-03-19T10:19:52Z

wrong runtime?
in your logs:

E0318 09:59:50.207221       1 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0318 09:59:50.207267       1 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?

luhong123 · 2024-03-20T07:59:08Z

wrong runtime? in your logs:

E0318 09:59:50.207221       1 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0318 09:59:50.207267       1 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?

I don't quite understand, the system time is correct

luhong123 · 2024-05-07T07:30:19Z

@elezar Help me

elezar · 2024-05-07T08:42:04Z

Since nvidia is not set as your default runtime in your containerd config, you also need to create a runtime class:

kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  name: nvidia
EOF

Then, when deploying the device plugin specify this this runtime class. If helm is used you can add --sed runtimeClassName=nvidia to the helm install or helm upgrade command, otherwise you would have to update your podspec to include:

spec:
  runtimeClassName: nvidia

luhong123 · 2024-05-07T08:44:22Z

Since nvidia is not set as your default runtime in your containerd config, you also need to create a runtime class:
kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  name: nvidia
EOF
Then, when deploying the device plugin specify this this runtime class. If helm is used you can add --sed runtimeClassName=nvidia to the helm install or helm upgrade command, otherwise you would have to update your podspec to include:
spec:
  runtimeClassName: nvidia

How to set the default runtime in your containerd?

elezar · 2024-05-07T09:26:52Z

You can run:

 sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default

and then restart containerd with:

sudo systemctl restart containerd

luhong123 · 2024-05-07T09:37:06Z

Thank you very much for successfully resolving my confusion

luhong123 changed the title ~~K8s 1.24 failed to schedule using GPU-(error code CUDA driver~~ K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot May 7, 2024

luhong123 closed this as completed May 7, 2024

elezar mentioned this issue May 7, 2024

Back-off restarting failed container nvidia-device-plugin-ctr #652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot #604

K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot #604

luhong123 commented Mar 18, 2024 •

edited

luhong123 commented Mar 18, 2024

elezar commented Mar 18, 2024

luhong123 commented Mar 19, 2024

luhong123 commented Mar 19, 2024

andy108369 commented Mar 19, 2024

luhong123 commented Mar 20, 2024

luhong123 commented May 7, 2024

elezar commented May 7, 2024

luhong123 commented May 7, 2024

elezar commented May 7, 2024

luhong123 commented May 7, 2024

K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot #604

K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot #604

Comments

luhong123 commented Mar 18, 2024 • edited

1. Quick Debug Information

2. Issue or feature description

luhong123 commented Mar 18, 2024

elezar commented Mar 18, 2024

luhong123 commented Mar 19, 2024

luhong123 commented Mar 19, 2024

andy108369 commented Mar 19, 2024

luhong123 commented Mar 20, 2024

luhong123 commented May 7, 2024

elezar commented May 7, 2024

luhong123 commented May 7, 2024

elezar commented May 7, 2024

luhong123 commented May 7, 2024

luhong123 commented Mar 18, 2024 •

edited