Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot #604

Closed
luhong123 opened this issue Mar 18, 2024 · 11 comments

Comments

@luhong123
Copy link

luhong123 commented Mar 18, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04.4 LTS
  • Kernel Version: Linux 5.15.0-94-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s VERSION:v1.26.0

2. Issue or feature description

Let me operate the steps.

root@zc:~/k8s# nvidia-smi
Tue May  7 15:16:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080        Off |   00000000:41:00.0 Off |                  N/A |
|  0%   25C    P8              9W /  370W |       1MiB /  10240MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
root@zc:~/k8s# sudo nvidia-ctk runtime configure --runtime=containerd
INFO[0000] Loading config from /etc/containerd/config.toml
INFO[0000] Wrote updated config to /etc/containerd/config.toml
INFO[0000] It is recommended that containerd daemon be restarted.

root@zc:~/k8s# sudo systemctl restart containerd
root@zc:~/k8s# cat  /etc/containerd/config.toml
root = "/var/lib/containerd"
state = "/run/containerd"
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.grpc.v1.cri"]
    sandbox_image = "registry.cn-beijing.aliyuncs.com/kubesphereio/pause:3.8"

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = ""
      max_conf_num = 1

    [plugins."io.containerd.grpc.v1.cri".containerd]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

    [plugins."io.containerd.grpc.v1.cri".registry]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

        [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
          endpoint = ["https://registry-1.docker.io"]

[timeouts]
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0
root@zc:~/k8s#

But the GPU still cannot be called

root@zc:~/k8s# kubectl  logs   nvidia-device-plugin-daemonset-mrgz4    -n  kube-system
I0507 07:14:58.366091       1 main.go:178] Starting FS watcher.
I0507 07:14:58.366181       1 main.go:185] Starting OS watcher.
I0507 07:14:58.366403       1 main.go:200] Starting Plugins.
I0507 07:14:58.366437       1 main.go:257] Loading configuration.
I0507 07:14:58.366893       1 main.go:265] Updating config with default resource matching patterns.
I0507 07:14:58.367145       1 main.go:276]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0507 07:14:58.367155       1 main.go:279] Retrieving plugins.
W0507 07:14:58.367204       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0507 07:14:58.367241       1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0507 07:14:58.367284       1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0507 07:14:58.367291       1 factory.go:112] Incompatible platform detected
E0507 07:14:58.367295       1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0507 07:14:58.367299       1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0507 07:14:58.367304       1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0507 07:14:58.367310       1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0507 07:14:58.367316       1 main.go:308] No devices found. Waiting indefinitely.

root@zc:~/k8s# kubectl  logs   gpu-pod
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

Call OK with docker

@luhong123
Copy link
Author

@elezar

@elezar
Copy link
Member

elezar commented Mar 18, 2024

@luhong123 could you please confirm your device plugin and NVIDIA Container Toolkit versions?

@luhong123
Copy link
Author

NVIDIA-SMI 535.161.07
GPU 0: NVIDIA GeForce RTX 3090
@elezar

@andy108369
Copy link

wrong runtime?
in your logs:

E0318 09:59:50.207221       1 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0318 09:59:50.207267       1 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?

@luhong123
Copy link
Author

wrong runtime? in your logs:

E0318 09:59:50.207221       1 factory.go:77] Failed to initialize NVML: ERROR_UNKNOWN.
E0318 09:59:50.207267       1 factory.go:78] If this is a GPU node, did you set the docker default runtime to `nvidia`?

I don't quite understand, the system time is correct

@luhong123
Copy link
Author

@elezar Help me

@luhong123 luhong123 changed the title K8s 1.24 failed to schedule using GPU-(error code CUDA driver K8s 1.26 failed to schedule using GPU-(error code CUDA driver)- could not load NVML library: libnvidia-ml.so.1: cannot May 7, 2024
@elezar
Copy link
Member

elezar commented May 7, 2024

Since nvidia is not set as your default runtime in your containerd config, you also need to create a runtime class:

kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  name: nvidia
EOF

Then, when deploying the device plugin specify this this runtime class. If helm is used you can add --sed runtimeClassName=nvidia to the helm install or helm upgrade command, otherwise you would have to update your podspec to include:

spec:
  runtimeClassName: nvidia

@luhong123
Copy link
Author

Since nvidia is not set as your default runtime in your containerd config, you also need to create a runtime class:

kubectl apply -f - <<EOF
apiVersion: node.k8s.io/v1
handler: nvidia
kind: RuntimeClass
metadata:
  name: nvidia
EOF

Then, when deploying the device plugin specify this this runtime class. If helm is used you can add --sed runtimeClassName=nvidia to the helm install or helm upgrade command, otherwise you would have to update your podspec to include:

spec:
  runtimeClassName: nvidia

How to set the default runtime in your containerd?

@elezar
Copy link
Member

elezar commented May 7, 2024

You can run:

 sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default

and then restart containerd with:

sudo systemctl restart containerd

@luhong123
Copy link
Author

图片
Thank you very much for successfully resolving my confusion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants