Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to get sandbox runtime: no runtime for nvidia is configured #432

Open
3 of 16 tasks
Bec-k opened this issue Nov 2, 2022 · 39 comments
Open
3 of 16 tasks

Failed to get sandbox runtime: no runtime for nvidia is configured #432

Bec-k opened this issue Nov 2, 2022 · 39 comments

Comments

@Bec-k
Copy link

Bec-k commented Nov 2, 2022

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

nov 02 18:00:58 beck containerd[10237]: time="2022-11-02T18:00:58.738797825+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:gpu-feature-discovery-qfjgk,Uid:02c7d4ad-db02-4145-846b-616a94416008,Namespace:gpu-operator,Attempt:2,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces

  • kubernetes daemonset status: kubectl get ds --all-namespaces

  • If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME

  • If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME

  • Output of running a container on the GPU machine: docker run -it alpine echo foo

  • Docker configuration file: cat /etc/docker/daemon.json

  • Docker runtime configuration: docker info | grep runtime

  • NVIDIA shared directory: ls -la /run/nvidia

  • NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit

  • NVIDIA driver directory: ls -la /run/nvidia/driver

  • kubelet logs journalctl -u kubelet > kubelet.logs

(base) beck@beck:/$ ls -la /run/nvidia/
total 4
drwxr-xr-x  4 root root  100 nov  2 18:48 .
drwxr-xr-x 39 root root 1140 nov  2 18:47 ..
drwxr-xr-x  2 root root   40 nov  2 17:59 driver
-rw-r--r--  1 root root    7 nov  2 18:48 toolkit.pid
drwxr-xr-x  2 root root   80 nov  2 18:48 validations

Driver folder is empty:

(base) beck@beck:/$ ls -la /run/nvidia/driver/
total 0
drwxr-xr-x 2 root root 40 nov  2 17:59 .
drwxr-xr-x 4 root root 80 nov  2 18:48 ..
@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

(base) beck@beck:/$ sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
    cuda-11.0.3-base-ubuntu20.04 nvidia-smi
Wed Nov  2 16:50:04 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P0    46W /  N/A |    601MiB /  8192MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

When i launch nvidia/cuda image via containerd cli, it is correctly detects and outputs my Nvidia GeForce video card, but for some reason, it doesn't see inside pods when deployed via helm.

@shivamerla
Copy link
Contributor

Can you run kubectl get pods -n gpu-operator to show which pods are running. If you deployed with driver enabled, it takes 3-5 minutes for the drivers to be installed and followed by nvidia runtime setup. If you have already installed them on the host, please specify --set driver.enabled=false --set toolkit.enabled=false.

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

I was checking /etc/containerd/config.toml , it is changing it contantly back and forth.
containerd is always gets restarted by itself, because it fails to cleanup sandboxes and dead shims.

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

(base) beck@beck:/$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS            RESTARTS      AGE
gpu-feature-discovery-w7vk6                                       1/1     Running           0             6m23s
gpu-operator-59b9d49c6f-7282l                                     1/1     Running           0             6m41s
nvidia-container-toolkit-daemonset-9rvz8                          1/1     Running           7 (67s ago)   5m56s
nvidia-cuda-validator-7mp9j                                       0/1     Init:0/1          0             4m4s
nvidia-dcgm-exporter-2ktzc                                        0/1     PodInitializing   0             6m24s
nvidia-device-plugin-daemonset-wvvh4                              0/1     PodInitializing   0             5m57s
nvidia-gpu-operator-node-feature-discovery-master-68495df8t9vd7   1/1     Running           0             6m41s
nvidia-gpu-operator-node-feature-discovery-worker-8gc88           1/1     Running           0             6m40s
nvidia-gpu-operator-node-feature-discovery-worker-stwpp           1/1     Running           9 (26s ago)   5m58s
nvidia-operator-validator-ptdgd                                   0/1     Init:Error        0             5m55s

@shivamerla
Copy link
Contributor

you can disable toolkit as well by editing kubectl edit clusterpolicy and setting toolkit.enabled=false. Looks like you have nvidia-container-runtime already configured on the host and containerd config updated manually?

@shivamerla
Copy link
Contributor

Can you also paste logs of nvidia-container-toolkit-daemonset-9rvz8 pod, curious as to why it is restarting. Which containerd and OS version is this?

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

Nope, didn't help. I have updated it, pod was removed and still complaining about:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

I have removed all pods, to trigger everything from scratch.

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

(base) beck@beck:/$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-8b8ls                                       0/1     Init:0/1   0          115s
gpu-operator-59b9d49c6f-gkk4j                                     1/1     Running    0          2m20s
nvidia-dcgm-exporter-6bmlt                                        0/1     Init:0/1   0          115s
nvidia-device-plugin-daemonset-f7xgb                              0/1     Init:0/1   0          117s
nvidia-gpu-operator-node-feature-discovery-master-68495df8kscw7   1/1     Running    0          2m20s
nvidia-gpu-operator-node-feature-discovery-worker-pcxwq           1/1     Running    0          2m20s
nvidia-gpu-operator-node-feature-discovery-worker-s2jjn           1/1     Running    0          2m20s
nvidia-operator-validator-rwt6z                                   0/1     Init:0/4   0          117s

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

here are error from systemd containerd logs:
https://gist.github.com/denissabramovs/a77e97972b5aa01c86955d812d3e8188

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

at least, now containerd is not constantly restarting, it is already up for 9 minutes:

● containerd.service - containerd container runtime
     Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2022-11-02 18:57:44 EET; 9min ago

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

All 3 systemd services are up and running on GPU node:

(base) beck@beck:/$ sudo systemctl status --no-pager kubelet containerd docker | grep active
     Active: active (running) since Wed 2022-11-02 18:57:49 EET; 14min ago
     Active: active (running) since Wed 2022-11-02 18:57:44 EET; 14min ago
     Active: active (running) since Wed 2022-11-02 19:09:12 EET; 3min 11s ago

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

Sorry, missed your message. Here it is:

(base) beck@beck:/$ cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

(base) beck@beck:/$ containerd --version
containerd containerd.io 1.6.9 1c90a442489720eec95342e1789ee8a5e1b9536f

@wjentner
Copy link

wjentner commented Nov 2, 2022

@denissabramovs this is a wild guess: are you using containerd 1.6.9? I believe we had problems with this version and the operator. We downgraded to containerd 1.6.8 and things started working again.

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

revision=9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 version=1.6.8:

nov 02 19:20:49 beck containerd[202761]: time="2022-11-02T19:20:49.723337417+02:00" level=info msg="starting containerd" revision=9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 version=1.6.8
...
...
...
nov 02 19:22:34 beck containerd[202761]: time="2022-11-02T19:22:34.246953180+02:00" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:nvidia-device-plugin-daemonset-lbnzw,Uid:fd4f1d3f-29d2-4d11-a724-96f4ed107cd5,Namespace:gpu-operator,Attempt:0,} failed, error" error="failed to get sandbox runtime: no runtime for \"nvidia\" is configured"

Killed/re-scheduled all pods in gpu-operator namespace after downgrading containerd.

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

Oh wow! @wjentner you actually were right, i have re-enabled above mentioned toolkit and after downgrade, it finished without problems and all pods are up and running now!

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

Good that i have captured both logs @shivamerla , adding those below.

These logs are from failing toolkit:
https://gist.github.com/denissabramovs/0c3ad150ea2b9450a91b430a91704d08

These from successful toolkit:
https://gist.github.com/denissabramovs/343c8fb0169866133fa1cc35b9d5365c

Hope this helps to find the problem and resolve it. It seems that they are different after all.

@shivamerla
Copy link
Contributor

Thanks @denissabramovs will check these out and try to repro with 1.6.9 containerd version.

@Bec-k
Copy link
Author

Bec-k commented Nov 2, 2022

If you won't be able to reproduce, please ping me and i'll try to reproduce it locally again. Then we could catch that issue and possibly make some patch together.
In any case, thank you guys.

@klueska
Copy link
Contributor

klueska commented Nov 8, 2022

Issue diagnosed and workaround MR can be found here:
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/568

hsinhoyeh added a commit to FootprintAI/multikf that referenced this issue Dec 2, 2022
as kind has upgraded its containerd version to 1.9 which triggered
issues to gpu-operator (see issue NVIDIA/gpu-operator#432)

so we sticked kind version with containerd 1.8

also fix gpu installation
@wjentner
Copy link

wjentner commented Dec 3, 2022

@klueska thanks! When will this be released? I assume it has been also tested with contained 1.6.10 which has been released recently?

@cdesiniotis
Copy link
Contributor

Hi @denissabramovs @wjentner. We just released v22.9.1. This includes the workaround mentioned above for resolving the containerd issues. Please give it a try and let us know if there are any issues.

@wjentner
Copy link

wjentner commented Dec 16, 2022

Thanks @cdesiniotis, I can confirm that it works with containerd 1.6.12 as well.
Edit: 1.6.14 is also working.

@tuxtof
Copy link

tuxtof commented Jan 9, 2023

Hi @cdesiniotis @klueska

it seems i have exactly the same issue with

OS: CentOS 7.9.2009
Kernel: 3.10.0-1160.76.1.el7.x86_64
Containerd: 1.6.9 & 1.6.14 (tested both)
Gpu Operator: v22.9.1

my nvidia-driver-daemonset is looping
module build seems OK, i see them appear in lsmod
but after few second they disappear and everything restart

it failed after

nvidia-driver-ctr Post-install sanity check passed.
nvidia-driver-ctr
nvidia-driver-ctr Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 525.60.13) is now complete.
nvidia-driver-ctr
nvidia-driver-ctr Parsing kernel module parameters...
nvidia-driver-ctr Loading ipmi and i2c_core kernel modules...
nvidia-driver-ctr Loading NVIDIA driver kernel modules...
nvidia-driver-ctr + modprobe nvidia
nvidia-driver-ctr + modprobe nvidia-uvm
nvidia-driver-ctr + modprobe nvidia-modeset
nvidia-driver-ctr + set +o xtrace -o nounset
nvidia-driver-ctr Starting NVIDIA persistence daemon...
nvidia-driver-ctr ls: cannot access /proc/driver/nvidia-nvswitch/devices/*: No such file or directory
nvidia-driver-ctr Mounting NVIDIA driver rootfs...
nvidia-driver-ctr Done, now waiting for signal
nvidia-driver-ctr Caught signal
nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-ctr Unmounting NVIDIA driver rootfs...

if i downgrade containerd to 1.6.8 everything is fixed

@xhejtman
Copy link

xhejtman commented Jan 10, 2023

There is another issue with containerd:
containerd/containerd#7843

if containerd is restared (version 1.6.9 and above), most pods are restarted, so together with nvidia container toolkit pod they end in endless restarting loop as toolkit tries to restart containerd which restarts the toolkit and driver and everything loops again. There is a fix for containerd, but it may not land yet everywhere.

@tuxtof, I think you are hitting exactly this issue.

@shivamerla
Copy link
Contributor

thanks @xhejtman for linking the relevant issue.

@tuxtof
Copy link

tuxtof commented Jan 10, 2023

thanks @xhejtman

so what is the situation ? , GPU operator is no more working with containerd version 1.6.9 and above

@danlenar
Copy link
Contributor

I am no longer experiencing the issue once upgrading to containerd 1.6.15.

Containerd 1.6.15 contains the fix to
containerd/containerd#7843

@tuxtof
Copy link

tuxtof commented Jan 11, 2023

Ok i confirm the freshly released docker RPM containerd 1.6.15 fix the issue on my side too

Nice

@msherm2
Copy link

msherm2 commented Oct 12, 2023

I am currently having this issue with RHEL 8.8, rke2, containerd 1.6.24.

gpu-operator gpu-operator-6f97b7b47c-vzfnm 1/1 Running 0 19h
gpu-operator gpu-operator-node-feature-discovery-master-77984d5f58-zd88s 1/1 Running 0 19h
gpu-operator gpu-operator-node-feature-discovery-worker-4k9mt 1/1 Running 0 19h
gpu-operator gpu-operator-node-feature-discovery-worker-kch9g 1/1 Running 0 19h
gpu-operator gpu-operator-node-feature-discovery-worker-mwqpc 1/1 Running 0 19h
gpu-operator nvidia-container-toolkit-daemonset-79p9v 1/1 Running 0 17h
gpu-operator nvidia-container-toolkit-daemonset-hnrwf 1/1 Running 0 17h
gpu-operator nvidia-container-toolkit-daemonset-vnddf 1/1 Running 0 17h
gpu-operator nvidia-dcgm-exporter-khlll 0/1 Init:0/1 0 19h
gpu-operator nvidia-dcgm-exporter-qmd4t 0/1 Init:0/1 0 19h
gpu-operator nvidia-dcgm-exporter-ts5hs 0/1 Init:0/1 0 19h
gpu-operator nvidia-device-plugin-daemonset-6zw7m 0/1 Init:0/1 0 19h
gpu-operator nvidia-device-plugin-daemonset-btgtz 0/1 Init:0/1 0 19h
gpu-operator nvidia-device-plugin-daemonset-n4lhk 0/1 Init:0/1 0 19h
gpu-operator nvidia-operator-validator-bqrsb 0/1 Init:0/4 0 19h
gpu-operator nvidia-operator-validator-fnn5w 0/1 Init:0/4 0 19h
gpu-operator nvidia-operator-validator-g95nc 0/1 Init:0/4 0 19h

The following seems to function properly as long as runtime is default or set to runc, but if the runtime is set to nvidia, there is an error:

nerdctl run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

FATA[0000] failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/default/f3c271c63533254369a34950e71f085e5119af387e5a85793203057ac0c7f5d4/log.json: no such file or directory): exec: "nvidia": executable file not found in $PATH: unknown

nerdctl run --rm --gpus all ubuntu nvidia-smi

Thu Oct 12 14:43:22 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 25C P8 9W / 70W | 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

@shivamerla
Copy link
Contributor

@msherm2 did you configure the container-toolkit correctly for RKE2 as documented here?

toolkit:
   env:
   - name: CONTAINERD_CONFIG
     value: /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
   - name: CONTAINERD_SOCKET
     value: /run/k3s/containerd/containerd.sock
   - name: CONTAINERD_RUNTIME_CLASS
     value: nvidia
   - name: CONTAINERD_SET_AS_DEFAULT
     value: "true"

@msherm2
Copy link

msherm2 commented Oct 12, 2023

@shivamerla yes this is my helm chart configuration:

Note: I have tested both files for CONTAINERD_CONFIG,
/etc/containerd/config.toml as well as /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl

helm install gpu-operator -n gpu-operator
nvidia/gpu-operator $HELM_OPTIONS
--set driver.enabled=false
--set gfd.enabled=false
--set operator.defaultRuntime="containerd"
--set toolkit.enabled=true
--set toolkit.version=v1.14.2-ubi8
--set toolkit.env[0].name=CONTAINERD_CONFIG
--set toolkit.env[0].value=/etc/containerd/config.toml
--set toolkit.env[1].name=CONTAINERD_SOCKET
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS
--set toolkit.env[2].value=nvidia
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT
--set-string toolkit.env[3].value="true"

@msherm2
Copy link

msherm2 commented Oct 13, 2023

Update: I followed the instructions here to install containerd using this method, and I believe the critical part is enabling systemd cgroup. Since doing this, I am able to schedule the pods and workloads.

@hotspoons
Copy link

Update: I followed the instructions here to install containerd using this method, and I believe the critical part is enabling systemd cgroup. Since doing this, I am able to schedule the pods and workloads.

Totally unrelated to the gpu operator, but this fixed my problem with getting the spin wasm shim working on a Rocky 8 cluster. Many thanks!

@SaadKaleem
Copy link

SaadKaleem commented Jan 9, 2024

Update: I followed the instructions here to install containerd using this method, and I believe the critical part is enabling systemd cgroup. Since doing this, I am able to schedule the pods and workloads.

Thanks! with sudo privileges, I generated the configuration via containerd config default and also modified it to add NVIDIA run-times (below). GPU Feature Discovery alongside the NVIDIA device plugin as DaemonSets seems to be working fine now on a cluster managed via kubeadm.

I'm not using the GPU-operator, since I already have the drivers and container toolkit installed on the host machine.

Will keep monitoring for any intermittent pod sandbox crashes though.

Versions:
containerd.io 1.6.25
nvdp/nvidia-device-plugin 0.14.3 (via Helm with runtimeClassName as "nvidia", with gfd enabled)

[/etc/containerd/config.toml]:

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

     [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = ""
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            BinaryName = ""
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = true

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"
          runtime_engine = ""
          runtime_root = ""
          privileged_without_host_devices = false
            [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
              BinaryName = "/usr/bin/nvidia-container-runtime"
              SystemdCgroup = true
              

@chokosabe
Copy link

@shivamerla yes this is my helm chart configuration:

Note: I have tested both files for CONTAINERD_CONFIG, /etc/containerd/config.toml as well as /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl

helm install gpu-operator -n gpu-operator
nvidia/gpu-operator $HELM_OPTIONS
--set driver.enabled=false
--set gfd.enabled=false
--set operator.defaultRuntime="containerd"
--set toolkit.enabled=true
--set toolkit.version=v1.14.2-ubi8
--set toolkit.env[0].name=CONTAINERD_CONFIG
--set toolkit.env[0].value=/etc/containerd/config.toml
--set toolkit.env[1].name=CONTAINERD_SOCKET
--set toolkit.env[1].value=/run/k3s/containerd/containerd.sock
--set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS
--set toolkit.env[2].value=nvidia
--set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT
--set-string toolkit.env[3].value="true"

@msherm2 which containerd process did you end up using?
From your answer, it isn't too clear.

The rke2 containerd or the node based containerd?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests