Skip to content

feat: Use CDI for GPU injection instead of nvidia-container-cli #398

@klueska

Description

@klueska

Problem Statement

GPU access currently relies on the legacy nvidia-container-runtime +
nvidia-container-cli stack at two layers: once when Docker injects GPUs
into the k3s cluster container, and again when the nvidia-device-plugin +
nvidia-container-runtime inject them into individual sandbox pods.

Proposed Design

Both layers should be migrated to CDI instead. The general idea:

  1. Generate a CDI spec on the host before starting the cluster:
    nvidia-ctk cdi generate
  2. Use Docker's native CDI support (available since Docker 25) to pass GPUs
    into the k3s container: --device nvidia.com/gpu=all
  3. Mount /etc/cdi into the k3s container, enable enable_cdi_devices = true
    in the containerd config, and configure the nvidia-device-plugin to use CDI
    device IDs so containerd handles injection natively

CDI is the canonical way NVIDIA supports GPU access in containerized
environments going forward. Some platforms require CDI and are incompatible
with the legacy runtime stack, so this would also broaden the set of platforms
OpenShell can run on. It also makes what gets injected explicit and
auditable via the CDI spec rather than delegating to a CLI with broad host
access.

/cc @elezar @jgehrcke

Alternatives Considered

None

Agent Investigation

No response

Checklist

  • I've reviewed existing issues and the architecture docs
  • This is a design proposal, not a "please build this" request

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions