Skip to content

fix(gpu): select one CDI GPU by default for Docker and Podman #1477

@elezar

Description

@elezar

Description

Update Docker and Podman GPU sandbox defaults so --gpu prefers one CDI GPU device instead of defaulting to nvidia.com/gpu=all.

This is part of the GPU roadmap in #1444. --gpu means the active driver's default GPU behavior, and for GPU-enabled drivers that default should inject or allocate one suitable GPU when the runtime supports individual device selection.

Context

Parent roadmap: #1444

Current local-container behavior maps a GPU request with no explicit gpu_device to nvidia.com/gpu=all through the shared CDI helper. That makes Docker and Podman inconsistent with Kubernetes and VM behavior, where a default GPU request maps to one GPU.

Docker has priority for implementation because OpenShell's Docker GPU path and CDI discovery are more mature today. Podman should be handled in the same task, but may require additional runtime support or an out-of-band CDI device discovery path. Upstream Podman behavior such as containers/podman#28712 may be relevant.

Proposed Scope

  • Define local-container default GPU selection semantics for Docker and Podman.
  • Change Docker default --gpu behavior to prefer one CDI GPU device instead of nvidia.com/gpu=all.
  • Change Podman default --gpu behavior to prefer one CDI GPU device instead of nvidia.com/gpu=all.
  • Prefer runtime-reported CDI inventory when available.
  • Preserve explicit --gpu-device behavior as a driver-native advanced option.
  • Do not add multi-GPU count support in this task.
  • Do not require OpenShell-managed GPU assignment/exclusivity tracking in this task.

Target Behavior

Default GPU selection should use this order:

  1. If the runtime reports individual CDI GPU devices, select one individual device.
  2. If reliable CDI inventory is unavailable but individual device IDs are expected to work, fall back to nvidia.com/gpu=0.
  3. If the runtime/platform only reports or supports nvidia.com/gpu=all, such as some WSL2-based setups, use nvidia.com/gpu=all as a compatibility fallback.

Additional behavior:

  • openshell sandbox create --gpu ... on Docker injects one CDI GPU device when individual device selection is available.
  • openshell sandbox create --gpu ... on Podman injects one CDI GPU device when individual device selection is available.
  • openshell sandbox create --gpu --gpu-device nvidia.com/gpu=0 ... continues to pass the explicit CDI device ID through.
  • The fallback to nvidia.com/gpu=all should be intentional and documented, not the default for platforms with individual device selection.
  • Non-zero gpu_count remains unsupported unless a driver explicitly implements count-based allocation.

Out of Scope

This task fixes default GPU device selection cardinality. It does not require OpenShell to track active GPU assignments or prevent two OpenShell sandboxes from selecting the same default GPU.

If multiple sandboxes are created concurrently, selecting the same default fallback device is acceptable until a separate allocation/exclusivity task is implemented.

Open Questions

  • Where should CDI inventory discovery live: shared OpenShell core helper, driver-specific code, or both?
  • What should Podman use as the authoritative CDI device inventory source before runtime-level enumeration is reliable?
  • Should assignment/exclusivity tracking be added later at the driver level or as part of a broader resource allocation model?

Definition of Done

  • Docker default --gpu prefers one individual CDI GPU device when available.
  • Podman default --gpu prefers one individual CDI GPU device when available.
  • If reliable CDI inventory is unavailable and individual IDs are expected to work, default selection falls back to nvidia.com/gpu=0.
  • If individual selection is unavailable, nvidia.com/gpu=all remains available as a documented compatibility fallback.
  • Explicit --gpu-device pass-through behavior is preserved for Docker and Podman.
  • Tests cover individual-device default selection, fallback selection, and explicit device pass-through.
  • Docs describe Docker/Podman default GPU behavior, compatibility fallback behavior, and --gpu-device as an advanced driver-native option.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions