Skip to content

Improve GB200 networking, runtime idempotency, Docker pinning, and Ubuntu keyring handling#141

Draft
ericrife wants to merge 4 commits into
NVIDIA:masterfrom
ericrife:bugfix/gb200-device-names
Draft

Improve GB200 networking, runtime idempotency, Docker pinning, and Ubuntu keyring handling#141
ericrife wants to merge 4 commits into
NVIDIA:masterfrom
ericrife:bugfix/gb200-device-names

Conversation

@ericrife
Copy link
Copy Markdown

Summary

This PR carries a focused set of Cloud Native Stack playbook improvements:

  • Add enP* Calico interface autodetection for GB200-style NIC names.
  • Make GPU Operator Helm installs retry-safe by switching from generated Helm
    release names to helm upgrade --install gpu-operator.
  • Add Docker CE version and architecture variables, then use them in Ubuntu apt
    repository/package selection.
  • Replace active Ubuntu apt_key usage with explicit keyring downloads/dearmor
    and signed-by apt repository handling.

Closes #140.

Motivation

These changes address issues found while validating newer NVIDIA systems and
newer Ubuntu runtimes:

  • Some systems expose NIC names matching enP*, which were not covered by the
    existing Calico autodetection prefix list.
  • helm install --generate-name is fragile in retry/reapply workflows because
    it can create duplicate generated releases or fail after partial installs.
  • Docker CE package selection was unpinned and architecture was implicit, which
    can drift between reruns or mirrors.
  • Newer Ubuntu images may not include apt-key, causing Ansible apt_key
    tasks to fail before repositories are configured.

Changes

  1. Extend Calico autodetection from
    interface=ens*,eth*,enc*,bond*,enp*,eno* to include enP*.
  2. Convert GPU Operator install paths to use a stable gpu-operator Helm
    release name with helm upgrade --install.
  3. Add Docker CE versioning and apt architecture variables:
    • docker_ce_version
    • docker_ce_apt_version
    • docker_apt_arch
  4. Replace active Ubuntu apt_key module usage with direct keyring management
    for Kubernetes, CRI-O, Docker, and NVIDIA Container Toolkit repositories.

Validation

Local validation passed for this patch series:

  • git diff --check
  • YAML parse checks for touched YAML files
  • Ansible syntax checks for:
    • playbooks/prerequisites.yaml
    • playbooks/nvidia-driver.yaml
    • playbooks/nvidia-docker.yaml
    • playbooks/operators-install.yaml
    • playbooks/k8s-install.yaml
    • playbooks/cns.yaml

Maintainer Review Notes

  • Please confirm whether gpu-operator is acceptable as the stable Helm
    release name.
  • Please confirm Docker CE version defaults for CNS 16.0, 16.1, and 17.0.
  • Please confirm the enP* Calico autodetection prefix matches expected GB200
    NIC naming.
  • Please confirm keyring paths are acceptable:
    • /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    • /etc/apt/keyrings/cri-o-apt-keyring.gpg
    • /etc/apt/keyrings/docker.asc
    • /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

robdaly and others added 4 commits February 6, 2026 14:53
- Introduced `docker_ce_version`, `docker_ce_apt_version`, and `docker_apt_arch` variables for consistent Docker installation across playbooks.
- Updated Docker repository definitions to include architecture specifications for Ubuntu installations.
- Modified Docker installation tasks to use versioned package names, ensuring compatibility and allowing for downgrades if necessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GB200 and newer Ubuntu runtime fixes needed in Cloud Native Stack playbooks

2 participants