Skip to content

Package NVIDIA Container Toolkit install as a reusable Ansible role#147

Open
Hardikrepo wants to merge 2 commits into
NVIDIA:masterfrom
Hardikrepo:feat/nvidia-container-toolkit-role
Open

Package NVIDIA Container Toolkit install as a reusable Ansible role#147
Hardikrepo wants to merge 2 commits into
NVIDIA:masterfrom
Hardikrepo:feat/nvidia-container-toolkit-role

Conversation

@Hardikrepo

Copy link
Copy Markdown

Resolves #74.

Summary

The NVIDIA Container Toolkit install steps were only available embedded inside playbooks/nvidia-docker.yaml, tightly coupled to Cloud Native Stack's own variables (nvidia_docker_exists, cns_version, etc.). As requested in the issue, other NVIDIA customers couldn't reuse just this part without pulling in the whole CNS playbook.

Adds roles/nvidia_container_toolkit/, a self-contained, standalone Ansible role:

  • defaults/main.ymlnvidia_container_toolkit_version, nvidia_container_toolkit_runtime (docker/containerd/cri-dockerd/cri-o), nvidia_container_toolkit_enable_cdi, nvidia_container_toolkit_remove_existing
  • tasks/main.yml — apt/yum signing key + repo setup, pinned package install on Ubuntu/Debian and RHEL/CentOS, and nvidia-ctk CDI configuration for the chosen runtime
  • meta/main.yml + README.md documenting usage via:
    ansible-galaxy role install git+https://github.com/NVIDIA/cloud-native-stack,master#/roles/nvidia_container_toolkit
    

This is purely additive — playbooks/nvidia-docker.yaml is unchanged and the existing CNS install flow continues to work exactly as before.

Test plan

  • Reviewed task logic against the equivalent steps in playbooks/nvidia-docker.yaml for parity
  • Not run end-to-end against a live host in this environment (no real GPU/Docker/containerd target available)

…ker CE, modern apt keyring, Calico enP* support

Resolves NVIDIA#140: replaces deprecated apt_key module with get_url+gpg
--dearmor, pins docker-ce to 29.4.3 with arch detection instead of
state: latest, switches GPU/Network Operator helm installs from
install --generate-name to upgrade --install for retry-safety, and
adds enP* to Calico IP_AUTODETECTION_METHOD for GB200/Grace NIC names.
Resolves NVIDIA#74: the toolkit install steps were only available embedded
inside playbooks/nvidia-docker.yaml, tightly coupled to Cloud Native
Stack's own variables (nvidia_docker_exists, cns_version, etc.), so
other NVIDIA customers couldn't reuse just this part.

Adds roles/nvidia_container_toolkit/, a self-contained role with:
- defaults/main.yml: nvidia_container_toolkit_version,
  nvidia_container_toolkit_runtime, nvidia_container_toolkit_enable_cdi
- tasks/main.yml: apt/yum key + repo setup, pinned package install on
  Ubuntu/Debian and RHEL/CentOS, and nvidia-ctk CDI configuration for
  docker, containerd, cri-dockerd, or cri-o
- meta/main.yml + README.md documenting standalone usage via
  `ansible-galaxy role install git+https://github.com/NVIDIA/cloud-native-stack,master#/roles/nvidia_container_toolkit`

This is additive only; the existing nvidia-docker.yaml playbook is
unchanged and continues to work as before.
@Hardikrepo Hardikrepo force-pushed the feat/nvidia-container-toolkit-role branch from a37e691 to dccf543 Compare June 24, 2026 12:58
@agudanv

agudanv commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Thank you for the contribution. Please fork the cns-dev branch, make the changes there, and create the PR against cns-dev instead of master. These changes need QA verification and will be added to the next release if they are not urgent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Request for ansible galaxy collections to reuse software

2 participants