Improve GB200 networking, runtime idempotency, Docker pinning, and Ubuntu keyring handling#141
Draft
ericrife wants to merge 4 commits into
Draft
Improve GB200 networking, runtime idempotency, Docker pinning, and Ubuntu keyring handling#141ericrife wants to merge 4 commits into
ericrife wants to merge 4 commits into
Conversation
- Introduced `docker_ce_version`, `docker_ce_apt_version`, and `docker_apt_arch` variables for consistent Docker installation across playbooks. - Updated Docker repository definitions to include architecture specifications for Ubuntu installations. - Modified Docker installation tasks to use versioned package names, ensuring compatibility and allowing for downgrades if necessary.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR carries a focused set of Cloud Native Stack playbook improvements:
enP*Calico interface autodetection for GB200-style NIC names.release names to
helm upgrade --install gpu-operator.repository/package selection.
apt_keyusage with explicit keyring downloads/dearmorand
signed-byapt repository handling.Closes #140.
Motivation
These changes address issues found while validating newer NVIDIA systems and
newer Ubuntu runtimes:
enP*, which were not covered by theexisting Calico autodetection prefix list.
helm install --generate-nameis fragile in retry/reapply workflows becauseit can create duplicate generated releases or fail after partial installs.
can drift between reruns or mirrors.
apt-key, causing Ansibleapt_keytasks to fail before repositories are configured.
Changes
interface=ens*,eth*,enc*,bond*,enp*,eno*to includeenP*.gpu-operatorHelmrelease name with
helm upgrade --install.docker_ce_versiondocker_ce_apt_versiondocker_apt_archapt_keymodule usage with direct keyring managementfor Kubernetes, CRI-O, Docker, and NVIDIA Container Toolkit repositories.
Validation
Local validation passed for this patch series:
git diff --checkplaybooks/prerequisites.yamlplaybooks/nvidia-driver.yamlplaybooks/nvidia-docker.yamlplaybooks/operators-install.yamlplaybooks/k8s-install.yamlplaybooks/cns.yamlMaintainer Review Notes
gpu-operatoris acceptable as the stable Helmrelease name.
enP*Calico autodetection prefix matches expected GB200NIC naming.
/etc/apt/keyrings/kubernetes-apt-keyring.gpg/etc/apt/keyrings/cri-o-apt-keyring.gpg/etc/apt/keyrings/docker.asc/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg