Skip to content

Refresh workload, RoCE, and registry examples#1356

Merged
michael-balint merged 2 commits into
masterfrom
dholt/example-roce-refresh
Jul 2, 2026
Merged

Refresh workload, RoCE, and registry examples#1356
michael-balint merged 2 commits into
masterfrom
dholt/example-roce-refresh

Conversation

@dholt

@dholt dholt commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Refresh active workload and container examples to current NGC images: RAPIDS/Dask notebooks to nvcr.io/nvidia/rapidsai/notebooks:26.04-cuda12-py3.13, PyTorch Kubernetes smoke job and Slurm validation base container to current NGC tags, and the rootless Docker smoke to nvcr.io/nvidia/cuda:13.0.2-base-ubuntu24.04.
  • Update the RAPIDS example Dockerfile and prepare.sh for the current notebooks image layout (notebooks under /opt/rapids/notebooks).
  • Refresh RoCE backend defaults and docs to MLNX_OFED_LINUX-24.10-4.1.4.0-ubuntu24.04-x86_64 and current k8snetworkplumbingwg Multus/SR-IOV manifest URLs; document that new Kubernetes RDMA/RoCE deployments should prefer the NVIDIA Network Operator, with the DeepOps role retained as a legacy direct-host path.
  • Replace the stale public NCCL test image example with site-owned image placeholders and update NGC Ready defaults/docs for PyTorch and TensorFlow.
  • Update registry-cache defaults to include registry.k8s.io and correct the documented registry proxy image to rpardini/docker-registry-proxy:0.6.5.

Validation

  • git diff --check origin/master..HEAD
  • YAML parse for all changed YAML files
  • Ansible syntax-check suite plus focused checks for playbooks/k8s-cluster/roce.yaml and playbooks/slurm-cluster/slurm-validation.yml
  • Docker manifest inspection for the referenced NGC/CUDA/RAPIDS image tags and HTTP verification of the OFED download URL
  • OS compatibility audit against this branch: high=0, medium=0, low=45, info=30
  • Public sanitizer check on the PR body

Notes

  • This is an examples/docs/defaults refresh; it does not change deployment playbook behavior.
  • The RoCE backend role is kept as a legacy direct-host path; Network Operator remains the recommended route for new Kubernetes RDMA deployments.

@dholt dholt requested a review from michael-balint July 2, 2026 13:04
@michael-balint michael-balint merged commit 9533349 into master Jul 2, 2026
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants