Skip to content

22.04.1

Compare
Choose a tag to compare
@ajdecon ajdecon released this 02 May 17:09
· 2 commits to release-22.04 since this release
1a7b13c

DeepOps 22.04.1 Release Notes

Bugfix release

NVIDIA rotated the signing keys for the CUDA repositories on April 27, breaking installs from DeepOps 22.04 released a few days prior.

This release, 22.04.1, starts from 22.04 and adds PR #1167 to handle the updated key.

The previous release notes from 22.04 appear below.

Known Issues

  • Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6. See #1147.

General

  • Extensive improvements to automated testing with Jenkins, Ansible Molecule, and ansible-lint
  • Update MIG playbook to use the new nvidia-mig-manager systemd service
  • Updates to roles for nvidia-docker and GPU driver
  • Various bug fixes

Slurm

  • Enhanced NCCL tests for Slurm cluster validation
  • Make use of pam_slurm_adopt optional
  • Break out multiple sections in Slurm inventory file

Kubernetes

  • Update to Kubernetes 1.22.6
  • Update default container runtime from dockershim to containerd
  • Add support for NVIDIA Network Operator
  • Add support to deploy NVIDIA Deep Learning Examples on Kubernetes clusters
  • Update to GPU Operator 1.9

Changes

Bugs/Enhancements

  • Fixes for rsyslog server role (#1096, #1098)
  • Update NetApp Trident default version number and branding (#1105)
  • Introduce a common script library (#953)
  • Update versions of monitoring stack components (#1107)
  • Updates to Jenkins testing (#1112, #1127, #1133, #1137, #1138, #1139, #1150, #1151)
  • Fixes for setup script (#1114)
  • Automated testing of DeepOps roles using Molecule (#1094, #1116, #1158)
  • Update nvidia.nvidia_docker role to v1.2.4 (#1121)
  • Automated deployment of Deep Learning Examples (#1083, #1145)
  • Make it optional to use pam_slurm_adopt (#1111)
  • Convert MIG playbook to use nvidia-mig-manager service (#1106)
  • Update to GPU Operator 1.9 (#1074)
  • Automatically run ansible-lint on each role (#1129)
  • Update Kubeflow deployment script to Kubeflow 1.4 (#1104)
  • Remove old build dirs during Slurm upgrade (#1101)
  • Fixes to ood-wrapper role (#1125)
  • Documentation of network ports (#1126)
  • Set missing defaults in playbooks (#1134)
  • Update to Kubespray v2.18.1 and containerd (#1043, #1141)
  • Fix GPU Operator config (#1136)
  • Break out functional host groups in Slurm inventory (#1087)
  • Fix ordering in k8s cluster deployment (#1128)
  • Update nvidia.nvidia_driver role to v2.2.0 (#1143, #1160)
  • Add support for NVIDIA Network Operator (#1113, #1156)
  • Enhanced NCCL tests for Slurm validation (#1042)
  • Fix git.io shortlinks (#1163)
  • Check for SELinux disabled in SELinux tasks (#1162)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.01 run git diff 22.01 22.04 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes