Skip to content

20.12

Compare
Choose a tag to compare
@dholt dholt released this 15 Dec 21:44
· 17 commits to release-20.12 since this release

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.12 Release Notes

What's New

  • Support for DGX OS 5.0
  • Support for Ubuntu 20.04
  • Support for CentOS 8
  • MAAS bare-metal provisioning documentation
  • Initial support for Slurm high-availability
  • Caching container registry for Slurm and k8s
  • Slurm and Open OnDemand usage guide
  • MIG support in K8s and documentation

Changes

  • HPC SDK 20.9
  • Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
  • Kubernetes v1.18.10 (Kubespray v2.14.2), Helm 3, GPU Operator v0.6.0
  • Kubeflow v1.2 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
  • K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
  • Docker 19.03
  • NVIDIA driver role v2.0 (see notes: [1])

Bugs/Enhancements

  • Fix paths from repo re-org
  • Update MIG playbook to enable MIG per device rather than all
  • Rook/Ceph install script improvements
  • Added Slurm tests to QA
  • Move Helm charts off deprecated repo
  • Fix OpenMPI build on CentOS
  • Fix MAAS repo location
  • Fix Enroot removing cache during existing jobs
  • Updates to use Helm 3
  • Fix for all GPUs visible when ssh on slurm compute node
  • Fix python bootstrap script to support python3 on CentOS
  • Allow disabling docker/nvidia-docker install
  • Update Kubeflow deployment to all custom configurations/kustomization in the workloads directory with example culling configuration.
  • Update Kubeflow defaults containers to example NGC containers
  • Update nvidia-dgx-firmware role to work with new update container with more verifications
  • Use a persistent volume for Prometheus metrics
  • Limit CPU usage for Prom node exporters
  • Many more bug fixes

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.10 run git diff 20.08 20.12 -- config.example/. Note, there are many changes in this release, if you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

[1] On Ubuntu, this update changes the default behavior to use nvidia-headless-450-server package by default, instead of the cuda-drivers package. See release notes for the driver role for more information.