Skip to content

21.06

Compare
Choose a tag to compare
@dholt dholt released this 14 Jun 23:07
· 1 commit to release-21.06 since this release

DeepOps 21.06 Release Notes

What's New

General

  • Documentation-based support for NGC Ready configuration in offline environment
  • New setup script (see notes below [1])
  • Rootless Docker support
  • Update burn-in test to use NGC container (v21.4)
  • UFM OS added to packer-maas repo

Slurm

  • Slurm version 20.11.7 (previously 20.11.3) (NOTE: Slurm versions prior to 20.11.7 are affected by CVE-2021-31215 [2])
  • HPC SDK 21.3 (previously 21.2)
  • Open OnDemand support for Ubuntu 20.04
  • Open OnDemand v1.8.20
  • Playbook for single-node Slurm cluster
  • CUDA toolkit 11.3
  • Singularity 3.7.3
  • Slurm Pyxis plugin 0.9.1
  • Hwloc 2.4.1, pmix 3.2.3
  • Spack v0.16.1

K8s

  • Kubernetes version v1.19.9 (kubespray v2.15.1)
  • Helm version v3.5.3 (previously v3.4.1)
  • GPU Operator v1.6.0(previously v1.5.2)
  • GPU Device Plugin v0.9.0 (previously v0.8.2)
  • GPU Feature Discovery v0.4.1
  • Update Trident deployment role to use Helm chart
  • Update k8s examples to run on A100

Changes

  • Move nvidia-peer-memory logic into a role
  • Option to allow force install of GPU driver
  • Install DCGM via CUDA repos
  • Change namespace of k8s ingress controller
  • Simplify GPU Operator support and change vGPU deployment method
  • Update Triton Kubeflow pipeline to leverage nfs-client and download examples
  • Ansible version 2.9.21

Bugs/Enhancements

  • Update MIG playbook to enable MIG per device rather than all
  • Fixes for DGX firmware update role
  • Reorganize Slurm config file
  • Various fixes to QA tests
  • Documentation updates
  • Improve NHC checks for DGX A100
  • Skip automatic re-installation of NV HPC SDK
  • Use correct ns when checking helm status of metallb
  • Install python-docker/docker-py via yum vs. pip
  • Updates/fixes for DCGM exporter
  • Correctly install docker python SDK

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 21.03 run git diff 21.03 21.06 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

This
[1] This release makes significant modifications to the control machine setup script (scripts/setup.sh) with the goal of making fewer modifications to the system and conflicting less with existing software. Re-running the setup script will create a Python virtual environment where Ansible and other dependencies are installed (except for any required system packages). The script attempts to make this new virtual environment part of the user's path, but some manual intervention may be required to use the newly installed Ansible version.

[2] https://www.schedmd.com/archives.php