Skip to content

Releases: NVIDIA/deepops

23.08

28 Aug 18:27
d248b65
Compare
Choose a tag to compare
Merge pull request #1296 from dholt/release-23.08

Release 23.08

22.08

24 Aug 16:26
5fdde40
Compare
Choose a tag to compare

DeepOps 22.08 Release Notes

Known Issues

  • Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6.

General

  • Re-work of large portion of documentation
  • Updates to NCCL tests
  • Various bug fixes

Slurm

  • Update to Slurm 22.05.2
  • Add Alertmanager integration
  • Option to share Slurm configuration among nodes via NFS
  • Enhancements to Slurm re-install/re-build tasks

Kubernetes

  • Update to Kubernetes 1.24.4
  • Update to GPU Operator 1.11.1 (GPU driver branch 515)

Changes

Bugs/Enhancements

  • Update NVIDIA driver role (#1216)
  • Update Kubespray submodule URL (#1200)
  • Add Alertmanager to Slurm cluster deployment (#1198)
  • Fix Slurm configuration GRES syntax (#1196)
  • Update Pyxis image cache size (#1191)
  • Updates to documentation (#1188)
  • Fix Slurm reinstall/rebuild tasks (#1187)
  • Update MetalLB helm repo (#1185)
  • Update EPEL GPG key (#1184)
  • Add option to share Slurm configuration among nodes (#1182)
  • Update NCCL tests (#1180, #1209)
  • Netapp Trident fix PATH (#1176)
  • Update default Slurm version to 21.08.8 (#1169, #1171)
  • Update NVIDIA signing key (#1166, #1167)
  • Update Ansible (#1165)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.04 run git diff 22.04 22.08 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

22.04.2

12 May 21:44
b903d78
Compare
Choose a tag to compare

DeepOps 22.04.2 Release Notes

Bugfix release - 22.04.2

SchedMD released Slurm 21.08.8 in order to address multiple CVEs. As part of this process, they un-published earlier versions of Slurm. This prevents earlier releases of DeepOps from installing with the default value of slurm_version.

Release 22.04.2 updates the default version of slurm_version to point to the latest available version, which should install successfully. (#1171)

Bugfix release - 22.04.1

NVIDIA rotated the signing keys for the CUDA repositories on April 27, breaking installs from DeepOps 22.04 released a few days prior.

This release, 22.04.1, starts from 22.04 and adds PR #1167 to handle the updated key.

The previous release notes from 22.04 appear below.

Known Issues

  • Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6. See #1147.

General

  • Extensive improvements to automated testing with Jenkins, Ansible Molecule, and ansible-lint
  • Update MIG playbook to use the new nvidia-mig-manager systemd service
  • Updates to roles for nvidia-docker and GPU driver
  • Various bug fixes

Slurm

  • Enhanced NCCL tests for Slurm cluster validation
  • Make use of pam_slurm_adopt optional
  • Break out multiple sections in Slurm inventory file

Kubernetes

  • Update to Kubernetes 1.22.6
  • Update default container runtime from dockershim to containerd
  • Add support for NVIDIA Network Operator
  • Add support to deploy NVIDIA Deep Learning Examples on Kubernetes clusters
  • Update to GPU Operator 1.9

Changes

Bugs/Enhancements

  • Fixes for rsyslog server role (#1096, #1098)
  • Update NetApp Trident default version number and branding (#1105)
  • Introduce a common script library (#953)
  • Update versions of monitoring stack components (#1107)
  • Updates to Jenkins testing (#1112, #1127, #1133, #1137, #1138, #1139, #1150, #1151)
  • Fixes for setup script (#1114)
  • Automated testing of DeepOps roles using Molecule (#1094, #1116, #1158)
  • Update nvidia.nvidia_docker role to v1.2.4 (#1121)
  • Automated deployment of Deep Learning Examples (#1083, #1145)
  • Make it optional to use pam_slurm_adopt (#1111)
  • Convert MIG playbook to use nvidia-mig-manager service (#1106)
  • Update to GPU Operator 1.9 (#1074)
  • Automatically run ansible-lint on each role (#1129)
  • Update Kubeflow deployment script to Kubeflow 1.4 (#1104)
  • Remove old build dirs during Slurm upgrade (#1101)
  • Fixes to ood-wrapper role (#1125)
  • Documentation of network ports (#1126)
  • Set missing defaults in playbooks (#1134)
  • Update to Kubespray v2.18.1 and containerd (#1043, #1141)
  • Fix GPU Operator config (#1136)
  • Break out functional host groups in Slurm inventory (#1087)
  • Fix ordering in k8s cluster deployment (#1128)
  • Update nvidia.nvidia_driver role to v2.2.0 (#1143, #1160)
  • Add support for NVIDIA Network Operator (#1113, #1156)
  • Enhanced NCCL tests for Slurm validation (#1042)
  • Fix git.io shortlinks (#1163)
  • Check for SELinux disabled in SELinux tasks (#1162)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.01 run git diff 22.01 22.04 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

22.04.1

02 May 17:09
1a7b13c
Compare
Choose a tag to compare

DeepOps 22.04.1 Release Notes

Bugfix release

NVIDIA rotated the signing keys for the CUDA repositories on April 27, breaking installs from DeepOps 22.04 released a few days prior.

This release, 22.04.1, starts from 22.04 and adds PR #1167 to handle the updated key.

The previous release notes from 22.04 appear below.

Known Issues

  • Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6. See #1147.

General

  • Extensive improvements to automated testing with Jenkins, Ansible Molecule, and ansible-lint
  • Update MIG playbook to use the new nvidia-mig-manager systemd service
  • Updates to roles for nvidia-docker and GPU driver
  • Various bug fixes

Slurm

  • Enhanced NCCL tests for Slurm cluster validation
  • Make use of pam_slurm_adopt optional
  • Break out multiple sections in Slurm inventory file

Kubernetes

  • Update to Kubernetes 1.22.6
  • Update default container runtime from dockershim to containerd
  • Add support for NVIDIA Network Operator
  • Add support to deploy NVIDIA Deep Learning Examples on Kubernetes clusters
  • Update to GPU Operator 1.9

Changes

Bugs/Enhancements

  • Fixes for rsyslog server role (#1096, #1098)
  • Update NetApp Trident default version number and branding (#1105)
  • Introduce a common script library (#953)
  • Update versions of monitoring stack components (#1107)
  • Updates to Jenkins testing (#1112, #1127, #1133, #1137, #1138, #1139, #1150, #1151)
  • Fixes for setup script (#1114)
  • Automated testing of DeepOps roles using Molecule (#1094, #1116, #1158)
  • Update nvidia.nvidia_docker role to v1.2.4 (#1121)
  • Automated deployment of Deep Learning Examples (#1083, #1145)
  • Make it optional to use pam_slurm_adopt (#1111)
  • Convert MIG playbook to use nvidia-mig-manager service (#1106)
  • Update to GPU Operator 1.9 (#1074)
  • Automatically run ansible-lint on each role (#1129)
  • Update Kubeflow deployment script to Kubeflow 1.4 (#1104)
  • Remove old build dirs during Slurm upgrade (#1101)
  • Fixes to ood-wrapper role (#1125)
  • Documentation of network ports (#1126)
  • Set missing defaults in playbooks (#1134)
  • Update to Kubespray v2.18.1 and containerd (#1043, #1141)
  • Fix GPU Operator config (#1136)
  • Break out functional host groups in Slurm inventory (#1087)
  • Fix ordering in k8s cluster deployment (#1128)
  • Update nvidia.nvidia_driver role to v2.2.0 (#1143, #1160)
  • Add support for NVIDIA Network Operator (#1113, #1156)
  • Enhanced NCCL tests for Slurm validation (#1042)
  • Fix git.io shortlinks (#1163)
  • Check for SELinux disabled in SELinux tasks (#1162)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.01 run git diff 22.01 22.04 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

22.01

19 Jan 21:48
009bdeb
Compare
Choose a tag to compare

DeepOps 22.01 Release Notes

General

  • Updates for Slurm and Kubernetes
  • Bug fixes

Slurm

  • Slurm version 21.08.5
  • HPC SDK 22.1
  • Open OnDemand v2.0.9
  • CUDA toolkit 11.5
  • Slurm Pyxis plugin 0.11.1
  • Enroot container runtime v3.2.0
  • Hwloc 2.5.0, pmix 3.2.3
  • Spack v0.16.2

K8s

  • Kubernetes version v1.20.7 (kubespray v2.17.1)
  • Helm version v3.7.1
  • GPU Operator v1.8.2 (GPU driver 470.57.02)
  • GPU Device Plugin v0.9.0
  • GPU Feature Discovery v0.4.1
  • NFS Client Provisioner v4.0.13

Changes

Bugs/Enhancements

  • Add new HPL files for DGX A100 (#1047)
  • Fix vagrant_startup.sh on Ubuntu 20.04 (#1049)
  • Improve documentation and playbook for DGX firmware upgrade (#1058)
  • Update firmware docs (#1063)
  • Fix python interpreter (#1061)
  • GPU Operator automation with NVIDIA AI Enterprise (#1059)
  • [Open OnDemand] Remove task for ood_auth_map.regex permisisons (#1068)
  • Change default Interpreter in Ansible system default instead of Python3 (#1078)
  • Add Log4Shell mitigation to ES statefulset example (#1080)
  • Default to testing in Ubuntu 20.04 (#1051)
  • Update k8s logging doc to use Elastic stack (#1081)
  • Rewrite of DeepOps update documentation (#1050)
  • Update Slurm ElasticSearch logging playbook for log4shell (#1079)
  • Introduce a common script library, config for env vars, and inject these into all scripts (#953)
  • Add proxy config to standalone container registry (#1090)
  • Stop systemd-resolved on Ubuntu 20.04 (#1089)
  • Add Molecule testing for Singularity, plus infra for more roles (#1088)

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 21.09 run git diff 21.09 22.01 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

21.09

01 Oct 00:02
Compare
Choose a tag to compare

DeepOps 21.09 Release Notes

What's New

Release 21.09 is mostly a bug fix release

General

  • Support for DGX OS 5 in nvidia-dgx role

Slurm

  • Slurm version 21.08.1
  • HPC SDK 21.9
  • Open OnDemand v2.0.9
  • CUDA toolkit 11.4
  • Slurm Pyxis plugin 0.11.1
  • Enroot container runtime v3.2.0
  • Hwloc 2.5.0, pmix 3.2.3
  • Spack v0.16.2

K8s

  • Kubernetes version v1.20.7 (kubespray v2.16.0)
  • Helm version v3.5.4
  • GPU Operator v1.8.2 (GPU driver 470.57.02)
  • GPU Device Plugin v0.9.0
  • GPU Feature Discovery v0.4.1
  • NFS Client Provisioner v4.0.13

Changes

  • Docker version 20.10

Bugs/Enhancements

  • Improved cleanup in Slurm epilog (#965)
  • Fix disabling NVIDIA driver install on Slurm cluster install (#948)
  • Permit SFTP in default SSHD config (#980)
  • Address different possible DCGM service names depending on version (#983)
  • Fix PAM Slurm adopt/login (#989)
  • Enroot: adjust cache directory to be per-user (#997)
  • Adding proxy support for downloading of hwloc, pmix, nhc and slurm (#1002)
  • Remove broken offline deployment support and clarify documentation (#1012)
  • Grafana: add var for custom config template (#994)
  • EasyBuild: Enable both shells on all distros (#993)
  • Default to building Slurm with dynamic libs (#1021)
  • ood-wrapper: Don't install python3-passlib on CentOS 7 (#995)
  • Update ansible-role-enroot to 0.5.0 (#1030)

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 21.06 run git diff 21.06 21.09 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

21.06

14 Jun 23:07
Compare
Choose a tag to compare

DeepOps 21.06 Release Notes

What's New

General

  • Documentation-based support for NGC Ready configuration in offline environment
  • New setup script (see notes below [1])
  • Rootless Docker support
  • Update burn-in test to use NGC container (v21.4)
  • UFM OS added to packer-maas repo

Slurm

  • Slurm version 20.11.7 (previously 20.11.3) (NOTE: Slurm versions prior to 20.11.7 are affected by CVE-2021-31215 [2])
  • HPC SDK 21.3 (previously 21.2)
  • Open OnDemand support for Ubuntu 20.04
  • Open OnDemand v1.8.20
  • Playbook for single-node Slurm cluster
  • CUDA toolkit 11.3
  • Singularity 3.7.3
  • Slurm Pyxis plugin 0.9.1
  • Hwloc 2.4.1, pmix 3.2.3
  • Spack v0.16.1

K8s

  • Kubernetes version v1.19.9 (kubespray v2.15.1)
  • Helm version v3.5.3 (previously v3.4.1)
  • GPU Operator v1.6.0(previously v1.5.2)
  • GPU Device Plugin v0.9.0 (previously v0.8.2)
  • GPU Feature Discovery v0.4.1
  • Update Trident deployment role to use Helm chart
  • Update k8s examples to run on A100

Changes

  • Move nvidia-peer-memory logic into a role
  • Option to allow force install of GPU driver
  • Install DCGM via CUDA repos
  • Change namespace of k8s ingress controller
  • Simplify GPU Operator support and change vGPU deployment method
  • Update Triton Kubeflow pipeline to leverage nfs-client and download examples
  • Ansible version 2.9.21

Bugs/Enhancements

  • Update MIG playbook to enable MIG per device rather than all
  • Fixes for DGX firmware update role
  • Reorganize Slurm config file
  • Various fixes to QA tests
  • Documentation updates
  • Improve NHC checks for DGX A100
  • Skip automatic re-installation of NV HPC SDK
  • Use correct ns when checking helm status of metallb
  • Install python-docker/docker-py via yum vs. pip
  • Updates/fixes for DCGM exporter
  • Correctly install docker python SDK

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 21.03 run git diff 21.03 21.06 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

This
[1] This release makes significant modifications to the control machine setup script (scripts/setup.sh) with the goal of making fewer modifications to the system and conflicting less with existing software. Re-running the setup script will create a Python virtual environment where Ansible and other dependencies are installed (except for any required system packages). The script attempts to make this new virtual environment part of the user's path, but some manual intervention may be required to use the newly installed Ansible version.

[2] https://www.schedmd.com/archives.php

21.03

11 Mar 22:33
Compare
Choose a tag to compare

DeepOps 21.03 Release Notes

What's New

General

  • Rsyslog client/server for K8s & Slurm deployments
  • Examples for running Ansible and configuring Inventory file
  • Improved support for Ubuntu 20.04 and CentOS 8
  • Docker login convenience playbook
  • Marked air-gap as "experimental"
  • Vagrant/virtual 2.2.14 (previously 2.2.3)

Slurm

  • Slurm version 20.11.3 (previously 20.02.4)
  • HPC SDK 21.2 (previously 2020_207)

K8s

  • Helm version v3.4.1 (previously v3.1.2)
  • NFS Client Provisioner as K8s Default StorageClass
  • GPU Operator v1.5.2(previously v1.1.7)
  • GPU Device Plugin v0.8.2 (previously v0.7.0)
  • GPU Feature Discovery v0.4.1 (previously v0.2.0)
  • Example NGC Dockerfiles bumped to 20.12 with improved documentation
  • New example yaml files for launching single node/multi node training and jupyter notebooks
  • RoCE perfromance playbook

Changes

  • Deprecation of Rook-Ceph deployment script
  • Removed default MPI Operator install for K8s
  • NFS server is now deployed on kube-master[0] by default with path /export/deepops_nfs
  • New log bundling tool (debug.sh) for K8s
  • Enroot marked as "not fully automated" for CentOS (simple workaround is to bump enroot Ansible Galaxy role from v0.3.2 to v0.4.0 and re-run setup.sh)

Bugs/Enhancements

  • K8s monitoring metrics now persist by default using NFS-backed PVs.
  • Additional testing for Ubuntu 20.04, CentOS 8, GPU Operator, enroot, mpi, and testing.md
  • Addressed firewall issues in CentOS
  • Add vGPU support for GPU Operator installs
  • Address intermittent download failures in Slurm install

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.12 run git diff 20.12 21.03 -- config.example/. Note, the majority of the config changes are around new functionality such as nfs-client-provisioner, rsyslog, and persistent monitoring metrics in K8s. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

20.12

15 Dec 21:44
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.12 Release Notes

What's New

  • Support for DGX OS 5.0
  • Support for Ubuntu 20.04
  • Support for CentOS 8
  • MAAS bare-metal provisioning documentation
  • Initial support for Slurm high-availability
  • Caching container registry for Slurm and k8s
  • Slurm and Open OnDemand usage guide
  • MIG support in K8s and documentation

Changes

  • HPC SDK 20.9
  • Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
  • Kubernetes v1.18.10 (Kubespray v2.14.2), Helm 3, GPU Operator v0.6.0
  • Kubeflow v1.2 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
  • K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
  • Docker 19.03
  • NVIDIA driver role v2.0 (see notes: [1])

Bugs/Enhancements

  • Fix paths from repo re-org
  • Update MIG playbook to enable MIG per device rather than all
  • Rook/Ceph install script improvements
  • Added Slurm tests to QA
  • Move Helm charts off deprecated repo
  • Fix OpenMPI build on CentOS
  • Fix MAAS repo location
  • Fix Enroot removing cache during existing jobs
  • Updates to use Helm 3
  • Fix for all GPUs visible when ssh on slurm compute node
  • Fix python bootstrap script to support python3 on CentOS
  • Allow disabling docker/nvidia-docker install
  • Update Kubeflow deployment to all custom configurations/kustomization in the workloads directory with example culling configuration.
  • Update Kubeflow defaults containers to example NGC containers
  • Update nvidia-dgx-firmware role to work with new update container with more verifications
  • Use a persistent volume for Prometheus metrics
  • Limit CPU usage for Prom node exporters
  • Many more bug fixes

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.10 run git diff 20.08 20.12 -- config.example/. Note, there are many changes in this release, if you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

[1] On Ubuntu, this update changes the default behavior to use nvidia-headless-450-server package by default, instead of the cuda-drivers package. See release notes for the driver role for more information.

20.10

05 Oct 17:42
Compare
Choose a tag to compare

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.10 Release Notes

What's New

  • Repo reorganization
  • Slurm cluster node health check
  • HPL burn-in test version 1.0 (adds multi-node test)
  • Playbook to disable cloud-init on Ubuntu
  • Playbook to install NVIDIA DCGM on non-DGX servers
  • GPU feature discovery plugin with MIG support for K8S

Changes

  • Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
  • Kubernetes v1.18.9 (Kubespray v2.14.1), Helm 3, GPU Operator v0.6.0
  • Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
  • K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
  • Docker 19.03
  • NVIDIA driver role v1.2.2

Bugs/Enhancements

  • Add additional Slurm cluster deployment validation tests
  • Fix Rook script to properly delete cluster with Helm 3
  • Fix default OpenMPI build compatibility with Slurm
  • Fix unnecessary rebuild in PMIx
  • Update Slurm install to contain SSH sessions by default
  • Fix bug with NVIDIA GPU driver role on RHEL/CentOS
  • Clean-up and consolidate Rook script (poll_ceph.sh and rmrook.sh rolled into deploy_rook.sh -d&-w`)
  • Additional testing (MPI, Rook, ...)

New Directory Structure

├── config.example
│   ├── airgap
│   ├── helm
│   └── pxe
│       └── machines
├── docs
│   ├── airgap
│   ├── deepops
│   ├── img
│   ├── k8s-cluster
│   ├── ngc-ready
│   ├── pxe
│   └── slurm-cluster
├── playbooks
│   ├── airgap
│   ├── bootstrap
│   ├── container
│   ├── generic
│   ├── k8s-cluster
│   ├── nvidia-dgx
│   ├── nvidia-egx
│   ├── nvidia-software
│   ├── provisioning
│   ├── slurm-cluster
│   └── utilities
├── roles
│   ├── autofs
│   ├── container-registry
│   ├── dns-config
│   ├── easy-build
│   ├── easy-build-packages
│   ├── facts
│   ├── grafana
│   ├── kerberos-client
│   ├── lmod
│   ├── move-home-dirs
│   ├── netapp-trident
│   ├── nfs
│   ├── nhc
│   ├── nis-client
│   ├── nvidia-cuda
│   ├── nvidia-dcgm
│   ├── nvidia-dcgm-exporter
│   ├── nvidia-dgx
│   ├── nvidia-dgx-firmware
│   ├── nvidia-gpu-operator
│   ├── nvidia-gpu-operator-node-prep
│   ├── nvidia-gpu-tests
│   ├── nvidia-hpc-sdk
│   ├── nvidia-k8s-gpu-device-plugin
│   ├── nvidia-k8s-gpu-feature-discovery
│   ├── nvidia-ml
│   ├── offline-repo-mirrors
│   ├── ood-wrapper
│   ├── openmpi
│   ├── openshift
│   ├── prometheus
│   ├── prometheus-node-exporter
│   ├── prometheus-slurm-exporter
│   ├── pyxis
│   ├── roce_backend
│   ├── slurm
│   └── spack
├── scripts
│   ├── airgap
│   ├── deepops
│   ├── generic
│   ├── k8s
│   └── pxe
├── src
│   ├── containers
│   │   ├── ansible
│   │   ├── dgx-firmware
│   │   ├── dgxie
│   │   ├── kubeflow-jupyter-web-app
│   │   ├── nccl-tests
│   │   ├── ngc
│   │   │   ├── pytorch
│   │   │   ├── rapids
│   │   │   └── tensorflow
│   │   ├── pixiecore
│   │   └── pxe
│   │       └── dhcp
│   ├── dashboards
│   └── repo
├── submodules
│   └── kubespray
├── virtual
│   ├── scripts
└── workloads
    ├── burn-in
    ├── examples
    │   ├── k8s
    │   │   ├── dask-rapids
    │   │   ├── kubeflow-pipeline-deploy
    │   │   ├── services
    │   │   │   └── logging
    │   │   └── users
    │   └── slurm
    │       ├── dask-rapids
    │       └── mpi-hello
    ├── jenkins
    │   └── scripts
    └── services
        └── k8s
            └── dgxie

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.08.1 run git diff 20.10 20.08.1 -- config.example/

It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh as a reference.