28 Aug 18:27

dholt

d248b65

23.08 Latest

Latest

Merge pull request #1296 from dholt/release-23.08

Release 23.08

Assets 2

24 Aug 16:26

dholt

22.08

5fdde40

22.08

DeepOps 22.08 Release Notes

Known Issues

Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6.

General

Re-work of large portion of documentation
Updates to NCCL tests
Various bug fixes

Slurm

Update to Slurm 22.05.2
Add Alertmanager integration
Option to share Slurm configuration among nodes via NFS
Enhancements to Slurm re-install/re-build tasks

Kubernetes

Update to Kubernetes 1.24.4
Update to GPU Operator 1.11.1 (GPU driver branch 515)

Changes

Bugs/Enhancements

Update NVIDIA driver role (#1216)
Update Kubespray submodule URL (#1200)
Add Alertmanager to Slurm cluster deployment (#1198)
Fix Slurm configuration GRES syntax (#1196)
Update Pyxis image cache size (#1191)
Updates to documentation (#1188)
Fix Slurm reinstall/rebuild tasks (#1187)
Update MetalLB helm repo (#1185)
Update EPEL GPG key (#1184)
Add option to share Slurm configuration among nodes (#1182)
Update NCCL tests (#1180, #1209)
Netapp Trident fix PATH (#1176)
Update default Slurm version to 21.08.8 (#1169, #1171)
Update NVIDIA signing key (#1166, #1167)
Update Ansible (#1165)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.04 run git diff 22.04 22.08 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

Assets 2

12 May 21:44

ajdecon

22.04.2

b903d78

22.04.2

DeepOps 22.04.2 Release Notes

Bugfix release - 22.04.2

SchedMD released Slurm 21.08.8 in order to address multiple CVEs. As part of this process, they un-published earlier versions of Slurm. This prevents earlier releases of DeepOps from installing with the default value of slurm_version.

Release 22.04.2 updates the default version of slurm_version to point to the latest available version, which should install successfully. (#1171)

Bugfix release - 22.04.1

NVIDIA rotated the signing keys for the CUDA repositories on April 27, breaking installs from DeepOps 22.04 released a few days prior.

This release, 22.04.1, starts from 22.04 and adds PR #1167 to handle the updated key.

The previous release notes from 22.04 appear below.

Known Issues

Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6. See #1147.

General

Extensive improvements to automated testing with Jenkins, Ansible Molecule, and ansible-lint
Update MIG playbook to use the new nvidia-mig-manager systemd service
Updates to roles for nvidia-docker and GPU driver
Various bug fixes

Slurm

Enhanced NCCL tests for Slurm cluster validation
Make use of pam_slurm_adopt optional
Break out multiple sections in Slurm inventory file

Kubernetes

Update to Kubernetes 1.22.6
Update default container runtime from dockershim to containerd
Add support for NVIDIA Network Operator
Add support to deploy NVIDIA Deep Learning Examples on Kubernetes clusters
Update to GPU Operator 1.9

Changes

Bugs/Enhancements

Fixes for rsyslog server role (#1096, #1098)
Update NetApp Trident default version number and branding (#1105)
Introduce a common script library (#953)
Update versions of monitoring stack components (#1107)
Updates to Jenkins testing (#1112, #1127, #1133, #1137, #1138, #1139, #1150, #1151)
Fixes for setup script (#1114)
Automated testing of DeepOps roles using Molecule (#1094, #1116, #1158)
Update nvidia.nvidia_docker role to v1.2.4 (#1121)
Automated deployment of Deep Learning Examples (#1083, #1145)
Make it optional to use pam_slurm_adopt (#1111)
Convert MIG playbook to use nvidia-mig-manager service (#1106)
Update to GPU Operator 1.9 (#1074)
Automatically run ansible-lint on each role (#1129)
Update Kubeflow deployment script to Kubeflow 1.4 (#1104)
Remove old build dirs during Slurm upgrade (#1101)
Fixes to ood-wrapper role (#1125)
Documentation of network ports (#1126)
Set missing defaults in playbooks (#1134)
Update to Kubespray v2.18.1 and containerd (#1043, #1141)
Fix GPU Operator config (#1136)
Break out functional host groups in Slurm inventory (#1087)
Fix ordering in k8s cluster deployment (#1128)
Update nvidia.nvidia_driver role to v2.2.0 (#1143, #1160)
Add support for NVIDIA Network Operator (#1113, #1156)
Enhanced NCCL tests for Slurm validation (#1042)
Fix git.io shortlinks (#1163)
Check for SELinux disabled in SELinux tasks (#1162)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.01 run git diff 22.01 22.04 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

Assets 2

02 May 17:09

ajdecon

22.04.1

1a7b13c

22.04.1

DeepOps 22.04.1 Release Notes

Bugfix release

NVIDIA rotated the signing keys for the CUDA repositories on April 27, breaking installs from DeepOps 22.04 released a few days prior.

This release, 22.04.1, starts from 22.04 and adds PR #1167 to handle the updated key.

The previous release notes from 22.04 appear below.

Known Issues

Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6. See #1147.

General

Extensive improvements to automated testing with Jenkins, Ansible Molecule, and ansible-lint
Update MIG playbook to use the new nvidia-mig-manager systemd service
Updates to roles for nvidia-docker and GPU driver
Various bug fixes

Slurm

Enhanced NCCL tests for Slurm cluster validation
Make use of pam_slurm_adopt optional
Break out multiple sections in Slurm inventory file

Kubernetes

Update to Kubernetes 1.22.6
Update default container runtime from dockershim to containerd
Add support for NVIDIA Network Operator
Add support to deploy NVIDIA Deep Learning Examples on Kubernetes clusters
Update to GPU Operator 1.9

Changes

Bugs/Enhancements

Fixes for rsyslog server role (#1096, #1098)
Update NetApp Trident default version number and branding (#1105)
Introduce a common script library (#953)
Update versions of monitoring stack components (#1107)
Updates to Jenkins testing (#1112, #1127, #1133, #1137, #1138, #1139, #1150, #1151)
Fixes for setup script (#1114)
Automated testing of DeepOps roles using Molecule (#1094, #1116, #1158)
Update nvidia.nvidia_docker role to v1.2.4 (#1121)
Automated deployment of Deep Learning Examples (#1083, #1145)
Make it optional to use pam_slurm_adopt (#1111)
Convert MIG playbook to use nvidia-mig-manager service (#1106)
Update to GPU Operator 1.9 (#1074)
Automatically run ansible-lint on each role (#1129)
Update Kubeflow deployment script to Kubeflow 1.4 (#1104)
Remove old build dirs during Slurm upgrade (#1101)
Fixes to ood-wrapper role (#1125)
Documentation of network ports (#1126)
Set missing defaults in playbooks (#1134)
Update to Kubespray v2.18.1 and containerd (#1043, #1141)
Fix GPU Operator config (#1136)
Break out functional host groups in Slurm inventory (#1087)
Fix ordering in k8s cluster deployment (#1128)
Update nvidia.nvidia_driver role to v2.2.0 (#1143, #1160)
Add support for NVIDIA Network Operator (#1113, #1156)
Enhanced NCCL tests for Slurm validation (#1042)
Fix git.io shortlinks (#1163)
Check for SELinux disabled in SELinux tasks (#1162)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.01 run git diff 22.01 22.04 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

Assets 2

19 Jan 21:48

dholt

22.01

009bdeb

22.01

DeepOps 22.01 Release Notes

General

Updates for Slurm and Kubernetes
Bug fixes

Slurm

Slurm version 21.08.5
HPC SDK 22.1
Open OnDemand v2.0.9
CUDA toolkit 11.5
Slurm Pyxis plugin 0.11.1
Enroot container runtime v3.2.0
Hwloc 2.5.0, pmix 3.2.3
Spack v0.16.2

K8s

Kubernetes version v1.20.7 (kubespray v2.17.1)
Helm version v3.7.1
GPU Operator v1.8.2 (GPU driver 470.57.02)
GPU Device Plugin v0.9.0
GPU Feature Discovery v0.4.1
NFS Client Provisioner v4.0.13

Changes

Bugs/Enhancements

Add new HPL files for DGX A100 (#1047)
Fix vagrant_startup.sh on Ubuntu 20.04 (#1049)
Improve documentation and playbook for DGX firmware upgrade (#1058)
Update firmware docs (#1063)
Fix python interpreter (#1061)
GPU Operator automation with NVIDIA AI Enterprise (#1059)
[Open OnDemand] Remove task for ood_auth_map.regex permisisons (#1068)
Change default Interpreter in Ansible system default instead of Python3 (#1078)
Add Log4Shell mitigation to ES statefulset example (#1080)
Default to testing in Ubuntu 20.04 (#1051)
Update k8s logging doc to use Elastic stack (#1081)
Rewrite of DeepOps update documentation (#1050)
Update Slurm ElasticSearch logging playbook for log4shell (#1079)
Introduce a common script library, config for env vars, and inject these into all scripts (#953)
Add proxy config to standalone container registry (#1090)
Stop systemd-resolved on Ubuntu 20.04 (#1089)
Add Molecule testing for Singularity, plus infra for more roles (#1088)

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 21.09 run git diff 21.09 22.01 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

Assets 2

01 Oct 00:02

dholt

21.09

5fddbb1

21.09

DeepOps 21.09 Release Notes

What's New

Release 21.09 is mostly a bug fix release

General

Support for DGX OS 5 in nvidia-dgx role

Slurm

Slurm version 21.08.1
HPC SDK 21.9
Open OnDemand v2.0.9
CUDA toolkit 11.4
Slurm Pyxis plugin 0.11.1
Enroot container runtime v3.2.0
Hwloc 2.5.0, pmix 3.2.3
Spack v0.16.2

K8s

Kubernetes version v1.20.7 (kubespray v2.16.0)
Helm version v3.5.4
GPU Operator v1.8.2 (GPU driver 470.57.02)
GPU Device Plugin v0.9.0
GPU Feature Discovery v0.4.1
NFS Client Provisioner v4.0.13

Changes

Docker version 20.10

Bugs/Enhancements

Improved cleanup in Slurm epilog (#965)
Fix disabling NVIDIA driver install on Slurm cluster install (#948)
Permit SFTP in default SSHD config (#980)
Address different possible DCGM service names depending on version (#983)
Fix PAM Slurm adopt/login (#989)
Enroot: adjust cache directory to be per-user (#997)
Adding proxy support for downloading of hwloc, pmix, nhc and slurm (#1002)
Remove broken offline deployment support and clarify documentation (#1012)
Grafana: add var for custom config template (#994)
EasyBuild: Enable both shells on all distros (#993)
Default to building Slurm with dynamic libs (#1021)
ood-wrapper: Don't install python3-passlib on CentOS 7 (#995)
Update ansible-role-enroot to 0.5.0 (#1030)

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 21.06 run git diff 21.06 21.09 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

Assets 2

14 Jun 23:07

dholt

21.06

d5541c2

21.06

DeepOps 21.06 Release Notes

What's New

General

Documentation-based support for NGC Ready configuration in offline environment
New setup script (see notes below [1])
Rootless Docker support
Update burn-in test to use NGC container (v21.4)
UFM OS added to packer-maas repo

Slurm

Slurm version 20.11.7 (previously 20.11.3) (NOTE: Slurm versions prior to 20.11.7 are affected by CVE-2021-31215 [2])
HPC SDK 21.3 (previously 21.2)
Open OnDemand support for Ubuntu 20.04
Open OnDemand v1.8.20
Playbook for single-node Slurm cluster
CUDA toolkit 11.3
Singularity 3.7.3
Slurm Pyxis plugin 0.9.1
Hwloc 2.4.1, pmix 3.2.3
Spack v0.16.1

K8s

Kubernetes version v1.19.9 (kubespray v2.15.1)
Helm version v3.5.3 (previously v3.4.1)
GPU Operator v1.6.0(previously v1.5.2)
GPU Device Plugin v0.9.0 (previously v0.8.2)
GPU Feature Discovery v0.4.1
Update Trident deployment role to use Helm chart
Update k8s examples to run on A100

Changes

Move nvidia-peer-memory logic into a role
Option to allow force install of GPU driver
Install DCGM via CUDA repos
Change namespace of k8s ingress controller
Simplify GPU Operator support and change vGPU deployment method
Update Triton Kubeflow pipeline to leverage nfs-client and download examples
Ansible version 2.9.21

Bugs/Enhancements

Update MIG playbook to enable MIG per device rather than all
Fixes for DGX firmware update role
Reorganize Slurm config file
Various fixes to QA tests
Documentation updates
Improve NHC checks for DGX A100
Skip automatic re-installation of NV HPC SDK
Use correct ns when checking helm status of metallb
Install python-docker/docker-py via yum vs. pip
Updates/fixes for DCGM exporter
Correctly install docker python SDK

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 21.03 run git diff 21.03 21.06 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

This
[1] This release makes significant modifications to the control machine setup script (scripts/setup.sh) with the goal of making fewer modifications to the system and conflicting less with existing software. Re-running the setup script will create a Python virtual environment where Ansible and other dependencies are installed (except for any required system packages). The script attempts to make this new virtual environment part of the user's path, but some manual intervention may be required to use the newly installed Ansible version.

[2] https://www.schedmd.com/archives.php

Assets 2

11 Mar 22:33

supertetelman

21.03

0fd845c

21.03

DeepOps 21.03 Release Notes

What's New

General

Rsyslog client/server for K8s & Slurm deployments
Examples for running Ansible and configuring Inventory file
Improved support for Ubuntu 20.04 and CentOS 8
Docker login convenience playbook
Marked air-gap as "experimental"
Vagrant/virtual 2.2.14 (previously 2.2.3)

Slurm

Slurm version 20.11.3 (previously 20.02.4)
HPC SDK 21.2 (previously 2020_207)

K8s

Helm version v3.4.1 (previously v3.1.2)
NFS Client Provisioner as K8s Default StorageClass
GPU Operator v1.5.2(previously v1.1.7)
GPU Device Plugin v0.8.2 (previously v0.7.0)
GPU Feature Discovery v0.4.1 (previously v0.2.0)
Example NGC Dockerfiles bumped to 20.12 with improved documentation
New example yaml files for launching single node/multi node training and jupyter notebooks
RoCE perfromance playbook

Changes

Deprecation of Rook-Ceph deployment script
Removed default MPI Operator install for K8s
NFS server is now deployed on kube-master[0] by default with path /export/deepops_nfs
New log bundling tool (debug.sh) for K8s
Enroot marked as "not fully automated" for CentOS (simple workaround is to bump enroot Ansible Galaxy role from v0.3.2 to v0.4.0 and re-run setup.sh)

Bugs/Enhancements

K8s monitoring metrics now persist by default using NFS-backed PVs.
Additional testing for Ubuntu 20.04, CentOS 8, GPU Operator, enroot, mpi, and testing.md
Addressed firewall issues in CentOS
Add vGPU support for GPU Operator installs
Address intermittent download failures in Slurm install

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.12 run git diff 20.12 21.03 -- config.example/. Note, the majority of the config changes are around new functionality such as nfs-client-provisioner, rsyslog, and persistent monitoring metrics in K8s. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

Assets 2

15 Dec 21:44

dholt

20.12

759dbf8

20.12

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.12 Release Notes

What's New

Support for DGX OS 5.0
Support for Ubuntu 20.04
Support for CentOS 8
MAAS bare-metal provisioning documentation
Initial support for Slurm high-availability
Caching container registry for Slurm and k8s
Slurm and Open OnDemand usage guide
MIG support in K8s and documentation

Changes

HPC SDK 20.9
Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
Kubernetes v1.18.10 (Kubespray v2.14.2), Helm 3, GPU Operator v0.6.0
Kubeflow v1.2 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
Docker 19.03
NVIDIA driver role v2.0 (see notes: [1])

Bugs/Enhancements

Fix paths from repo re-org
Update MIG playbook to enable MIG per device rather than all
Rook/Ceph install script improvements
Added Slurm tests to QA
Move Helm charts off deprecated repo
Fix OpenMPI build on CentOS
Fix MAAS repo location
Fix Enroot removing cache during existing jobs
Updates to use Helm 3
Fix for all GPUs visible when ssh on slurm compute node
Fix python bootstrap script to support python3 on CentOS
Allow disabling docker/nvidia-docker install
Update Kubeflow deployment to all custom configurations/kustomization in the workloads directory with example culling configuration.
Update Kubeflow defaults containers to example NGC containers
Update nvidia-dgx-firmware role to work with new update container with more verifications
Use a persistent volume for Prometheus metrics
Limit CPU usage for Prom node exporters
Many more bug fixes

Upgrade steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 20.10 run git diff 20.08 20.12 -- config.example/. Note, there are many changes in this release, if you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes

[1] On Ubuntu, this update changes the default behavior to use nvidia-headless-450-server package by default, instead of the cuda-drivers package. See release notes for the driver role for more information.

Assets 2

05 Oct 17:42

dholt

20.10

1100634

20.10

NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.

If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.

DeepOps 20.10 Release Notes

What's New

Repo reorganization
Slurm cluster node health check
HPL burn-in test version 1.0 (adds multi-node test)
Playbook to disable cloud-init on Ubuntu
Playbook to install NVIDIA DCGM on non-DGX servers
GPU feature discovery plugin with MIG support for K8S

Changes

Slurm 20.02.4, Pyxis v0.8.1, Enroot v3.1.1
Kubernetes v1.18.9 (Kubespray v2.14.1), Helm 3, GPU Operator v0.6.0
Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
K8s GPU device plugin v0.7.0 & GPU Feature Discovery v.0.2.0 with support for NVIDIA A100 and MIG
Docker 19.03
NVIDIA driver role v1.2.2

Bugs/Enhancements

Add additional Slurm cluster deployment validation tests
Fix Rook script to properly delete cluster with Helm 3
Fix default OpenMPI build compatibility with Slurm
Fix unnecessary rebuild in PMIx
Update Slurm install to contain SSH sessions by default
Fix bug with NVIDIA GPU driver role on RHEL/CentOS
Clean-up and consolidate Rook script (poll_ceph.sh and rmrook.sh rolled into deploy_rook.sh -d&-w`)
Additional testing (MPI, Rook, ...)

New Directory Structure

├── config.example
│   ├── airgap
│   ├── helm
│   └── pxe
│       └── machines
├── docs
│   ├── airgap
│   ├── deepops
│   ├── img
│   ├── k8s-cluster
│   ├── ngc-ready
│   ├── pxe
│   └── slurm-cluster
├── playbooks
│   ├── airgap
│   ├── bootstrap
│   ├── container
│   ├── generic
│   ├── k8s-cluster
│   ├── nvidia-dgx
│   ├── nvidia-egx
│   ├── nvidia-software
│   ├── provisioning
│   ├── slurm-cluster
│   └── utilities
├── roles
│   ├── autofs
│   ├── container-registry
│   ├── dns-config
│   ├── easy-build
│   ├── easy-build-packages
│   ├── facts
│   ├── grafana
│   ├── kerberos-client
│   ├── lmod
│   ├── move-home-dirs
│   ├── netapp-trident
│   ├── nfs
│   ├── nhc
│   ├── nis-client
│   ├── nvidia-cuda
│   ├── nvidia-dcgm
│   ├── nvidia-dcgm-exporter
│   ├── nvidia-dgx
│   ├── nvidia-dgx-firmware
│   ├── nvidia-gpu-operator
│   ├── nvidia-gpu-operator-node-prep
│   ├── nvidia-gpu-tests
│   ├── nvidia-hpc-sdk
│   ├── nvidia-k8s-gpu-device-plugin
│   ├── nvidia-k8s-gpu-feature-discovery
│   ├── nvidia-ml
│   ├── offline-repo-mirrors
│   ├── ood-wrapper
│   ├── openmpi
│   ├── openshift
│   ├── prometheus
│   ├── prometheus-node-exporter
│   ├── prometheus-slurm-exporter
│   ├── pyxis
│   ├── roce_backend
│   ├── slurm
│   └── spack
├── scripts
│   ├── airgap
│   ├── deepops
│   ├── generic
│   ├── k8s
│   └── pxe
├── src
│   ├── containers
│   │   ├── ansible
│   │   ├── dgx-firmware
│   │   ├── dgxie
│   │   ├── kubeflow-jupyter-web-app
│   │   ├── nccl-tests
│   │   ├── ngc
│   │   │   ├── pytorch
│   │   │   ├── rapids
│   │   │   └── tensorflow
│   │   ├── pixiecore
│   │   └── pxe
│   │       └── dhcp
│   ├── dashboards
│   └── repo
├── submodules
│   └── kubespray
├── virtual
│   ├── scripts
└── workloads
    ├── burn-in
    ├── examples
    │   ├── k8s
    │   │   ├── dask-rapids
    │   │   ├── kubeflow-pipeline-deploy
    │   │   ├── services
    │   │   │   └── logging
    │   │   └── users
    │   └── slurm
    │       ├── dask-rapids
    │       └── mpi-hello
    ├── jenkins
    │   └── scripts
    └── services
        └── k8s
            └── dgxie

Upgrade Steps

It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh as a reference.

Assets 2

Releases: NVIDIA/deepops

23.08

22.08

DeepOps 22.08 Release Notes

Known Issues

General

Slurm

Kubernetes

Changes

Bugs/Enhancements

Upgrade Steps

Notes

22.04.2

DeepOps 22.04.2 Release Notes

Bugfix release - 22.04.2

Bugfix release - 22.04.1

Known Issues

General

Slurm

Kubernetes

Changes

Bugs/Enhancements

Upgrade Steps

Notes

22.04.1

DeepOps 22.04.1 Release Notes

Bugfix release

Known Issues

General

Slurm

Kubernetes

Changes

Bugs/Enhancements

Upgrade Steps

Notes

22.01

DeepOps 22.01 Release Notes

General

Slurm

K8s

Changes

Bugs/Enhancements

Upgrade steps

Notes

21.09

DeepOps 21.09 Release Notes

What's New

General

Slurm

K8s

Changes

Bugs/Enhancements

Upgrade steps

Notes

21.06

DeepOps 21.06 Release Notes

What's New

General

Slurm

K8s

Changes

Bugs/Enhancements

Upgrade steps

Notes

21.03

DeepOps 21.03 Release Notes

What's New

General

Slurm

K8s

Changes

Bugs/Enhancements

Upgrade steps

Notes

20.12

DeepOps 20.12 Release Notes

What's New

Changes

Bugs/Enhancements

Upgrade steps