diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst index 596ca46ed..1e249748b 100644 --- a/gpu-operator/getting-started.rst +++ b/gpu-operator/getting-started.rst @@ -160,6 +160,11 @@ To view all the options, run ``helm show values nvidia/gpu-operator``. * - ``daemonsets.labels`` - Map of custom labels to add to all GPU Operator managed pods. - ``{}`` + + * - ``dcgmExporter.service.internalTrafficPolicy`` + - Specifies the `internalTrafficPolicy `_ for the DCGM Exporter service. + Available values are ``Cluster`` (default) or ``Local``. + - ``Cluster`` * - ``devicePlugin.config`` - Specifies the configuration for the NVIDIA Device Plugin as a config map. diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst index 7ffdcac13..f55e3db06 100644 --- a/gpu-operator/life-cycle-policy.rst +++ b/gpu-operator/life-cycle-policy.rst @@ -89,10 +89,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. - ${version} * - NVIDIA GPU Driver - - | `570.148.08 `_ (recommended) - | `570.133.20 `_ - | `570.124.06 `_ (default) - | `570.86.15 `_ + - | `570.148.08 `_ (default, recommended) | `550.163.01 `_ | `535.247.01 `_ @@ -100,26 +97,26 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. - `v0.8.0 `__ * - NVIDIA Container Toolkit - - `1.17.5 `__ + - `1.17.8 `__ * - NVIDIA Kubernetes Device Plugin - - `0.17.1 `__ + - `0.17.2 `__ * - DCGM Exporter - - `4.1.1-4.0.4 `__ + - `4.2.3-4.1.3 `__ * - Node Feature Discovery - - `v0.17.2 `__ + - `v0.17.3 `__ * - | NVIDIA GPU Feature Discovery | for Kubernetes - - `0.17.1 `__ + - `0.17.2 `__ * - NVIDIA MIG Manager for Kubernetes - `0.12.1 `__ * - DCGM - - `4.1.1-2 `__ + - `4.2.3 `__ * - Validator for NVIDIA GPU Operator - ${version} @@ -141,7 +138,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. - v0.1.1 * - NVIDIA GDRCopy Driver - - `v2.4.4 `__ + - `v2.5.0 `__ .. _gds-open-kernel: diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index aed71f8b3..512faff81 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -72,15 +72,18 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: | | NVIDIA H200, | NVIDIA Hopper | | | NVIDIA H200 NVL | | +-------------------------+---------------------------+ - | NVIDIA HGX H200 | NVIDIA Hopper and | + | NVIDIA DGX H100 | NVIDIA Hopper and | | | NVSwitch | +-------------------------+---------------------------+ - | NVIDIA DGX H100 | NVIDIA Hopper and | + | NVIDIA DGX H200 | NVIDIA Hopper and | | | NVSwitch | +-------------------------+---------------------------+ | NVIDIA HGX H100 | NVIDIA Hopper and | | | NVSwitch | +-------------------------+---------------------------+ + | NVIDIA HGX H200 | NVIDIA Hopper and | + | | NVSwitch | + +-------------------------+---------------------------+ | | NVIDIA H100, | NVIDIA Hopper | | | NVIDIA H100 NVL | | +-------------------------+---------------------------+ @@ -170,15 +173,18 @@ The following NVIDIA data center GPUs are supported on x86 based platforms: +-------------------------+------------------------+ | Product | Architecture | +=========================+========================+ + | NVIDIA DGX B200 | NVIDIA Blackwell | + +-------------------------+------------------------+ | NVIDIA HGX B200 | NVIDIA Blackwell | +-------------------------+------------------------+ - | NVIDIA HGX GB200 NVL | NVIDIA Blackwell | + | NVIDIA HGX GB200 NVL72 | NVIDIA Blackwell | +-------------------------+------------------------+ .. note:: * HGX B200 requires a driver container version of 570.133.20 or later. + .. _gpu-operator-arm-platforms: Supported ARM Based Platforms @@ -242,6 +248,8 @@ Supported Operating Systems and Kubernetes Platforms .. |fn1| replace:: :sup:`1` .. _fn2: #ubuntu-kernel .. |fn2| replace:: :sup:`2` +.. _fn3: #rhel-9 +.. |fn3| replace:: :sup:`3` The GPU Operator has been validated in the following scenarios: @@ -271,25 +279,25 @@ The GPU Operator has been validated in the following scenarios: | NKP * - Ubuntu 20.04 LTS |fn2|_ - - 1.29---1.32 + - 1.29---1.33 - - 7.0 U3c, 8.0 U2, 8.0 U3 - - 1.29---1.32 + - 1.29---1.33 - - - 2.12, 2.13 * - Ubuntu 22.04 LTS |fn2|_ - - 1.29---1.32 + - 1.29---1.33 - - 8.0 U2, 8.0 U3 - - 1.29---1.32 + - 1.29---1.33 - - 1.26 - 2.12, 2.13 * - Ubuntu 24.04 LTS - - 1.29---1.32 + - 1.29---1.33 - - - @@ -308,27 +316,27 @@ The GPU Operator has been validated in the following scenarios: * - | Red Hat | Enterprise - | Linux 8.8, - | 8.10 - - 1.29---1.32 + | Linux 9.2, 9.4, 9.5, 9.6 |fn3|_ + - 1.29---1.33 - - - - 1.29---1.32 + - 1.29---1.33 - - - * - | Red Hat | Enterprise - | Linux 8.4, 8.5 - - + | Linux 8.8, + | 8.10 + - 1.29---1.33 - - + - 1.29---1.33 - - - 5.5 - - - + .. _kubernetes-version: :sup:`1` @@ -345,7 +353,12 @@ The GPU Operator has been validated in the following scenarios: `Ubuntu kernel lifecycle and enablement stack `_ page for more information. NVIDIA recommends disabling automatic updates for the Linux kernel that are performed by the ``unattended-upgrades`` package to prevent an upgrade to an unsupported kernel version. - + + .. _rhel-9: + + :sup:`3` + Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 versions are available for x86 based platforms only. + They are not available for ARM based systems. .. note:: @@ -395,21 +408,21 @@ The GPU Operator has been validated in the following scenarios: | NKP * - Ubuntu 20.04 LTS - - 1.29--1.32 + - 1.29--1.33 - - 7.0 U3c, 8.0 U2, 8.0 U3 - 1.23---1.25 - 2.12, 2.13 * - Ubuntu 22.04 LTS - - 1.29--1.32 + - 1.29--1.33 - - 8.0 U2, 8.0 U3 - - 2.12, 2.13 * - Ubuntu 24.04 LTS - - 1.29--1.32 + - 1.29--1.33 - - - @@ -426,10 +439,10 @@ The GPU Operator has been validated in the following scenarios: | Enterprise | Linux 8.4, | 8.6---8.10 - - 1.29---1.32 + - 1.29---1.33 - - - - 1.29---1.32 + - 1.29---1.33 - @@ -469,6 +482,8 @@ The GPU Operator has been validated in the following scenarios: +----------------------------+------------------------+----------------+ | Red Hat Enterprise Linux 8 | Yes | Yes | +----------------------------+------------------------+----------------+ +| Red Hat Enterprise Linux 9 | Yes | Yes | ++----------------------------+------------------------+----------------+ Support for KubeVirt and OpenShift Virtualization @@ -521,6 +536,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA. - Ubuntu 24.04 LTS with Network Operator 25.1.0. - Ubuntu 20.04 and 22.04 LTS with Network Operator 24.10.0. +- Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 with Network Operator 25.1.0. - Red Hat OpenShift 4.12 and higher with Network Operator 23.10.0 For information about configuring GPUDirect RDMA, refer to :doc:`gpu-operator-rdma`. diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index 07ae4e775..1bdf111e4 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -33,6 +33,51 @@ See the :ref:`GPU Operator Component Matrix` for a list of software components a ---- +.. _v25.3.1: + +25.3.1 +====== + +.. _v25.3.1-new-features: + +New Features +------------ + +* Added support for the following software component versions: + + - NVIDIA Container Toolkit version v1.17.8 + - NVIDIA DCGM v4.2.3 + - NVIDIA DCGM Exporter v4.2.3-4.1.3 + - NVIDIA Kubernetes Device Plugin v0.17.2 + - Node Feature Discovery v0.17.3 + - NVIDIA GDRCopy Driver v2.5.0 + +* Added support for the following NVIDIA Data Center GPU Driver versions: + + - 570.148.08 (default, recommended) + - 570.133.20 + - 550.163.01 + - 535.247.01 + +* Added support for Red Hat Enterprise Linux 9. + Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 versions are available for x86 based platforms only. + They are not available for ARM based systems. + +* Added support for Kubernetes v1.33. + +* Added support for setting the internalTrafficPolicy for the DCGM Exporter service. + You can configure this in the Helm chart value by setting `dcgmexporter.service.internalTrafficPolicy` to `Local` or `Cluster` (default). + Choose Local if you want to route internal traffic within the node only. + +.. _v25.3.1-fixed-issues: + +Fixed Issues +------------ + +* Fixed an issue where the NVIDIADriver controller may enter an endless loop of creating and deleting a DaemonSet. + This could occur when the NVIDIADriver DaemonSet does not tolerate a taint present on all nodes matching its configured nodeSelector, or when none of the DaemonSet pods have been scheduled yet. + Refer to Github `pull request #1416 `__ for more details. + .. _v25.3.0: 25.3.0 @@ -2263,3 +2308,4 @@ Known Limitations * After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. + diff --git a/gpu-operator/versions1.json b/gpu-operator/versions1.json index 5e2447674..2c269f403 100644 --- a/gpu-operator/versions1.json +++ b/gpu-operator/versions1.json @@ -1,6 +1,10 @@ [ { "preferred": "true", + "url": "../25.3.1", + "version": "25.3.1" + }, + { "url": "../25.3.0", "version": "25.3.0" }, @@ -19,9 +23,5 @@ { "url": "../24.6.2", "version": "24.6.2" - }, - { - "url": "../24.6.1", - "version": "24.6.1" } ] \ No newline at end of file diff --git a/repo.toml b/repo.toml index cd05e8bcb..1a701fda2 100644 --- a/repo.toml +++ b/repo.toml @@ -165,8 +165,8 @@ output_format = "linkcheck" docs_root = "${root}/gpu-operator" project = "gpu-operator" name = "NVIDIA GPU Operator" -version = "25.3.0" -source_substitutions = { version = "v25.3.0", recommended = "570.124.06" } +version = "25.3.1" +source_substitutions = { version = "v25.3.1", recommended = "570.148.08" } copyright_start = 2020 sphinx_exclude_patterns = [ "life-cycle-policy.rst",