Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions gpu-operator/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,11 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
* - ``daemonsets.labels``
- Map of custom labels to add to all GPU Operator managed pods.
- ``{}``

* - ``dcgmExporter.service.internalTrafficPolicy``
- Specifies the `internalTrafficPolicy <https://kubernetes.io/docs/concepts/services-networking/service/#internal-traffic-policy>`_ for the DCGM Exporter service.
Available values are ``Cluster`` (default) or ``Local``.
- ``Cluster``

* - ``devicePlugin.config``
- Specifies the configuration for the NVIDIA Device Plugin as a config map.
Expand Down
19 changes: 8 additions & 11 deletions gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,37 +89,34 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
- ${version}

* - NVIDIA GPU Driver
- | `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_ (recommended)
| `570.133.20 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-133-20/index.html>`_
| `570.124.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-124-06/index.html>`_ (default)
| `570.86.15 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-86-15/index.html>`_
- | `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_ (default, recommended)
| `550.163.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-550-163-01/index.html>`_
| `535.247.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-247-01/index.html>`_

* - NVIDIA Driver Manager for Kubernetes
- `v0.8.0 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`__

* - NVIDIA Container Toolkit
- `1.17.5 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`__
- `1.17.8 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`__

* - NVIDIA Kubernetes Device Plugin
- `0.17.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
- `0.17.2 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to bump the version for GPU Feature Discovery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling that I never shared with Abbie that some of the answers are in the back of the book. I would snoop merges to the GPU Op repo for changes to the default values.yaml and glean what I could. The slight curveball is NFD. If memory serves, I could snoop from the Chart.yaml file.

Fingers crossed this helps rather than causes confusion--any confusion and disregard this update!


* - DCGM Exporter
- `4.1.1-4.0.4 <https://github.com/NVIDIA/dcgm-exporter/releases>`__
- `4.2.3-4.1.3 <https://github.com/NVIDIA/dcgm-exporter/releases>`__

* - Node Feature Discovery
- `v0.17.2 <https://github.com/kubernetes-sigs/node-feature-discovery/releases/>`__
- `v0.17.3 <https://github.com/kubernetes-sigs/node-feature-discovery/releases/>`__

* - | NVIDIA GPU Feature Discovery
| for Kubernetes
- `0.17.1 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__
- `0.17.2 <https://github.com/NVIDIA/k8s-device-plugin/releases>`__

* - NVIDIA MIG Manager for Kubernetes
- `0.12.1 <https://github.com/NVIDIA/mig-parted/tree/main/deployments/gpu-operator>`__

* - DCGM
- `4.1.1-2 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__
- `4.2.3 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`__

* - Validator for NVIDIA GPU Operator
- ${version}
Expand All @@ -141,7 +138,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
- v0.1.1

* - NVIDIA GDRCopy Driver
- `v2.4.4 <https://github.com/NVIDIA/gdrcopy/releases>`__
- `v2.5.0 <https://github.com/NVIDIA/gdrcopy/releases>`__

.. _gds-open-kernel:

Expand Down
60 changes: 38 additions & 22 deletions gpu-operator/platform-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,15 +72,18 @@ The following NVIDIA data center GPUs are supported on x86 based platforms:
| | NVIDIA H200, | NVIDIA Hopper |
| | NVIDIA H200 NVL | |
+-------------------------+---------------------------+
| NVIDIA HGX H200 | NVIDIA Hopper and |
| NVIDIA DGX H100 | NVIDIA Hopper and |
| | NVSwitch |
+-------------------------+---------------------------+
| NVIDIA DGX H100 | NVIDIA Hopper and |
| NVIDIA DGX H200 | NVIDIA Hopper and |
| | NVSwitch |
+-------------------------+---------------------------+
| NVIDIA HGX H100 | NVIDIA Hopper and |
| | NVSwitch |
+-------------------------+---------------------------+
| NVIDIA HGX H200 | NVIDIA Hopper and |
| | NVSwitch |
+-------------------------+---------------------------+
| | NVIDIA H100, | NVIDIA Hopper |
| | NVIDIA H100 NVL | |
+-------------------------+---------------------------+
Expand Down Expand Up @@ -170,15 +173,18 @@ The following NVIDIA data center GPUs are supported on x86 based platforms:
+-------------------------+------------------------+
| Product | Architecture |
+=========================+========================+
| NVIDIA DGX B200 | NVIDIA Blackwell |
+-------------------------+------------------------+
| NVIDIA HGX B200 | NVIDIA Blackwell |
+-------------------------+------------------------+
| NVIDIA HGX GB200 NVL | NVIDIA Blackwell |
| NVIDIA HGX GB200 NVL72 | NVIDIA Blackwell |
+-------------------------+------------------------+

.. note::

* HGX B200 requires a driver container version of 570.133.20 or later.


.. _gpu-operator-arm-platforms:

Supported ARM Based Platforms
Expand Down Expand Up @@ -242,6 +248,8 @@ Supported Operating Systems and Kubernetes Platforms
.. |fn1| replace:: :sup:`1`
.. _fn2: #ubuntu-kernel
.. |fn2| replace:: :sup:`2`
.. _fn3: #rhel-9
.. |fn3| replace:: :sup:`3`

The GPU Operator has been validated in the following scenarios:

Expand Down Expand Up @@ -271,25 +279,25 @@ The GPU Operator has been validated in the following scenarios:
| NKP

* - Ubuntu 20.04 LTS |fn2|_
- 1.29---1.32
- 1.29---1.33
-
- 7.0 U3c, 8.0 U2, 8.0 U3
- 1.29---1.32
- 1.29---1.33
-
-
- 2.12, 2.13

* - Ubuntu 22.04 LTS |fn2|_
- 1.29---1.32
- 1.29---1.33
-
- 8.0 U2, 8.0 U3
- 1.29---1.32
- 1.29---1.33
-
- 1.26
- 2.12, 2.13

* - Ubuntu 24.04 LTS
- 1.29---1.32
- 1.29---1.33
-
-
-
Expand All @@ -308,27 +316,27 @@ The GPU Operator has been validated in the following scenarios:

* - | Red Hat
| Enterprise
| Linux 8.8,
| 8.10
- 1.29---1.32
| Linux 9.2, 9.4, 9.5, 9.6 |fn3|_
- 1.29---1.33
-
-
- 1.29---1.32
- 1.29---1.33
-
-
-

* - | Red Hat
| Enterprise
| Linux 8.4, 8.5
-
| Linux 8.8,
| 8.10
- 1.29---1.33
-
-
- 1.29---1.33
-
- 5.5
-
-

.. _kubernetes-version:

:sup:`1`
Expand All @@ -345,7 +353,12 @@ The GPU Operator has been validated in the following scenarios:
`Ubuntu kernel lifecycle and enablement stack <https://ubuntu.com/kernel/lifecycle>`_ page for more information.
NVIDIA recommends disabling automatic updates for the Linux kernel that are performed
by the ``unattended-upgrades`` package to prevent an upgrade to an unsupported kernel version.


.. _rhel-9:

:sup:`3`
Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 versions are available for x86 based platforms only.
They are not available for ARM based systems.

.. note::

Expand Down Expand Up @@ -395,21 +408,21 @@ The GPU Operator has been validated in the following scenarios:
| NKP

* - Ubuntu 20.04 LTS
- 1.29--1.32
- 1.29--1.33
-
- 7.0 U3c, 8.0 U2, 8.0 U3
- 1.23---1.25
- 2.12, 2.13

* - Ubuntu 22.04 LTS
- 1.29--1.32
- 1.29--1.33
-
- 8.0 U2, 8.0 U3
-
- 2.12, 2.13

* - Ubuntu 24.04 LTS
- 1.29--1.32
- 1.29--1.33
-
-
-
Expand All @@ -426,10 +439,10 @@ The GPU Operator has been validated in the following scenarios:
| Enterprise
| Linux 8.4,
| 8.6---8.10
- 1.29---1.32
- 1.29---1.33
-
-
- 1.29---1.32
- 1.29---1.33
-


Expand Down Expand Up @@ -469,6 +482,8 @@ The GPU Operator has been validated in the following scenarios:
+----------------------------+------------------------+----------------+
| Red Hat Enterprise Linux 8 | Yes | Yes |
+----------------------------+------------------------+----------------+
| Red Hat Enterprise Linux 9 | Yes | Yes |
+----------------------------+------------------------+----------------+


Support for KubeVirt and OpenShift Virtualization
Expand Down Expand Up @@ -521,6 +536,7 @@ Supported operating systems and NVIDIA GPU Drivers with GPUDirect RDMA.

- Ubuntu 24.04 LTS with Network Operator 25.1.0.
- Ubuntu 20.04 and 22.04 LTS with Network Operator 24.10.0.
- Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 with Network Operator 25.1.0.
- Red Hat OpenShift 4.12 and higher with Network Operator 23.10.0

For information about configuring GPUDirect RDMA, refer to :doc:`gpu-operator-rdma`.
Expand Down
46 changes: 46 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,51 @@ See the :ref:`GPU Operator Component Matrix` for a list of software components a

----

.. _v25.3.1:

25.3.1
======

.. _v25.3.1-new-features:

New Features
------------

* Added support for the following software component versions:

- NVIDIA Container Toolkit version v1.17.8
- NVIDIA DCGM v4.2.3
- NVIDIA DCGM Exporter v4.2.3-4.1.3
- NVIDIA Kubernetes Device Plugin v0.17.2
- Node Feature Discovery v0.17.3
- NVIDIA GDRCopy Driver v2.5.0

* Added support for the following NVIDIA Data Center GPU Driver versions:

- 570.148.08 (default, recommended)
- 570.133.20
- 550.163.01
- 535.247.01

* Added support for Red Hat Enterprise Linux 9.
Non-precompiled driver containers for Red Hat Enterprise Linux 9.2, 9.4, 9.5, and 9.6 versions are available for x86 based platforms only.
They are not available for ARM based systems.

* Added support for Kubernetes v1.33.

* Added support for setting the internalTrafficPolicy for the DCGM Exporter service.
You can configure this in the Helm chart value by setting `dcgmexporter.service.internalTrafficPolicy` to `Local` or `Cluster` (default).
Choose Local if you want to route internal traffic within the node only.

.. _v25.3.1-fixed-issues:

Fixed Issues
------------

* Fixed an issue where the NVIDIADriver controller may enter an endless loop of creating and deleting a DaemonSet.
This could occur when the NVIDIADriver DaemonSet does not tolerate a taint present on all nodes matching its configured nodeSelector, or when none of the DaemonSet pods have been scheduled yet.
Refer to Github `pull request #1416 <https://github.com/NVIDIA/gpu-operator/pull/1416>`__ for more details.

.. _v25.3.0:

25.3.0
Expand Down Expand Up @@ -2263,3 +2308,4 @@ Known Limitations

* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using
``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again.

8 changes: 4 additions & 4 deletions gpu-operator/versions1.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
[
{
"preferred": "true",
"url": "../25.3.1",
"version": "25.3.1"
},
{
"url": "../25.3.0",
"version": "25.3.0"
},
Expand All @@ -19,9 +23,5 @@
{
"url": "../24.6.2",
"version": "24.6.2"
},
{
"url": "../24.6.1",
"version": "24.6.1"
}
]
4 changes: 2 additions & 2 deletions repo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,8 @@ output_format = "linkcheck"
docs_root = "${root}/gpu-operator"
project = "gpu-operator"
name = "NVIDIA GPU Operator"
version = "25.3.0"
source_substitutions = { version = "v25.3.0", recommended = "570.124.06" }
version = "25.3.1"
source_substitutions = { version = "v25.3.1", recommended = "570.148.08" }
copyright_start = 2020
sphinx_exclude_patterns = [
"life-cycle-policy.rst",
Expand Down