diff --git a/container-toolkit/arch-overview.md b/container-toolkit/arch-overview.md index 742ef81e0..10835da5b 100644 --- a/container-toolkit/arch-overview.md +++ b/container-toolkit/arch-overview.md @@ -92,7 +92,7 @@ a `prestart` hook into it, and then calls out to the native `runC`, passing it t For versions of the NVIDIA Container Runtime from `v1.12.0`, this runtime also performs additional modifications to the OCI runtime spec to inject specific devices and mounts not handled by the NVIDIA Container CLI. -It's important to note that this component is not necessarily specific to docker (but it is specific to `runC`). +It is important to note that this component is not necessarily specific to docker (but it is specific to `runC`). ### The NVIDIA Container Toolkit CLI diff --git a/gpu-operator/dra-intro-install.rst b/gpu-operator/dra-intro-install.rst index e1a71c938..27f8d73a7 100644 --- a/gpu-operator/dra-intro-install.rst +++ b/gpu-operator/dra-intro-install.rst @@ -12,7 +12,7 @@ Introduction With NVIDIA's DRA Driver for GPUs, your Kubernetes workload can allocate and consume the following two types of resources: -* **GPUs**: for controlled sharing and dynamic reconfiguration of GPUs. A modern replacement for the traditional GPU allocation method (using `NVIDIA's device plugin `_). We are excited about this part of the driver; it is however not yet fully supported (Technology Preview). +* **GPUs**: for controlled sharing and dynamic reconfiguration of GPUs. A modern replacement for the traditional GPU allocation method (using `NVIDIA's device plugin `_). NVIDIA is excited about this part of the driver; it is however not yet fully supported (Technology Preview). * **ComputeDomains**: for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems. Fully supported. A primer on DRA @@ -25,7 +25,7 @@ For NVIDIA devices, there are two particularly beneficial characteristics provid #. A clean way to allocate **cross-node resources** in Kubernetes (leveraged here for providing NVLink connectivity across pods running on multiple nodes). #. Mechanisms to explicitly **share, partition, and reconfigure** devices **on-the-fly** based on user requests (leveraged here for advanced GPU allocation). -To understand and make best use of NVIDIA's DRA Driver for GPUs, we recommend becoming familiar with DRA by working through the `official documentation `_. +To understand and make best use of NVIDIA's DRA Driver for GPUs, NVIDIA recommends becoming familiar with DRA by working through the `official documentation `_. The twofold nature of this driver @@ -34,7 +34,7 @@ The twofold nature of this driver NVIDIA's DRA Driver for GPUs is comprised of two subsystems that are largely independent of each other: one manages GPUs, and the other one manages ComputeDomains. Below, you can find instructions for how to install both parts or just one of them. -Additionally, we have prepared two separate documentation chapters, providing more in-depth information for each of the two subsystems: +Additionally, NVIDIA has prepared two separate documentation chapters, providing more in-depth information for each of the two subsystems: - :ref:`Documentation for ComputeDomain (MNNVL) support ` - :ref:`Documentation for GPU support ` @@ -52,7 +52,7 @@ Prerequisites - `CDI `_ must be enabled in the underlying container runtime (such as containerd or CRI-O). - NVIDIA GPU Driver 565 or later. -For the last two items on the list above, as well as for other reasons, we recommend installing NVIDIA's GPU Operator v25.3.0 or later. +For the last two items on the list above, as well as for other reasons, NVIDIA recommends installing NVIDIA's GPU Operator v25.3.0 or later. For detailed instructions, see the official GPU Operator `installation documentation `__. Also note that, in the near future, the preferred method to install NVIDIA's DRA Driver for GPUs will be through the GPU Operator (the DRA driver will then no longer require installation as a separate Helm chart). @@ -65,8 +65,8 @@ Also note that, in the near future, the preferred method to install NVIDIA's DRA - Refer to the `docs on installing the GPU Operator with a pre-installed GPU driver `__. -Configure and Helm-install the driver -===================================== +Configure and install the driver with Helm +========================================== #. Add the NVIDIA Helm repository: @@ -103,15 +103,15 @@ All install-time configuration parameters can be listed by running ``helm show v .. note:: - A common mode of operation for now is to enable only the ComputeDomain subsystem (to have GPUs allocated using the traditional device plugin). The example above achieves that by setting ``resources.gpus.enabled=false``. - - Setting ``nvidiaDriverRoot=/run/nvidia/driver`` above expects a GPU Operator-provided GPU driver. That configuration parameter must be changed in case the GPU driver is installed straight on the host (typically at ``/``, which is the default value for ``nvidiaDriverRoot``). + - Setting ``nvidiaDriverRoot=/run/nvidia/driver`` above expects a GPU Operator-provided GPU driver. That configuration parameter must be changed in case the GPU driver is installed straight on the host (typically at ``/``, which is the default value for ``nvidiaDriverRoot``). Validate installation ===================== A lot can go wrong, depending on the exact nature of your Kubernetes environment and specific hardware and driver choices as well as configuration options chosen. -That is why we recommend to perform a set of validation tests to confirm the basic functionality of your setup. -To that end, we have prepared separate documentation: +That is why NVIDIA recommends performing a set of validation tests to confirm the basic functionality of your setup. +To that end, NVIDIA has prepared separate documentation: - `Testing ComputeDomain allocation `_ - `Testing GPU allocation `_ diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst index ef5643a4e..a398f3fa5 100644 --- a/gpu-operator/getting-started.rst +++ b/gpu-operator/getting-started.rst @@ -277,7 +277,7 @@ To view all the options, run ``helm show values nvidia/gpu-operator``. - ``{}`` * - ``psp.enabled`` - - The GPU operator deploys ``PodSecurityPolicies`` if enabled. + - The GPU Operator deploys ``PodSecurityPolicies`` if enabled. - ``false`` * - ``sandboxWorkloads.defaultWorkload`` diff --git a/gpu-operator/gpu-operator-kata.rst b/gpu-operator/gpu-operator-kata.rst index c0d871c89..bb05c2c35 100644 --- a/gpu-operator/gpu-operator-kata.rst +++ b/gpu-operator/gpu-operator-kata.rst @@ -19,8 +19,8 @@ .. lingo: - It's "Kata Containers" when referring to the software component. - It's "Kata container" when it's a container that uses the Kata Containers runtime. + It is "Kata Containers" when referring to the software component. + It is "Kata container" when it is a container that uses the Kata Containers runtime. Treat our operands as proper nouns and use title case. ################################# diff --git a/gpu-operator/gpu-operator-kubevirt.rst b/gpu-operator/gpu-operator-kubevirt.rst index 5b012ddf5..e4f00a840 100644 --- a/gpu-operator/gpu-operator-kubevirt.rst +++ b/gpu-operator/gpu-operator-kubevirt.rst @@ -37,7 +37,7 @@ Given the following node configuration: * Node B is configured with the label ``nvidia.com/gpu.workload.config=vm-passthrough`` and configured to run virtual machines with Passthrough GPU. * Node C is configured with the label ``nvidia.com/gpu.workload.config=vm-vgpu`` and configured to run virtual machines with vGPU. -The GPU operator will deploy the following software components on each node: +The GPU Operator will deploy the following software components on each node: * Node A receives the following software components: * ``NVIDIA Datacenter Driver`` - to install the driver diff --git a/gpu-operator/gpu-operator-mig.rst b/gpu-operator/gpu-operator-mig.rst index 5bb239dec..890345fe1 100644 --- a/gpu-operator/gpu-operator-mig.rst +++ b/gpu-operator/gpu-operator-mig.rst @@ -102,7 +102,7 @@ Perform the following steps to install the Operator and configure MIG: Known Issue: For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state. - It's recommended that you downgrade the driver to version 570.86.15 to work around this issue. + NVIDIA recommends that you downgrade the driver to version 570.86.15 to work around this issue. For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361. diff --git a/gpu-operator/install-gpu-operator-air-gapped.rst b/gpu-operator/install-gpu-operator-air-gapped.rst index bf8427438..c4a1b23c9 100644 --- a/gpu-operator/install-gpu-operator-air-gapped.rst +++ b/gpu-operator/install-gpu-operator-air-gapped.rst @@ -246,7 +246,7 @@ Sample of ``values.yaml`` for GPU Operator v1.9.0: Local Package Repository ************************ -The ``driver`` container deployed as part of the GPU operator requires certain packages to be available as part of the +The ``driver`` container deployed as part of the GPU Operator requires certain packages to be available as part of the driver installation. In restricted internet access or air-gapped installations, users are required to create a local mirror repository for their OS distribution and make the following packages available: diff --git a/gpu-operator/install-gpu-operator-outdated-kernels.rst b/gpu-operator/install-gpu-operator-outdated-kernels.rst index 0afa6875b..b83483d78 100644 --- a/gpu-operator/install-gpu-operator-outdated-kernels.rst +++ b/gpu-operator/install-gpu-operator-outdated-kernels.rst @@ -12,7 +12,7 @@ On GPU nodes where the running kernel is not the latest, the ``driver`` containe see the following error message: ``Could not resolve Linux kernel version``. In general, upgrading your system to the latest kernel should fix this issue. But if this is not an option, the following is a -workaround to successfully deploy the GPU operator when GPU nodes in your cluster may not be running the latest kernel. +workaround to successfully deploy the GPU Operator when GPU nodes in your cluster may not be running the latest kernel. Add Archived Package Repositories ================================= diff --git a/gpu-operator/life-cycle-policy.rst b/gpu-operator/life-cycle-policy.rst index 0c73c25d3..c4c084d8c 100644 --- a/gpu-operator/life-cycle-policy.rst +++ b/gpu-operator/life-cycle-policy.rst @@ -91,8 +91,9 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. - ${version} * - NVIDIA GPU Driver |ki|_ - - | `575.57.08 `_ - | `570.172.08 `_ (default, recommended) + - | `580.65.06 `_ (recommended) + | `575.57.08 `_ + | `570.172.08 `_ (default) | `570.158.01 `_ | `570.148.08 `_ | `535.261.03 `_ @@ -152,7 +153,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information. Known Issue: For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state. - It's recommended that you downgrade the driver to version 570.86.15 to work around this issue. + NVIDIA recommends that you downgrade the driver to version 570.86.15 to work around this issue. For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361. diff --git a/gpu-operator/overview.rst b/gpu-operator/overview.rst index 245262959..d9ba3f422 100644 --- a/gpu-operator/overview.rst +++ b/gpu-operator/overview.rst @@ -31,7 +31,7 @@ configuration of multiple software components such as drivers, container runtime and prone to errors. The NVIDIA GPU Operator uses the `operator framework `_ within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA), Kubernetes device plugin for GPUs, the `NVIDIA Container Toolkit `_, -automatic node labelling using `GFD `_, `DCGM `_ based monitoring and others. +automatic node labeling using `GFD `_, `DCGM `_ based monitoring and others. .. card:: Red Hat OpenShift Container Platform diff --git a/gpu-operator/platform-support.rst b/gpu-operator/platform-support.rst index e11a98610..91a4f8673 100644 --- a/gpu-operator/platform-support.rst +++ b/gpu-operator/platform-support.rst @@ -459,8 +459,8 @@ The GPU Operator has been validated in the following scenarios: Supported Precompiled Drivers ----------------------------- -The GPU Operator has been validated with the following precomplied drivers. -See the :doc:`precompiled-drivers` page for more on using precompiled drivers. +The GPU Operator has been validated with the following precompiled drivers. +See the :doc:`precompiled-drivers` page for more information about using precompiled drivers. +----------------------------+------------------------+----------------+---------------------+ | Operating System | Kernel Flavor | Kernel Version | CUDA Driver Branch | @@ -477,10 +477,10 @@ See the :doc:`precompiled-drivers` page for more on using precompiled drivers. Supported Container Runtimes ---------------------------- -The GPU Operator has been validated in the following scenarios: +The GPU Operator has been validated for the following container runtimes: +----------------------------+------------------------+----------------+ -| Operating System | Containerd 1.6 - 2.0 | CRI-O | +| Operating System | Containerd 1.6 - 2.1 | CRI-O | +============================+========================+================+ | Ubuntu 20.04 LTS | Yes | Yes | +----------------------------+------------------------+----------------+ @@ -581,4 +581,4 @@ Additional Supported Container Management Tools ----------------------------------------------- * Helm v3 -* Red Hat Operator Lifecycle Manager (OLM) \ No newline at end of file +* Red Hat Operator Lifecycle Manager (OLM) diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index 6e6e63aad..a0900cb9a 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -43,14 +43,15 @@ See the :ref:`GPU Operator Component Matrix` for a list of software components a New Features ------------ -* Added support for the following software component versions: +* Updated software component versions: - NVIDIA Kubernetes Device Plugin/NVIDIA GPU Feature Discovery v0.17.3 - NVIDIA MIG Manager for Kubernetes v0.12.2 -* Added support for the following NVIDIA Data Center GPU Driver versions: +* Supports these NVIDIA Data Center GPU Driver versions: - - 570.172.08 (default, recommended) + - 580.65.06 (recommended) + - 570.172.08 (default) - 535.261.03 .. _v25.3.2-known-issues: @@ -58,14 +59,33 @@ New Features Known Issues ------------ +* The 580.65.06 driver container comes with Coherent Driver Memory Management (CDMM) enabled by default to support GB200 on Kubernetes. + For more details on CDMM, refer to the `release notes `__. + Note that the 580.65.06 driver container does not support MIG on GB200 when CDMM is enabled. + Additionally, CDMM enablement is only applicable to Grace-based systems like GH200 and GB200, so it is ignored on other GPU machines. + NVIDIA strongly recommends keeping CDMM enabled on these systems to avoid memory over-reporting and unrestricted GPU memory access issues. + * For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01, GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs. This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state. NVIDIA recommends that you upgrade the driver to version 570.172.08 to avoid this issue. - For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issue/1361. + For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361. * Configuring the Operator to enable CDI is not supported on Rancher Kubernetes Engine 2 (RKE2). +* The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. + Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. + +.. _v25.3.2-fixed-issues: + +Fixed Issues +------------ + +* Fixed security vulnerabilities in NVIDIA Container Toolkit and related components. + This release addresses CVE-2025-23266 (Critical) and CVE-2025-23267 (High) which could allow + arbitrary code execution and link following attacks in container environments. + For complete details, refer to the `NVIDIA Security Bulletin `__. .. _v25.3.1: @@ -77,7 +97,7 @@ Known Issues New Features ------------ -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Container Toolkit version v1.17.8 - NVIDIA DCGM v4.2.3 @@ -86,13 +106,14 @@ New Features - Node Feature Discovery v0.17.3 - NVIDIA GDRCopy Driver v2.5.0 -* Added support for the following NVIDIA Data Center GPU Driver versions: +* Supports these NVIDIA Data Center GPU Driver versions: - - 570.172.08 (default, recommended) - - 535.261.03 + - 580.65.06 (recommended) + - 570.172.08 (default) - 570.148.08 - - 570.133.20 + - 570.133.20 - 550.163.01 + - 535.261.03 - 535.247.01 * Added support for Red Hat Enterprise Linux 9. @@ -102,7 +123,7 @@ New Features * Added support for Kubernetes v1.33. * Added support for setting the internalTrafficPolicy for the DCGM Exporter service. - You can configure this in the Helm chart value by setting `dcgmexporter.service.internalTrafficPolicy` to `Local` or `Cluster` (default). + You can configure this in the Helm chart value by setting ``dcgmexporter.service.internalTrafficPolicy`` to ``Local`` or ``Cluster`` (default). Choose Local if you want to route internal traffic within the node only. .. _v25.3.1-known-issues: @@ -125,7 +146,7 @@ Fixed Issues * Fixed an issue where the NVIDIADriver controller may enter an endless loop of creating and deleting a DaemonSet. This could occur when the NVIDIADriver DaemonSet does not tolerate a taint present on all nodes matching its configured nodeSelector, or when none of the DaemonSet pods have been scheduled yet. - Refer to Github `pull request #1416 `__ for more details. + Refer to GitHub `pull request #1416 `__ for more details. .. _v25.3.0: @@ -137,7 +158,7 @@ Fixed Issues New Features ------------ -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Container Toolkit version v1.17.5 - NVIDIA Driver Manager for Kubernetes v0.8.0 @@ -154,7 +175,7 @@ New Features * Added support for the NVIDIA GPU DRA Driver v25.3.0 component (coming soon) which enables Multi-Node NVLink through Kubernetes Dynamic Resource Allocation (DRA) and IMEX support. - This component is an additional component that can be installed alongside the the GPU Operator. + This component can be installed alongside the GPU Operator. It is supported on Kubernetes v1.32 clusters, running on NVIDIA HGX GB200 NVL, and with CDI enabled on your GPU Operator. * Transitioned to installing the open kernel modules by default starting with R570 driver containers. @@ -172,7 +193,7 @@ New Features In previous versions, the ``useOpenKernelModules`` field specified the driver containers to install the NVIDIA Open GPU kernel module driver. This field is now deprecated and will be removed in a future release. - If you were using the ``useOpenKernelModules`` field, it's recommended that you update your configuration to use the ``kernelModuleType`` field instead. + If you were using the ``useOpenKernelModules`` field, NVIDIA recommends that you update your configuration to use the ``kernelModuleType`` field instead. * Added support for Ubuntu 24.04 LTS. @@ -195,7 +216,7 @@ New Features * Added support for new MIG profiles with HGX B200. - * Added support for the following profiles: + * Supports these profiles: * ``1g.23gb`` * ``1g.23gb+me`` @@ -205,7 +226,7 @@ New Features * ``4g.90gb`` * ``7g.180gb`` - * Added an ``all-balanced`` profile creates the following GPU instances: + * Added an ``all-balanced`` profile that creates the following GPU instances: * ``1g.23gb`` :math:`\times` 2 * ``2g.45gb`` :math:`\times` 1 @@ -213,7 +234,7 @@ New Features * Added support for new MIG profiles with HGX GB200. - * Added support for the following profiles: + * Supports these profiles: * ``1g.24gb`` * ``1g.24gb+me`` @@ -223,7 +244,7 @@ New Features * ``4g.95gb`` * ``7g.189gb`` - * Added an ``all-balanced`` profile creates the following GPU instances: + * Added an ``all-balanced`` profile that creates the following GPU instances: * ``1g.24gb`` :math:`\times` 2 * ``2g.47gb`` :math:`\times` 1 @@ -247,7 +268,7 @@ Fixed Issues Long response times of the `lsmod` commands were causing timeout errors in the probe and unnecessary restarts of the container, resulting in the DaemonSet being in a bad state. * Fixed an issue where the GPU Operator failed to create a valid DaemonSet name on OpenShift Container Platform when using 64 kernel page size. - Refer to Github `issue #1207 `__ for more details. + Refer to GitHub `issue #1207 `__ for more details. * Removed deprecated `operator.defaultRuntime`` parameter. @@ -266,7 +287,7 @@ New Features Refer to the :ref:`GPU Operator Component Matrix` on the platform support page for more details on supported drivers. -* This release adds support for NVIDIA Container Toolkit 1.17.4. This version includes updates for `NVIDIA CVE-2025-23359 `__. +* Added support for NVIDIA Container Toolkit 1.17.4. This version includes updates for `NVIDIA CVE-2025-23359 `__. To view any published security bulletins for NVIDIA products, refer to the NVIDIA product security page at https://www.nvidia.com/en-us/security/. @@ -287,7 +308,7 @@ New Features Refer to the :ref:`GPU Operator Component Matrix` on the platform support page. -* This release adds support for NVIDIA Container Toolkit 1.17.3. +* Added support for NVIDIA Container Toolkit 1.17.3. This version includes updates for: * `NVIDIA CVE-2024-0135 `__ @@ -298,7 +319,7 @@ New Features For more information regarding NVIDIA security vulnerability remediation policies, refer to https://www.nvidia.com/en-us/security/psirt-policies/. -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Container Toolkit v1.17.3 - NVIDIA DCGM v3.3.9-1 @@ -319,10 +340,10 @@ Fixed Issues ------------ * Fixed an issue where NVIDIA Container Toolkit would fail to start on Rancher RKE2, K3s, and Canonical MicroK8s. - Refer to Github `issue #1109 `__ for more details. + Refer to GitHub `issue #1109 `__ for more details. * Fixed an issue where events were not being generated by the NVIDIA driver upgrade controller. - Refer to Github `issue #1101 `__ for more details. + Refer to GitHub `issue #1101 `__ for more details. .. _v24.9.0: @@ -334,7 +355,7 @@ Fixed Issues New Features ------------ -* This release adds support for NVIDIA Container Toolkit 1.17.0. +* Added support for NVIDIA Container Toolkit 1.17.0. This version includes updates for the following CVEs: * `NVIDIA CVE-2024-0134 `__ @@ -349,7 +370,7 @@ New Features Refer to the :ref:`GPU Operator Component Matrix` on the platform support page. -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Container Toolkit v1.17.0 - NVIDIA Driver Manager for Kubernetes v0.7.0 @@ -378,7 +399,7 @@ New Features * Added support for new MIG profiles with GH200 NVL2 144GB HBM3e. - * Added support for the following profiles: + * Supports these profiles: * ``1g.18gb`` * ``1g.18gb+me`` @@ -388,14 +409,14 @@ New Features * ``4g.72gb`` * ``7g.144gb`` - * Added an ``all-balanced`` profile creates the following GPU instances: + * Added an ``all-balanced`` profile that creates the following GPU instances: * ``1g.18gb`` :math:`\times` 2 * ``2g.36gb`` :math:`\times` 1 * ``3g.72gb`` :math:`\times` 1 * Added support for KubeVirt and OpenShift Virtualization with vGPU v17.4 for A30, A100, and H100 GPUs. - These GPUs are supported with an NVIDIA AI Enterpise subscription only and require building the NVIDIA vGPU Manager container image with the ``aie`` .run file. + These GPUs are supported with an NVIDIA AI Enterprise subscription only and require building the NVIDIA vGPU Manager container image with the ``aie`` .run file. * Revised roles and role-based access controls for the Operator. The Operator is revised to use Kubernetes controller-runtime caching that is limited to the Operator namespace and the OpenShift namespace, ``openshift``. @@ -410,7 +431,7 @@ New Features Fixed Issues ------------ -* Fixed an issue with the clean up CRD and upgrade CRD jobs that are triggered by Helm hooks. +* Fixed an issue with the cleanup CRD and upgrade CRD jobs that are triggered by Helm hooks. On clusters that have nodes with taints, even when ``operator.tolerations`` includes tolerations, the jobs are not scheduled. In this release, the tolerations that you specify for the Operator are applied to the jobs. For more information about the hooks, refer to :ref:`Option 2: Automatically Upgrading CRDs Using a Helm Hook`. @@ -443,7 +464,7 @@ New Features **This release provides critical security updates and is recommended for all users.** -This release adds support for NVIDIA Container Toolkit 1.16.2. +Added support for NVIDIA Container Toolkit 1.16.2. This version includes updates for the following CVEs: * `NVIDIA CVE-2024-0132 `__ @@ -463,7 +484,7 @@ For more information regarding NVIDIA security vulnerability remediation policie New Features ------------ -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Kubernetes Device Plugin v0.16.2 - NVIDIA GPU Feature Discovery for Kubernetes v0.16.2 @@ -477,10 +498,10 @@ Fixed Issues ------------ * Fixed an issue with role-based access controls that prevented a service account from accessing config maps. - Refer to Github `issue #883 `__ for more details. + Refer to GitHub `issue #883 `__ for more details. * Fixed an issue with role-based access controls in the GPU Operator validator that prevented retrieving NVIDIA Driver daemon set information. On OpenShift Container Platform, this issue triggered `GPUOperatorNodeDeploymentDriverFailed` alerts. - Refer to Github `issue #892 `__ for more details. + Refer to GitHub `issue #892 `__ for more details. .. _v24.6.0: @@ -497,7 +518,7 @@ New Features Refer to the :ref:`GPU Operator Component Matrix` on the platform support page. -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Container Toolkit v1.16.1 - NVIDIA Driver Manager for Kubernetes v0.6.10 @@ -550,7 +571,7 @@ New Features * ``4g.71gb`` * ``7g.141gb`` - * Added an ``all-balanced`` profile for H20 devices that creates the following GPU instances: + * Added an ``all-balanced`` profile for H200 devices that creates the following GPU instances: * ``1g.12gb`` :math:`\times` 2 * ``2g.24gb`` :math:`\times` 1 @@ -580,10 +601,10 @@ Fixed Issues * Fixed an issue with Node Feature Discovery (NFD). When an NFD pod was deleted or restarted, all NFD node labels were removed from the node and GPU Operator operands were restarted. The v0.16.2 release of NFD fixes the issue. - Refer to Github `issue #782 `__ for more details. + Refer to GitHub `issue #782 `__ for more details. * Fixed an issue with NVIDIA vGPU Manager not working correctly on nodes with GPUs that require Open Kernel module drivers and GPU System Processor (GSP) firmware. - Refer to Github `issue #761 `__ for more details. + Refer to GitHub `issue #761 `__ for more details. * DGCM is revised to use a cluster IP and a service with the internal traffic policy set to ``Local``. In previous releases, DCGM was a host networked pod. @@ -591,16 +612,16 @@ Fixed Issues * Fixed an issue that prevented enabling GDRCopy and additional volume mounts with the NVIDIA Driver custom resource. Previously, the driver daemon set did not update with the change and the Operator logs included an error message. - Refer to Github `issue #713 `__ for more details. + Refer to GitHub `issue #713 `__ for more details. * Fixed an issue with deleting GPU Driver daemon sets due to having misscheduled pods rather than zero pods. Previously, if a node had an untolerated taint such as ``node.kubernetes.io/unreachable:NoSchedule``, the Operator could repeatedly delete and recreate the driver daemon sets. - Refer to Github `issue #715 `__ for more details. + Refer to GitHub `issue #715 `__ for more details. * Fixed an issue with reporting the correct GPU capacity and allocatable resources from the KubeVirt GPU Device Plugin. Previously, if a GPU became unavailable, the reported GPU capacity and allocatable resources remained unchanged. - Refer to Github `issue #97 `__ for more details. + Refer to GitHub `issue #97 `__ for more details. .. _v24.6.0-known-limitations: @@ -628,7 +649,7 @@ Known Limitations The technical preview feature that provides :doc:`gpu-driver-configuration` is an alternative. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the ``enable_selinux=true`` configuration option. @@ -658,7 +679,7 @@ New Features Refer to the :ref:`GPU Operator Component Matrix` on the platform support page. -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Container Toolkit version v1.15.0 - NVIDIA MIG Manager version v0.7.0 @@ -705,17 +726,17 @@ Fixed Issues In this release, the init container no longer runs because the MOFED installation check is performed by the Kubernetes Driver Manager init container. * Previously, for Red Hat OpenShift Container Platform, the GPU driver installation would fail when the Linux kernel version did not match the ``/etc/os-release`` file. In this release, the Kernel version is determined from the running kernel to prevent the issue. - Refer to Github `issue #617 `__ for more details. + Refer to GitHub `issue #617 `__ for more details. * Previously, if the metrics for DCGM Exporter were configured in a config map and the cluster policy specified the name of the config map as ``:`` in the ``DCGM_EXPORTER_CONFIGMAP_DATA`` environment variable, the exporter pods could not read the configuration from the config map. In this release, the role used by the exporter is granted access to read from config maps. * Previously, under load, the Operator could fail with the message ``fatal error: concurrent map read and map write``. In this release, the Operator controller is refactored to prevent the race condition. - Refer to Github `issue #689 `__ for more details. + Refer to GitHub `issue #689 `__ for more details. * Previously, if any node in the cluster was in the ``NotReady`` state, the GPU driver upgrade controller failed to make progress. In this release, the upgrade library is updated and skips unhealthy nodes. - Refer to Github `issue #688 `__ for more details. + Refer to GitHub `issue #688 `__ for more details. .. _v24.3.0-known-limitations: @@ -784,7 +805,7 @@ Known Limitations The technical preview feature that provides :doc:`gpu-driver-configuration` is an alternative. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the ``enable_selinux=true`` configuration option. @@ -812,7 +833,7 @@ New Features Refer to :ref:`Supported Operating Systems and Kubernetes Platforms` on the platform support page. -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Data Center GPU Driver version 550.54.14 - NVIDIA Container Toolkit version v1.14.6 @@ -837,11 +858,11 @@ Fixed issues * Previously, common daemon set labels, annotations, and tolerations configured in ClusterPolicy were not also applied to the default NVIDIADriver CR instance. - Refer to Github `issue #665 `__ for more details. + Refer to GitHub `issue #665 `__ for more details. * Previously, the technical preview NVIDIA driver custom resource was failing to render the ``licensing-config`` volume mount that is required for licensing a vGPU guest driver. - Refer to Github `issue #672 `__ for more details. + Refer to GitHub `issue #672 `__ for more details. * Previously, the technical preview NVIDIA driver custom resource was broken when GDS was enabled. An OS suffix was not appended to the image path of the GDS driver container image. @@ -899,7 +920,7 @@ Known Limitations The technical preview feature that provides :doc:`gpu-driver-configuration` is an alternative. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the ``enable_selinux=true`` configuration option. @@ -940,7 +961,7 @@ New Features - Refer to :doc:`gpu-driver-configuration` for information about setting ``spec.useOpenKernelModules`` if you manage the driver containers with the technology preview NVIDIA driver custom resource. -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Data Center GPU Driver version 535.129.03 - NVIDIA Driver Manager for Kubernetes v0.6.5 @@ -975,7 +996,7 @@ Fixed issues * Previously, the technical preview NVIDIA driver custom resource did not support adding custom labels, annotations, or tolerations to the pods that run as part of the driver daemon set. This limitation prevented scheduling the driver daemon set in some environments. - Refer to GitHub `issue #602 `_ for more details. + Refer to GitHub `issue #602 `__ for more details. * Previously, when you specified the ``operator.upgradeCRD=true`` argument to the ``helm upgrade`` command, the pre-upgrade hook ran with the ``gpu-operator`` service account @@ -1048,7 +1069,7 @@ Known Limitations The technical preview feature that provides :doc:`gpu-driver-configuration` is an alternative. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the ``enable_selinux=true`` configuration option. @@ -1083,7 +1104,7 @@ New Features When you view the cluster policy with a command like ``kubectl get cluster-policy``, the response now includes a ``Status.Conditions`` field. -* Added support for the following software component versions: +* Includes these software component versions: - NVIDIA Data Center GPU Driver version 535.104.12. - NVIDIA Driver Manager for Kubernetes v0.6.4 @@ -1145,7 +1166,7 @@ Known Limitations The technical preview feature that provides :doc:`gpu-driver-configuration` is an alternative. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, such as setting the ``enable_selinux=true`` configuration option. @@ -1315,7 +1336,7 @@ Known Limitations * The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported. @@ -1402,7 +1423,7 @@ Known Limitations * The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported. @@ -1435,8 +1456,8 @@ New Features on the platform support page. * Added support for Node Feature Discovery v0.12.1. - This release adds support for using the NodeFeature API CRD for labelling nodes - instead of labelling nodes over gRPC. + Added support for using the NodeFeature API CRD for labeling nodes + instead of labeling nodes over gRPC. The :ref:`documentation for upgrading the Operator manually ` is updated to include applying the custom resource definitions for Node Feature Discovery. @@ -1502,20 +1523,20 @@ Fixed issues is used. Previously, if you needed to provide CA certificates, the certificates were not installed correctly. The certificates are now installed in the correct directories. - Refer to GitHub `issue #299 `_ for more details. + Refer to GitHub `issue #299 `__ for more details. * Fixed an issue that created audit log records related to deprecated API requests for pod security policy. on Red Hat OpenShift Container Platform. Refer to GitHub `issue #451 `_ - and `issue #490 `_ for more details. + and `issue #490 `__ for more details. * Fixed an issue that caused the Operator to attempt to add a pod security policy on pre-release versions of Kubernetes v1.25. - Refer to GitHub `issue #484 `_ for more details. + Refer to GitHub `issue #484 `__ for more details. * Fixed a race condition that is related to preinstalled GPU drivers, validator pods, and the device plugin pods. The race condition can cause the device plugin pods to set the wrong path to the GPU driver. - Refer to GitHub `issue #508 `_ for more details. + Refer to GitHub `issue #508 `__ for more details. * Fixed an issue with the driver manager that prevented the manager from accurately detecting whether a node has preinstalled GPU drivers. @@ -1524,7 +1545,7 @@ Fixed issues and to check the output from executing the file. * Fixed an issue that prevented adding custom annotations to daemon sets that the Operator starts. - Refer to GitHub `issue #499 `_ for more details. + Refer to GitHub `issue #499 `__ for more details. * Fixed an issue that is related to not starting the GPU Feature Discovery (GFD) pods when the DCGM Exporter service monitor is enabled, but a service monitor custom resource definition does not exist. @@ -1567,7 +1588,7 @@ Common Vulnerabilities and Exposures (CVEs) ------------------------------------------- The ``gpu-operator:v23.3.0`` and ``gpu-operator-validator:v23.3.0`` images have the following known high-vulnerability CVEs. -These CVEs are from the base images and are not in libraries that are used by the GPU operator: +These CVEs are from the base images and are not in libraries that are used by the GPU Operator: * ``openssl-libs`` - `CVE-2023-0286 `_ * ``platform-python`` and ``python3-libs`` - `CVE-2023-24329 `_ @@ -1593,7 +1614,7 @@ Known Limitations * The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported. @@ -1631,14 +1652,14 @@ Fixed issues * Fixed an issue where containers allocated GPU lose access to them when systemd is triggered to run some reevaluation of the cgroups it manages. The issue affects systems using runc configured with systemd cgroups. - Refer to Github `issue #430 `_ for more details. -* Fixed an issue that prevented the GPU operator from applying PSA labels on the namespace when no prior labels existed. + Refer to GitHub `issue #430 `__ for more details. +* Fixed an issue that prevented the GPU Operator from applying PSA labels on the namespace when no prior labels existed. Common Vulnerabilities and Exposures (CVEs) ------------------------------------------- The ``gpu-operator:v22.9.2`` and ``gpu-operator:v22.9.2-ubi8`` images have the following known high-vulnerability CVEs. -These CVEs are from the base images and are not in libraries that are used by the GPU operator: +These CVEs are from the base images and are not in libraries that are used by the GPU Operator: * ``libksba`` - `CVE-2022-47629 `_ @@ -1653,7 +1674,7 @@ Known Limitations * The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version. The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster. * The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU. Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs. - Additionally, all GPU operator pods become stuck in the ``Init`` state. + Additionally, all GPU Operator pods become stuck in the ``Init`` state. * When using RHEL 8 with Kubernetes, SELinux must be enabled (either in permissive or enforcing mode) for use with the GPU Operator. Additionally, network-restricted environments are not supported. @@ -1689,7 +1710,7 @@ Improvements Fixed issues ------------ -* Fix repeated restarts of container-toolkit when used with containerd versions ``v1.6.9`` and above. Refer to Github `issue #432 `_ for more details. +* Fix repeated restarts of container-toolkit when used with containerd versions ``v1.6.9`` and above. Refer to GitHub `issue #432 `__ for more details. * Disable creation of PodSecurityPolicies (PSP) with K8s versions ``1.25`` and above as it is removed. Common Vulnerabilities and Exposures (CVEs) @@ -2050,7 +2071,7 @@ Known Limitations Fixed issues ------------ -* NFD version bumped to `v0.8.2` to support correct kernel version labelling on Anthos nodes. See `NFD issue `_ for more details. +* NFD version bumped to `v0.8.2` to support correct kernel version labeling on Anthos nodes. See `NFD issue `__ for more details. ---- @@ -2087,8 +2108,8 @@ Fixed issues Known Limitations ------------------ * DCGM does not support profiling metrics on RTX 6000 and RTX 8000. Support will be added in a future release of DCGM Exporter. -* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using - ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. +* After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using + ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator. * When MIG strategy of ``mixed`` is configured, device-plugin-validation may stay in ``Pending`` state due to incorrect GPU resource request type. User would need to modify the pod spec to apply correct resource type to match the MIG devices configured in the cluster. @@ -2142,8 +2163,8 @@ Fixed issues Known Limitations ------------------ * DCGM does not support profiling metrics on RTX 6000 and RTX 8000. Support will be added in a future release of DCGM Exporter. -* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using - ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. +* After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using + ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator. * When MIG strategy of ``mixed`` is configured, device-plugin-validation may stay in ``Pending`` state due to incorrect GPU resource request type. User would need to modify the pod spec to apply correct resource type to match the MIG devices configured in the cluster. * ``gpu-operator-resources`` project in Red Hat OpenShift requires label ``openshift.io/cluster-monitoring=true`` for Prometheus to collect DCGM metrics. User will need to add this @@ -2204,8 +2225,8 @@ Known Limitations * The GPU Operator v1.5.x does not support mixed types of GPUs in the same cluster. All GPUs within a cluster need to be either NVIDIA vGPUs, GPU Passthrough GPUs or Bare Metal GPUs. * GPU Operator v1.5.x with NVIDIA vGPUs support Turing and newer GPU architectures. * DCGM does not support profiling metrics on RTX 6000 and RTX 8000. Support will be added in a future release of DCGM Exporter. -* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using - ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. +* After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using + ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator. * When MIG strategy of ``mixed`` is configured, device-plugin-validation may stay in ``Pending`` state due to incorrect GPU resource request type. User would need to modify the pod spec to apply correct resource type to match the MIG devices configured in the cluster. * ``gpu-operator-resources`` project in Red Hat OpenShift requires label ``openshift.io/cluster-monitoring=true`` for Prometheus to collect DCGM metrics. User will need to add this @@ -2253,8 +2274,8 @@ Fixed issues Known Limitations ------------------ -* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using - ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. +* After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using + ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator. ---- @@ -2277,12 +2298,12 @@ Improvements Fixed issues ------------ * Fixed the ordering of the state machine to ensure that the driver daemonset is deployed before the other components. This fix addresses the issue - where the NVIDIA container toolkit would be setup as the default runtime, causing the driver container initialization to fail. + where the NVIDIA container toolkit would be set up as the default runtime, causing the driver container initialization to fail. Known Limitations ------------------ -* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using - ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. +* After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using + ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator. ---- @@ -2301,14 +2322,14 @@ Improvements Fixed issues ------------ -* Fixed an issue where the GPU operator would not correctly detect GPU nodes due to inconsistent PCIe node labels. +* Fixed an issue where the GPU Operator would not correctly detect GPU nodes due to inconsistent PCIe node labels. * Fixed a race condition where some of the NVIDIA pods would start out of order resulting in some pods in ``RunContainerError`` state. * Fixed an issue in the driver container where the container would fail to install on systems with the ``linux-gke`` kernel due to not finding the kernel headers. Known Limitations ------------------ -* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using - ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. +* After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using + ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator. ---- @@ -2327,7 +2348,7 @@ Improvements Fixed issues ------------ -* Fixed an issue with the toolkit container which would setup the NVIDIA runtime under ``/run/nvidia`` with a symlink to ``/usr/local/nvidia``. +* Fixed an issue with the toolkit container which would set up the NVIDIA runtime under ``/run/nvidia`` with a symlink to ``/usr/local/nvidia``. If a node was rebooted, this would prevent any containers from being run with Docker as the container runtime configured in ``/etc/docker/daemon.json`` would not be available after reboot. * Fixed a race condition with the creation of the CRD and registration. @@ -2347,14 +2368,13 @@ Fixed Issues ------------- * The driver container now sets up the required dependencies on ``i2c`` and ``ipmi_msghandler`` modules. * Fixed an issue with the validation steps (for the driver and device plugin) taking considerable time. Node provisioning times are now improved by 5x. -* The SRO custom resource definition is setup as part of the operator. -* Fixed an issue with the clean up of driver mount files when deleting the operator from the cluster. This issue used to require a reboot of the node, which is no longer required. +* The SRO custom resource definition is set up as part of the operator. +* Fixed an issue with the cleanup of driver mount files when deleting the operator from the cluster. This issue previously required a reboot of the node. .. _operator-known-limitations: Known Limitations ------------------ -* After un-install of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using - ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator again. - +* After uninstall of GPU Operator, NVIDIA driver modules might still be loaded. Either reboot the node or forcefully remove them using + ``sudo rmmod nvidia nvidia_modeset nvidia_uvm`` command before re-installing GPU Operator. diff --git a/gpu-operator/troubleshooting.rst b/gpu-operator/troubleshooting.rst index 9fadab98d..71ddb9c34 100644 --- a/gpu-operator/troubleshooting.rst +++ b/gpu-operator/troubleshooting.rst @@ -145,7 +145,7 @@ Operator validator pods crashing with "error code system not yet initialized" ***************************************************************************** When the operator validator pods are crashing with this error, this most likely points to a GPU node that is NVSwitch-based and requires the nvidia-fabricmanager to be installed. -NVSwitch-based systems, like NVIDIA DGX and NVIDIA HGX server systems, require the memory fabric to be setup after the GPU driver is installed. +NVSwitch-based systems, like NVIDIA DGX and NVIDIA HGX server systems, require the memory fabric to be set up after the GPU driver is installed. Learn more about the Fabric Manager from the `Fabric Manager user guide `_ .. rubric:: Action @@ -304,7 +304,7 @@ The above node only advertises 7 GPU devices as allocatable when we expect it to DCGM Exporter pods go into CrashLoopBackoff ******************************************* -By default, the gpu-operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings aren't applied correctly, you would likely see error messages as mentioned below in the ``dcgm-exporter`` logs: +By default, the GPU Operator only deploys the ``dcgm-exporter`` while disabling the standalone ``dcgm``. In this setup, the ``dcgm-exporter`` spawns a dcgm process locally. If, however, ``dcgm`` is enabled and deployed as a separate pod/container, then the ``dcgm-exporter`` will attempt to connect to the ``dcgm`` pod through a Kubernetes service. If the cluster networking settings aren't applied correctly, you would likely see error messages as mentioned below in the ``dcgm-exporter`` logs: .. code-block:: console @@ -376,7 +376,7 @@ This is due to a regression in NVML introduced in the R570 drivers starting from .. rubric:: Action :class: h4 -It's recommended that you downgrade to driver version 570.86.15 to work around this issue. +NVIDIA recommends that you downgrade to driver version 570.86.15 to work around this issue. **************************************************** GPU Operator Validator: Failed to Create Pod Sandbox @@ -601,7 +601,7 @@ The return values for the ``nvidia-smi`` command are listed below. .. rubric:: Root Cause :class: h4 -The ``nvidia-smi`` command should return a success code (return code 0) for the driver-validator container to pass and GPU operator to successfully deploy driver pod on the node. +The ``nvidia-smi`` command should return a success code (return code 0) for the driver-validator container to pass and GPU Operator to successfully deploy driver pod on the node. .. rubric:: Action :class: h4 diff --git a/gpu-operator/upgrade.rst b/gpu-operator/upgrade.rst index e2c656330..bc6fde092 100644 --- a/gpu-operator/upgrade.rst +++ b/gpu-operator/upgrade.rst @@ -57,7 +57,7 @@ Option 1: Manually Upgrading CRDs B["Upgrade by using Helm"] -With this procedure, all existing GPU operator resources are updated inline and the cluster policy resource is patched with updates from ``values.yaml``. +With this procedure, all existing GPU Operator resources are updated inline and the cluster policy resource is patched with updates from ``values.yaml``. #. Specify the Operator release tag in an environment variable: diff --git a/repo.toml b/repo.toml index 08960c88d..a9f9f1b76 100644 --- a/repo.toml +++ b/repo.toml @@ -167,7 +167,7 @@ docs_root = "${root}/gpu-operator" project = "gpu-operator" name = "NVIDIA GPU Operator" version = "25.3.2" -source_substitutions = { version = "v25.3.2", recommended = "570.172.08" } +source_substitutions = { version = "v25.3.2", recommended = "580.65.06" } copyright_start = 2020 sphinx_exclude_patterns = [ "life-cycle-policy.rst",