Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion container-toolkit/arch-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ a `prestart` hook into it, and then calls out to the native `runC`, passing it t
For versions of the NVIDIA Container Runtime from `v1.12.0`, this runtime also performs additional modifications to the OCI runtime spec to inject
specific devices and mounts not handled by the NVIDIA Container CLI.

It's important to note that this component is not necessarily specific to docker (but it is specific to `runC`).
It is important to note that this component is not necessarily specific to docker (but it is specific to `runC`).

### The NVIDIA Container Toolkit CLI

Expand Down
18 changes: 9 additions & 9 deletions gpu-operator/dra-intro-install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Introduction

With NVIDIA's DRA Driver for GPUs, your Kubernetes workload can allocate and consume the following two types of resources:

* **GPUs**: for controlled sharing and dynamic reconfiguration of GPUs. A modern replacement for the traditional GPU allocation method (using `NVIDIA's device plugin <https://github.com/NVIDIA/k8s-device-plugin>`_). We are excited about this part of the driver; it is however not yet fully supported (Technology Preview).
* **GPUs**: for controlled sharing and dynamic reconfiguration of GPUs. A modern replacement for the traditional GPU allocation method (using `NVIDIA's device plugin <https://github.com/NVIDIA/k8s-device-plugin>`_). NVIDIA is excited about this part of the driver; it is however not yet fully supported (Technology Preview).
* **ComputeDomains**: for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems. Fully supported.

A primer on DRA
Expand All @@ -25,7 +25,7 @@ For NVIDIA devices, there are two particularly beneficial characteristics provid
#. A clean way to allocate **cross-node resources** in Kubernetes (leveraged here for providing NVLink connectivity across pods running on multiple nodes).
#. Mechanisms to explicitly **share, partition, and reconfigure** devices **on-the-fly** based on user requests (leveraged here for advanced GPU allocation).

To understand and make best use of NVIDIA's DRA Driver for GPUs, we recommend becoming familiar with DRA by working through the `official documentation <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.
To understand and make best use of NVIDIA's DRA Driver for GPUs, NVIDIA recommends becoming familiar with DRA by working through the `official documentation <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.


The twofold nature of this driver
Expand All @@ -34,7 +34,7 @@ The twofold nature of this driver
NVIDIA's DRA Driver for GPUs is comprised of two subsystems that are largely independent of each other: one manages GPUs, and the other one manages ComputeDomains.

Below, you can find instructions for how to install both parts or just one of them.
Additionally, we have prepared two separate documentation chapters, providing more in-depth information for each of the two subsystems:
Additionally, NVIDIA has prepared two separate documentation chapters, providing more in-depth information for each of the two subsystems:

- :ref:`Documentation for ComputeDomain (MNNVL) support <dra_docs_compute_domains>`
- :ref:`Documentation for GPU support <dra_docs_gpus>`
Expand All @@ -52,7 +52,7 @@ Prerequisites
- `CDI <https://github.com/cncf-tags/container-device-interface?tab=readme-ov-file#how-to-configure-cdi>`_ must be enabled in the underlying container runtime (such as containerd or CRI-O).
- NVIDIA GPU Driver 565 or later.

For the last two items on the list above, as well as for other reasons, we recommend installing NVIDIA's GPU Operator v25.3.0 or later.
For the last two items on the list above, as well as for other reasons, NVIDIA recommends installing NVIDIA's GPU Operator v25.3.0 or later.
For detailed instructions, see the official GPU Operator `installation documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options>`__.
Also note that, in the near future, the preferred method to install NVIDIA's DRA Driver for GPUs will be through the GPU Operator (the DRA driver will then no longer require installation as a separate Helm chart).

Expand All @@ -65,8 +65,8 @@ Also note that, in the near future, the preferred method to install NVIDIA's DRA
- Refer to the `docs on installing the GPU Operator with a pre-installed GPU driver <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers>`__.


Configure and Helm-install the driver
=====================================
Configure and install the driver with Helm
==========================================

#. Add the NVIDIA Helm repository:

Expand Down Expand Up @@ -103,15 +103,15 @@ All install-time configuration parameters can be listed by running ``helm show v
.. note::

- A common mode of operation for now is to enable only the ComputeDomain subsystem (to have GPUs allocated using the traditional device plugin). The example above achieves that by setting ``resources.gpus.enabled=false``.
- Setting ``nvidiaDriverRoot=/run/nvidia/driver`` above expects a GPU Operator-provided GPU driver. That configuration parameter must be changed in case the GPU driver is installed straight on the host (typically at ``/``, which is the default value for ``nvidiaDriverRoot``).
- Setting ``nvidiaDriverRoot=/run/nvidia/driver`` above expects a GPU Operator-provided GPU driver. That configuration parameter must be changed in case the GPU driver is installed straight on the host (typically at ``/``, which is the default value for ``nvidiaDriverRoot``).


Validate installation
=====================

A lot can go wrong, depending on the exact nature of your Kubernetes environment and specific hardware and driver choices as well as configuration options chosen.
That is why we recommend to perform a set of validation tests to confirm the basic functionality of your setup.
To that end, we have prepared separate documentation:
That is why NVIDIA recommends performing a set of validation tests to confirm the basic functionality of your setup.
To that end, NVIDIA has prepared separate documentation:

- `Testing ComputeDomain allocation <https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-ComputeDomain-allocation>`_
- `Testing GPU allocation <https://github.com/NVIDIA/k8s-dra-driver-gpu/wiki/Validate-setup-for-GPU-allocation>`_
2 changes: 1 addition & 1 deletion gpu-operator/getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ To view all the options, run ``helm show values nvidia/gpu-operator``.
- ``{}``

* - ``psp.enabled``
- The GPU operator deploys ``PodSecurityPolicies`` if enabled.
- The GPU Operator deploys ``PodSecurityPolicies`` if enabled.
- ``false``

* - ``sandboxWorkloads.defaultWorkload``
Expand Down
4 changes: 2 additions & 2 deletions gpu-operator/gpu-operator-kata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
..
lingo:

It's "Kata Containers" when referring to the software component.
It's "Kata container" when it's a container that uses the Kata Containers runtime.
It is "Kata Containers" when referring to the software component.
It is "Kata container" when it is a container that uses the Kata Containers runtime.
Treat our operands as proper nouns and use title case.

#################################
Expand Down
2 changes: 1 addition & 1 deletion gpu-operator/gpu-operator-kubevirt.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ Given the following node configuration:
* Node B is configured with the label ``nvidia.com/gpu.workload.config=vm-passthrough`` and configured to run virtual machines with Passthrough GPU.
* Node C is configured with the label ``nvidia.com/gpu.workload.config=vm-vgpu`` and configured to run virtual machines with vGPU.

The GPU operator will deploy the following software components on each node:
The GPU Operator will deploy the following software components on each node:

* Node A receives the following software components:
* ``NVIDIA Datacenter Driver`` - to install the driver
Expand Down
2 changes: 1 addition & 1 deletion gpu-operator/gpu-operator-mig.rst
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Perform the following steps to install the Operator and configure MIG:
Known Issue: For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
NVIDIA recommends that you downgrade the driver to version 570.86.15 to work around this issue.
For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.


Expand Down
2 changes: 1 addition & 1 deletion gpu-operator/install-gpu-operator-air-gapped.rst
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ Sample of ``values.yaml`` for GPU Operator v1.9.0:
Local Package Repository
************************

The ``driver`` container deployed as part of the GPU operator requires certain packages to be available as part of the
The ``driver`` container deployed as part of the GPU Operator requires certain packages to be available as part of the
driver installation. In restricted internet access or air-gapped installations, users are required to create a
local mirror repository for their OS distribution and make the following packages available:

Expand Down
2 changes: 1 addition & 1 deletion gpu-operator/install-gpu-operator-outdated-kernels.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ On GPU nodes where the running kernel is not the latest, the ``driver`` containe
see the following error message: ``Could not resolve Linux kernel version``.

In general, upgrading your system to the latest kernel should fix this issue. But if this is not an option, the following is a
workaround to successfully deploy the GPU operator when GPU nodes in your cluster may not be running the latest kernel.
workaround to successfully deploy the GPU Operator when GPU nodes in your cluster may not be running the latest kernel.

Add Archived Package Repositories
=================================
Expand Down
7 changes: 4 additions & 3 deletions gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,9 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
- ${version}

* - NVIDIA GPU Driver |ki|_
- | `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
| `570.172.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-172-08/index.html>`_ (default, recommended)
- | `580.65.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-580-65-06/index.html>`_ (recommended)
| `575.57.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-575-57-08/index.html>`_
| `570.172.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-172-08/index.html>`_ (default)
| `570.158.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-158-01/index.html>`_
| `570.148.08 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-570-148-08/index.html>`_
| `535.261.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-261-03/index.html>`_
Expand Down Expand Up @@ -152,7 +153,7 @@ Refer to :ref:`Upgrading the NVIDIA GPU Operator` for more information.
Known Issue: For drivers 570.124.06, 570.133.20, 570.148.08, and 570.158.01,
GPU workloads cannot be scheduled on nodes that have a mix of MIG slices and full GPUs.
This manifests as GPU pods getting stuck indefinitely in the ``Pending`` state.
It's recommended that you downgrade the driver to version 570.86.15 to work around this issue.
NVIDIA recommends that you downgrade the driver to version 570.86.15 to work around this issue.
For more detailed information, see GitHub issue https://github.com/NVIDIA/gpu-operator/issues/1361.


Expand Down
2 changes: 1 addition & 1 deletion gpu-operator/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ configuration of multiple software components such as drivers, container runtime
and prone to errors. The NVIDIA GPU Operator uses the `operator framework <https://coreos.com/blog/introducing-operator-framework>`_
within Kubernetes to automate the management of all NVIDIA software components needed to provision GPU. These components include the NVIDIA drivers (to enable CUDA),
Kubernetes device plugin for GPUs, the `NVIDIA Container Toolkit <https://github.com/NVIDIA/nvidia-container-toolkit>`_,
automatic node labelling using `GFD <https://github.com/NVIDIA/gpu-feature-discovery>`_, `DCGM <https://developer.nvidia.com/dcgm>`_ based monitoring and others.
automatic node labeling using `GFD <https://github.com/NVIDIA/gpu-feature-discovery>`_, `DCGM <https://developer.nvidia.com/dcgm>`_ based monitoring and others.


.. card:: Red Hat OpenShift Container Platform
Expand Down
10 changes: 5 additions & 5 deletions gpu-operator/platform-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -459,8 +459,8 @@ The GPU Operator has been validated in the following scenarios:
Supported Precompiled Drivers
-----------------------------

The GPU Operator has been validated with the following precomplied drivers.
See the :doc:`precompiled-drivers` page for more on using precompiled drivers.
The GPU Operator has been validated with the following precompiled drivers.
See the :doc:`precompiled-drivers` page for more information about using precompiled drivers.

+----------------------------+------------------------+----------------+---------------------+
| Operating System | Kernel Flavor | Kernel Version | CUDA Driver Branch |
Expand All @@ -477,10 +477,10 @@ See the :doc:`precompiled-drivers` page for more on using precompiled drivers.
Supported Container Runtimes
----------------------------

The GPU Operator has been validated in the following scenarios:
The GPU Operator has been validated for the following container runtimes:

+----------------------------+------------------------+----------------+
| Operating System | Containerd 1.6 - 2.0 | CRI-O |
| Operating System | Containerd 1.6 - 2.1 | CRI-O |
+============================+========================+================+
| Ubuntu 20.04 LTS | Yes | Yes |
+----------------------------+------------------------+----------------+
Expand Down Expand Up @@ -581,4 +581,4 @@ Additional Supported Container Management Tools
-----------------------------------------------

* Helm v3
* Red Hat Operator Lifecycle Manager (OLM)
* Red Hat Operator Lifecycle Manager (OLM)
Loading