diff --git a/docs/advanced/advanced.rst b/docs/advanced/advanced.rst index 09c69f03..cb860d8d 100644 --- a/docs/advanced/advanced.rst +++ b/docs/advanced/advanced.rst @@ -22,6 +22,6 @@ Advanced Configurations .. toctree:: Proxy & Air-gapped - DOCA Driver Container + DOCA-OFED Driver Container Other Advanced Configurations Container Images Digests \ No newline at end of file diff --git a/docs/advanced/doca-drivers.rst b/docs/advanced/doca-drivers.rst index 1c18230e..0e54a6de 100644 --- a/docs/advanced/doca-drivers.rst +++ b/docs/advanced/doca-drivers.rst @@ -1,20 +1,20 @@ .. headings # #, * *, =, -, ^, ", ~ .. include:: ../common/vars.rst -**************************** -NVIDIA DOCA Driver Container -**************************** +********************************* +NVIDIA DOCA-OFED Driver Container +********************************* .. contents:: On this page :depth: 2 :local: :backlinks: none -================================================== -NVIDIA DOCA Driver Container Environment Variables -================================================== +======================================================= +NVIDIA DOCA-OFED Driver Container Environment Variables +======================================================= -The following are special environment variables supported by the NVIDIA DOCA Driver container to configure its behavior: +The following are special environment variables supported by the NVIDIA DOCA-OFED Driver container to configure its behavior: .. list-table:: :header-rows: 1 @@ -28,7 +28,7 @@ The following are special environment variables supported by the NVIDIA DOCA Dri - Create an udev rule to preserve "old-style" path based netdev names e.g enp3s0f0 * - UNLOAD_STORAGE_MODULES - "false" - - | Unload host storage modules prior to loading NVIDIA DOCA Driver modules: + - | Unload host storage modules prior to loading NVIDIA DOCA-OFED Driver modules: | * ib_isert | * nvme_rdma | * nvmet_rdma @@ -37,19 +37,19 @@ The following are special environment variables supported by the NVIDIA DOCA Dri | * ib_srpt * - ENABLE_NFSRDMA - "false" - - Enable loading of NFS & NVME related storage modules from a NVIDIA DOCA Driver container + - Enable loading of NFS & NVME related storage modules from a NVIDIA DOCA-OFED Driver container * - RESTORE_DRIVER_ON_POD_TERMINATION - "false" - Restore host drivers when a container -In addition, it is possible to specify any environment variables to be exposed to the NVIDIA DOCA Driver container, such as the standard "HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY". +In addition, it is possible to specify any environment variables to be exposed to the NVIDIA DOCA-OFED Driver container, such as the standard "HTTP_PROXY", "HTTPS_PROXY", "NO_PROXY". .. warning:: CREATE_IFNAMES_UDEV is set automatically by the Network Operator, depending on the Operating System of the worker nodes in the cluster (the cluster is assumed to be homogenous). .. warning:: - When ENABLE_NFSRDMA is set to `true`, it is not possible to load NVME related storage modules from NVIDIA DOCA Driver container when they are in use by the system - (e.g the system has NVMe SSD drives in use). User should ensure the modules are not in use and blacklist them prior to the use of NVIDIA DOCA Driver container. + When ENABLE_NFSRDMA is set to `true`, it is not possible to load NVME related storage modules from NVIDIA DOCA-OFED Driver container when they are in use by the system + (e.g the system has NVMe SSD drives in use). User should ensure the modules are not in use and blacklist them prior to the use of NVIDIA DOCA-OFED Driver container. These variables can be set in the NicClusterPolicy. For example: @@ -71,9 +71,9 @@ These variables can be set in the NicClusterPolicy. For example: .. _advanced-configurations-precompiled: -========================================================================= -Precompiled Container Build Instructions for NVIDIA DOCA Driver Container -========================================================================= +============================================================================== +Precompiled Container Build Instructions for NVIDIA DOCA-OFED Driver Container +============================================================================== ------------- Prerequisites @@ -84,7 +84,7 @@ Before you begin, ensure that you have the following prerequisites: - Docker (Ubuntu) / Podman (RH) installed on your build system. - Web access to NVIDIA NIC drivers sources. Latest NIC drivers are published at `NVIDIA DOCA Downloads `_, for example: `https://linux.mellanox.com/public/repo/doca/2.10.0/SOURCES/MLNX_OFED/MLNX_OFED_SRC-debian-25.01-0.6.0.0.tgz `_ -**NOTE:** NVIDIA NIC driver sources are bundled as part of NVIDIA DOCA package. Both the DOCA package version and its corresponding NIC driver (DOCA Driver) version need to be specified to fetch the correct driver sources when building the driver container. +**NOTE:** NVIDIA NIC driver sources are bundled as part of NVIDIA DOCA package. Both the DOCA package version and its corresponding NIC driver (DOCA-OFED Driver) version need to be specified to fetch the correct driver sources when building the driver container. For example, given a DOCA package version (e.g `2.10.0`) you can find the corresponding MLNX_OFED version at the link: ``_ which is `25.01-0.6.0.0'` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -114,7 +114,7 @@ The Dockerfile consists of the following stages: 1. **Base Image Update**: The base image is updated and common requirements are installed. This stage sets up the basic environment for the subsequent stages. -2. **Download Driver Sources**: This stage downloads the NVIDIA DOCA Driver sources to the specified path. It prepares the necessary files for the driver build process. +2. **Download Driver Sources**: This stage downloads the NVIDIA DOCA-OFED Driver sources to the specified path. It prepares the necessary files for the driver build process. 3. **Build Driver**: The driver is built using the downloaded sources and installed on the container. This stage ensures that the driver is compiled and configured correctly for the target system. diff --git a/docs/advanced/images-sha256.rst b/docs/advanced/images-sha256.rst index e329146e..02af155a 100644 --- a/docs/advanced/images-sha256.rst +++ b/docs/advanced/images-sha256.rst @@ -125,9 +125,9 @@ NVIDIA Network Operator Container Images - v0.0.3 - sha256:d6a2546a8a65e1034d08ab7d85819f062769842dc96513b4fec44f75d3077316 -============================ -DOCA Driver Container Images -============================ +================================= +DOCA-OFED Driver Container Images +================================= .. list-table:: @@ -141,7 +141,7 @@ DOCA Driver Container Images - 25.04-0.6.1.0-2 -The followings tags are available for the above DOCA Driver container version: +The followings tags are available for the above DOCA-OFED Driver container version: ------ Ubuntu diff --git a/docs/advanced/proxy-airgapped.rst b/docs/advanced/proxy-airgapped.rst index 87fef1b6..90f82303 100644 --- a/docs/advanced/proxy-airgapped.rst +++ b/docs/advanced/proxy-airgapped.rst @@ -72,7 +72,7 @@ This section describes how to successfully deploy the Network Operator in cluste By default, the Network Operator requires internet access for the following reasons: - The container images must be pulled during the Network Operator installation. - - The DOCA Driver container must download several OS packages prior to the driver installation. + - The DOCA-OFED Driver container must download several OS packages prior to the driver installation. To address these requirements, it may be necessary to create a local image registry and/or a local package repository, so that the necessary images and packages will be available for your cluster. Subsequent sections of this document detail how to configure the Network Operator to use local image registries and local package repositories. @@ -91,25 +91,25 @@ Pulling and Pushing Container Images to a Local Registry To pull the correct images from the NVIDIA registry, you can leverage the fields ``repository``, ``image`` and ``version`` specified in the ``values.yaml`` file or in the :ref:`container_images_digest` section. -NicClusterPolicy supports use of image container digest in the `version` field, except for DOCA driver. +NicClusterPolicy supports use of image container digest in the `version` field, except for DOCA-OFED driver. -There is one caveat with regards to the DOCA driver image. The version field must be appended by the OS name and Architecture running on the worker node. +There is one caveat with regards to the DOCA-OFED driver image. The version field must be appended by the OS name and Architecture running on the worker node. -For example for DOCA driver version |doca-driver-version|, the tag for Ubuntu 24.04 with X86 architecture is "|doca-driver-version|-ubuntu24.04-amd64". -Available DOCA driver image tags can be found at `NGC `_. +For example for DOCA-OFED driver version |doca-driver-version|, the tag for Ubuntu 24.04 with X86 architecture is "|doca-driver-version|-ubuntu24.04-amd64". +Available DOCA-OFED driver image tags can be found at `NGC `_. In case of local registry required authentication, make sure to create a pull secret and configure in NicClusterPolicy accordingly. .. note:: - NVIDIA Network Operator communicates with the Image Registry configured for the DOCA Driver in the NICClusterPolicy to list the available tags. - Specifying pull secret is required in the NicClusterPolicy DOCA Driver section, even if global container access credentials are configured on nodes. + NVIDIA Network Operator communicates with the Image Registry configured for the DOCA-OFED Driver in the NICClusterPolicy to list the available tags. + Specifying pull secret is required in the NicClusterPolicy DOCA-OFED Driver section, even if global container access credentials are configured on nodes. ----------------------------------- Configuring Local Registry TLS Cert ----------------------------------- -NVIDIA Network Operator communicates with the Image Registry configured for the DOCA Driver in the NICClusterPolicy to list the available tags. -This is required to verify the availability of precompiled DOCA Driver container images. +NVIDIA Network Operator communicates with the Image Registry configured for the DOCA-OFED Driver in the NICClusterPolicy to list the available tags. +This is required to verify the availability of precompiled DOCA-OFED Driver container images. If the Image Registry uses a TLS certificate that is not issued by a well-known Certificate Authority (CA), it is required to configure the NVIDIA Network Operator with the Certificate. @@ -184,7 +184,7 @@ Local Package Repository .. warning:: The instructions below are provided as reference examples to set up a local package repository for NVIDIA Network Operator. -The DOCA Driver container deployed as part of the Network Operator requires certain packages to be available for the driver installation. In restricted internet access or air-gapped installations, users are required to create a local mirror repository for their OS distribution, and make the following packages available: +The DOCA-OFED Driver container deployed as part of the Network Operator requires certain packages to be available for the driver installation. In restricted internet access or air-gapped installations, users are required to create a local mirror repository for their OS distribution, and make the following packages available: .. code-block:: diff --git a/docs/getting-started-kubernetes.rst b/docs/getting-started-kubernetes.rst index 4cd2119d..18e4318c 100644 --- a/docs/getting-started-kubernetes.rst +++ b/docs/getting-started-kubernetes.rst @@ -180,7 +180,7 @@ First install the Network Operator with NFD enabled: enabled: true Once the Network Operator is installed create a NicClusterPolicy with - * DOCA driver + * DOCA-OFED driver * RDMA Shared device plugin configured to a netdev with name ens1f0. @@ -261,7 +261,7 @@ First install the Network Operator with NFD enabled: enabled: true Once the Network Operator is installed create a NicClusterPolicy with: - * DOCA driver + * DOCA-OFED driver * RDMA Shared Device pluging with two RDMA resources - the first mapped to ens1f0 and ens1f1 and the second mapped to ens2f0 and ens2f1. Note: You may need to change the interface names in the NicClusterPolicy to those used by your target nodes. @@ -464,7 +464,7 @@ Network Operator Deployment with a Host Device Network In this mode, the Network Operator could be deployed on virtualized deployments as well. It supports both Ethernet and InfiniBand modes. From the Network Operator perspective, there is no difference between the deployment procedures. To work on a VM (virtual machine), the PCI passthrough must be configured for SR-IOV devices. The Network Operator works both with VF (Virtual Function) and PF (Physical Function) inside the VMs. -.. warning:: If the Host Device Network is used without the DOCA Driver, the following packages should be installed: +.. warning:: If the Host Device Network is used without the DOCA-OFED Driver, the following packages should be installed: * the linux-generic package on Ubuntu hosts * the kernel-modules-extra package on the RedHat-based hosts @@ -726,7 +726,7 @@ First install the Network Operator with NFD enabled: enabled: true Once the Network Operator is installed create a NicClusterPolicy with: - * DOCA driver + * DOCA-OFED driver * RDMA shared device plugin * Secondary network * Multus CNI @@ -897,7 +897,7 @@ Network Operator Deployment for GPUDirect Workloads GPUDirect requires the following: -* NVIDIA DOCA Driver v5.5-1.0.3.2 or newer +* NVIDIA DOCA-OFED Driver v5.5-1.0.3.2 or newer * GPU Operator v1.9.0 or newer * NVIDIA GPU and driver supporting GPUDirect e.g Quadro RTX 6000/8000 or NVIDIA T4/NVIDIA V100/NVIDIA A100 @@ -910,7 +910,7 @@ First install the Network Operator with NFD enabled: enabled: true Once the Network Operator is installed create a NicClusterPolicy with: - * DOCA driver + * DOCA-OFED driver * SR-IOV Device Plugin * Secondary network * Multus CNI @@ -1090,7 +1090,7 @@ First install the Network Operator with NFD and SRIOV Network Operator enabled: enabled: true Once the Network Operator is installed create a NicClusterPolicy with: - * DOCA driver + * DOCA-OFED driver * Secondary network * Multus CNI * IPoIB CNI @@ -1352,7 +1352,7 @@ Network Operator Deployment with an SR-IOV InfiniBand Network Network Operator deployment with InfiniBand network requires the following: -* NVIDIA DOCA Driver and OpenSM running. OpenSM runs on top of the NVIDIA DOCA Driver stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to this `article `_. +* NVIDIA DOCA-OFED Driver and OpenSM running. OpenSM runs on top of the NVIDIA DOCA-OFED Driver stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to this `article `_. * InfiniBand device – Both the host device and switch ports must be enabled in InfiniBand mode. * rdma-core package should be installed when an inbox driver is used. @@ -1367,7 +1367,7 @@ First install the Network Operator with NFD and SR-IOV Network Operator enabled: enabled: true Once the Network Operator is installed create a NicClusterPolicy with: - * DOCA driver + * DOCA-OFED driver * Secondary network * Multus CNI * Container Networking Plugins @@ -1512,7 +1512,7 @@ Network Operator Deployment with an SR-IOV InfiniBand Network with PKey Manageme Network Operator deployment with InfiniBand network requires the following: -* NVIDIA DOCA Driver and OpenSM running. OpenSM runs on top of the NVIDIA DOCA Driver stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to `this article`_. +* NVIDIA DOCA-OFED Driver and OpenSM running. OpenSM runs on top of the NVIDIA DOCA-OFED Driver stack, so both the driver and the subnet manager should come from the same installation. Note that partitions that are configured by OpenSM should specify defmember=full to enable the SR-IOV functionality over InfiniBand. For more details, please refer to `this article`_. * NVIDIA UFM running on top of OpenSM. For more details, please refer to `the project documentation`_. * InfiniBand device – Both the host device and the switch ports must be enabled in InfiniBand mode. * rdma-core package should be installed when an inbox driver is used. @@ -1559,7 +1559,7 @@ First install the Network Operator with NFD enabled: resourcePrefix: "nvidia.com" Once the Network Operator is installed create a NicClusterPolicy with: - * DOCA driver + * DOCA-OFED driver * ibKubernetes * Secondary network * Multus CNI @@ -1645,7 +1645,7 @@ Create IPPool object for nv-ipam - key: node-role.kubernetes.io/worker operator: Exists -Wait for NVIDIA DOCA Driver to install and apply the following CRs: +Wait for NVIDIA DOCA-OFED Driver to install and apply the following CRs: ``sriov-ib-network-node-policy.yaml`` @@ -1759,7 +1759,7 @@ Network Operator Deployment for DPDK Workloads with NicClusterPolicy .. _HUGEPAGE: http://manpages.ubuntu.com/manpages/focal/man8/hugeadm.8.html -This deployment mode supports DPDK applications. In order to run DPDK applications, HUGEPAGE_ should be configured on the required K8s Worker Nodes. By default, the inbox operating system driver is used. For support of cases with specific requirements, DOCA Driver container should be deployed. +This deployment mode supports DPDK applications. In order to run DPDK applications, HUGEPAGE_ should be configured on the required K8s Worker Nodes. By default, the inbox operating system driver is used. For support of cases with specific requirements, DOCA-OFED Driver container should be deployed. Network Operator deployment with: @@ -1878,6 +1878,8 @@ Network Operator Deployment and OpenvSwitch offload - managed OpenvSwitch .. warning:: This feature is supported only for Vanilla Kubernetes deployments with SR-IOV Network Operator. +.. warning:: To use DOCA-OFED Driver container with this mode of operation, set the `RESTORE_DRIVER_ON_POD_TERMINATION` environment variable to `false` in the driver configuration section in the NicClusterPolicy. Restoration to the inbox driver is not supported for this feature. + .. warning:: Tech Preview feature. @@ -2196,7 +2198,7 @@ Please see the following DOCA documentation for OVS hardware offload verificatio Network Operator Deployment and OpenvSwitch offload - externally managed OpenvSwitch with VF lag ------------------------------------------------------------------------------------------------ -.. warning:: This feature is not compatible with the DOCA Driver container. +.. warning:: This feature is not compatible with the DOCA-OFED Driver container. .. warning:: This feature is supported only for Vanilla Kubernetes deployments with SR-IOV Network Operator. @@ -2938,7 +2940,7 @@ NIC Configuration Operator updates status conditions of the NicDevice CR to set message: Device firmware '20.42.1000' matches to recommended version '20.42.1000' lastTransitionTime: "2024-09-21T08:43:10Z" -`FirmwareConfigMatch` condition status is set to `Unknown` if DOCA Driver is not installed otherwise it notifies if current NIC firmware is recommended or not recommended by DOCA Driver. E.g.: +`FirmwareConfigMatch` condition status is set to `Unknown` if DOCA-OFED Driver is not installed otherwise it notifies if current NIC firmware is recommended or not recommended by DOCA-OFED Driver. E.g.: .. code-block:: bash diff --git a/docs/getting-started-openshift.rst b/docs/getting-started-openshift.rst index 0798896a..bb3db45a 100644 --- a/docs/getting-started-openshift.rst +++ b/docs/getting-started-openshift.rst @@ -36,7 +36,7 @@ Network Operator Deployment on an OpenShift Container Platform It is recommended to have dedicated control plane nodes for OpenShift deployments with NVIDIA Network Operator. .. warning:: - Automatic DOCA Driver Upgrade doesn't work on Single Node OpenShift (SNO) deployments. + Automatic DOCA-OFED Driver Upgrade doesn't work on Single Node OpenShift (SNO) deployments. ---------------------- Node Feature Discovery @@ -107,7 +107,7 @@ If you are planning to use SR-IOV, follow these `instructions -n nvidia-network-operator -It is possible to remove all pods with secondary networks from all cluster nodes, and then restart the DOCA Driver pods on all nodes at once. +It is possible to remove all pods with secondary networks from all cluster nodes, and then restart the DOCA-OFED Driver pods on all nodes at once. The alternative option is to perform an upgrade in a rolling manner to reduce the impact of the driver upgrade on the cluster. The driver pod restart can be done on each node individually. In this case, pods with secondary networks should be removed from the single node only. There is no need to stop pods on all nodes. For each node, follow these steps to reload the driver on the node: 1. Remove pods with a secondary network from the node. -2. Restart the DOCA Driver pod. +2. Restart the DOCA-OFED Driver pod. 3. Return the pods with a secondary network to the node. -When the DOCA Driver is ready, proceed with the same steps for other nodes. +When the DOCA-OFED Driver is ready, proceed with the same steps for other nodes. #################################################### Removing Pods with a Secondary Network from the Node @@ -227,11 +227,11 @@ To remove pods with a secondary network from the node with node drain, run the f .. warning:: Replace with -l "network.nvidia.com/operator.mofed.wait=false" if you wish to drain all nodes at once. -############################## -Restarting the DOCA Driver Pod -############################## +################################### +Restarting the DOCA-OFED Driver Pod +################################### -Find the DOCA Driver pod name for the node: +Find the DOCA-OFED Driver pod name for the node: .. code-block:: bash @@ -243,25 +243,25 @@ Example for Ubuntu 20.04: kubectl get pod -l app=mofed-ubuntu20.04 -o wide -A -########################################## -Deleting the DOCA Driver Pod from the Node -########################################## +############################################### +Deleting the DOCA-OFED Driver Pod from the Node +############################################### -To delete the DOCA Driver pod from the node, run: +To delete the DOCA-OFED Driver pod from the node, run: .. code-block:: bash $ kubectl delete pod -n -.. warning:: Replace with -l app=mofed-ubuntu20.04 if you wish to remove DOCA Driver pods on all nodes at once. +.. warning:: Replace with -l app=mofed-ubuntu20.04 if you wish to remove DOCA-OFED Driver pods on all nodes at once. -A new version of the DOCA Driver pod will automatically start. +A new version of the DOCA-OFED Driver pod will automatically start. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Returning Pods with a Secondary Network to the Node ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -After the DOCA Driver pod is ready on the node, you can make the node schedulable again. +After the DOCA-OFED Driver pod is ready on the node, you can make the node schedulable again. The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSchedule taint) the node, and return the pods to it: @@ -269,11 +269,11 @@ The command below will uncordon (remove node.kubernetes.io/unschedulable:NoSched $ kubectl uncordon -l "network.nvidia.com/operator.mofed.wait=false" ------------------------------- -Automatic DOCA Driver Upgrade ------------------------------- +---------------------------------- +Automatic DOCA-OFED Driver Upgrade +---------------------------------- -To enable automatic DOCA Driver upgrade, define the UpgradePolicy section for the ofedDriver in the NicClusterPolicy spec, and change the DOCA Driver version. +To enable automatic DOCA-OFED Driver upgrade, define the UpgradePolicy section for the ofedDriver in the NicClusterPolicy spec, and change the DOCA-OFED Driver version. ``nicclusterpolicy.yaml``: @@ -344,9 +344,9 @@ The status upgrade of each node is reflected in its nvidia.com/ofed-driver-upgra * - Unknown (empty) - The node has this state when the upgrade flow is disabled or the node has not been processed yet. * - ``upgrade-done`` - - Set when DOCA Driver POD is up-to-date and running on the node, the node is schedulable. + - Set when DOCA-OFED Driver POD is up-to-date and running on the node, the node is schedulable. * - ``upgrade-required`` - - Set when DOCA Driver POD on the node is not up-to-date and requires upgrade. No actions are performed at this stage. + - Set when DOCA-OFED Driver POD on the node is not up-to-date and requires upgrade. No actions are performed at this stage. * - ``node-maintenance-required`` - Set when requestor mode upgrade is used, e.g. `MAINTENANCE_OPERATOR_ENABLED=true`, post `upgrade-required` state. Essentially it will create a matching nodeMaintenance object for dedicated node(s), utilizing maintenance operator to perform its node operations. * - ``cordon-required`` @@ -356,9 +356,9 @@ The status upgrade of each node is reflected in its nvidia.com/ofed-driver-upgra * - ``drain-required`` - Set when the node is scheduled for drain. After the drain, the state is changed either to pod-restart-required or upgrade-failed. * - ``pod-restart-required`` - - Set when the DOCA Driver POD on the node is scheduled for restart. After the restart, the state is changed to uncordon-required. + - Set when the DOCA-OFED Driver POD on the node is scheduled for restart. After the restart, the state is changed to uncordon-required. * - ``uncordon-required`` - - Set when DOCA Driver POD on the node is up-to-date and has "Ready" status. After uncordone, the state is changed to upgrade-done + - Set when DOCA-OFED Driver POD on the node is up-to-date and has "Ready" status. After uncordone, the state is changed to upgrade-done * - ``upgrade-failed`` - Set when the upgrade on the node has failed. Manual interaction is required at this stage. See Troubleshooting section for more details. @@ -392,7 +392,7 @@ Upgrade modes .. _maintenance-operator repo: https://github.com/Mellanox/maintenance-operator -DOCA Driver upgrade supports the following modes: +DOCA-OFED Driver upgrade supports the following modes: .. list-table:: :header-rows: 1 @@ -402,7 +402,7 @@ DOCA Driver upgrade supports the following modes: * - In-place - In-place (legacy) mode is incorporates full driver upgrade lifecycle, including nodes operations e.g. cordon, pod eviction, drain, uncordon. It also maintains an internal scheduler for performing above node operations, according to provided ``maxParallelUpgrades`` under ``UpgradePolicy``. * - Requestor - - New ``requestor`` upgrade mode uses NVIDIA maintenance operator (please refer to `maintenance-operator repo`_) nodeMaintenance k8s API objects, to initiate the DOCA driver upgrade process. Essentially, it will retire current upgrade controller (in-place mode) from performing the following node operations: cordon, wait for pods completion, drain, uncordon. To enable requestor mode, the following environment variable should be enabled ``MAINTENANCE_OPERATOR_ENABLED=true``. + - New ``requestor`` upgrade mode uses NVIDIA maintenance operator (please refer to `maintenance-operator repo`_) nodeMaintenance k8s API objects, to initiate the DOCA-OFED driver upgrade process. Essentially, it will retire current upgrade controller (in-place mode) from performing the following node operations: cordon, wait for pods completion, drain, uncordon. To enable requestor mode, the following environment variable should be enabled ``MAINTENANCE_OPERATOR_ENABLED=true``. .. note:: Enabling requestor mode will require deployment of NVIDIA maintenance operator on the cluster. By default, upgrade controller will use in-place mode. @@ -431,11 +431,11 @@ Safe Driver Loading .. warning:: The state of this feature can be controlled with the ofedDriver.upgradePolicy.safeLoad option. -Upon node startup, the DOCA Driver container takes some time to compile and load the driver. During that time, workloads might get scheduled on that node. When DOCA Driver is loaded, all existing PODs that use NVIDIA NICs will lose their network interfaces. Some such PODs might silently fail or hang. To avoid this situation, before the DOCA Driver container is loaded, the node should get cordoned and drained to ensure all workloads are rescheduled. The node should be un-cordoned when the driver is ready on it. +Upon node startup, the DOCA-OFED Driver container takes some time to compile and load the driver. During that time, workloads might get scheduled on that node. When DOCA-OFED Driver is loaded, all existing PODs that use NVIDIA NICs will lose their network interfaces. Some such PODs might silently fail or hang. To avoid this situation, before the DOCA-OFED Driver container is loaded, the node should get cordoned and drained to ensure all workloads are rescheduled. The node should be un-cordoned when the driver is ready on it. -The safe driver loading feature is implemented as a part of the upgrade flow, meaning safe driver loading is a special scenario of the upgrade procedure, where we upgrade from the inbox driver to the containerized DOCA Driver. +The safe driver loading feature is implemented as a part of the upgrade flow, meaning safe driver loading is a special scenario of the upgrade procedure, where we upgrade from the inbox driver to the containerized DOCA-OFED Driver. -When this feature is enabled, the initial DOCA Driver driver rollout on the large cluster can take a while. To speed up the rollout, the initial deployment can be done with the safe driver loading feature disabled, and this feature can be enabled later by updating the NicClusterPolicy CRD. +When this feature is enabled, the initial DOCA-OFED Driver driver rollout on the large cluster can take a while. To speed up the rollout, the initial deployment can be done with the safe driver loading feature disabled, and this feature can be enabled later by updating the NicClusterPolicy CRD. ^^^^^^^^^^^^^^^ Troubleshooting @@ -448,16 +448,16 @@ Troubleshooting - Required Action * - The node is in upgrade-failed state. - * Drain the node manually by running kubectl drain --ignore-daemonsets. - * Delete the NVIDIA DOCA Driver pod on the node manually, by running the following command: ``kubectl delete pod -n `kubectl get pods --A --field-selector spec.nodeName= -l nvidia.com/ofed-driver --no-headers | awk '{print $1 " "$2}'```. + * Delete the NVIDIA DOCA-OFED Driver pod on the node manually, by running the following command: ``kubectl delete pod -n `kubectl get pods --A --field-selector spec.nodeName= -l nvidia.com/ofed-driver --no-headers | awk '{print $1 " "$2}'```. **NOTE:** If the "Safe driver loading" feature is enabled, you may also need to remove the ``nvidia.com/ofed-driver-upgrade.driver-wait-for-safe-load`` annotation from the node object to unblock the loading of the driver ``kubectl annotate node nvidia.com/ofed-driver-upgrade.driver-wait-for-safe-load-`` * Wait for the node to complete the upgrade. - * - The updated NVIDIA DOCA Driver pod failed to start/ a new version of NVIDIA DOCA Driver cannot be installed on the node. + * - The updated NVIDIA DOCA-OFED Driver pod failed to start/ a new version of NVIDIA DOCA-OFED Driver cannot be installed on the node. - Manually delete the pod by using ``kubectl delete -n ``. - If following the restart the pod still fails, change the NVIDIA DOCA Driver version in the NicClusterPolicy to the previous version or to another working version. + If following the restart the pod still fails, change the NVIDIA DOCA-OFED Driver version in the NicClusterPolicy to the previous version or to another working version. ================================= Uninstalling the Network Operator diff --git a/docs/platform-support.rst b/docs/platform-support.rst index e44580cf..47208e78 100644 --- a/docs/platform-support.rst +++ b/docs/platform-support.rst @@ -228,15 +228,15 @@ Supported Container Runtimes - No - -======================================================= -Supported Precompiled Container Images for DOCA Drivers -======================================================= +============================================================ +Supported Precompiled Container Images for DOCA-OFED Drivers +============================================================ -------- Overview -------- -To save startup time and operational effort, precompiled DOCA driver container images are available for common OS/flavor/kernel/architecture variants. +To save startup time and operational effort, precompiled DOCA-OFED driver container images are available for common OS/flavor/kernel/architecture variants. The container image tag pattern used for common variants is: **driver_ver-container_ver-kernel_ver-flavor-os-arch**. For example: ``24.07-0.6.1.0-0-6.8.0-49-generic-ubuntu24.04-amd64`` @@ -246,7 +246,7 @@ The container image tag pattern used for common variants is: **driver_ver-contai Supported Operating Systems --------------------------- -Currently precompiled DOCA driver container images are provided for the following operating systems: +Currently precompiled DOCA-OFED driver container images are provided for the following operating systems: - Ubuntu 24.04 (amd64/arm64) - Ubuntu 22.04 (amd64/arm64) @@ -255,7 +255,7 @@ Currently precompiled DOCA driver container images are provided for the followin Limitations ----------- -- NVIDIA supports precompiled driver containers for the most recently released DOCA GA drivers. +- NVIDIA supports precompiled driver containers for the most recently released DOCA-OFED GA drivers. - NVIDIA builds precompiled driver containers for ``generic``, ``nvidia``, ``aws``, ``azure``, and ``oracle`` kernel flavors. - Precompiled driver containers are currently unsigned. - If your hosts use a different kernel variant, you can create a custom precompiled driver container and host it in your own container registry. Please refer to :ref:`advanced-configurations-precompiled` section. diff --git a/docs/release-notes.rst b/docs/release-notes.rst index 28c51cf2..7c264d0b 100644 --- a/docs/release-notes.rst +++ b/docs/release-notes.rst @@ -38,7 +38,7 @@ Changes and New Features - Description * - 25.4.0 - | - Added support for NVIDIA NIC Configuration Operator deployment through NicClusterPolicy CR, since using Helm chart will be deprecated in future releases. - | - Integrate NVIDIA Network Operator with NVIDIA Maintenance Operator for DOCA OFED container upgrade. + | - Integrate NVIDIA Network Operator with NVIDIA Maintenance Operator for DOCA-OFED Driver container upgrade. | - Added support for OpenShift 4.18. | - Added support for ConnectX-8 SuperNIC. | - Added support for NVIDIA Spectrum-X Operator deployment - Tech Preview. @@ -66,16 +66,16 @@ Changes and New Features | - Added support for Ubuntu 24.04. | - Added support for NVIDIA Grace based ARM platforms with Ubuntu 22.04 and Upstream K8s as a Tech Preview feature. | - Added support for NVIDIA IGX Orin based ARM platforms with Ubuntu 22.04 and Upstream K8s as a GA feature. - | - Added support for Precompiled DOCA Driver containers for Ubuntu 22.04. + | - Added support for Precompiled DOCA-OFED Driver containers for Ubuntu 22.04. | - Added support for Switchdev SR-IOV mode with SR-IOV Network Operator and OVS CNI as a Tech Preview feature. | - Added support for DOCA Telemetry Service (DTS) integration to expose network telemetry and NIC metrics in K8s. | - Added support for network namespace isolation of RDMA devices with RDMA CNI | - Added support for RHEL and OpenShift deployments with Real-time kernels. - | - Enhanced DOCA Driver container deployment and significantly reduced compilation time after node reboots. + | - Enhanced DOCA-OFED Driver container deployment and significantly reduced compilation time after node reboots. * - 24.1.0 - | - Added support for Ubuntu 22.04 with Upstream K8s on ARM platforms (NVIDIA IGX Orin) - Tech Preview. | - Added support for CNI bin directory configuration. - | - Added support for OpenShift MOFED/DOCA driver container build and deployment via driver toolkit (DTK). + | - Added support for OpenShift MOFED/DOCA-OFED driver container build and deployment via driver toolkit (DTK). | - Added support for Ubuntu 22.04 deployments with Real-time kernels. | - Added the ability to disable SR-IOV VF for SR-IOV Network Operator (in systems with pre-configured SR-IOV). | - Added the ability to set resource request and limits on the network operator and it components. @@ -218,10 +218,10 @@ Known Limitations - | - In Infiniband mode, due to a kernel bug, there is a limitation on the number of Virtual Functions (VFs) on a single Physical Function (PF). The recommendation is to create up to 16 VFs per PF. Larger number will cause "ip link show dev " to fail with a "Message too long" error. | - SR-IOV switchdev mode is not supported on SLES. - | - In infiniband mode, in case of existing Intel NICs, loaded `irdma` module should be unloaded before deploying DOCA driver. + | - In infiniband mode, in case of existing Intel NICs, loaded `irdma` module should be unloaded before deploying DOCA-OFED driver. * - 24.10.0 - - | - There is a known limitation when using NVIDIA NICs as **primary network interfaces**. If the NVIDIA DOCA Driver container is configured to be deployed, we cannot guarantee that the inbox or pre-installed NVIDIA NIC driver will unload successfully if it remains in use. - If the current driver does unload, it removes all NVIDIA NIC networking interfaces and netdevices. DOCA driver container then loads new drivers but only restores **basic configuration** (for example, IP addresses) on the primary network interface’s Physical Function (PF) and its Virtual Functions (VFs). More advanced settings (such as VLANs, bonding, and OVS) will **not** be restored automatically. + - | - There is a known limitation when using NVIDIA NICs as **primary network interfaces**. If the NVIDIA DOCA-OFED Driver container is configured to be deployed, we cannot guarantee that the inbox or pre-installed NVIDIA NIC driver will unload successfully if it remains in use. + If the current driver does unload, it removes all NVIDIA NIC networking interfaces and netdevices. DOCA-OFED driver container then loads new drivers but only restores **basic configuration** (for example, IP addresses) on the primary network interface’s Physical Function (PF) and its Virtual Functions (VFs). More advanced settings (such as VLANs, bonding, and OVS) will **not** be restored automatically. This limitation applies to **all** versions of the NVIDIA Network Operator. * - 24.10.0 - | - There is a known limitation when using `docker` on RHEL 8 and 9. If you encounter this issue, it is recommended to use "the preferred, maintained, and supported container runtime of choice for Red Hat Enterprise Linux". @@ -229,7 +229,7 @@ Known Limitations | - In NIC Configuration Operator template v0.1.14 BF2/BF3 DPUs (not SuperNICs) FW reset flow isn't supported. | - NVIDIA NIC Configuration Operator v0.1.14 Firmware Mismatch notification feature doesn't support NVIDIA BlueField-3 SuperNIC. * - 24.7.0 - - | - In case ENABLE_NFSRDMA is enabled for DOCA Driver container and NVMe modules are loaded in the host system, NVIDA DOCA Driver Container will fail to load. + - | - In case ENABLE_NFSRDMA is enabled for DOCA-OFED Driver container and NVMe modules are loaded in the host system, NVIDA DOCA-OFED Driver Container will fail to load. | User should blacklist NVMe modules to prevent them from loading on system boot. If this is not possible (e.g when the system uses NVMe SSD drives) then ENABLE_NFSRDMA must be set to `false`. | Using features such as GPU Direct Storage is not supported in such case. * - 23.10.0 diff --git a/hack/release/release.go b/hack/release/release.go index e9ad60db..450615fc 100644 --- a/hack/release/release.go +++ b/hack/release/release.go @@ -39,7 +39,7 @@ import ( // ReleaseImageSpec contains ImageSpec in addition with Image SHA256. type ReleaseImageSpec struct { - // Shas is a list of SHA2256. A list is needed for DOCA drivers that have multiple images. + // Shas is a list of SHA2256. A list is needed for DOCA-OFED drivers that have multiple images. Shas []SHA256ImageRef mellanoxv1alpha1.ImageSpec } diff --git a/hack/release/templates/image-sha256/images-sha256.template b/hack/release/templates/image-sha256/images-sha256.template index c6c3fe7a..7558f742 100644 --- a/hack/release/templates/image-sha256/images-sha256.template +++ b/hack/release/templates/image-sha256/images-sha256.template @@ -125,9 +125,9 @@ NVIDIA Network Operator Container Images - {{ .SpectrumXOperator.Version }} - {{ (imageSha .SpectrumXOperator) }} -============================ -DOCA Driver Container Images -============================ +================================= +DOCA-OFED Driver Container Images +================================= .. list-table:: @@ -141,7 +141,7 @@ DOCA Driver Container Images - {{ .Mofed.Version}} -The followings tags are available for the above DOCA Driver container version: +The followings tags are available for the above DOCA-OFED Driver container version: ------ Ubuntu