From da9fb790794bef61d627293ac11ef029f78f7612 Mon Sep 17 00:00:00 2001 From: Vitaliy Emporopulo Date: Tue, 1 Jul 2025 18:19:32 +0300 Subject: [PATCH] Remove entitled driver builds support - Remove entitled NVIDIA driver builds support - Delete entitlement certificate instructions from appendix-ocp.rst - Update get-entitlement.rst to indicate entitled builds are no longer supported - Add comprehensive troubleshooting guidance for broken Driver Toolkit errors - Remove references to entitlement-based fallbacks in troubleshooting steps - Ensure users focus on resolving NFD/DTK root causes instead of entitled workarounds This addresses the deprecation of entitled driver builds and provides proper guidance for troubleshooting broken DTK errors on OpenShift 4.10+. Signed-off-by: Vitaliy Emporopulo --- openshift/appendix-ocp.rst | 223 ++++---------------------- openshift/get-entitlement.rst | 21 +-- openshift/steps-overview.rst | 14 +- openshift/troubleshooting-gpu-ocp.rst | 8 +- 4 files changed, 42 insertions(+), 224 deletions(-) diff --git a/openshift/appendix-ocp.rst b/openshift/appendix-ocp.rst index 0b6270fd0..22ce929ba 100644 --- a/openshift/appendix-ocp.rst +++ b/openshift/appendix-ocp.rst @@ -9,228 +9,59 @@ Appendix .. _cluster-entitlement: -Enabling a Cluster-wide entitlement -============================================ +Entitled NVIDIA Driver Builds No Longer Supported +================================================= Introduction ------------- -.. note:: +.. important:: - The Driver Toolkit, which enables entitlement-free deployments of the GPU Operator, is available for certain z-streams on OpenShift - 4.8 and all z-streams on OpenShift 4.9. However, some Driver Toolkit images are broken, so we recommend maintaining entitlements for - all OpenShift versions prior to 4.9.9. See :ref:`broken driver toolkit ` for more information. + **Entitled NVIDIA driver builds are deprecated and not supported starting with Red Hat OpenShift 4.10.** -The **NVIDIA GPU Operator** deploys several pods used to manage and enable GPUs for use in the OpenShift Container Platform. -Some of these Pods require packages that are not available by default in the Universal Base Image (UBI) that OpenShift Container -Platform uses. To make packages available to the NVIDIA GPU driver container, you must enable cluster-wide entitled container builds in OpenShift. + The Driver Toolkit (DTK) enables entitlement-free deployments of the GPU Operator. In the past, entitled builds were used pre-DTK and for some OpenShift versions where Driver Toolkit images were broken. -At a high level, enabling a cluster-wide entitlement involves three steps: + If you encounter the :ref:`"broken driver toolkit detected" ` warning on OpenShift 4.10 or later, you should :ref:`troubleshoot ` to find the root cause instead of falling back to entitled driver builds. -#. Download Red Hat OpenShift Container Platform subscription certificates from the `Red Hat Customer Portal `_ (access requires login credentials). + If the broken DTK warning is encountered on an older version of OpenShift, refer to the documentation for an older version of the NVIDIA GPU operator to enable entitled builds. Keep in mind that older versions of OpenShift might no longer be supported. -#. Create a ``MachineConfig`` that enables the subscription manager and provides a valid subscription certificate. Wait for the ``MachineConfigOperator`` to reboot the node and finish applying the ``MachineConfig``. +.. _broken-dtk-troubleshooting: -#. Validate that cluster-wide entitlement is working properly. +Troubleshooting Broken Driver Toolkit Errors +-------------------------------------------- -These instructions assume you downloaded an entitlement encoded in base64 from the `Red Hat Customer Portal `_ or extracted it from an existing node. +The most likely reason for the broken DTK message is Node Feature Discovery (NFD) not working correctly. NFD might be disabled, failing, or not updating the kernel version label for other reasons. Another cause might be a missing or incomplete DTK image stream, e.g. because of broken mirroring. -Creating entitled containers requires that you assign machine configuration that has a valid Red Hat entitlement certificate to your worker nodes. This step is necessary because Red Hat Enterprise Linux (RHEL) CoreOS nodes are not yet automatically entitled. +Follow these steps for initial troubleshooting of Node Feature Discovery: -.. _obtain-entitlement: - -Obtaining an entitlement certificate ---------------------------------------- - -Follow the guidance below to edit obtain the entitlement certificate. - -#. Navigate to the `Red Hat Customer Portal systems management page `_ and click **New**. - - .. image:: graphics/cluster_entitlement_1.png - -#. Select **Hypervisor** and populate the **Name** field with the text **OpenShift-Entitlement**. - - .. image:: graphics/entitlement_hypervisor.png - -#. Click **CREATE**. - -#. Select the **Subscriptions** tab and click **Attach Subscriptions**. - - .. image:: graphics/cluster_entitlement_3.png - -#. Search for **Red Hat Developer Subscription** [content here may vary according to accounts], select one of them and click **Attach Subscriptions**. - - .. note:: - The **Red Hat Developer Subscription** is choosen here purely for illustrating this example. Choose an appropriate subscription relevant for your your needs. - -#. Click **Download Certificates**. - -.. image:: graphics/cluster_entitlement_5.png - -#. Download and extract the file. - -#. Extract the key *.pem* and test it with this command: - - .. code-block:: console - - $ curl -E .pem -Sfs -k https://cdn.redhat.com/content/dist/rhel8/8/x86_64/baseos/os/repodata/repomd.xml | head -3 - - .. note:: - - With a valid key, `curl` downloads the repository entrypoint and shows its `head` shown in the example below. - - With an invalid key, `curl` download is refused by the Red Hat package mirror. - - .. code-block:: console - - - - 1631130504 - -Add a cluster-wide entitlement ---------------------------------------- - -Use the following procedure to add a cluster-wide entitlement: - -#. Create a local appropriately named directory. Change to this directory. - -#. Download the :download:`machine config YAML template ` for cluster-wide entitlements on OpenShift Container Platform. Save the downloaded file ``0003-cluster-wide-machineconfigs.yaml.template`` to the directory created in step 1. - -#. Copy the selected ``pem`` file from your entitlement certificate to a local file named ``nvidia.pem``: +#. **Check Node Feature Discovery (NFD) status:** .. code-block:: console - $ cp /.pem nvidia.pem - -#. Generate the MachineConfig file by appending the entitlement certificate: - - .. code-block:: console + $ oc get pods -n openshift-nfd - $ sed -i -f - 0003-cluster-wide-machineconfigs.yaml.template << EOF - s/BASE64_ENCODED_PEM_FILE/$(base64 -w0 nvidia.pem)/g - EOF + Ensure NFD pods are running and healthy. If NFD is not deployed or is failing, this can cause DTK issues. -#. Apply the machine config to the OpenShift cluster: +#. **Verify kernel version labels are present and correct:** .. code-block:: console - $ oc apply -f 0003-cluster-wide-machineconfigs.yaml.template + $ oc get nodes -o jsonpath='{range .items[*]}{.metadata.name}{":\t"}{.metadata.labels.feature\.node\.kubernetes\.io/kernel-version\.full}{"\n"}{end}' - .. note:: This step triggers an update driven by the OpenShift Machine Config Operator and initiates a restart on all worker nodes one by one. + Ensure nodes have proper kernel version labels that match current OpenShift version of the cluster. - .. code-block:: console - - machineconfig.machineconfiguration.openshift.io/50-rhsm-conf created - machineconfig.machineconfiguration.openshift.io/50-entitlement-pem created - machineconfig.machineconfiguration.openshift.io/50-entitlement-key-pem created - -#. Check the ``machineconfig``: +#. **Check Driver Toolkit image stream:** .. code-block:: console - $ oc get machineconfig | grep entitlement + $ oc get -n openshift is/driver-toolkit - .. code-block:: console + Verify the driver-toolkit image stream exists and has the correct tags that correspond to current OpenShift version. - 50-entitlement-key-pem 2.2.0 45s - 50-entitlement-pem 2.2.0 45s - -#. Monitor the ``MachineConfigPool`` object: - - .. code-block:: console - - $ oc get mcp/worker - - .. code-block:: console - - NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE - worker rendered-worker-5f1eaf24c760fb389d47d3c37ef41c29 True False False 2 2 2 0 7h15m - - Here you can see that the MCP is updated, not updating or degraded, so all the ``MachineConfig`` resources have been successfully applied to the nodes and you can proceed to validate the cluster. - -Validate the cluster-wide entitlement ---------------------------------------- - -Validate the cluster-wide entitlement with a test pod that queries a Red Hat subscription repo for the kernel-devel package. - -#. Create a test pod: - - .. code-block:: console - - $ cat << EOF >> mypod.yaml - - apiVersion: v1 - kind: Pod - metadata: - name: cluster-entitled-build-pod - namespace: default - spec: - containers: - - name: cluster-entitled-build - image: registry.access.redhat.com/ubi8:latest - command: [ "/bin/sh", "-c", "dnf search kernel-devel --showduplicates" ] - restartPolicy: Never - EOF - -#. Apply the test pod: - - .. code-block:: console - - $ oc create -f mypod.yaml - - .. code-block:: console - - pod/cluster-entitled-build-pod created - -#. Verify the test pod is created: - - .. code-block:: console - - $ oc get pods -n default - - .. code-block:: console - - NAME READY STATUS RESTARTS AGE - cluster-entitled-build-pod 1/1 Completed 0 64m - -#. Validate that the pod can locate the necessary kernel-devel packages: - - .. code-block:: console - - $ oc logs cluster-entitled-build-pod -n default - - .. code-block:: console +For additional troubleshooting resources: - Updating Subscription Management repositories. - Unable to read consumer identity - Subscription Manager is operating in container mode. - Red Hat Enterprise Linux 8 for x86_64 - AppStre 15 MB/s | 14 MB 00:00 - Red Hat Enterprise Linux 8 for x86_64 - BaseOS 15 MB/s | 13 MB 00:00 - Red Hat Universal Base Image 8 (RPMs) - BaseOS 493 kB/s | 760 kB 00:01 - Red Hat Universal Base Image 8 (RPMs) - AppStre 2.0 MB/s | 3.1 MB 00:01 - Red Hat Universal Base Image 8 (RPMs) - CodeRea 12 kB/s | 9.1 kB 00:00 - ====================== Name Exactly Matched: kernel-devel ====================== - kernel-devel-4.18.0-80.1.2.el8_0.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-80.el8.x86_64 : Development package for building kernel - : modules to match the kernel - kernel-devel-4.18.0-80.4.2.el8_0.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-80.7.1.el8_0.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-80.11.1.el8_0.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-147.el8.x86_64 : Development package for building kernel - : modules to match the kernel - kernel-devel-4.18.0-80.11.2.el8_0.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-80.7.2.el8_0.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-147.0.3.el8_1.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-147.0.2.el8_1.x86_64 : Development package for building - : kernel modules to match the kernel - kernel-devel-4.18.0-147.3.1.el8_1.x86_64 : Development package for building - : kernel modules to match the kernel - -Any Pod based on RHEL can now run entitled builds. +* `Node Feature Discovery documentation `_. +* `Red Hat Node Feature Discovery Operator documentation `_ +* `OpenShift Driver Toolkit documentation `_ +* `OpenShift Driver Toolkit GihHub repository `_ +* `OpenShift troubleshooting guide `_ diff --git a/openshift/get-entitlement.rst b/openshift/get-entitlement.rst index 057b0d088..2fda85b85 100644 --- a/openshift/get-entitlement.rst +++ b/openshift/get-entitlement.rst @@ -4,24 +4,11 @@ .. _get-entitlement: #################################################### -Obtaining an entitlement certificate +Entitled Driver Builds No Longer Supported #################################################### -Follow the guidance below to edit your cluster subscription setting and obtain the entitlement. +.. important:: -#. Navigate to `https://access.redhat.com/management/systems/`` and click **New**. -Log in to `access.redhat.com `_ . + **Entitled NVIDIA driver builds are deprecated and not supported.** -#. Fill "Virtual Server", "x86_64", 1 core, RHEL 8, and click Create. - - .. image:: graphics/locate-cluster-acm.png - -#. Go to the "Subscription" page and click "Attach Subscriptions"r. - -#. Search for "Red Hat Developer Subscription" [content here may vary according to accounts], tick one of them and click "Attach Subscriptions". - -#. Click "Download Certificates" - -#. Download and extract the file. - -#. Extract the key from "consumer_export.zip/export/entitlement_certificates/.pem" and test it with this command: + If you encounter issues with the NVIDIA GPU driver build that might require entitlement, please refer to the Driver Toolkit (DTK) troubleshooting section: :ref:`broken-dtk-troubleshooting` diff --git a/openshift/steps-overview.rst b/openshift/steps-overview.rst index 11cddeba0..80d2bd86d 100644 --- a/openshift/steps-overview.rst +++ b/openshift/steps-overview.rst @@ -120,13 +120,15 @@ A fix for this issue has been merged in the following releases: About the Broken Driver Toolkit ******************************* -OpenShift 4.8.19, 4.8.21, 4.9.8 are known to have a broken Driver Toolkit image. -The following messages are recorded in the driver pod containers. -Follow the guidance in :ref:`enabling a Cluster-wide entitlement `. -Afterward, the ``nvidia-driver-daemonset`` automatically uses an entitlement-based fallback. +.. important:: + + **Entitled NVIDIA driver builds are deprecated and not supported.** + +OpenShift 4.8.19, 4.8.21, 4.9.8 are known to have a broken Driver Toolkit image. However, on newer OpenShift versions the driver builds rely on Driver Toolkit (DTK). With these versions, entitled builds are not supported and might not work. + +When the DTK image is broken, the following messages are recorded in the driver pod containers. Follow the guidance in :ref:`broken-dtk-troubleshooting` to troubleshoot the underlying issue. -To disable the use of Driver Toolkit image altogether, edit the cluster policy instance and set ``operator.use_ocp_driver_toolkit`` option to ``false``. -Also, we recommend maintaining entitlements for OpenShift versions < 4.9.9. +If you need to force entitled builds, disable the use of Driver Toolkit image by editing the cluster policy instance and setting ``operator.use_ocp_driver_toolkit`` option to ``false``. #. View the logs from the OpenShift Driver Toolkit container: diff --git a/openshift/troubleshooting-gpu-ocp.rst b/openshift/troubleshooting-gpu-ocp.rst index 738f0f87f..da63a54c2 100644 --- a/openshift/troubleshooting-gpu-ocp.rst +++ b/openshift/troubleshooting-gpu-ocp.rst @@ -194,11 +194,9 @@ This is an illustrated example of a situation where the deployment of the Operat FATAL: failed to install elfutils packages. RHEL entitlement may be improperly deployed - This message maybe associated with the unsuccessful deployment of the driver toolkit. To confirm the driver toolkit is successfully deployed follow the guidance in :ref:`verify_toolkit`. - If you see this message a workaround is to edit the created ``gpu-cluster-policy`` YAML file in the OpenShift Container Platform console and set ``use_ocp_driver_toolkit`` to ``false``. - - Set up the entitlement. - Refer to :ref:`cluster-entitlement` for more information. + This message may be associated with the unsuccessful deployment of the driver toolkit. To confirm the driver toolkit is successfully deployed follow the guidance in :ref:`verify_toolkit`. + If you see this message, you should troubleshoot the underlying issue instead of relying on RHEL entitlement. Entitled driver builds are deprecated and not supported on recent versions of Red Hat OpenShift. + See :ref:`broken-dtk-troubleshooting` for more information. .. _verify_toolkit: