From a9e00e80eb4b23499b42bf40485a0561a89f1605 Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Thu, 10 Apr 2025 10:45:39 -0400 Subject: [PATCH 1/2] update openshift docs, pdate mirantis, add selinux notes, update release notes Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- gpu-operator/getting-started.rst | 3 +- gpu-operator/release-notes.rst | 2 +- openshift/openshift-virtualization.rst | 12 +++--- partner-validated/mirantis-mke.rst | 59 +++++++++++++------------- 4 files changed, 40 insertions(+), 36 deletions(-) diff --git a/gpu-operator/getting-started.rst b/gpu-operator/getting-started.rst index 5836667d6..596ca46ed 100644 --- a/gpu-operator/getting-started.rst +++ b/gpu-operator/getting-started.rst @@ -347,7 +347,8 @@ with the NVIDIA GPU Operator. Refer to the :ref:`GPU Operator Component Matrix` on the platform support page. When using RHEL8 with Kubernetes, SELinux must be enabled either in permissive or enforcing mode for use with the GPU Operator. -Additionally, network restricted environments are not supported. +Additionally, when using RHEL8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode) at the host level, containerd must also be configured for SELinux, by setting the ``enable_selinux=true`` configuration option. +Note, network restricted environments are not supported. Pre-Installed NVIDIA GPU Drivers diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index 75360a4ba..8f6f4df3f 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -86,7 +86,7 @@ New Features * Added support for the NVIDIA Data Center GPU Driver version 570.124.06. -* Added support for KubeVirt and OpenShift Virtualization with vGPU v18 for A30, A100, and H100 GPUs. +* Added support for KubeVirt and OpenShift Virtualization with vGPU v18 on H200NVL. * Added support for NVIDIA Network Operator v25.1.0. Refer to :ref:`Support for GPUDirect RDMA` and :ref:`Support for GPUDirect Storage`. diff --git a/openshift/openshift-virtualization.rst b/openshift/openshift-virtualization.rst index ad89ed5d6..8fa89b4a2 100644 --- a/openshift/openshift-virtualization.rst +++ b/openshift/openshift-virtualization.rst @@ -129,6 +129,8 @@ Procedure version: 3.2.0 kernelArguments: - intel_iommu=on + # If you are using AMD CPU, include the following argument: + # - amd_iommu=on #. Create the new ``MachineConfig`` object: @@ -196,7 +198,7 @@ Use the following steps to build the vGPU Manager container and push it to a pri .. code-block:: console - $ cd vgpu-manager/rhel + $ cd vgpu-manager/rhel8 #. Copy the NVIDIA vGPU Manager from your extracted zip file: @@ -216,10 +218,10 @@ Use the following steps to build the vGPU Manager container and push it to a pri $ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11 CUDA_VERSION=11.7.1 - .. note:: +.. note:: - The recommended registry to use is the Integrated OpenShift Container Platform registry. - For more information about the registry, see `Accessing the registry `_. + The recommended registry to use is the Integrated OpenShift Container Platform registry. + For more information about the registry, see `Accessing the registry `_. #. Build the NVIDIA vGPU Manager image: @@ -242,7 +244,7 @@ Installing the NVIDIA GPU Operator using the CLI Install the NVIDIA GPU Operator using the guidance at :ref:`Installing the NVIDIA GPU Operator `. - .. note:: When prompted to create a cluster policy follow the guidance :ref:`Creating a ClusterPolicy for the GPU Operator`. +.. note:: When prompted to create a cluster policy follow the guidance :ref:`Creating a ClusterPolicy for the GPU Operator`. Create the secret ================= diff --git a/partner-validated/mirantis-mke.rst b/partner-validated/mirantis-mke.rst index 9c5c7189e..dd24286cd 100644 --- a/partner-validated/mirantis-mke.rst +++ b/partner-validated/mirantis-mke.rst @@ -44,6 +44,36 @@ Validated Configuration Matrix - NVIDIA GPU - Hardware Model + * - k0s v1.31.5+k0s / k0rdent 0.1.0 + - v24.9.2 + - | Ubuntu 22.04 + - containerd v1.7.24 with the NVIDIA Container Toolkit v1.17.4 + - 1.31.5 + - Helm v3 + - | 2x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC) + - | Supermicro SuperServer 6028U-E1CNR4T+ + + | 1000W Supermicro PWS-1K02A-1R + + | 2x Intel Xeon E5-2630v4, 10C/20T 2.2/3.1 GHz LGA 2011-3 25MB 85W + + | 32GB DDR4-2666 RDIMM, M393A4K40BB2-CTD6Q + + | NVMe 960GB PM983 NVMe M.2, MZ1LB960HAJQ-00007 + + | 2 x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC), 70W, PCIe 4.0x16, 4x + + | 4x Mini DisplayPort 1.4a + + * - MKE 3.8 + - v24.9.2 + - | Ubuntu 22.04 + - Mirantis Container Runtime (MCR) 25.0.1 + - 1.31.5 + - Helm v3 + - | NVIDIA T4 Tensor Core + - | AWS EC2 g4dn.2xlarge (8vcpus/32GB) + * - MKE 3.6.2+ and 3.5.7+ - v23.3.1 - | RHEL 8.7 @@ -71,35 +101,6 @@ Validated Configuration Matrix | 1x RAID Controller PERC H710 | 1x Network card FM487 - * - MKE 3.8 - - v24.9.2 - - | Ubuntu 22.04 - - Mirantis Container Runtime (MCR) 25.0.1 - - 1.31.5 - - Helm v3 - - | NVIDIA T4 Tensor Core - - | AWS EC2 g4dn.2xlarge (8vcpus/32GB) - * - k0s v1.31.5+k0s / k0rdent 0.1.0 - - v24.9.2 - - | Ubuntu 22.04 - - containerd v1.7.24 with the NVIDIA Container Toolkit v1.17.4 - - 1.31.5 - - Helm v3 - - | 2x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC) - - | Supermicro SuperServer 6028U-E1CNR4T+ - - | 1000W Supermicro PWS-1K02A-1R - - | 2x Intel Xeon E5-2630v4, 10C/20T 2.2/3.1 GHz LGA 2011-3 25MB 85W - - | 32GB DDR4-2666 RDIMM, M393A4K40BB2-CTD6Q - - | NVMe 960GB PM983 NVMe M.2, MZ1LB960HAJQ-00007 - - | 2 x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC), 70W, PCIe 4.0x16, 4x - - | 4x Mini DisplayPort 1.4a - ************* Prerequisites From dfaed61563ff93f8e7101d3bda7e674a91c9d69a Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Fri, 11 Apr 2025 12:15:52 -0400 Subject: [PATCH 2/2] remove cuda_version from vgpu manager image build steps Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- openshift/openshift-virtualization.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/openshift/openshift-virtualization.rst b/openshift/openshift-virtualization.rst index 8fa89b4a2..39c633469 100644 --- a/openshift/openshift-virtualization.rst +++ b/openshift/openshift-virtualization.rst @@ -212,11 +212,10 @@ Use the following steps to build the vGPU Manager container and push it to a pri * ``VERSION`` - The NVIDIA vGPU Manager version downloaded from the NVIDIA Software Portal. * ``OS_TAG`` - This must match the Guest OS version. For RedHat OpenShift, specify ``rhcos4.x`` where _x_ is the supported minor OCP version. - * ``CUDA_VERSION`` - CUDA base image version to build the driver image with. .. code-block:: console - $ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11 CUDA_VERSION=11.7.1 + $ export PRIVATE_REGISTRY=my/private/registry VERSION=510.73.06 OS_TAG=rhcos4.11 .. note:: @@ -229,7 +228,6 @@ Use the following steps to build the vGPU Manager container and push it to a pri $ docker build \ --build-arg DRIVER_VERSION=${VERSION} \ - --build-arg CUDA_VERSION=${CUDA_VERSION} \ -t ${PRIVATE_REGISTRY}/vgpu-manager:${VERSION}-${OS_TAG} . #. Push the NVIDIA vGPU Manager image to your private registry: