From 3a279fc7333586cda78bd93a692a55581849a3d2 Mon Sep 17 00:00:00 2001 From: Lee Xie Date: Mon, 14 Apr 2025 10:54:14 -0700 Subject: [PATCH 1/7] Create k0rdent.rst Adding k0RDENT to partner validated configurations Signed-off-by: Lee Xie Signed-off-by: Lee Xie --- partner-validated/k0rdent.rst | 163 ++++++++++++++++++++++++++++++++++ 1 file changed, 163 insertions(+) create mode 100644 partner-validated/k0rdent.rst diff --git a/partner-validated/k0rdent.rst b/partner-validated/k0rdent.rst new file mode 100644 index 000000000..2a6938d90 --- /dev/null +++ b/partner-validated/k0rdent.rst @@ -0,0 +1,163 @@ +.. headings # #, * *, =, -, ^, " + +.. |prod-name-long| replace:: Mirantis k0RDENT +.. |prod-name-short| replace:: k0RDENT + +############################################# +|prod-name-long| with the NVIDIA GPU Operator +############################################# + + +********************************************* +About |prod-name-short| with the GPU Operator +********************************************* + +|prod-name-short| is as a "super control plane" designed to ensure the consistent provisioning and lifecycle +management of kubernetes clusters and the services that make them useful. The goal of the k0rdent project is +to provide platform engineers with the means to deliver a distributed container management environment (DCME) +and enable them to compose unique internal developer platforms (IDP) to support a diverse range of complex +modern application workloads. + +The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate +both the deployment and management of all NVIDIA software components needed to provision NVIDIA GPUs. +These components include the NVIDIA GPU drivers to enable CUDA, Kubernetes device plugin for GPUs, +the NVIDIA Container Toolkit, automatic node labeling using GFD, DCGM based monitoring and others. + + +****************************** +Validated Configuration Matrix +****************************** + +|prod-name-long| has self-validated with the following components and versions: + +.. list-table:: + :header-rows: 1 + + * - Version + - | NVIDIA + | GPU + | Operator + - | Operating + | System + - | Container + | Runtime + - Kubernetes + - Helm + - NVIDIA GPU + - Hardware Model + + * - k0rdent 0.2.0 / k0s v1.31.5+k0s + - v24.9.2 + - | Ubuntu 22.04 + - containerd v1.7.24 with the NVIDIA Container Toolkit v1.17.4 + - 1.31.5 + - Helm v3 + - | 2x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC) + - | Supermicro SuperServer 6028U-E1CNR4T+ + + | 1000W Supermicro PWS-1K02A-1R + + | 2x Intel Xeon E5-2630v4, 10C/20T 2.2/3.1 GHz LGA 2011-3 25MB 85W + + | 32GB DDR4-2666 RDIMM, M393A4K40BB2-CTD6Q + + | NVMe 960GB PM983 NVMe M.2, MZ1LB960HAJQ-00007 + + | 2 x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC), 70W, PCIe 4.0x16, 4x + + | 4x Mini DisplayPort 1.4a + + +************* +Prerequisites +************* + +* A running |prod-name-short| managed cluster with at least one control plane node and two worker nodes. + The recommended configuration is at least three control plane nodes and at least two worker nodes. + +* At least one worker node with an NVIDIA GPU physically installed. + The GPU Operator can locate the GPU and label the node accordingly. + +* The kubeconfig file for the |prod-name-short| managed cluster on the seed node. + You can get the file from the |prod-name-short| control plane. + +* You have access to the |prod-name-short| cluster. + + +********* +Procedure +********* + +Perform the following steps to prepare the |prod-name-short| cluster: + +#. Install template to k0rdent + + .. code-block:: console + + $ helm upgrade --install gpu-operator oci://ghcr.io/k0rdent/catalog/charts/kgst -n kcm-system \ + --set "helm.repository.url=https://helm.ngc.nvidia.com/nvidia" \ + --set "helm.charts[0].name=gpu-operator" \ + --set "helm.charts[0].version=24.9.2" + +#. Verify service template: + + .. code-block:: console + + $ kubectl get servicetemplates -A + + *Example Output* + + .. code-block:: output + + NAMESPACE NAME VALID + kcm-system gpu-operator-24-9-2 true + +#. Deploy service template to child cluster + + .. code-block:: console + + apiVersion: k0rdent.mirantis.com/v1alpha1 + kind: MultiClusterService + metadata: + name: gpu-operator + spec: + clusterSelector: + matchLabels: + group: demo + serviceSpec: + services: + - template: gpu-operator-24-9-2 + name: gpu-operator + namespace: gpu-operator + values: | + operator: + defaultRuntime: containerd + toolkit: + env: + - name: CONTAINERD_CONFIG + value: /etc/k0s/containerd.d/nvidia.toml + - name: CONTAINERD_SOCKET + value: /run/k0s/containerd.sock + - name: CONTAINERD_RUNTIME_CLASS + value: nvidia + + +The |prod-name-short| managed clusters will now have the Nvidia GPU operator + +************************************************* +Verifying |prod-name-short| with the GPU Operator +************************************************* + +Refer to :external+gpuop:ref:`running sample gpu applications` to verify the installation. + +*************** +Getting Support +*************** + +Refer to the k0RDENT product documentation for information about working with k0RDENT. + +******************* +Related information +******************* + +* https://docs.k0rdent.io/v0.2.0/ From a60fe4955fd0655fa9a06817d86ad2265bebaee9 Mon Sep 17 00:00:00 2001 From: Lee Xie Date: Mon, 14 Apr 2025 10:54:42 -0700 Subject: [PATCH 2/7] Update index.rst Signed-off-by: Lee Xie Signed-off-by: Lee Xie --- partner-validated/index.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/partner-validated/index.rst b/partner-validated/index.rst index e6b3d28ad..24a84eae8 100644 --- a/partner-validated/index.rst +++ b/partner-validated/index.rst @@ -27,6 +27,7 @@ About Partner-Validated Configurations :hidden: self + k0rdent.rst mirantis-mke.rst Partner-validated configurations help end users who want to use @@ -153,4 +154,4 @@ What happens if the partner requires changes to the NVIDIA GPU Operator that are How are CVE fixes managed for partner software that is used by the NVIDIA GPU Operator? The partner is responsible for managing security issues and is advised to proactively notify users of issues and fixes. - When the partner provides users with software, such as a containerized GPU driver, the partner is responsible for notifying and resolving issues with the container image. \ No newline at end of file + When the partner provides users with software, such as a containerized GPU driver, the partner is responsible for notifying and resolving issues with the container image. From 0dc65702fbd09ba73c1c2f1f89c57bc6283ecb60 Mon Sep 17 00:00:00 2001 From: Lee Xie Date: Mon, 14 Apr 2025 10:57:36 -0700 Subject: [PATCH 3/7] Update mirantis-mke.rst remove k0RDENT from matrix since it will have it's own page. Signed-off-by: Lee Xie Signed-off-by: Lee Xie --- partner-validated/mirantis-mke.rst | 21 --------------------- 1 file changed, 21 deletions(-) diff --git a/partner-validated/mirantis-mke.rst b/partner-validated/mirantis-mke.rst index dd24286cd..deef2074d 100644 --- a/partner-validated/mirantis-mke.rst +++ b/partner-validated/mirantis-mke.rst @@ -44,27 +44,6 @@ Validated Configuration Matrix - NVIDIA GPU - Hardware Model - * - k0s v1.31.5+k0s / k0rdent 0.1.0 - - v24.9.2 - - | Ubuntu 22.04 - - containerd v1.7.24 with the NVIDIA Container Toolkit v1.17.4 - - 1.31.5 - - Helm v3 - - | 2x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC) - - | Supermicro SuperServer 6028U-E1CNR4T+ - - | 1000W Supermicro PWS-1K02A-1R - - | 2x Intel Xeon E5-2630v4, 10C/20T 2.2/3.1 GHz LGA 2011-3 25MB 85W - - | 32GB DDR4-2666 RDIMM, M393A4K40BB2-CTD6Q - - | NVMe 960GB PM983 NVMe M.2, MZ1LB960HAJQ-00007 - - | 2 x NVIDIA RTX 4000 SFF Ada 20GB GDDR6 (ECC), 70W, PCIe 4.0x16, 4x - - | 4x Mini DisplayPort 1.4a - * - MKE 3.8 - v24.9.2 - | Ubuntu 22.04 From 47079dec7b1f72d0bc1af5bd5d0f75dc6e6b4a37 Mon Sep 17 00:00:00 2001 From: Lee Xie Date: Mon, 14 Apr 2025 10:59:41 -0700 Subject: [PATCH 4/7] Update k0rdent.rst Signed-off-by: Lee Xie Signed-off-by: Lee Xie --- partner-validated/k0rdent.rst | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/partner-validated/k0rdent.rst b/partner-validated/k0rdent.rst index 2a6938d90..b4a9a985a 100644 --- a/partner-validated/k0rdent.rst +++ b/partner-validated/k0rdent.rst @@ -94,10 +94,8 @@ Perform the following steps to prepare the |prod-name-short| cluster: .. code-block:: console - $ helm upgrade --install gpu-operator oci://ghcr.io/k0rdent/catalog/charts/kgst -n kcm-system \ - --set "helm.repository.url=https://helm.ngc.nvidia.com/nvidia" \ - --set "helm.charts[0].name=gpu-operator" \ - --set "helm.charts[0].version=24.9.2" + $ helm install gpu-operator oci://ghcr.io/k0rdent/catalog/charts/gpu-operator-service-template \ + --version 24.9.2 -n kcm-system #. Verify service template: From d3984ebcdc0a253139211ea37ed59ec1c324cba3 Mon Sep 17 00:00:00 2001 From: Lee Xie Date: Fri, 18 Apr 2025 09:39:58 -0700 Subject: [PATCH 5/7] Update partner-validated/k0rdent.rst Co-authored-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Signed-off-by: Lee Xie --- partner-validated/k0rdent.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/partner-validated/k0rdent.rst b/partner-validated/k0rdent.rst index b4a9a985a..53a294bfe 100644 --- a/partner-validated/k0rdent.rst +++ b/partner-validated/k0rdent.rst @@ -140,7 +140,7 @@ Perform the following steps to prepare the |prod-name-short| cluster: value: nvidia -The |prod-name-short| managed clusters will now have the Nvidia GPU operator +The |prod-name-short| managed clusters will now have the NVIDIA GPU operator ************************************************* Verifying |prod-name-short| with the GPU Operator From fe322b1fb9cbfea2a5077267a25a178768a32537 Mon Sep 17 00:00:00 2001 From: Lee Xie Date: Fri, 18 Apr 2025 09:40:09 -0700 Subject: [PATCH 6/7] Update partner-validated/k0rdent.rst Co-authored-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Signed-off-by: Lee Xie --- partner-validated/k0rdent.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/partner-validated/k0rdent.rst b/partner-validated/k0rdent.rst index 53a294bfe..f402d9888 100644 --- a/partner-validated/k0rdent.rst +++ b/partner-validated/k0rdent.rst @@ -110,7 +110,7 @@ Perform the following steps to prepare the |prod-name-short| cluster: NAMESPACE NAME VALID kcm-system gpu-operator-24-9-2 true -#. Deploy service template to child cluster +#. Deploy service template to child cluster: .. code-block:: console From f5aed46b0dd280ad864df17761da7ea2f2748163 Mon Sep 17 00:00:00 2001 From: Lee Xie Date: Fri, 18 Apr 2025 09:40:21 -0700 Subject: [PATCH 7/7] Update partner-validated/k0rdent.rst Co-authored-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Signed-off-by: Lee Xie --- partner-validated/k0rdent.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/partner-validated/k0rdent.rst b/partner-validated/k0rdent.rst index f402d9888..d2bb8cf29 100644 --- a/partner-validated/k0rdent.rst +++ b/partner-validated/k0rdent.rst @@ -13,7 +13,7 @@ About |prod-name-short| with the GPU Operator ********************************************* |prod-name-short| is as a "super control plane" designed to ensure the consistent provisioning and lifecycle -management of kubernetes clusters and the services that make them useful. The goal of the k0rdent project is +management of Kubernetes clusters and the services that make them useful. The goal of the k0rdent project is to provide platform engineers with the means to deliver a distributed container management environment (DCME) and enable them to compose unique internal developer platforms (IDP) to support a diverse range of complex modern application workloads.