From 66d2babeec12e04bce0c60898752d68188e48cff Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Fri, 14 Nov 2025 15:50:20 -0500 Subject: [PATCH 1/2] Add gke known issue Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- gpu-operator/google-gke.rst | 14 ++++++++++++++ gpu-operator/release-notes.rst | 11 +++++++++++ 2 files changed, 25 insertions(+) diff --git a/gpu-operator/google-gke.rst b/gpu-operator/google-gke.rst index c32c92a8b..ab3091aae 100644 --- a/gpu-operator/google-gke.rst +++ b/gpu-operator/google-gke.rst @@ -80,6 +80,20 @@ Prerequisites Refer to `GPU platforms `_ in the Google Cloud documentation. +.. note:: + + When installing NVIDIA GPU Operator v25.10.0 on GKE, there is a known issue in the NVIDIA Container Toolkit v1.18.0, the default toolkit version, that will misconfigure the config.toml file and prevent GPU Operator containers from starting up correctly. + + To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. + You can set this environment variable by setting the below in the ClusterPolicy CR: + + .. code-block:: yaml + + toolkit: + env: + - name: RUNTIME_CONFIG_SOURCE + value: "file" + ********************************* Using the Google Driver Installer diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index eb3a8fd7d..deba23159 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -188,6 +188,17 @@ Known Issues Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the ``vgpuDeviceManager.config.name``, and restart the vgpu-device-manager pod. +- When using GKE, there is a known issue in the NVIDIA Container Toolkit v1.18.0 that will miss configure the config.toml file and prevent GPU Operator containers from starting up correctly. + To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. + You can set this environment variable by setting the below in the ClusterPolicy CR: + + .. code-block:: yaml + + toolkit: + env: + - name: RUNTIME_CONFIG_SOURCE + value: "file" + .. _v25.3.4: 25.3.4 From f75c755203114ff7f4ae4cb8dd2e1573af7bca43 Mon Sep 17 00:00:00 2001 From: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Date: Mon, 17 Nov 2025 11:21:55 -0500 Subject: [PATCH 2/2] Apply suggestions from code review Co-authored-by: Christopher Desiniotis Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --- gpu-operator/google-gke.rst | 2 +- gpu-operator/release-notes.rst | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/gpu-operator/google-gke.rst b/gpu-operator/google-gke.rst index ab3091aae..822d7563c 100644 --- a/gpu-operator/google-gke.rst +++ b/gpu-operator/google-gke.rst @@ -82,7 +82,7 @@ Prerequisites .. note:: - When installing NVIDIA GPU Operator v25.10.0 on GKE, there is a known issue in the NVIDIA Container Toolkit v1.18.0, the default toolkit version, that will misconfigure the config.toml file and prevent GPU Operator containers from starting up correctly. + When installing NVIDIA GPU Operator on GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly. To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. You can set this environment variable by setting the below in the ClusterPolicy CR: diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index deba23159..373aeae0a 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -188,7 +188,7 @@ Known Issues Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the ``vgpuDeviceManager.config.name``, and restart the vgpu-device-manager pod. -- When using GKE, there is a known issue in the NVIDIA Container Toolkit v1.18.0 that will miss configure the config.toml file and prevent GPU Operator containers from starting up correctly. +- When using GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly. To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. You can set this environment variable by setting the below in the ClusterPolicy CR: