diff --git a/gpu-operator/google-gke.rst b/gpu-operator/google-gke.rst index c32c92a8b..822d7563c 100644 --- a/gpu-operator/google-gke.rst +++ b/gpu-operator/google-gke.rst @@ -80,6 +80,20 @@ Prerequisites Refer to `GPU platforms `_ in the Google Cloud documentation. +.. note:: + + When installing NVIDIA GPU Operator on GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly. + + To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. + You can set this environment variable by setting the below in the ClusterPolicy CR: + + .. code-block:: yaml + + toolkit: + env: + - name: RUNTIME_CONFIG_SOURCE + value: "file" + ********************************* Using the Google Driver Installer diff --git a/gpu-operator/release-notes.rst b/gpu-operator/release-notes.rst index eb3a8fd7d..373aeae0a 100644 --- a/gpu-operator/release-notes.rst +++ b/gpu-operator/release-notes.rst @@ -188,6 +188,17 @@ Known Issues Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the ``vgpuDeviceManager.config.name``, and restart the vgpu-device-manager pod. +- When using GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly. + To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. + You can set this environment variable by setting the below in the ClusterPolicy CR: + + .. code-block:: yaml + + toolkit: + env: + - name: RUNTIME_CONFIG_SOURCE + value: "file" + .. _v25.3.4: 25.3.4