Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions gpu-operator/google-gke.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,20 @@ Prerequisites
Refer to `GPU platforms <https://cloud.google.com/compute/docs/gpus>`_
in the Google Cloud documentation.

.. note::
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possibly not the best place for this note. open to suggestions :)


When installing NVIDIA GPU Operator on GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly.

To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing a small redundancy:

Suggested change
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container.

You can set this environment variable by setting the below in the ClusterPolicy CR:

.. code-block:: yaml

toolkit:
env:
- name: RUNTIME_CONFIG_SOURCE
value: "file"


*********************************
Using the Google Driver Installer
Expand Down
11 changes: 11 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,17 @@ Known Issues

Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the ``vgpuDeviceManager.config.name``, and restart the vgpu-device-manager pod.

- When using GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly.
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container.

You can set this environment variable by setting the below in the ClusterPolicy CR:

.. code-block:: yaml

toolkit:
env:
- name: RUNTIME_CONFIG_SOURCE
value: "file"

.. _v25.3.4:

25.3.4
Expand Down