Skip to content

Conversation

@a-mccarthy
Copy link
Collaborator

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Refer to `GPU platforms <https://cloud.google.com/compute/docs/gpus>`_
in the Google Cloud documentation.

.. note::
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possibly not the best place for this note. open to suggestions :)

@github-actions
Copy link

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-312

@a-mccarthy
Copy link
Collaborator Author

thanks for the review @cdesiniotis! can you take another look and approve?

Co-authored-by: Christopher Desiniotis <chris.desiniotis@gmail.com>
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor suggestions, but otherwise lgtm. Feel free to merge this after addressing my comments.


When installing NVIDIA GPU Operator on GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly.

To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing a small redundancy:

Suggested change
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container.

Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the ``vgpuDeviceManager.config.name``, and restart the vgpu-device-manager pod.

- When using GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly.
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue.
To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container.

@a-mccarthy a-mccarthy merged commit 4d468dc into NVIDIA:main Nov 18, 2025
2 checks passed
a-mccarthy added a commit to a-mccarthy/cloud-native-docs that referenced this pull request Nov 18, 2025
* Add gke known issue

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Christopher Desiniotis <chris.desiniotis@gmail.com>
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>

---------

Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
Co-authored-by: Christopher Desiniotis <chris.desiniotis@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants