-
Notifications
You must be signed in to change notification settings - Fork 31
Add gke known issue #312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gke known issue #312
Conversation
Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
| Refer to `GPU platforms <https://cloud.google.com/compute/docs/gpus>`_ | ||
| in the Google Cloud documentation. | ||
|
|
||
| .. note:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possibly not the best place for this note. open to suggestions :)
Documentation preview |
|
thanks for the review @cdesiniotis! can you take another look and approve? |
Co-authored-by: Christopher Desiniotis <chris.desiniotis@gmail.com> Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com>
cdesiniotis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor suggestions, but otherwise lgtm. Feel free to merge this after addressing my comments.
|
|
||
| When installing NVIDIA GPU Operator on GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly. | ||
|
|
||
| To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing a small redundancy:
| To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. | |
| To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container. |
| Create the ConfigMap, then update the ClusterPolicy with the name of the configMap in the ``vgpuDeviceManager.config.name``, and restart the vgpu-device-manager pod. | ||
|
|
||
| - When using GKE 1.33+, there is a known issue where NVIDIA Container Toolkit will misconfigure the containerd `config.toml` file and prevent GPU Operator containers from starting up correctly. | ||
| To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container to resolve this issue. | |
| To resolve this issue, set the ``RUNTIME_CONFIG_SOURCE=file`` environment variable in the toolkit container. |
* Add gke known issue Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Christopher Desiniotis <chris.desiniotis@gmail.com> Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> --------- Signed-off-by: Abigail McCarthy <20771501+a-mccarthy@users.noreply.github.com> Co-authored-by: Christopher Desiniotis <chris.desiniotis@gmail.com>
Related to NVIDIA/gpu-operator#1577