Skip to content

gpu: fix operator deployment instructions #20552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gjulianm
Copy link
Contributor

@gjulianm gjulianm commented Jun 19, 2025

What does this PR do?

Updates the GPU deployment instructions to fix the deployment with the Datadog operator.

Motivation

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

Copy link
Contributor

@janine-c janine-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I made some minor writing suggestions to try to make the directions a bit easier to parse and to follow our style guide a little more closely. If you have any questions, don't hesitate to let me know!

```

For **mixed environments**, use the [DatadogAgentProfiles feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. In this case, it is not necessary to modify the DatadogAgent manifest. Instead, create a profile that enables the configuration on GPU nodes only:
For **mixed environments**, use the [DatadogAgentProfiles (DAP) feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. Note that this feature is disabled by default, so it needs to be enabled [as described here](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md#enabling-datadogagentprofiles).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For **mixed environments**, use the [DatadogAgentProfiles (DAP) feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. Note that this feature is disabled by default, so it needs to be enabled [as described here](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md#enabling-datadogagentprofiles).
For **mixed environments**, use the [DatadogAgentProfiles (DAP) feature](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md) of the operator, which allows different configurations to be deployed for different nodes. Note that this feature is disabled by default, so it needs to be enabled. For more information, see [Enabling DatadogAgentProfiles](https://github.com/DataDog/datadog-operator/blob/main/docs/datadog_agent_profiles.md#enabling-datadogagentprofiles).

Making the link more descriptive, which is a good accessibility practice 🙂

Comment on lines +211 to +215
Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet. First, the existing configuration should enable the `system-probe` container in the datadog-agent pods (this can be easily checked by looking at the list of containers when running `kubectl describe pod <datadog-agent-pod-name> -n <namespace>`). Because the DAP feature does not yet support conditionally enabling containers, a feature that uses `system-probe` needs to be enabled for all agent pods. We recommend enabling the `oomKill` integration, as it is lightweight and does not require any additional configuration or extra cost.

Additionally, the agent needs to be configured so that the NVIDIA container runtime exposes GPUs to the agent. This can be done via environment variables or volume mounts, depending on whether the `accept-nvidia-visible-devices-as-volume-mounts` parameter is set to `true` or `false` in the NVIDIA container runtime configuration. We recommend configuring the agent both ways, as it reduces the chance of misconfiguration and there are no side effects to having both.

Also, the PodResources socket needs to be exposed to the agent too to integrate with the Kubernetes Device Plugin. Again, this needs to be done globally as the DAP does not yet support conditional volume mounts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet. First, the existing configuration should enable the `system-probe` container in the datadog-agent pods (this can be easily checked by looking at the list of containers when running `kubectl describe pod <datadog-agent-pod-name> -n <namespace>`). Because the DAP feature does not yet support conditionally enabling containers, a feature that uses `system-probe` needs to be enabled for all agent pods. We recommend enabling the `oomKill` integration, as it is lightweight and does not require any additional configuration or extra cost.
Additionally, the agent needs to be configured so that the NVIDIA container runtime exposes GPUs to the agent. This can be done via environment variables or volume mounts, depending on whether the `accept-nvidia-visible-devices-as-volume-mounts` parameter is set to `true` or `false` in the NVIDIA container runtime configuration. We recommend configuring the agent both ways, as it reduces the chance of misconfiguration and there are no side effects to having both.
Also, the PodResources socket needs to be exposed to the agent too to integrate with the Kubernetes Device Plugin. Again, this needs to be done globally as the DAP does not yet support conditional volume mounts.
Modifying the DatadogAgent manifest is necessary to enable certain features that are not supported by the DAP yet:
- In the existing configuration, enable the `system-probe` container in the datadog-agent pods. Because the DAP feature does not yet support conditionally enabling containers, a feature that uses `system-probe` needs to be enabled for all Agent pods.
- You can check this by looking at the list of containers when running `kubectl describe pod <datadog-agent-pod-name> -n <namespace>`.
- Datadog recommends enabling the `oomKill` integration, as it is lightweight and does not require any additional configuration or cost.
- Configure the Agent so that the NVIDIA container runtime exposes GPUs to the Agent.
- You can do this using environment variables or volume mounts, depending on whether the `accept-nvidia-visible-devices-as-volume-mounts` parameter is set to `true` or `false` in the NVIDIA container runtime configuration.
- Datadog recommends configuring the Agent both ways, as it reduces the chance of misconfiguration. There are no side effects to having both.
- Expose the PodResources socket to the Agent to integrate with the Kubernetes Device Plugin.
- This needs to be done globally, as the DAP does not yet support conditional volume mounts.

I tried formatting this section a little bit differently, in the hopes of making it easier for customers to understand the action items they need to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with this one, short bullet one liners are easier to follow and not miss, rather than long paragraphs of plain text.

```yaml
spec:
features:
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all Agent pods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to enable oomkill for SP? enabling GPUM is not enough?

Copy link
Contributor

@val06 val06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed

```yaml
spec:
features:
oomKill: # Only enable this feature if there is nothing else that requires the system-probe container in all agent pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to enable oomkill for SP? enabling GPUM is not enough?

readOnly: true
- name: pod-resources
mountPath: /var/lib/kubelet/pod-resources
readOnly: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i vaguely remember that we need it as rw (ref)

agent:
env:
- name: DD_ENABLE_NVML_DETECTION
value: "true"
# add this env var, if using operator versions 1.14.x or 1.15.x
value: "true"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why removing the original comment?
this env is required for customers that still use operator version 1.14, as you added it explicitly to the gpu/feature only in 1.15

agent:
env:
- name: DD_ENABLE_NVML_DETECTION
value: "true"
# add this env var, if using operator versions 1.14.x or 1.15.x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants