Add Datadog Agent Profiles #1083

davidor · 2024-02-12T07:49:37Z

What does this PR do?

Adds support for Datadog Agent Profiles.

It adds a new CRD DatadogAgentProfile that allows users to override the agent resources for the nodes of the cluster that match the given affinity.

This is an example:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogAgentProfile
metadata:
  name: profile-example
spec:
  profileAffinity:
    profileNodeAffinity:
      - key: disktype
        operator: In
        values:
          - ssd
  config:
    override:
      nodeAgent:
        containers:
          agent:
            resources:
              requests:
                cpu: 20m

Describe your test plan

This has already been tested, you can check our internal doc with the test plan for more details.
In general, try the feature and verify that everything works as you would expect.

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label

* Add profiles base * Fix ci + apply renames in crd

* [override/daemonset] Apply only if it's not empty * [config/rbac/role] Add profiles * Add basic version of profiles manager * [controllers/datadogagent] Create DaemonSets according to profiles * [override/container] Don't override resource if not specified * [agentprofile] Add support for overrides in all node agent containers * [agentprofile] Take into account namespace when setting DS name

* [datadogagent_controller_v2_test] Make create/delete funcs more generic * Add integration tests for agent profiles

* Add profile validation * At least one container resource must be defined * Remove redundant assert and update boilerplate text

* [profiles] Add support for profiles with multiple affinity requirements * [profiles] Cleanup integration tests * [datadogagent/finalizer] Delete profile labels Also fixes the finalizer handle so that it no longer ignores errors.

This avoids scheduling multiple agent pods of different profiles on the same node during rollouts.

* add node informer * add nodestore tests and update existing profiles tests * refactor profile store * update tests * update func name * cleanup * apply review suggestions * apply review suggestions * tests need fixing * Simplify nodestore * Review suggestions --------- Co-authored-by: khewonc <39867936+khewonc@users.noreply.github.com>

* Bump k8s dependencies These should have been updated when operator-sdk was updated to 1.23.0 according to the docs: https://sdk.operatorframework.io/docs/upgrading-sdk-version/v1.23.0/ * Run "make manifests" * Update licenses

* [profiles] Replace custom node cache with k8s client one * [profiles/test] Use random node names to avoid conflicts between tests * [profiles/test] Use Eventually() when checking node labels * [profiles/test] Comment node labels part for now

* [defaulting/images] Set default version of agent and cluster agent to `7.49.0` (#968) * Allow enabling SBOM collection for host and container images (#836) * Allow enabling SBOM collection for host and container images * small fixes and add test * actually add test * address comments --------- Co-authored-by: Celene <celene@datadoghq.com> * Bump go-grpc to 1.56.3 (#970) * Remove SecurityContextConstraints parameter and references (#977) * remove references to scc creation * rm commented tests * rm unused functions * Add GCR EU and Asia registries with auto defaulting (#978) * update examples for v2 datadogagent (#980) * Update default Agent/DCA version to 7.49.1 (#984) 7.49.1 has just been released. it would be great to have it in Operator 1.3.0 * Enable by default container-image collection (#983) We want to enabled container-image collection by default. it will be the case from agent 7.50.0. But in Operator 1.3.0, the Agent is defaulted to 7.49.0 since 7.50.0 is not yet released. This commit set the envvar DD_CONTAINER_IMAGE_ENABLED=true by default in the NodeAgent's "agent" container. * [Docs] OpenShift docs update (#991) * OpenShift docs update * Resize and crop image * Update integrations_autodiscovery.md (#985) * Update integrations_autodiscovery.md Noticed a mismatched https://a.cl.ly/qGuYnRqn * Update integrations_autodiscovery.md Updated the configuration to match the graph title "Use the spec.override.nodeAgent.extraConfd.configDataMap" -> "Use the spec.override.clusterAgent.extraConfd.configDataMap" apiVersion: datadoghq.com/v2alpha1 kind: DatadogAgent metadata: name: datadog spec: global: credentials: apiKey: "<DATADOG_API_KEY>" appKey: "<DATADOG_APP_KEY>" override: nodeAgent: extraConfd: configDataMap: http_check.yaml: |- init_config: instances: - url: "http://%%host%%" name: "My service" * update patch-bundle script to include spec.replaces field in bundle (#990) * Support `CanaryAutoPauseMaxSlowStartDuration` option (#997) * Add support for max slow start duration config param * Update max slow start arg name to include auto pause * fixup! Update max slow start arg name to include auto pause * update datadog-api-client-go (#986) * update datadog-api-client-go * update license * add dependabot.yaml file (#989) * Update boilerplate text (#1005) * Enabling APM by default (#1006) * feat: add NotificationPresetName to monitor options (#1001) * feat: add NotificationPresetName to monitor options In datadog, we have the possibility to toggle the "Content display in notification" for every monitor. This PR enables this. More info in https://docs.datadoghq.com/monitors/notify/#toggle-additional-content * pr review Co-authored-by: Celene <celene@datadoghq.com> * Update datadogmonitor_types.go Co-authored-by: Celene <celene@datadoghq.com> * Update datadogmonitor_types.go Co-authored-by: Celene <celene@datadoghq.com> * Update datadogmonitor_types.go Co-authored-by: Celene <celene@datadoghq.com> * update generated files --------- Co-authored-by: Celene <celene@datadoghq.com> * [APMON-406] Set Datadog namespace env variable on Cluster Agent (#1012) * [APMON-406] Set env variable for the Datadog resources namespace * DCA namespace resources env var added for V2 --------- Co-authored-by: julia-elsammak <julia.elsammak@datadoghq.com> Co-authored-by: Julia-elsammak <109159652+Julia-elsammak@users.noreply.github.com> * Add `createdAd`, `support` fields to bundle CSV file (#1018) * Add createAd, support fields to bundle CSV file * Update hack/patch-bundle.sh Co-authored-by: Celene <celene@datadoghq.com> * Update hack/patch-bundle.sh Co-authored-by: Celene <celene@datadoghq.com> --------- Co-authored-by: Celene <celene@datadoghq.com> * update feature default values and tests (#1015) * Provide default probe handler httpGet values if not configured in override (#998) * Provide default readinessProbe.httpGet if not configured * apply review suggestions * Update bundles to match 1.3.0 release (#1019) * Fix patch_bundle.sh `spec.replaces` formatting (#1026) * [defaulting/images] Update agents to 7.50.1 (#1029) * ContainerProcessStrategy, allow running non-privileged agents in one container (#921) * Add mono-container config to CRD * mono-container support implementation * livecontainer feature unit test and factory fix * cluster checks feature test; change in CC feature * ksm feature test; change in the feature * APM feature test * OTLP feature test; feature change * add ManageMonoContainerNodeAgent to ebpfcheck feature * add ManageMonoContainerNodeAgent to process discovery feature * Basic global_test.go test, minor refactor * feature factory test * Refactor tests to reduce direct use of mono-container CRD * Change CRD and container name * Rename ManageMonoContainerNodeAgent -> ManageMultiProcessNodeAgent * Updates after merge; doc update * refactor around ApplyGlobalSettings * drop 'mono' from naming, comments * Update tests with trace agent enabled by default now * Update apis/datadoghq/v2alpha1/datadogagent_types.go Co-authored-by: Celene <celene@datadoghq.com> * Update apis/datadoghq/v2alpha1/datadogagent_types.go Co-authored-by: Celene <celene@datadoghq.com> * Update apis/datadoghq/v2alpha1/datadogagent_types.go Co-authored-by: Celene <celene@datadoghq.com> * Update apis/datadoghq/v2alpha1/datadogagent_types.go Co-authored-by: Celene <celene@datadoghq.com> * Update controllers/datadogagent/feature/admissioncontroller/feature.go Co-authored-by: Celene <celene@datadoghq.com> * Update controllers/datadogagent/feature/factory.go Co-authored-by: Celene <celene@datadoghq.com> * updates on PR feedback * updates on PR feedback * Update apis/datadoghq/v2alpha1/test/builder.go Co-authored-by: Celene <celene@datadoghq.com> * rename UsesMultiProcessContainer * Update apis/datadoghq/v2alpha1/datadogagent_types.go Co-authored-by: Charly Fontaine <charly@datadoghq.com> * PR feedback updates --------- Co-authored-by: Celene <celene@datadoghq.com> Co-authored-by: Charly Fontaine <charly@datadoghq.com> * [defaulting/images] Update agents to 7.50.2 (#1033) * [Onboarding telemetry] Set install id, install time and install type env variables (#1034) * [Onboarding telemetry] Set install id, install time and install type env variables * Add APM prefix to env variables * [CECO-570] Add operator introspection (#817) * Add operator introspection * [defaulting/images] Update agents to 7.50.3 (#1038) * [gitlab] Add nightly operator and operator_check jobs (#1039) * add gitlab job to release nightly image * update non-privileged to unprivileged (#1040) * Change gcp to gke (#1046) * Mount host files for proper os detection in SBOMs (#1044) * [gitlab] auto push nightly image (#1048) * [introspection] Fix override name combining (#1049) * Fix override name combining * Add a test for the change --------- Co-authored-by: Levan Machablishvili <levan.machablishvili@datadoghq.com> * [override/podtemplatespec] Merge affinities (#1004) (#1052) Co-authored-by: David Ortiz <david.ortiz@datadoghq.com> * [gitlab] update runners for build jobs (#1056) * test image build jobs * rename docker-push-ci * test fix (#1057) * update monocontainer config (#1059) * update monocontainer config * Update docs/configuration.v2alpha1.md Co-authored-by: May Lee <mayl@alumni.cmu.edu> * fix generated file * fix missed var names --------- Co-authored-by: May Lee <mayl@alumni.cmu.edu> * Correct a malformed example command in installation.md (#1066) Fix installation document * Fix errors for tests and allow both introspection and profiles to run --------- Co-authored-by: Jennifer Chen <32009013+jennchenn@users.noreply.github.com> Co-authored-by: Sylvain Baubeau <lebauce@gmail.com> Co-authored-by: Celene <celene@datadoghq.com> Co-authored-by: Charly Fontaine <charly@datadoghq.com> Co-authored-by: Vincent Boulineau <58430298+vboulineau@users.noreply.github.com> Co-authored-by: Cedric Lamoriniere <cedric.lamoriniere@datadoghq.com> Co-authored-by: tbavelier <97530782+tbavelier@users.noreply.github.com> Co-authored-by: Julia-elsammak <109159652+Julia-elsammak@users.noreply.github.com> Co-authored-by: bakayolo <benjamin.apprederisse@gmail.com> Co-authored-by: Liliya Belaus <59583867+liliyadd@users.noreply.github.com> Co-authored-by: julia-elsammak <julia.elsammak@datadoghq.com> Co-authored-by: levan-m <116471169+levan-m@users.noreply.github.com> Co-authored-by: Fanny Jiang <fanny.jiang@datadoghq.com> Co-authored-by: David Ortiz <david.ortiz@datadoghq.com> Co-authored-by: Levan Machablishvili <levan.machablishvili@datadoghq.com> Co-authored-by: May Lee <mayl@alumni.cmu.edu> Co-authored-by: David Goffredo <dmgoffredo@gmail.com>

github-actions

This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, refactoring, documentation, tooling, dependencies

codecov-commenter · 2024-02-12T10:20:18Z

Codecov Report

Attention: 201 lines in your changes are missing coverage. Please review.

Comparison is base (cb2d3ba) 58.54% compared to head (791ae60) 58.74%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1083      +/-   ##
==========================================
+ Coverage   58.54%   58.74%   +0.19%     
==========================================
  Files         166      170       +4     
  Lines       20525    21033     +508     
==========================================
+ Hits        12017    12356     +339     
- Misses       7782     7920     +138     
- Partials      726      757      +31

Flag	Coverage Δ
unittests	`58.74% <65.81%> (+0.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...is/datadoghq/v1alpha1/datadogagentprofile_types.go	`100.00% <100.00%> (ø)`
...tadoghq/v1alpha1/datadogagentprofile_validation.go	`100.00% <100.00%> (ø)`
...ontrollers/datadogagent/feature/oomkill/feature.go	`87.50% <ø> (ø)`
controllers/datadogagent/override/container.go	`94.70% <100.00%> (+0.26%)`	⬆️
controllers/setup.go	`60.00% <100.00%> (+2.71%)`	⬆️
pkg/kubernetes/const.go	`86.95% <ø> (ø)`
pkg/kubernetes/provider.go	`85.71% <100.00%> (-0.24%)`	⬇️
controllers/datadogagent/override/daemonset.go	`0.00% <0.00%> (ø)`
pkg/kubernetes/objects.go	`0.00% <0.00%> (ø)`
controllers/datadogagentprofile_controller.go	`64.70% <64.70%> (ø)`
... and 5 more

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb2d3ba...791ae60. Read the comment docs.

levan-m · 2024-02-21T02:58:49Z

controllers/datadogagent/controller_reconcile_v2.go

+	}
+
+	if err = r.handleProfiles(ctx, profiles, profilesByNode, instance.Namespace); err != nil {
+		return reconcile.Result{}, err


I'm a bit confused by error handling here as some errors are collected into errs, some lead to return from function.
In this case, why are we returning here (but not on line 156)?
What happens if some profile/provider combinations succeed in line 162, shouldn't we doe Create/Update/Cleanup dependencies?

I don't think it makes a bit difference in the end because any error here means that the request needs to be requeued and retried. I think it's a bit clearer to ensure that all the agent daemonsets are created as expected and unnecessary ones deleted before trying to manage their dependencies.

For the providers part, @khewonc can answer, I'm not so familiar with that feature.

When adding introspection, I was also thinking we shouldn't block the rest of the reconcile even if there was an error with handleProviders since an error there would come from an issue with deleting old DaemonSets. Then, when merging main into the profiles branch, I tried to make minimal changes so the different approaches to error handling were kept. We can make the behavior for both consistent in another PR

levan-m · 2024-02-21T03:03:35Z

controllers/datadogagent/controller_reconcile_v2.go

+		}
+	}
+
+	if err = r.handleProfiles(ctx, profiles, profilesByNode, instance.Namespace); err != nil {


This function does node labeling, after DSs are already create. What happens if DS creation is successful but labeling fails (or keeps failing)? Do we reconcile such nodes?

The error handling logic follows the same pattern as in other places of the code.

All the steps in handleProfiles should be successful. If there's an error in any of the steps, we return an error in the Reconcile function so that the request is retried later.

If it keeps failing, we would have the same problem as in any other part of the code that creates, updates or deletes any Kubernetes objects. There isn't much we can do. Although the only case where I think the node labeling could be failing permanently (and not other operations) is if RBACs are not properly set, but that should be taken care of by our installation charts.

khewonc and others added 20 commits November 7, 2023 16:57

Add profiles base (#966)

5add791

* Add profiles base * Fix ci + apply renames in crd

Add profile sample manifest (#974)

02d3fa0

[profiles] Add label to DaemonSets (#1000)

cceca2f

[profiles] Add integration tests (#992)

d66091d

* [datadogagent_controller_v2_test] Make create/delete funcs more generic * Add integration tests for agent profiles

[profiles] Handle conflicts between profiles (#999)

94d3284

Profile manifest validation (#973)

c6eaaa8

* Add profile validation * At least one container resource must be defined * Remove redundant assert and update boilerplate text

[override/podtemplatespec] Merge affinities (#1004)

2778c1e

[profiles] Add pod-antiaffinity to agent pods (#1003)

00ffc02

This avoids scheduling multiple agent pods of different profiles on the same node during rollouts.

Validate dap (#1028)

bf2deaa

[profiles] Add support for EDS (#1032)

1256ac0

[profiles] Fix Watches call for Profiles (#1043)

f6846a9

[profiles] Fix container resources override with 0 (#1054)

210b609

Bump k8s dependencies to 0.24.x (#1064)

fb1469a

* Bump k8s dependencies These should have been updated when operator-sdk was updated to 1.23.0 according to the docs: https://sdk.operatorframework.io/docs/upgrading-sdk-version/v1.23.0/ * Run "make manifests" * Update licenses

[profiles] Reduce mem usage in pods cleanup (#1065)

7d50145

Change node labels to use dap ns-name format (#1071)

66a7941

github-actions bot requested changes Feb 12, 2024

View reviewed changes

davidor added the enhancement New feature or request label Feb 12, 2024

davidor added this to the v1.5.0 milestone Feb 12, 2024

github-actions bot approved these changes Feb 12, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into datadog-agent-profiles

791ae60

davidor marked this pull request as ready for review February 12, 2024 10:17

davidor requested review from a team as code owners February 12, 2024 10:17

khewonc mentioned this pull request Feb 13, 2024

[introspection][profiles][CECO-851] Refactor introspection #1086

Merged

2 tasks

levan-m reviewed Feb 21, 2024

View reviewed changes

levan-m approved these changes Feb 21, 2024

View reviewed changes

khewonc merged commit a4cf7a3 into main Feb 21, 2024
19 checks passed

khewonc deleted the datadog-agent-profiles branch February 21, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Datadog Agent Profiles #1083

Add Datadog Agent Profiles #1083

davidor commented Feb 12, 2024 •

edited

github-actions bot left a comment

codecov-commenter commented Feb 12, 2024 •

edited

levan-m Feb 21, 2024

davidor Feb 21, 2024 •

edited

khewonc Feb 21, 2024

levan-m Feb 21, 2024

davidor Feb 21, 2024

Add Datadog Agent Profiles #1083

Add Datadog Agent Profiles #1083

Conversation

davidor commented Feb 12, 2024 • edited

What does this PR do?

Describe your test plan

Checklist

github-actions bot left a comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 12, 2024 • edited

Codecov Report

levan-m Feb 21, 2024

Choose a reason for hiding this comment

davidor Feb 21, 2024 • edited

Choose a reason for hiding this comment

khewonc Feb 21, 2024

Choose a reason for hiding this comment

levan-m Feb 21, 2024

Choose a reason for hiding this comment

davidor Feb 21, 2024

Choose a reason for hiding this comment

davidor commented Feb 12, 2024 •

edited

codecov-commenter commented Feb 12, 2024 •

edited

davidor Feb 21, 2024 •

edited