FA rollout experiment fixes by levan-m · Pull Request #3035 · DataDog/datadog-operator

levan-m · 2026-05-21T20:44:46Z

What does this PR do?

Addressing several bugs/issues:

When changing same config back and forth to rollout, two controller revisions basically get version bump reusing same name/hash. This leads to accumulation of both marker annotations if one config update is rolled back.
```
    operator.datadoghq.com/experiment-promoted: "true"
    operator.datadoghq.com/experiment-rollback: "true"
```
Again when flipping same config, revisions of two controllerrevs increase while their creationTimestamp which is used as proxy for experiment start time, doesn't change. Once these revisions are old enough controller will timeout and rollback.
If experiment times out, subsequent experiments will fail unless different config change to force new controller rev creation.
Operator doesn't send task abort or timeout errors to FA (regardless how FA processes those, which now are ignored).

Easiest to review Commit by commit

b6ef9e5 Decouples experiment timing from ControllerRevision metadata by adding explicit StartedAt field to ExperimentStatus. rev.CreationTimestamp might be stale if new spec has is same as earlier experiment. This could lead to immediate timeout if spec matched pre-existing controller rev spec.

d202cf7 rehydrate installer state from DatadogAgent on daemon startup to report state with correct task ID instead of empty one forcing FA to send start requests.

6cf24ad report TaskState_ERROR for the original start task on timeout.

8509c37 replaces two annotations (*experiment-promoted, *experiment-rollback) on controller rev with a single annotation to avoid accidentally accumulating two mutually exclusive annotations and to simplify logic. Bug reproducible by flipping same config back and forth and intentionally failing experiment to force rollback.

5aa1b51 similar to 6cf24ad, report abort error for the experiment task when DDA is changed during experiment.

Motivation

What inspired you to submit this pull request?

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

Agent: vX.Y.Z
Cluster Agent: vX.Y.Z

Describe your test plan

Below args needs to be set on Operator (debug level logging recommended)

        - --createControllerRevisions=true
        - --remoteUpdatesEnabled=true
        - --remoteConfigEnabled=true

and env vars to configure cluster name and endpoint

        - name: DD_CLUSTER_NAME
          value: my-test-cluster
        - name: DD_SITE
          value: datad0g.com          # endpoint for internal testing and api/app key in the same org

Happy path

Apply DDA
Go to FA -> find cluster -> Configuration -> Edit -> Deploy
Switch to Deployments view in FA see it succeeds.
Check the change was applied in cluster.

Failures
timeout
2. Watch dda yaml | yq .status.experiment, once experiment is running scale Operator to 0 replicas.
3. Wait 15min
4. Scale back to 1
5. Operator marks experiment as timed_out and is rolled back. FA retries same experiment eventually failing deployment.
6. try happy path

abort
2. Watch dda yaml | yq .status.experiment, once experiment is running scale Operator to 0 replicas.
3. Apply updated DDA in cluster
4. Operator marks experiment as aborted, updated DDA should be reconciled in cluster. FA retries same experiment eventually failing deployment.
5. Try happy path.

Checklist

PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
PR has a milestone or the qa/skip-qa label
All commits are signed (see: signing commits)

codecov-commenter · 2026-05-21T20:57:21Z

Codecov Report

❌ Patch coverage is 79.01235% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.31%. Comparing base (f70667d) to head (649b95d).

Files with missing lines	Patch %	Lines
pkg/fleet/daemon.go	65.38%	6 Missing and 3 partials ⚠️
internal/controller/datadogagent/experiment.go	76.92%	4 Missing and 2 partials ⚠️
pkg/fleet/daemon_worker.go	92.85%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3035      +/-   ##
==========================================
+ Coverage   42.24%   42.31%   +0.06%     
==========================================
  Files         337      337              
  Lines       28951    29007      +56     
==========================================
+ Hits        12230    12273      +43     
- Misses      15916    15924       +8     
- Partials      805      810       +5

Flag	Coverage Δ
unittests	`42.31% <79.01%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
internal/controller/datadogagent/revision.go	`78.46% <100.00%> (ø)`
pkg/fleet/daemon_worker.go	`78.18% <92.85%> (+1.95%)`	⬆️
internal/controller/datadogagent/experiment.go	`77.52% <76.92%> (-0.08%)`	⬇️
pkg/fleet/daemon.go	`69.34% <65.38%> (-0.60%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f70667d...649b95d. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

datadog-datadog-prod-us1-2 · 2026-05-21T20:57:39Z

✨ Fix all issues with BitsAI

🛑 Gate Violations

🎯 1 Code Coverage issue detected

A Patch coverage percentage gate may be blocking this PR.

• Patch coverage: 76.71% (threshold: 80.00%)

ℹ️ Info

🎯 Code Coverage (details)
• Patch Coverage: 76.71%
• Overall Coverage: 42.64% (+0.06%)

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 649b95d | Docs | Datadog PR Page | Give us feedback!}

…rtup

…n timeout

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5aa1b51cb3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Mathew-Estafanous · 2026-05-22T21:29:20Z

+// `pkg.experimentConfigVersion == experiment.ID` never matches).
+//
+// Reads go through the API reader (not the cache) because the informer
+// cache may not be populated yet at the moment Start runs.


Instead of doing this, can we use cache.WaitForCacheSync to guarantee the informer is populated?

I think using APIReader for direct reads and doing it before d.rcClient.Subscribe makes intent clearer. Technically we could use cache.WaitForCacheSync and then read DDA from cache and initialize config version, would be same amount of code but RC subscription would block on whole cache sync vs single DDA list. In practice there is no big tradeoff.
Dropping this and relying only on WaitForCacheSync and informer event triggering config update would remove some code but create race between RC callbacks and informer events.

* fix(experiment): anchor timeout on Status.Experiment.StartedAt * fix(fleet): rehydrate installer state from DatadogAgent on daemon startup * fix(experiment): report TaskState_ERROR for the original start task on timeout * refactor(experiment): single experiment-state annotation on revisions * fix(fleet): publish abort to Fleet Automation the same way as timeout (cherry picked from commit c85321f) Co-authored-by: levan-m <116471169+levan-m@users.noreply.github.com>

levan-m added this to the v1.27.0 milestone May 21, 2026

levan-m added bug Something isn't working enhancement New feature or request labels May 21, 2026

github-actions Bot added team/container-platform team/container-autoscaling team/fleet labels May 21, 2026

levan-m added 5 commits May 21, 2026 17:02

fix(experiment): anchor timeout on Status.Experiment.StartedAt

b6ef9e5

fix(fleet): rehydrate installer state from DatadogAgent on daemon sta…

d202cf7

…rtup

fix(experiment): report TaskState_ERROR for the original start task o…

6cf24ad

…n timeout

refactor(experiment): single experiment-state annotation on revisions

8509c37

fix(fleet): publish abort to Fleet Automation the same way as timeout

5aa1b51

levan-m force-pushed the levan-m/experiment-fixes branch from 86065ef to 5aa1b51 Compare May 21, 2026 21:07

levan-m marked this pull request as ready for review May 22, 2026 01:19

levan-m requested a review from a team May 22, 2026 01:19

levan-m requested review from a team as code owners May 22, 2026 01:19

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread internal/controller/datadogagent/experiment.go

coignetp approved these changes May 22, 2026

View reviewed changes

Mathew-Estafanous reviewed May 22, 2026

View reviewed changes

Mathew-Estafanous approved these changes May 26, 2026

View reviewed changes

Merge branch 'main' into levan-m/experiment-fixes

649b95d

levan-m merged commit c85321f into main May 27, 2026
37 of 38 checks passed

levan-m deleted the levan-m/experiment-fixes branch May 27, 2026 14:22

levan-m added the backport/v1.27 label May 27, 2026

dd-octo-sts Bot mentioned this pull request May 27, 2026

[Backport v1.27] FA rollout experiment fixes #3049

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA rollout experiment fixes#3035

FA rollout experiment fixes#3035
levan-m merged 6 commits into
mainfrom
levan-m/experiment-fixes

levan-m commented May 21, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 21, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 21, 2026 •

edited by datadog-official Bot

Loading

🎯 1 Code Coverage issue detected

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Mathew-Estafanous May 22, 2026

Uh oh!

levan-m May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

levan-m commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Easiest to review Commit by commit

Motivation

Additional Notes

Minimum Agent Versions

Describe your test plan

Checklist

Uh oh!

codecov-commenter commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 21, 2026 • edited by datadog-official Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛑 Gate Violations

🎯 1 Code Coverage issue detected

ℹ️ Info

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Mathew-Estafanous May 22, 2026

Choose a reason for hiding this comment

Uh oh!

levan-m May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

levan-m commented May 21, 2026 •

edited

Loading

codecov-commenter commented May 21, 2026 •

edited

Loading

datadog-datadog-prod-us1-2 Bot commented May 21, 2026 •

edited by datadog-official Bot

Loading