FA rollout experiment fixes#3035
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3035 +/- ##
==========================================
+ Coverage 42.24% 42.31% +0.06%
==========================================
Files 337 337
Lines 28951 29007 +56
==========================================
+ Hits 12230 12273 +43
- Misses 15916 15924 +8
- Partials 805 810 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
🛑 Gate Violations
ℹ️ Info🎯 Code Coverage (details) Useful? React with 👍 / 👎 This comment will be updated automatically if new data arrives.🔗 Commit SHA: 649b95d | Docs | Datadog PR Page | Give us feedback! |
86065ef to
5aa1b51
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5aa1b51cb3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // `pkg.experimentConfigVersion == experiment.ID` never matches). | ||
| // | ||
| // Reads go through the API reader (not the cache) because the informer | ||
| // cache may not be populated yet at the moment Start runs. |
There was a problem hiding this comment.
Instead of doing this, can we use cache.WaitForCacheSync to guarantee the informer is populated?
There was a problem hiding this comment.
I think using APIReader for direct reads and doing it before d.rcClient.Subscribe makes intent clearer. Technically we could use cache.WaitForCacheSync and then read DDA from cache and initialize config version, would be same amount of code but RC subscription would block on whole cache sync vs single DDA list. In practice there is no big tradeoff.
Dropping this and relying only on WaitForCacheSync and informer event triggering config update would remove some code but create race between RC callbacks and informer events.
* fix(experiment): anchor timeout on Status.Experiment.StartedAt * fix(fleet): rehydrate installer state from DatadogAgent on daemon startup * fix(experiment): report TaskState_ERROR for the original start task on timeout * refactor(experiment): single experiment-state annotation on revisions * fix(fleet): publish abort to Fleet Automation the same way as timeout (cherry picked from commit c85321f) Co-authored-by: levan-m <116471169+levan-m@users.noreply.github.com>
What does this PR do?
Addressing several bugs/issues:
creationTimestampwhich is used as proxy for experiment start time, doesn't change. Once these revisions are old enough controller will timeout and rollback.Easiest to review Commit by commit
b6ef9e5 Decouples experiment timing from
ControllerRevisionmetadata by adding explicitStartedAtfield toExperimentStatus.rev.CreationTimestampmight be stale if new spec has is same as earlier experiment. This could lead to immediate timeout if spec matched pre-existing controller rev spec.d202cf7 rehydrate installer state from DatadogAgent on daemon startup to report state with correct task ID instead of empty one forcing FA to send start requests.
6cf24ad report TaskState_ERROR for the original start task on timeout.
8509c37 replaces two annotations (
*experiment-promoted,*experiment-rollback) on controller rev with a single annotation to avoid accidentally accumulating two mutually exclusive annotations and to simplify logic. Bug reproducible by flipping same config back and forth and intentionally failing experiment to force rollback.5aa1b51 similar to 6cf24ad, report
aborterror for the experiment task when DDA is changed during experiment.Motivation
What inspired you to submit this pull request?
Additional Notes
Anything else we should know when reviewing?
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
Below args needs to be set on Operator (debug level logging recommended)
and env vars to configure cluster name and endpoint
Happy path
Failures
timeout
2. Watch dda yaml
| yq .status.experiment, once experiment is running scale Operator to 0 replicas.3. Wait 15min
4. Scale back to 1
5. Operator marks experiment as
timed_outand is rolled back. FA retries same experiment eventually failing deployment.6. try happy path
abort
2. Watch dda yaml
| yq .status.experiment, once experiment is running scale Operator to 0 replicas.3. Apply updated DDA in cluster
4. Operator marks experiment as
aborted, updated DDA should be reconciled in cluster. FA retries same experiment eventually failing deployment.5. Try happy path.
Checklist
bug,enhancement,refactoring,documentation,tooling, and/ordependenciesqa/skip-qalabel