Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLO] fix reconcile bugs, add predicate to builder #1185

Merged
merged 1 commit into from
May 24, 2024

Conversation

celenechang
Copy link
Contributor

What does this PR do?

Update reconcile code to 1) not continue reconciling if tags have been updated; 2) set LastForceSyncTime on create and update, so that the status update does not trigger a reconcile; and 3) for good measure, add the GenerationChanged predicate filter so that any status-only updates do not trigger a reconcile.

Motivation

Address #1062 .

Additional Notes

Anything else we should know when reviewing?

Minimum Agent Versions

Are there minimum versions of the Datadog Agent and/or Cluster Agent required?

  • Agent: vX.Y.Z
  • Cluster Agent: vX.Y.Z

Describe your test plan

Write there any instructions and details you may have to test your PR.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label

@celenechang celenechang added bug Something isn't working feature/slo labels May 15, 2024
@celenechang celenechang added this to the v1.7.0 milestone May 15, 2024
@celenechang celenechang requested review from a team as code owners May 15, 2024 12:20
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 54.54545% with 10 lines in your changes are missing coverage. Please review.

Project coverage is 59.57%. Comparing base (6b0c7bd) to head (97e7e57).
Report is 4 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1185      +/-   ##
==========================================
+ Coverage   59.55%   59.57%   +0.02%     
==========================================
  Files         174      174              
  Lines       21559    21620      +61     
==========================================
+ Hits        12839    12880      +41     
- Misses       7941     7961      +20     
  Partials      779      779              
Flag Coverage Δ
unittests 59.57% <54.54%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
controllers/datadogslo_controller.go 0.00% <0.00%> (ø)
controllers/utils/tag.go 0.00% <0.00%> (ø)
controllers/datadogslo/controller.go 56.21% <66.66%> (+0.23%) ⬆️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b0c7bd...97e7e57. Read the comment docs.

@philippeVV
Copy link

Hey @celenechang, Thanks for the quick feedback on this issue, it’s much appreciated.

I tested your code and it does fix the issue but there are caveats. When testing with 2 replicas or more, we can observe that it’s still possible to have what I assume is a race condition.

Testing with 2 replicas and the sample SLO. I obtained the following log 2 out of 3 times:

{"level":"ERROR","ts":"2024-05-16T15:05:50Z","logger":"controllers.DatadogSLO","msg":"failed to update DatadogSLO with required tags","datadogslo":"system/datadogslo-sample","error":"Operation cannot be fulfilled on data
dogslos.datadoghq.com \"datadogslo-sample\": the object has been modified; please apply your changes to the latest version and try again"}

But only one SLO was created. Since the reconciliation loop stops after updating the tags, it no longer creates 2 SLOs. It fails to update the tags instead which is less impactful.
However, if you create an SLO with all the required tags, you can still create duplicate SLOs.
To test that, I added the following tag "generated:kubernetes" to the sample SLO. In that case, the controller didn’t have to update the tags so it created duplicate SLOs.

@philippeVV
Copy link

I suggest we add validation early in the reconciliation loop to ensure it’s idempotent in all situations. The solution I suggested does that by modifying the sync status field before creating the SLO. If two loops are triggered at the same time, whichever one updates the DatadogSLO in the cluster also creates the SLO in Datadog while the second loop will fail since the first one modified DatadogSLO creating a version conflict.

It might not be the best solution, but feel free to incorporate it into your PR if it’s good enough.

@celenechang
Copy link
Contributor Author

Thanks for testing @philippeVV !

One clarification question for now - when you say with 2 replicas or more..., do you mean with leader election disabled?

@philippeVV
Copy link

Hey @celenechang, sorry for the late answer. I was out for a couple of days.

The leader election is enabled, I can see in the logs a leader being successfully elected.
I've only been able to replicate the issue with more than 1 replica.
I don't know the link between the replica count and the issue.

I increased the replica count here

@philippeVV
Copy link

I'm joining the startup logs in case it's helpful.
In this example, the datadogSLO is already created. I'm only restarting the deployment.

Replica 1:

{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Version: v1.6.0-rc.4_97e7e578"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Build time: 2024-05-16/10:22:21"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Git Commit: 97e7e578f959c655da3c5a331b68b491eab5c69c"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Go Version: go1.19.13"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Go OS/Arch: linux/amd64"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Manager will be watching namespace","namespace":"system"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"configuring manager health check","maximumGoroutines":400}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"setup","msg":"Feature disabled, not starting the controller","controller":"DatadogMonitor"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"setup","msg":"Feature disabled, not starting the controller","controller":"DatadogAgentProfile"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.builder","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"datadoghq.com/v2alpha1, Kind=DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.builder","msg":"skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called","GVK":"datadoghq.com/v2alpha1, Kind=DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/convert"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.builder","msg":"Conversion webhook enabled","GVK":"datadoghq.com/v2alpha1, Kind=DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"setup","msg":"starting manager"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":9443}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"INFO","ts":"2024-05-21T19:10:22Z","logger":"klog","msg":"attempting to acquire leader lease system/datadog-operator-lock...\n"}

replica 2:

{"level":"INFO","ts":"2024-05-21T19:10:20Z","logger":"setup","msg":"Version: v1.6.0-rc.4_97e7e578"}
{"level":"INFO","ts":"2024-05-21T19:10:20Z","logger":"setup","msg":"Build time: 2024-05-16/10:22:21"}
{"level":"INFO","ts":"2024-05-21T19:10:20Z","logger":"setup","msg":"Git Commit: 97e7e578f959c655da3c5a331b68b491eab5c69c"}
{"level":"INFO","ts":"2024-05-21T19:10:20Z","logger":"setup","msg":"Go Version: go1.19.13"}
{"level":"INFO","ts":"2024-05-21T19:10:20Z","logger":"setup","msg":"Go OS/Arch: linux/amd64"}
{"level":"INFO","ts":"2024-05-21T19:10:20Z","logger":"setup","msg":"Manager will be watching namespace","namespace":"system"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"configuring manager health check","maximumGoroutines":400}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Feature disabled, not starting the controller","controller":"DatadogMonitor"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"Feature disabled, not starting the controller","controller":"DatadogAgentProfile"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.builder","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"datadoghq.com/v2alpha1, Kind=DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.builder","msg":"skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called","GVK":"datadoghq.com/v2alpha1, Kind=DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.webhook","msg":"Registering webhook","path":"/convert"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.builder","msg":"Conversion webhook enabled","GVK":"datadoghq.com/v2alpha1, Kind=DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"setup","msg":"starting manager"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.webhook.webhooks","msg":"Starting webhook server"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":9443}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"INFO","ts":"2024-05-21T19:10:21Z","logger":"klog","msg":"attempting to acquire leader lease system/datadog-operator-lock...\n"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","logger":"klog","msg":"successfully acquired lease system/datadog-operator-lock\n"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v2alpha1.DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.Secret"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.ConfigMap"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.DaemonSet"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.Deployment"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.Role"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.RoleBinding"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.ServiceAccount"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.PodDisruptionBudget"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.NetworkPolicy"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.Node"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.ClusterRole"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","source":"kind source: *v1.ClusterRoleBinding"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting Controller","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting EventSource","controller":"datadogslo","controllerGroup":"datadoghq.com","controllerKind":"DatadogSLO","source":"kind source: *v1alpha1.DatadogSLO"}
{"level":"INFO","ts":"2024-05-21T19:11:36Z","msg":"Starting Controller","controller":"datadogslo","controllerGroup":"datadoghq.com","controllerKind":"DatadogSLO"}
{"level":"INFO","ts":"2024-05-21T19:11:37Z","msg":"Starting workers","controller":"datadogslo","controllerGroup":"datadoghq.com","controllerKind":"DatadogSLO","worker count":1}
{"level":"INFO","ts":"2024-05-21T19:11:37Z","msg":"Starting workers","controller":"datadogagent","controllerGroup":"datadoghq.com","controllerKind":"DatadogAgent","worker count":1}
{"level":"INFO","ts":"2024-05-21T19:11:37Z","logger":"controllers.DatadogSLO","msg":"Reconciling Datadog SLO","datadogslo":"system/datadogslo-sample","version":"v1.27.6+k3s1"}

@philippeVV
Copy link

philippeVV commented May 21, 2024

Here are the logs when the duplicate SLO is created:

Tested with:

  • 2 replica
  • generated:kubernetes added to the sample SLO
{"level":"INFO","ts":"2024-05-21T19:18:37Z","logger":"controllers.DatadogSLO","msg":"Reconciling Datadog SLO","datadogslo":"system/datadogslo-sample","version":"v1.27.6+k3s1"}
{"level":"INFO","ts":"2024-05-21T19:18:37Z","logger":"controllers.DatadogSLO","msg":"Object does not have a finalizer; adding finalizer","datadogslo":"system/datadogslo-sample","kind":"&Typ
eMeta{Kind:DatadogSLO,APIVersion:datadoghq.com/v1alpha1,}","datadogID":"","finalizername":"finalizer.slo.datadoghq.com"}
{"level":"INFO","ts":"2024-05-21T19:18:38Z","logger":"controllers.DatadogSLO","msg":"Created a new DatadogSLO","datadogslo":"system/datadogslo-sample","SLO ID":"be360314273352c9acb421a36c3b
0f1d"}
{"level":"INFO","ts":"2024-05-21T19:18:38Z","logger":"controllers.DatadogSLO","msg":"Reconciling Datadog SLO","datadogslo":"system/datadogslo-sample","version":"v1.27.6+k3s1"}
{"level":"INFO","ts":"2024-05-21T19:18:38Z","logger":"controllers.DatadogSLO","msg":"Created a new DatadogSLO","datadogslo":"system/datadogslo-sample","SLO ID":"0553579a8b585f9387da73080ab3
ccf3"}
{"level":"ERROR","ts":"2024-05-21T19:18:38Z","logger":"controllers.DatadogSLO","msg":"unable to update DatadogSLO status due to update conflict","datadogslo":"system/datadogslo-sample","err
or":"Operation cannot be fulfilled on datadogslos.datadoghq.com \"datadogslo-sample\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"INFO","ts":"2024-05-21T19:18:43Z","logger":"controllers.DatadogSLO","msg":"Reconciling Datadog SLO","datadogslo":"system/datadogslo-sample","version":"v1.27.6+k3s1"}

Those logs a from the elected pod.

@celenechang
Copy link
Contributor Author

Thank you again @philippeVV , appreciate your time and the information you shared. I have been trying to reproduce the issue you had with multiple replicas but I could not. I think the main interesting item from your logs is the Reconcile that occurs within the same second of the Create event. In my testing, there is always a 1m (defaultRequeuePeriod) interval between these events (with/without multiple replicas, with/without the required tags).

Since the changes in this PR seems to be an improvement and fixes at least one bug, I'm going to merge ahead of beginning our release of 1.7.0 . We can continue the conversation on your PR whether we should integrate that as we have some reservations.

@celenechang celenechang merged commit 8981855 into main May 24, 2024
23 checks passed
@celenechang celenechang deleted the celene/fix_slo_reconcile branch May 24, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature/slo
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants