-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SLO] fix reconcile bugs, add predicate to builder #1185
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1185 +/- ##
==========================================
+ Coverage 59.55% 59.57% +0.02%
==========================================
Files 174 174
Lines 21559 21620 +61
==========================================
+ Hits 12839 12880 +41
- Misses 7941 7961 +20
Partials 779 779
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
Hey @celenechang, Thanks for the quick feedback on this issue, it’s much appreciated. I tested your code and it does fix the issue but there are caveats. When testing with 2 replicas or more, we can observe that it’s still possible to have what I assume is a race condition. Testing with 2 replicas and the sample SLO. I obtained the following log 2 out of 3 times:
But only one SLO was created. Since the reconciliation loop stops after updating the tags, it no longer creates 2 SLOs. It fails to update the tags instead which is less impactful. |
I suggest we add validation early in the reconciliation loop to ensure it’s idempotent in all situations. The solution I suggested does that by modifying the sync status field before creating the SLO. If two loops are triggered at the same time, whichever one updates the DatadogSLO in the cluster also creates the SLO in Datadog while the second loop will fail since the first one modified DatadogSLO creating a version conflict. It might not be the best solution, but feel free to incorporate it into your PR if it’s good enough. |
Thanks for testing @philippeVV ! One clarification question for now - when you say |
Hey @celenechang, sorry for the late answer. I was out for a couple of days. The leader election is enabled, I can see in the logs a leader being successfully elected. I increased the replica count here |
I'm joining the startup logs in case it's helpful. Replica 1:
replica 2:
|
Here are the logs when the duplicate SLO is created: Tested with:
Those logs a from the elected pod. |
Thank you again @philippeVV , appreciate your time and the information you shared. I have been trying to reproduce the issue you had with multiple replicas but I could not. I think the main interesting item from your logs is the Reconcile that occurs within the same second of the Create event. In my testing, there is always a 1m ( Since the changes in this PR seems to be an improvement and fixes at least one bug, I'm going to merge ahead of beginning our release of 1.7.0 . We can continue the conversation on your PR whether we should integrate that as we have some reservations. |
What does this PR do?
Update reconcile code to 1) not continue reconciling if tags have been updated; 2) set
LastForceSyncTime
on create and update, so that the status update does not trigger a reconcile; and 3) for good measure, add the GenerationChanged predicate filter so that any status-only updates do not trigger a reconcile.Motivation
Address #1062 .
Additional Notes
Anything else we should know when reviewing?
Minimum Agent Versions
Are there minimum versions of the Datadog Agent and/or Cluster Agent required?
Describe your test plan
Write there any instructions and details you may have to test your PR.
Checklist
bug
,enhancement
,refactoring
,documentation
,tooling
, and/ordependencies
qa/skip-qa
label