-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken runtime config sync #10013
Conversation
c93869d
to
ceab5a6
Compare
This looks like it significantly changes what was only added in #9980. If it stays that way, it might make sense to revert #9980 as a whole and make a complete new attempt. That would make the backport of this cleaner (as you wouldn't have to backport a commit and its reversion at once). What's the reason for moving ceab5a6 seems to add a second independent |
Well, my first thought was that we might be using it even more than we thought in #9980, and that it might introduce a circular dependency. However, if you are strongly against it, then of course I can undo it.
Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess. |
Plus |
Plus |
I mainly asked the question because if it is moved, it's more code that was added in #9980 that wouldn't stay the way it was added there. So if it's moved, that would be more reason for me to revert #9980 completely and do a clean new PR. (I haven't looked at the PR in enough detail to assess whether it should be moved.)
Why does it feel strange? Think of what this mutex is supposed to protect: the consistency of the configuration of a specific object. So anything that attempts to modify the object at runtime should use the same mutex. Imagine someone would send a
Does the locking in |
IMHO yes, it does! It acquires the lock before it does
That would be too late for the cluster config sync part. |
2b40c6f
to
a259a68
Compare
What about that part? Currently, the PR adds different locking based on whether an object is created via HTTP or JSON-RPC. Both would create the same object though, so both should use the same locking. Wouldn't you end up with race conditions between HTTP and JSON-RPC otherwise? |
I don't think so! In case of a JSON-RPC connection, the object is locked twice! There is only one case in which a race condition may occur. This's when the API request enters icinga2/lib/remote/apilistener-configsync.cpp Lines 117 to 133 in a259a68
|
5bd94ce
to
4a43f0e
Compare
I've tested all the use cases of the config sync and it is working perfectly now. However, we still need to address the use of |
|
||
static std::mutex m_Mutex; | ||
static std::condition_variable m_CV; | ||
static std::map<Type*, std::set<String>> m_LockedObjectNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider unordered_{map,set}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact types won't matter that much here as the overall size of those containers will stay rather small anyways.
For completeness, as I already talked with @yhabteab about it: I was thinking about maybe making this a std::set<std::pair<Type*, String>>
instead (or std::unordered_set
for that matter, there won't be much of a difference), as that would reduce the map + set operations down to just the set operations. But this seems to bring no huge improvements for simplifying the code (which would have been my motivation here, not performance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine in general to me. One commit message refers to the outdated class name ObjectNameMutex
though. And if you're touching the PR anyways, you might do a slight improvement to one of the comments (see my inline comment).
4e265c1
to
40e71dd
Compare
ref/NC/804054 |
I'm somewhat confused how PRs and issues belong together here.
That part of the PR description is outdated, isn't it? That was already reverted in #10018. If I understand correctly, #10012 describes the problem introduced by #9980 and was thus fixed with the revert in #10018. What remains is the original issue from #9721 where this PR is now the second attempt to fix it. Therefor, I'd do the following:
Do you agree that this correctly represents the current state of these issues and PRs? |
40e71dd
to
b468e0c
Compare
Done! I've also just rebased the PR! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b468e0c
to
546dea9
Compare
Are there any updates on this? The configsync between our masters and the satellites below has died again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a slightly different test for this PR which both tests for the original issue (#9721) and the new temporarily introduced one (#10012):
Take a cluster with a HA master zone (master-1, master-2) and a satellite zone, shouldn't matter if it's HA, in my case it was (zone satallite-a with nodes satellite-a-1 and satellite-a-2). Next, stop all satellites and run the loop from the PR with a slight modification to add the objects in the satellite-a zone (i.e. add "zone": "satellite-a"
to the JSON attributes). Finally, after the objects were created, start the satellites again and observe how many of the newly created hosts were synced to which node.
v2.14.2
Sync to the online master works reliably, sync on reconnect to the satellites that were offline when the objects were created is unreliable (i.e. demonstrates the original issue #9721).
Number of test hosts present in /var/lib/icinga2/api/packages/_api/
per node:
100 master-1
100 master-2
72 satellite-a-1
95 satellite-a-2
04ef105 (merge commit of #9980)
Sync on reconnect worked fine with that, however it broke the online sync (i.e. demonstrates the temporarily introduced issue 10012):
100 master-1
0 master-1
100 satellite-a-1
100 satellite-a-2
This PR
Now all nodes receive and store their config:
100 master-1
100 master-2
100 satellite-a-1
100 satellite-a-2
|
||
static std::mutex m_Mutex; | ||
static std::condition_variable m_CV; | ||
static std::map<Type*, std::set<String>> m_LockedObjectNames; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These raw pointers aren't even as probably-problematic as in #9844 (comment), as our Type#~Type() are never called.
Basically the same as in #9980, but this PR additionally prohibits concurrent API requests targeting the same object. Consequently, no API request is being processed for an object whilst processing a cluster "config update/delete" event for that very same object and vice versa.
Tests
Master 1
Master 2
Satellite
Start both masters and create some objects via the API, wait until the two masters have synchronised their configuration and then start the satellite.
Tests for #10012
Start two Icinga endpoints that are connected to each other and must accept config updates from each other (
accept_config = true
).It doesn't silently aborts the synchronisation attempts.
fixes #9721