Fix broken runtime config sync #10013

yhabteab · 2024-03-01T09:58:51Z

Basically the same as in #9980, but this PR additionally prohibits concurrent API requests targeting the same object. Consequently, no API request is being processed for an object whilst processing a cluster "config update/delete" event for that very same object and vice versa.

Tests

Master 1

object Endpoint "master1" {
}

object Endpoint "master2" {
	host = "IP-Address"
	log_duration = 30m
}

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
	log_duration = 30m
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}

Master 2

object Endpoint "master1" {
//	host = "IP-Address"
//	port = "5665"
	log_duration = 30m
}

object Endpoint "master2" { }

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
	log_duration = 30m
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}

Satellite

object Endpoint "master1" {
	host = "IP"
	port = "5665"
}

object Endpoint "master2" {
	host = "IP"
	port = "5665"
}

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}

Start both masters and create some objects via the API, wait until the two masters have synchronised their configuration and then start the satellite.

Tests for #10012

Start two Icinga endpoints that are connected to each other and must accept config updates from each other (accept_config = true).

for j in {1..100} ; do
    curl -k -s -S -i -u root:icinga -H 'Accept: application/json' \
    -X PUT "https://localhost:5665/v1/objects/hosts/example-$j" \
    -d '{ "templates": [ "generic-host" ], "attrs": { "address": "127.0.0.1", "check_command": "hostalive", "vars.os" : "Linux" }, "pretty": true }'
done

It doesn't silently aborts the synchronisation attempts.

fixes #9721

julianbrost · 2024-03-06T15:17:05Z

This looks like it significantly changes what was only added in #9980. If it stays that way, it might make sense to revert #9980 as a whole and make a complete new attempt. That would make the backport of this cleaner (as you wouldn't have to backport a commit and its reversion at once).

What's the reason for moving ObjectNameMutex to types.{hpp,cpp}?

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

yhabteab · 2024-03-06T15:31:08Z

What's the reason for moving ObjectNameMutex to types.{hpp,cpp}?

Well, my first thought was that we might be using it even more than we thought in #9980, and that it might introduce a circular dependency. However, if you are strongly against it, then of course I can undo it.

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

yhabteab · 2024-03-06T15:36:04Z

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Plus ConfigObjectUtility:CreateObject() can also be used without an API listener.

yhabteab · 2024-03-06T15:43:28Z

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Plus ConfigObjectUtility:CreateObject() can also be used without an API listener.

Plus ApiListener::ConfigUpdateObjectAPIHandler() calls CreateObject() while holding a lock on that object and its type, so that's a good reason not to use the same mutex, I'd say.

julianbrost · 2024-03-06T16:02:32Z

Well, my first thought was that we might be using it even more than we thought in #9980, and that it might introduce a circular dependency. However, if you are strongly against it, then of course I can undo it.

I mainly asked the question because if it is moved, it's more code that was added in #9980 that wouldn't stay the way it was added there. So if it's moved, that would be more reason for me to revert #9980 completely and do a clean new PR. (I haven't looked at the PR in enough detail to assess whether it should be moved.)

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Why does it feel strange? Think of what this mutex is supposed to protect: the consistency of the configuration of a specific object. So anything that attempts to modify the object at runtime should use the same mutex.

Imagine someone would send a PUT /v1/objects/hosts/foo request to Master 1 and then the same HTTP request to Master 2 at about the same it receives the JSON-RPC request from Master 1 to create the object. Should both requests be processed at the same time on Master 2?

Plus ApiListener::ConfigUpdateObjectAPIHandler() calls CreateObject() while holding a lock on that object and its type, so that's a good reason not to use the same mutex, I'd say.

Does the locking in ConfigUpdateObjectAPIHandler() actually do anything useful anymore, or would the locking performed in CreateObject() already suffice for ConfigUpdateObjectAPIHandler()?

yhabteab · 2024-03-06T16:15:02Z

Does the locking in ConfigUpdateObjectAPIHandler() actually do anything useful anymore,

IMHO yes, it does! It acquires the lock before it does ctype->GetObject(objName);, which would cause two config updates to think the object doesn't exist yet and try to create it when it doesn't.

or would the locking performed in CreateObject() already suffice for ConfigUpdateObjectAPIHandler()?

That would be too late for the cluster config sync part.

julianbrost · 2024-03-07T14:58:05Z

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Why does it feel strange? Think of what this mutex is supposed to protect: the consistency of the configuration of a specific object. So anything that attempts to modify the object at runtime should use the same mutex.

Imagine someone would send a PUT /v1/objects/hosts/foo request to Master 1 and then the same HTTP request to Master 2 at about the same it receives the JSON-RPC request from Master 1 to create the object. Should both requests be processed at the same time on Master 2?

What about that part? Currently, the PR adds different locking based on whether an object is created via HTTP or JSON-RPC. Both would create the same object though, so both should use the same locking. Wouldn't you end up with race conditions between HTTP and JSON-RPC otherwise?

yhabteab · 2024-03-07T15:20:37Z

What about that part? Currently, the PR adds different locking based on whether an object is created via HTTP or JSON-RPC. Both would create the same object though, so both should use the same locking. Wouldn't you end up with race conditions between HTTP and JSON-RPC otherwise?

I don't think so! In case of a JSON-RPC connection, the object is locked twice! There is only one case in which a race condition may occur. This's when the API request enters CreateObject(), while the cluster sync takes place in this part of the code. It's hard to trigger, I'd say!

icinga2/lib/remote/apilistener-configsync.cpp

Lines 117 to 133 in a259a68

    
           	ConfigObject::Ptr object = ctype->GetObject(objName); 
        
           	String config = params->Get("config"); 
        
           	bool newObject = false; 
        
           	if (!object && !config.IsEmpty()) { 
        
           		newObject = true; 
        
           		/* object does not exist, create it through the API */ 
        
           		Array::Ptr errors = new Array(); 
        
           		/* 
        
           		 * Create the config object through our internal API. 
        
           		 * IMPORTANT: Pass the origin to prevent cluster sync loops. 
        
           		 */ 
        
           		if (!ConfigObjectUtility::CreateObject(ptype, objName, config, errors, nullptr, origin)) {

yhabteab · 2024-03-11T15:06:20Z

I've tested all the use cases of the config sync and it is working perfectly now. However, we still need to address the use of ObjectNameMutex in the modifyobjecthandler.cpp file, particularly whether it is even necessary to lock the object name there.

Al2Klimov · 2024-03-12T11:45:46Z

lib/remote/configobjectslock.hpp

+
+	static std::mutex m_Mutex;
+	static std::condition_variable m_CV;
+	static std::map<Type*, std::set<String>> m_LockedObjectNames;


Consider unordered_{map,set}.

The exact types won't matter that much here as the overall size of those containers will stay rather small anyways.

For completeness, as I already talked with @yhabteab about it: I was thinking about maybe making this a std::set<std::pair<Type*, String>> instead (or std::unordered_set for that matter, there won't be much of a difference), as that would reduce the map + set operations down to just the set operations. But this seems to bring no huge improvements for simplifying the code (which would have been my motivation here, not performance).

julianbrost

Looks fine in general to me. One commit message refers to the outdated class name ObjectNameMutex though. And if you're touching the PR anyways, you might do a slight improvement to one of the comments (see my inline comment).

lib/remote/configobjectutility.cpp

tbauriedel · 2024-04-02T06:54:29Z

ref/NC/804054

julianbrost · 2024-06-12T08:50:43Z

I'm somewhat confused how PRs and issues belong together here.

This PR kind of reverts 008fcd1

That part of the PR description is outdated, isn't it? That was already reverted in #10018.

If I understand correctly, #10012 describes the problem introduced by #9980 and was thus fixed with the revert in #10018. What remains is the original issue from #9721 where this PR is now the second attempt to fix it.

Therefor, I'd do the following:

Close Runtime created objects don't get synced #10012 as "fixed by Revert "Process config::update/delete cluster events gracefully" #10018"
Change this PR to "fixes API ConficSync - Packages - Out of Sync/Missing #9721"

Do you agree that this correctly represents the current state of these issues and PRs?

yhabteab · 2024-06-12T09:13:11Z

Done! I've also just rebased the PR!

julianbrost

Can you please provide instructions how to test this? Probably similar to #9980 but even that PR description misses information on when to create objects via the API and when to restart what. Additionally, how can one test now that this doesn't have the same problem as #10012?

lib/remote/apilistener-configsync.cpp

lib/remote/configobjectslock.cpp

…anup

K0nne · 2024-07-23T15:27:46Z

Are there any updates on this? The configsync between our masters and the satellites below has died again.

julianbrost

I did a slightly different test for this PR which both tests for the original issue (#9721) and the new temporarily introduced one (#10012):

Take a cluster with a HA master zone (master-1, master-2) and a satellite zone, shouldn't matter if it's HA, in my case it was (zone satallite-a with nodes satellite-a-1 and satellite-a-2). Next, stop all satellites and run the loop from the PR with a slight modification to add the objects in the satellite-a zone (i.e. add "zone": "satellite-a" to the JSON attributes). Finally, after the objects were created, start the satellites again and observe how many of the newly created hosts were synced to which node.

v2.14.2

Sync to the online master works reliably, sync on reconnect to the satellites that were offline when the objects were created is unreliable (i.e. demonstrates the original issue #9721).

Number of test hosts present in /var/lib/icinga2/api/packages/_api/ per node:

    100 master-1
    100 master-2
     72 satellite-a-1
     95 satellite-a-2

`04ef105` (merge commit of #9980)

Sync on reconnect worked fine with that, however it broke the online sync (i.e. demonstrates the temporarily introduced issue 10012):

    100 master-1
      0 master-1
    100 satellite-a-1
    100 satellite-a-2

This PR

Now all nodes receive and store their config:

    100 master-1
    100 master-2
    100 satellite-a-1
    100 satellite-a-2

Al2Klimov · 2024-08-05T12:48:44Z

lib/remote/configobjectslock.hpp

+
+	static std::mutex m_Mutex;
+	static std::condition_variable m_CV;
+	static std::map<Type*, std::set<String>> m_LockedObjectNames;


These raw pointers aren't even as probably-problematic as in #9844 (comment), as our Type#~Type() are never called.

cla-bot bot added the cla/signed label Mar 1, 2024

icinga-probot bot added area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working labels Mar 1, 2024

yhabteab added this to the 2.15.0 milestone Mar 1, 2024

yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Mar 1, 2024

yhabteab force-pushed the broken-runtime-config-sync branch from c93869d to ceab5a6 Compare March 1, 2024 10:01

yhabteab requested review from oxzi, julianbrost and Al2Klimov and removed request for oxzi March 1, 2024 10:02

julianbrost added the blocker Blocks a release or needs immediate attention label Mar 5, 2024

yhabteab force-pushed the broken-runtime-config-sync branch 2 times, most recently from 2b40c6f to a259a68 Compare March 6, 2024 17:22

This was referenced Mar 8, 2024

Revert "Process config::update/delete cluster events gracefully" #10018

Merged

Process config::update/delete cluster events gracefully #9980

Merged

API ConficSync - Packages - Out of Sync/Missing #9721

Closed

yhabteab force-pushed the broken-runtime-config-sync branch 3 times, most recently from 5bd94ce to 4a43f0e Compare March 11, 2024 11:41

yhabteab removed the request for review from julianbrost March 11, 2024 15:06

yhabteab requested a review from julianbrost March 12, 2024 10:12

Al2Klimov reviewed Mar 12, 2024

View reviewed changes

julianbrost requested changes Mar 12, 2024

View reviewed changes

lib/remote/configobjectutility.cpp Outdated Show resolved Hide resolved

yhabteab force-pushed the broken-runtime-config-sync branch from 4e265c1 to 40e71dd Compare March 14, 2024 08:39

yhabteab requested review from julianbrost and Al2Klimov March 14, 2024 08:47

Al2Klimov added the ref/NC label Apr 2, 2024

Al2Klimov previously approved these changes Apr 2, 2024

View reviewed changes

yhabteab force-pushed the broken-runtime-config-sync branch from 40e71dd to b468e0c Compare June 12, 2024 09:12

julianbrost requested changes Jun 12, 2024

View reviewed changes

lib/remote/apilistener-configsync.cpp Show resolved Hide resolved

lib/remote/configobjectslock.cpp Outdated Show resolved Hide resolved

yhabteab added 5 commits June 13, 2024 11:26

ConfigObjectUtility: Use AtomicFile to store object config files

2218ebd

Introduce RAII style ObjectNameLock class

1a55b68

ApiListener: Process cluster config updates sequentially

433e2de

ConfigObjectUtility#CreateObject(): Use Defer for config path cle…

099f664

…anup

Don't allow to modify/create/delete an object concurrently

546dea9

yhabteab dismissed Al2Klimov’s stale review via 546dea9 June 13, 2024 09:26

yhabteab force-pushed the broken-runtime-config-sync branch from b468e0c to 546dea9 Compare June 13, 2024 09:26

yhabteab requested review from Al2Klimov and julianbrost June 13, 2024 09:29

julianbrost mentioned this pull request Jul 29, 2024

object.FromEvent(): avoid database queries while holding mutex Icinga/icinga-notifications#266

Open

julianbrost approved these changes Aug 1, 2024

View reviewed changes

Al2Klimov approved these changes Aug 5, 2024

View reviewed changes

julianbrost merged commit 07d2530 into master Aug 6, 2024
26 checks passed

julianbrost deleted the broken-runtime-config-sync branch August 6, 2024 09:57

This was referenced Aug 21, 2024

Fix broken runtime config sync #10121

Merged

Fix broken runtime config sync #10126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken runtime config sync #10013

Fix broken runtime config sync #10013

yhabteab commented Mar 1, 2024 •

edited

Loading

julianbrost commented Mar 6, 2024

yhabteab commented Mar 6, 2024

yhabteab commented Mar 6, 2024

yhabteab commented Mar 6, 2024

julianbrost commented Mar 6, 2024

yhabteab commented Mar 6, 2024

julianbrost commented Mar 7, 2024

yhabteab commented Mar 7, 2024

yhabteab commented Mar 11, 2024 •

edited

Loading

Al2Klimov Mar 12, 2024

julianbrost Mar 12, 2024

julianbrost left a comment

tbauriedel commented Apr 2, 2024

julianbrost commented Jun 12, 2024

yhabteab commented Jun 12, 2024

julianbrost left a comment

K0nne commented Jul 23, 2024 •

edited

Loading

julianbrost left a comment

Al2Klimov Aug 5, 2024

Fix broken runtime config sync #10013

Fix broken runtime config sync #10013

Conversation

yhabteab commented Mar 1, 2024 • edited Loading

Tests

Tests for #10012

julianbrost commented Mar 6, 2024

yhabteab commented Mar 6, 2024

yhabteab commented Mar 6, 2024

yhabteab commented Mar 6, 2024

julianbrost commented Mar 6, 2024

yhabteab commented Mar 6, 2024

julianbrost commented Mar 7, 2024

yhabteab commented Mar 7, 2024

yhabteab commented Mar 11, 2024 • edited Loading

Al2Klimov Mar 12, 2024

Choose a reason for hiding this comment

julianbrost Mar 12, 2024

Choose a reason for hiding this comment

julianbrost left a comment

Choose a reason for hiding this comment

tbauriedel commented Apr 2, 2024

julianbrost commented Jun 12, 2024

yhabteab commented Jun 12, 2024

julianbrost left a comment

Choose a reason for hiding this comment

K0nne commented Jul 23, 2024 • edited Loading

julianbrost left a comment

Choose a reason for hiding this comment

v2.14.2

04ef105 (merge commit of #9980)

This PR

Al2Klimov Aug 5, 2024

Choose a reason for hiding this comment

yhabteab commented Mar 1, 2024 •

edited

Loading

yhabteab commented Mar 11, 2024 •

edited

Loading

K0nne commented Jul 23, 2024 •

edited

Loading

`04ef105` (merge commit of #9980)