Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken runtime config sync #10013

Merged
merged 5 commits into from
Aug 6, 2024
Merged

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Mar 1, 2024

Basically the same as in #9980, but this PR additionally prohibits concurrent API requests targeting the same object. Consequently, no API request is being processed for an object whilst processing a cluster "config update/delete" event for that very same object and vice versa.

Tests

Master 1
object Endpoint "master1" {
}

object Endpoint "master2" {
	host = "IP-Address"
	log_duration = 30m
}

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
	log_duration = 30m
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}
Master 2
object Endpoint "master1" {
//	host = "IP-Address"
//	port = "5665"
	log_duration = 30m
}

object Endpoint "master2" { }

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
	log_duration = 30m
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}
Satellite
object Endpoint "master1" {
	host = "IP"
	port = "5665"
}

object Endpoint "master2" {
	host = "IP"
	port = "5665"
}

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}

Start both masters and create some objects via the API, wait until the two masters have synchronised their configuration and then start the satellite.

Tests for #10012

Start two Icinga endpoints that are connected to each other and must accept config updates from each other (accept_config = true).

for j in {1..100} ; do
    curl -k -s -S -i -u root:icinga -H 'Accept: application/json' \
    -X PUT "https://localhost:5665/v1/objects/hosts/example-$j" \
    -d '{ "templates": [ "generic-host" ], "attrs": { "address": "127.0.0.1", "check_command": "hostalive", "vars.os" : "Linux" }, "pretty": true }'
done

It doesn't silently aborts the synchronisation attempts.

fixes #9721

@cla-bot cla-bot bot added the cla/signed label Mar 1, 2024
@icinga-probot icinga-probot bot added area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working labels Mar 1, 2024
@yhabteab yhabteab added this to the 2.15.0 milestone Mar 1, 2024
@yhabteab yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Mar 1, 2024
@yhabteab yhabteab requested review from oxzi, julianbrost and Al2Klimov and removed request for oxzi March 1, 2024 10:02
@julianbrost julianbrost added the blocker Blocks a release or needs immediate attention label Mar 5, 2024
@julianbrost
Copy link
Contributor

This looks like it significantly changes what was only added in #9980. If it stays that way, it might make sense to revert #9980 as a whole and make a complete new attempt. That would make the backport of this cleaner (as you wouldn't have to backport a commit and its reversion at once).

What's the reason for moving ObjectNameMutex to types.{hpp,cpp}?

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

What's the reason for moving ObjectNameMutex to types.{hpp,cpp}?

Well, my first thought was that we might be using it even more than we thought in #9980, and that it might introduce a circular dependency. However, if you are strongly against it, then of course I can undo it.

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Plus ConfigObjectUtility:CreateObject() can also be used without an API listener.

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Plus ConfigObjectUtility:CreateObject() can also be used without an API listener.

Plus ApiListener::ConfigUpdateObjectAPIHandler() calls CreateObject() while holding a lock on that object and its type, so that's a good reason not to use the same mutex, I'd say.

@julianbrost
Copy link
Contributor

Well, my first thought was that we might be using it even more than we thought in #9980, and that it might introduce a circular dependency. However, if you are strongly against it, then of course I can undo it.

I mainly asked the question because if it is moved, it's more code that was added in #9980 that wouldn't stay the way it was added there. So if it's moved, that would be more reason for me to revert #9980 completely and do a clean new PR. (I haven't looked at the PR in enough detail to assess whether it should be moved.)

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Why does it feel strange? Think of what this mutex is supposed to protect: the consistency of the configuration of a specific object. So anything that attempts to modify the object at runtime should use the same mutex.

Imagine someone would send a PUT /v1/objects/hosts/foo request to Master 1 and then the same HTTP request to Master 2 at about the same it receives the JSON-RPC request from Master 1 to create the object. Should both requests be processed at the same time on Master 2?

Plus ApiListener::ConfigUpdateObjectAPIHandler() calls CreateObject() while holding a lock on that object and its type, so that's a good reason not to use the same mutex, I'd say.

Does the locking in ConfigUpdateObjectAPIHandler() actually do anything useful anymore, or would the locking performed in CreateObject() already suffice for ConfigUpdateObjectAPIHandler()?

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

Does the locking in ConfigUpdateObjectAPIHandler() actually do anything useful anymore,

IMHO yes, it does! It acquires the lock before it does ctype->GetObject(objName);, which would cause two config updates to think the object doesn't exist yet and try to create it when it doesn't.

or would the locking performed in CreateObject() already suffice for ConfigUpdateObjectAPIHandler()?

That would be too late for the cluster config sync part.

@yhabteab yhabteab force-pushed the broken-runtime-config-sync branch 2 times, most recently from 2b40c6f to a259a68 Compare March 6, 2024 17:22
@julianbrost
Copy link
Contributor

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Why does it feel strange? Think of what this mutex is supposed to protect: the consistency of the configuration of a specific object. So anything that attempts to modify the object at runtime should use the same mutex.

Imagine someone would send a PUT /v1/objects/hosts/foo request to Master 1 and then the same HTTP request to Master 2 at about the same it receives the JSON-RPC request from Master 1 to create the object. Should both requests be processed at the same time on Master 2?

What about that part? Currently, the PR adds different locking based on whether an object is created via HTTP or JSON-RPC. Both would create the same object though, so both should use the same locking. Wouldn't you end up with race conditions between HTTP and JSON-RPC otherwise?

@yhabteab
Copy link
Member Author

yhabteab commented Mar 7, 2024

What about that part? Currently, the PR adds different locking based on whether an object is created via HTTP or JSON-RPC. Both would create the same object though, so both should use the same locking. Wouldn't you end up with race conditions between HTTP and JSON-RPC otherwise?

I don't think so! In case of a JSON-RPC connection, the object is locked twice! There is only one case in which a race condition may occur. This's when the API request enters CreateObject(), while the cluster sync takes place in this part of the code. It's hard to trigger, I'd say!

ConfigObject::Ptr object = ctype->GetObject(objName);
String config = params->Get("config");
bool newObject = false;
if (!object && !config.IsEmpty()) {
newObject = true;
/* object does not exist, create it through the API */
Array::Ptr errors = new Array();
/*
* Create the config object through our internal API.
* IMPORTANT: Pass the origin to prevent cluster sync loops.
*/
if (!ConfigObjectUtility::CreateObject(ptype, objName, config, errors, nullptr, origin)) {

@yhabteab
Copy link
Member Author

yhabteab commented Mar 11, 2024

I've tested all the use cases of the config sync and it is working perfectly now. However, we still need to address the use of ObjectNameMutex in the modifyobjecthandler.cpp file, particularly whether it is even necessary to lock the object name there.

@yhabteab yhabteab removed the request for review from julianbrost March 11, 2024 15:06

static std::mutex m_Mutex;
static std::condition_variable m_CV;
static std::map<Type*, std::set<String>> m_LockedObjectNames;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider unordered_{map,set}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact types won't matter that much here as the overall size of those containers will stay rather small anyways.

For completeness, as I already talked with @yhabteab about it: I was thinking about maybe making this a std::set<std::pair<Type*, String>> instead (or std::unordered_set for that matter, there won't be much of a difference), as that would reduce the map + set operations down to just the set operations. But this seems to bring no huge improvements for simplifying the code (which would have been my motivation here, not performance).

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine in general to me. One commit message refers to the outdated class name ObjectNameMutex though. And if you're touching the PR anyways, you might do a slight improvement to one of the comments (see my inline comment).

lib/remote/configobjectutility.cpp Outdated Show resolved Hide resolved
@tbauriedel
Copy link
Member

ref/NC/804054

Al2Klimov
Al2Klimov previously approved these changes Apr 2, 2024
@julianbrost
Copy link
Contributor

I'm somewhat confused how PRs and issues belong together here.

This PR kind of reverts 008fcd1

That part of the PR description is outdated, isn't it? That was already reverted in #10018.

If I understand correctly, #10012 describes the problem introduced by #9980 and was thus fixed with the revert in #10018. What remains is the original issue from #9721 where this PR is now the second attempt to fix it.

Therefor, I'd do the following:

Do you agree that this correctly represents the current state of these issues and PRs?

@yhabteab
Copy link
Member Author

Done! I've also just rebased the PR!

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please provide instructions how to test this? Probably similar to #9980 but even that PR description misses information on when to create objects via the API and when to restart what. Additionally, how can one test now that this doesn't have the same problem as #10012?

lib/remote/apilistener-configsync.cpp Show resolved Hide resolved
lib/remote/configobjectslock.cpp Outdated Show resolved Hide resolved
@K0nne
Copy link
Contributor

K0nne commented Jul 23, 2024

Are there any updates on this? The configsync between our masters and the satellites below has died again.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a slightly different test for this PR which both tests for the original issue (#9721) and the new temporarily introduced one (#10012):

Take a cluster with a HA master zone (master-1, master-2) and a satellite zone, shouldn't matter if it's HA, in my case it was (zone satallite-a with nodes satellite-a-1 and satellite-a-2). Next, stop all satellites and run the loop from the PR with a slight modification to add the objects in the satellite-a zone (i.e. add "zone": "satellite-a" to the JSON attributes). Finally, after the objects were created, start the satellites again and observe how many of the newly created hosts were synced to which node.

v2.14.2

Sync to the online master works reliably, sync on reconnect to the satellites that were offline when the objects were created is unreliable (i.e. demonstrates the original issue #9721).

Number of test hosts present in /var/lib/icinga2/api/packages/_api/ per node:

    100 master-1
    100 master-2
     72 satellite-a-1
     95 satellite-a-2

04ef105 (merge commit of #9980)

Sync on reconnect worked fine with that, however it broke the online sync (i.e. demonstrates the temporarily introduced issue 10012):

    100 master-1
      0 master-1
    100 satellite-a-1
    100 satellite-a-2

This PR

Now all nodes receive and store their config:

    100 master-1
    100 master-2
    100 satellite-a-1
    100 satellite-a-2


static std::mutex m_Mutex;
static std::condition_variable m_CV;
static std::map<Type*, std::set<String>> m_LockedObjectNames;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These raw pointers aren't even as probably-problematic as in #9844 (comment), as our Type#~Type() are never called.

@julianbrost julianbrost merged commit 07d2530 into master Aug 6, 2024
26 checks passed
@julianbrost julianbrost deleted the broken-runtime-config-sync branch August 6, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events blocker Blocks a release or needs immediate attention bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API ConficSync - Packages - Out of Sync/Missing
5 participants