Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process config::update/delete cluster events gracefully #9980

Merged
merged 3 commits into from
Feb 19, 2024

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Jan 25, 2024

Warning

This PR was reverted in #10018 because it cause other problems with the config sync as described in #10012. A new fix will be developed in #10013.

Since the internal config::Update cluster events are using ConfigObjectUtility::CreateObject() as well to create received runtime objects, we shouldn't persist the config file first to the disk and then load and validate it with ConfigCompiler::CompileFile(). Otherwise, we are forced to remove the newly created file whenever we can't validate, commit or activate it. This approach would also have the downside that two cluster events for the same object arriving at the same moment from two different endpoints would result in two different threads simultaneously creating and loading the same configuration file - whereby only one of them surpasses the validation process, while the other one is facing an object re-definition error and tries to remove that configuration file it mistakenly thinks it has created. As a consequence, an object successfully created by the former is implicitly deleted by the latter thread, causing the objects to mysteriously disappear.

This PR fixes this by preserving the config in a temporary file for the entire validation process and is only moved to the actual config when the object was successfully created, otherwise that temp file will be discarded. Doing so, guarantees that two different threads simultaneously creating and loading the same configuration file do not interfere with each other. When one thread fails to pass object validation, it only deletes its temporary file and does not affect the other thread in any way. This PR additionally prevents two cluster events relating to the same object from being processed simultaneously.

Tests

Master 1
object Endpoint "master1" {
}

object Endpoint "master2" {
	host = "IP-Address"
	log_duration = 30m
}

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
	log_duration = 30m
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}
Master 2
object Endpoint "master1" {
//	host = "IP-Address"
//	port = "5665"
	log_duration = 30m
}

object Endpoint "master2" { }

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
	log_duration = 30m
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}
Satellite
object Endpoint "master1" {
	host = "IP"
	port = "5665"
}

object Endpoint "master2" {
	host = "IP"
	port = "5665"
}

object Zone "master" {
	endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite1" {
}

object Endpoint "satellite2" {
	log_duration = 30m
}

object Zone "satellite" {
	endpoints = [ "satellite1", "satellite2" ]
	parent = "master"
}

object Zone "global-templates" {
	global = true
}

Results

Before

Master 1
root@debian-11:~# ls /var/lib/icinga2/api/packages/_api/9ff3d418-8691-4ca1-aba6-0fe2efec029a/conf.d/hosts/ | wc -l
1000
Master 2
root@config-sync centos]# ls /var/lib/icinga2/api/packages/_api/97d86dee-d245-468f-b7dc-bd6044376e2a/conf.d/hosts | wc -l
1000
Satellite
[root@config-sync-2 ~]# ls /var/lib/icinga2/api/packages/_api/fd7da15d-7167-4e25-8da0-21fc97381c1b/conf.d/hosts | wc -l
755

To make things even more mysterious, when you query one of the missing hosts via the API, you get the desired result, but it refers to a non-existent path. Nevertheless, the host disappears forever and cannot even be found via the API when the icinga2 daemon is restarted.

~/Workspace/icinga2 (master ✗) curl -k -s -S -i -u root:icinga 'https://10.27.0.198:5665/v1/objects/hosts/example-197?pretty=1'

{
    "results": [
        {
            "attrs": {
                "__name": "example-197",
            ...
           "source_location": {
                    "first_column": 0,
                    "first_line": 1,
                    "last_column": 24,
                    "last_line": 1,
                    "path": "/var/lib/icinga2/api/packages/_api/fd7da15d-7167-4e25-8da0-21fc97381c1b/conf.d/hosts/example-197.conf"
                },
....
[root@config-sync-2 ~]# cat /var/lib/icinga2/api/packages/_api/fd7da15d-7167-4e25-8da0-21fc97381c1b/conf.d/hosts/example-197.conf
cat: /var/lib/icinga2/api/packages/_api/fd7da15d-7167-4e25-8da0-21fc97381c1b/conf.d/hosts/example-197.conf: Datei oder Verzeichnis nicht gefunden

After

Satellite:

root@config-sync-2 ~]# ls /var/lib/icinga2/api/packages/_api/760f4e88-1688-46fd-a017-48905eb7fbb4/conf.d/hosts | wc -l
1000

fixes #9721

@cla-bot cla-bot bot added the cla/signed label Jan 25, 2024
@icinga-probot icinga-probot bot added area/configuration DSL, parser, compiler, error handling area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working ref/NC labels Jan 25, 2024
@yhabteab yhabteab added area/runtime Downtimes, comments, dependencies, events consider backporting Should be considered for inclusion in a bugfix release and removed area/configuration DSL, parser, compiler, error handling labels Jan 26, 2024
@yhabteab yhabteab added this to the 2.15.0 milestone Jan 26, 2024
@yhabteab yhabteab marked this pull request as ready for review January 26, 2024 09:37
Al2Klimov
Al2Klimov previously approved these changes Jan 26, 2024
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could explain

ref/IP/48850

lib/remote/configobjectutility.cpp Show resolved Hide resolved
@yhabteab
Copy link
Member Author

There's one more point to clarify! Before this PR, faulty objects with the ignore_on_error flag set were written to disk, now they are not. Do you think this breaks anything? I mean, if the configuration is broken, why should we even bother saving it to the disk and deleting it when reloading the daemon?

/* Remove ignored Downtime/Comment objects. */
try {
String configDir = ConfigObjectUtility::GetConfigDir();
ConfigItem::RemoveIgnoredItems(configDir);

Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

@julianbrost
Copy link
Contributor

There's one more point to clarify! Before this PR, faulty objects with the ignore_on_error flag set were written to disk, now they are not. Do you think this breaks anything? I mean, if the configuration is broken, why should we even bother saving it to the disk and deleting it when reloading the daemon?

You can probably construct cases where the object might survive if it references a object that was temporarily removed before the object was added and is added back before the next reload. But given that ignore_on_error API-created objects are somewhat volatile anyways due to possibly being deleted on startup, that should probably be fine.

There's something else where the behavior probably changed: instantiating the object in memory isn't the only operation that can fail, creating the file might fail as well. Please explain what happens in this case. If you want to try it, you probably don't even have to consider the cluster sync, simply observing what happens on the initial node should be enough. Such an error could be induced by removing write permissions for example.

Since the satellite receives two different events from both masters, it is not possible to avoid such situations, but the object will not be accidentally deleted from disk now.

Not possible? I doubt that, it would need a different approach though.

@Al2Klimov
Copy link
Member

There's something else where the behavior probably changed: instantiating the object in memory isn't the only operation that can fail, creating the file might fail as well.

Then what about creating the AtomicFile where the whole file has been created before, but committing where it's created now?

@yhabteab
Copy link
Member Author

Tests with AtomicFile:

---
[2024-01-26 13:34:27 +0000] critical/ApiListener: Error: Import references unknown template: 'generic-service'
[2024-01-26 13:34:27 +0000] critical/ApiListener: Could not create object 'example-999!service-999':
[2024-01-26 13:34:27 +0000] critical/ApiListener: Error: Import references unknown template: 'generic-service'
[2024-01-26 13:34:27 +0000] critical/ApiListener: Could not create object 'example-1000!service-1000':
[2024-01-26 13:34:27 +0000] critical/ApiListener: Error: Import references unknown template: 'generic-service'
---

File sync triggered an Icinga 2 daemon reload:

---
[2024-01-26 13:34:29 +0000] information/ApiListener: 'api' stopped.
[2024-01-26 13:34:29 +0000] information/ApiListener: 'api' started.
[2024-01-26 13:34:29 +0000] information/ApiListener: Started new listener on '[10.27.0.198]:5665'
---
[2024-01-26 13:34:29 +0000] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/global-templates' (2168 Bytes).
[2024-01-26 13:34:29 +0000] information/ApiListener: Received configuration updates (1) from endpoint 'master2' are equal to production, skipping validation and reload.
[2024-01-26 13:34:31 +0000] critical/ApiListener: Could not create object 'example-602':
[2024-01-26 13:34:31 +0000] critical/ApiListener: Error: Object 'example-602' of type 'Host' re-defined: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-602.conf: 1:0-1:24; previous definition: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-602.conf: 1:0-1:24
[2024-01-26 13:34:31 +0000] critical/ApiListener: Could not create object 'example-669':
[2024-01-26 13:34:31 +0000] critical/ApiListener: Error: Object 'example-669' of type 'Host' re-defined: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-669.conf: 1:0-1:24; previous definition: in /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts/example-669.conf: 1:0-1:24
---
[root@config-sync-2 ~]# ls /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/hosts | wc -l
1000
[root@config-sync-2 ~]# ls /var/lib/icinga2/api/packages/_api/97469d41-cc06-4486-b8c8-788c6ecb3aab/conf.d/services | wc -l
1000

@yhabteab
Copy link
Member Author

Since the satellite receives two different events from both masters, it is not possible to avoid such situations, but the object will not be accidentally deleted from disk now.

Not possible? I doubt that, it would need a different approach though.

Without reworking the entire message processing? Note that this not only affects the config::Update cluster event, but all cluster events are processed twice. E.g. for the delete event you get the error message "Could not delete non-existent object ...", but it is only logged with notice level.

However, I can at least prevent the configuration updates from being processed simultaneously, if you wish!

@yhabteab
Copy link
Member Author

There's something else where the behavior probably changed: instantiating the object in memory isn't the only operation that can fail, creating the file might fail as well. Please explain what happens in this case. If you want to try it, you probably don't even have to consider the cluster sync, simply observing what happens on the initial node should be enough. Such an error could be induced by removing write permissions for example.

Same inconsistency! Icinga 2 fails with:

[2024-01-26 14:47:24 +0000] critical/ApiListener: Error: Function call 'mkstemp' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2001.conf.tmp.oGHmxh' failed with error code 13, 'Permission denied'
[2024-01-26 14:47:24 +0000] critical/ApiListener: Error: Function call 'mkstemp' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2002.conf.tmp.SK6dgw' failed with error code 13, 'Permission denied'

However, the objects can still be found via the API!

~/Workspace/icinga2 (Api-connect_timeout-DNS ✗) curl -k -s -S -i -u root:icinga 'https://10.27.0.198:5665/v1/objects/hosts/example-2001?pretty=1'

---
{
    "results": [
        {
            "attrs": {
                "__name": "example-2001",
                "source_location": {
                    "first_column": 0,
                    "first_line": 1,
                    "last_column": 25,
                    "last_line": 1,
                    "path": "/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2001.conf"
                },
---

Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something else where the behavior probably changed: instantiating the object in memory isn't the only operation that can fail, creating the file might fail as well.

Then what about creating the AtomicFile where the whole file has been created before, but committing where it's created now?

Additionally, I'd also write everything ASAP, just as before. Only the commit has to be where it is right now in this PR.

@yhabteab
Copy link
Member Author

Additionally, I'd also write everything ASAP, just as before.

Which benefit do you get from doing so? It would just waste a write operation if the file isn't going to be committed.

@Al2Klimov
Copy link
Member

You'd ensure you have the permission (Hello SELinux!) and enough disk space to write the file. Also, while on it, call boost::iostreams::stream#flush() immediately after <<, just to be sure.

@julianbrost
Copy link
Contributor

@yhabteab Did you also try what happend when trying that without this PR? That should have returned an error without leaving behind the object in memory.

Also, as your logs show the error twice, I presume you also tested this in the config sync scenario. However, the same difference in behavior should show on the node handling an HTTP request for creating an object. If it couldn't be written to disk before, it wasn't created at all. With this PR, it's created in memory but not persisted on disk (and even though the client probably receives an error in this case, a subsequent GET request would return the object).

@yhabteab
Copy link
Member Author

@yhabteab Did you also try what happend when trying that without this PR? That should have returned an error without leaving behind the object in memory.

r2.14.2-1):

[2024-01-26 15:28:08 +0000] warning/HttpServerConnection: Error while processing HTTP request: Function call 'std::ifstream::open' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2001.conf' failed with error code 2, 'No such file or directory'

---
{
    "error": 404,
    "status": "The requested path 'v1/objects/hosts/example-2002' could not be found or the request method is not valid for this path."
}

This PR: Not, so good!

{
    "results": [
        {
            "code": 500,
            "errors": [
                "Error: Function call 'mkstemp' for file '/var/lib/icinga2/api/packages/_api/166b6876-61cc-4feb-b10a-0efbe051dc2c/conf.d/hosts/example-2005.conf.tmp.zdkA6q' failed with error code 13, 'Permission denied'\n"
            ],
            "status": "Object could not be created."
        }
    ]
}

@yhabteab yhabteab requested review from julianbrost and Al2Klimov and removed request for julianbrost and Al2Klimov February 13, 2024 13:59
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener.hpp Outdated Show resolved Hide resolved
lib/remote/apilistener.hpp Outdated Show resolved Hide resolved
lib/remote/apilistener.hpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
@yhabteab yhabteab force-pushed the config-sync-conflicts branch 2 times, most recently from f5f993d to 81d9909 Compare February 14, 2024 08:09
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener.hpp Outdated Show resolved Hide resolved
lib/remote/apilistener.hpp Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Outdated Show resolved Hide resolved
lib/remote/apilistener-configsync.cpp Show resolved Hide resolved
Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, forget to send my review.

Defer unlockAndNotify([&listener, &ptype, &objName]{
listener->m_ObjectConfigChangeLock.UnLock(ptype, objName);
});

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just screams for a RAII style class like CpuBoundWork.

Defer unlockAndNotify([&listener, &ptype, &objName]{
listener->m_ObjectConfigChangeLock.UnLock(ptype, objName);
});

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially as there's >1 usage...

std::condition_variable m_CV;
std::unordered_map<Type*, std::set<String>> m_LockedObjectNames;
};

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and even a dedicated class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have absolutely no idea what you're talking about! This class literally locks and unlocks Mutex and you want another Mutex on top of that?

Copy link
Member

@Al2Klimov Al2Klimov Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your class is like std::mutex. And I'd like something like std::unique_lock on top. Btw. depending on how you name your methods, you could actually use std::unique_lock itself! 💡

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just screams for a RAII style class like CpuBoundWork.

I mean that would be nice to have.

Especially as there's >1 usage...

Well, the number is 2, so IMHO that's also fine with Defer.

Btw. depending on how you name your methods, you could actually use std::unique_lock itself!

I also thought about that before. Names aren't the only issue here. You can't pass extra parameters and std::unique_lock takes a reference to the underlying mutex, whereas with this class, if a name is not locked, there's no memory behind it so nothing that could be referenced.

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is fine for me now, please update the PR description though to reflect the changes to the implementation.

@yhabteab yhabteab changed the title Don't write runtime created objects to disk before validating them Enhance config update cluster events handling Feb 19, 2024
@yhabteab yhabteab changed the title Enhance config update cluster events handling Process config::update/delete cluster events gracefully Feb 19, 2024
@julianbrost julianbrost merged commit 04ef105 into master Feb 19, 2024
25 checks passed
@julianbrost julianbrost deleted the config-sync-conflicts branch February 19, 2024 12:49
@julianbrost julianbrost removed the consider backporting Should be considered for inclusion in a bugfix release label Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working cla/signed ref/IP ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API ConficSync - Packages - Out of Sync/Missing
3 participants