-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API ConficSync - Packages - Out of Sync/Missing #9721
Comments
What exactly does that mean? Setting up a new satellite and this happens the first time that satellite receives any configuration? In general, logs from the time the host should have been synced would be interesting. When this was probably depends on the previous question, could be the previous connection to parent/same zone nodes or the time this host was created via the API.
Is there a particular reason to believe that it's related to the names? Like are there also objects with more regular names and this only happens for the ones with these exotic names? I don't want to fully rule it out now, but if it worked to create the file for the service, that should also have worked for the host as the service file name includes the host name and is even longer. |
Because of missing host-objects in the /var/lib/icinga2/api/packages-folder we stopped Icinga on the satellites, deleted the folder /var/lib/icinga2/api, as well as the state-file and started the icinga-process again. In the first Minute after the initial rebuild of the api folder the config check on the Satelliten was ok. A few minutes later, the number of hosts in the packages-folder folder was still growing, but after it stopped it never reached the number of hosts in the master-zone's api packages-folder, which should be synced to the soecific zone. It were always less objects and after another config check (~5min later) the satellits's config were broken again because of missing host objects.
Exactly. In every occurance of the problem only objects of this specific naming scheme were mentioned by the config check. Sidenote: We will send you the logs. |
I have uploaded the logfile. |
In the logs, there are (as you pointed out already) errors that the affected created object were not created because they import templates that don't exist yet and these templates are only synced at a later point. I suspect this might have been introduced by #7936 (2.12.0, backported to 2.11.5 in #8093): that PR changed that file-based config updates are handled in the background, which can result in a object-based update to be applied in a different order starting from these versions.
Was the version you upgraded from older than 2.11.5? That would be consistent with my theory then.
I think the names are just a coincidence here. Maybe these are the only objects using templates. |
Looks like that problem already showed up while testing in #7742 (comment), but was attributed to a version mismatch by mistake (#7742 (comment) (1.)). |
We have upgraded from 2.11.11 to 2.12.9. |
How confident are you that this bug is new between these versions, i.e. did you do the same on 2.11.11 and it worked? Note that to trigger what I think is the bug here, an API-created object must reference something that comes from the file-based sync (/etc/icinga2/zones.d or Director or the config packages API) and that file must be synced on the same connection and must not exist on the target node before (otherwise the reference would work (potentially referring an outdated version of that file)). |
I am doing this scenario since the beginning of november 2022 and it worked flawless (in terms of a valid config, being built every hour). The problem emerged after our upgrade to 2.12.9 during the time, when we found the outdated config on our satellites. To fix this we deleted the /var/lib/icinga2/api folder. After this the problem occured. It might be possible that this delete-operation (for the first time since november 2022, afaik) triggered that bug for the first time and it is just a coincidence. |
Looks like what's going on is a bit more complicated than what I originally imagined. If just some objects were missing because they reference templates that come from not yet synced and loaded files, that should fix itself with the next reload (as it should also be triggered after the files were synced) as the objects are sent again. However, due to #7936 moving the file sync to the background, it may actually see intermediate files from the API/object-based config sync that cause it to fail and the aforementioned reload never happens. I'd like to confirm that theory with your logs but the debug.log and startup.log are from different times. Can you please also upload logs from around 2023-03-09 12:28 corresponding to the startup.log you already uploaded. Doesn't matter if these aren't debug logs, the normal |
Hi @julianbrost i have uploaded the requested logfile to netways Nextcloud. |
That log doesn't seem to confirm my theory unfortunately. I'd only have expected config validation errors ("Config validation failed for staged cluster config sync") close to object sync error ("Could not create object"), but the latter only seem to happen around 10:01, while there are multiple config validation errors starting at 10:28. But we're currently trying to replicate the setup, so maybe we will just see the same behavior there. |
In the master-zone the config is always valid. |
Question about the hosts that show up in error messages like that one (also the ones from the startup.log you uploaded to Nextcloud which were different names):
Were these objects created using the |
Those objects were created with the /v1/objects api. The objects exist at the time of the error in the master zone and the config there is valid. Those hosts are placeholders for the underlying services , which are more volatile. If there's a referencing service, its host should always exist. Those hosts are not modified in any way. The are automaticaly deleted if their last service is deleted. We temporarly mitigated the problem by copying the missing hosts from the master zone to the satellites. After this the satellites config is valid. If there's the need, you can have a look at the system. |
This week we plan to trigger the bug again by deleting the API folder in another zone. There we have also surviving api objects, which were already deleted in the master zone, which we need to purge. |
Can you please create backups of |
Please do so on both masters and satellites, even if you're just performing some operation on one of them. |
I have uploaded the results. |
Hello, |
One of our satellites suddenly had a invalid config with missing components and was unable to sync again into a valid stage afterwards. This was satellite2 of a zone. satellite1 had a valid config. On satellite2 we stopped icinga, removed /var/lib/icinga2/api/ + the state-file and started icinga again. For 1-2min its config was valid, before the sync entered a invalid state again. At this point we made a backup from the api-dir of each satellite of the zone. We removed /var/lib/icinga2/api/ + the state-file of both satellites and restarted icinga on both machines. after this both satellites showed the same broken sync behaviour and satellite1 was now missing the same objects als satellite2. In this case api-created services were missing referencing templates from global-templates. We restored the valid config backup of satellite1, its config check was ok und we got icinga running again. On satellite2 we found that /var/lib/icinga2/api/zones/global-templates/ was empty. Then we copied the directory /var/lib/icinga2/api/zones/global-templates from satellite1 to satellite2. The configcheck on satellite2 now showed missing api-created hosts. We copied then all hosts from the api-dir /var/lib/icinga2/api/packages/_api//conf.d/hosts/ of satellite1 to satellite2. Since then the config is valid again. |
Hi @K0nne, sorry for the delay! We were able to reproduce your issue of some objects created via the API being magically disappearing on the satellite endpoints, and we're working on it! Thank you for your exhaustive contributions! |
We are happy to hear this! This issue is haunting us every now and then for years now. |
Describe the bug
Due to heavy usage of the API Host/Services creation with relatively long names <128 Characters there seems to be a inconsistency with the API Sync behaviour between the Master and the Satellites.
While on the HA-Masters there seems to be fine, during a fresh Config Sync/Deployment the Satellites seem to receive an inconsistent Config with missing parts on one side and another missing parts on the other side.
This leads to both not accepting the newly provided config due to the missing parts of the config which the partner Satellite received and vice versa.
with missing part on one side and another missing part on the other side
This shows in the Icinga check signaling
In the Startup.log there is shown what Host/Services are Missing which variates due to the missing other half.
Also the Synced Zone on both Satellite Partners is never the same but should be.
The Startup.log shows the missing files .. which have (maybe) a problematic naming scheme ?
Problematic Host/Service Name consist out of the following schema:
URL Encoded:
Also the Issue seems just to present since the update from version 2.11.x to 2.12.x
Expected behavior
Config Sync should be working and should sync properly between the config Master and the Satellites Partners without loosing Config (half) to the partner Node/Satellite and also Sync properly the packages folder.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
): icinga2 - The Icinga 2 network monitoring daemon (version: r2.12.9-1)OS name | Red Hat Enterprise Linux Server
OS Version | 7.9 (Maipo)
icinga2 daemon -C
):Additional context
ref/NC/777526
The text was updated successfully, but these errors were encountered: