Fixes #16773: Batch of new nodes can overflow rudder server with inventories #3367

fanf · 2020-11-15T13:32:51Z

https://issues.rudder.io/issues/16773

Add a saveInventoryBuffer that stays between inotify watcher (in charge of the complexe logic of making sense of inotify events and waiting for signature) and the part that [parse inventory, check signature, forward to backend] and is heavy on CPU and RAM.
That buffer can be big (chose 1024 without much rationnal safe that it's big and a power of 2, as asked in ZQueue comment).

The main benefits are:

info about the file to process are cheap (a string with name, 3 functions to open stream or test if the file exists)
we can deduplicate files while they are in the buffer super easily (and so avoid starvation case where some node goes made and send a lot of inventory).

Only the code path for inventories coming from file system has the buffer, because for REST, the API await the pre-processing status. So we need to wait for signature check etc, and so that part must be synchronous. A better API would be to just move uploaded inventory/signature to /var/rudder/inventories/incoming (or a dedicated incoming directory) and from that point, let them follow the usual path - but it's an API change which will need a major version for change.

Other important code changes:

in InventoryFileWatcher.scala, sendToProcessor was renamed to sendToProcessorBlocking and moved out of processFile. processFile now send files to process to the buffer.
in InventoryProcessor.scala, the backend waiting queue, queue was renamed to blockingQueue and as the name implies, it is now a BLOCKING queue. Its default size is reduced in configurations.properties.sample because now, the only interest of a big queue would be for people using A LOT rest endpoint,
a latch which dismiss new incoming requests when one is pending is added in front of blockingQueue in the REST API code path to keep previous, non-blocking behavior.

fanf · 2020-11-15T13:45:50Z

PR updated with a new commit

fanf · 2020-11-15T13:51:24Z

PR updated with a new commit

fanf · 2020-11-15T14:03:24Z

PR updated with a new commit

ncharles · 2020-11-15T21:18:44Z

.../rudder/rudder-core/src/main/scala/com/normation/rudder/inventory/InventoryFileWatcher.scala

+      inv <- saveInventoryBuffer.take
+      // deduplicate. TakeAll is not blocking
+      all <- saveInventoryBuffer.takeAll
+      _   <- saveInventoryBuffer.offerAll(all.filterNot { case (f, _) => inv._1.name == f.name })


isn't there a risk that we loose inventory ?
Doc says:

For Sliding Queue: uses Sliding Strategy

If there is room in the queue, it places the values otherwise it removes the old elements and

enqueues the new ones. Always returns true.

and it contradict the comment up there: if we reoffer old inventories, they become new

ncharles · 2020-11-15T21:21:04Z

.../rudder/rudder-core/src/main/scala/com/normation/rudder/inventory/InventoryFileWatcher.scala

+   * inventory.
+   * As only a small structure is saved (SaveInventory is a name and two function to create stream from file), buffer can be big.
+   * When queue is full, we prefer to drop old inventories to new ones. In the worst case, they will be catched up by the
+   * scheduler.


it can still be surprising. Why not using a bounded queue?

I prefer to have a non blocking, RAM-bounbed structure. If there is a bug somewhere in the code (which is far from impossible for concurrent and async processing which involves inotify and moving files around. A blocking structure can lead to thread accumulation or even deadlock, while here, in the worst case we loose an event (and so an inventory processing). But it's easy to set up workaround for that, which is: SchedulerMissedNotify.

ncharles · 2020-11-15T21:32:24Z

.../rudder/rudder-core/src/main/scala/com/normation/rudder/inventory/InventoryFileWatcher.scala

+    def safeMove[T](chunk: =>T): IOResult[Unit] = {
+      Task.effect{chunk ; ()}.catchAll {
+        case ex: NoSuchFileException => // ignore
+          InventoryProcessingLogger.debug(s"Ignored exception '${ex.getClass.getSimpleName} ${ex.getMessage}'. The was file was correctly handled.")


The error message is not very explicit
"The file '${inventory.pathAsString}' was correctly handled" maybe?

fanf · 2020-11-16T08:03:05Z

PR updated with a new commit

Normation-Quality-Assistant · 2020-11-16T08:34:34Z

This PR is not mergeable to upper versions.
Since it is "Ready for merge" you must merge it by yourself using the following command:
rudder-dev merge https://github.com/Normation/rudder/pull/3367
-- Your faithful QA
Kant merge: "In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so."
(https://ci.normation.com/jenkins/job/merge-accepted-pr/32523/console)

fanf · 2020-11-16T17:09:39Z

OK, squash merging this PR

…ntories

fanf requested a review from ncharles November 15, 2020 13:32

fanf added the Trigger test label Nov 15, 2020

fanf added Trigger test and removed Trigger test labels Nov 15, 2020

ncharles reviewed Nov 15, 2020

View reviewed changes

fanf added Trigger test and removed Trigger test labels Nov 16, 2020

ncharles approved these changes Nov 16, 2020

View reviewed changes

Normation-Quality-Assistant added the qa: Can't merge label Nov 16, 2020

Fixes #16773: Batch of new nodes can overflow rudder server with inve…

282148b

…ntories

fanf force-pushed the bug_16773/batch_of_new_nodes_can_overflow_rudder_server_with_inventories branch from 6eb9284 to 282148b Compare November 16, 2020 17:09

fanf merged commit 282148b into Normation:branches/rudder/6.1 Nov 16, 2020

fanf deleted the bug_16773/batch_of_new_nodes_can_overflow_rudder_server_with_inventories branch March 15, 2024 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #16773: Batch of new nodes can overflow rudder server with inventories #3367

Fixes #16773: Batch of new nodes can overflow rudder server with inventories #3367

fanf commented Nov 15, 2020 •

edited

fanf commented Nov 15, 2020

fanf commented Nov 15, 2020

fanf commented Nov 15, 2020

ncharles Nov 15, 2020

ncharles Nov 15, 2020

ncharles Nov 15, 2020

fanf Nov 16, 2020

ncharles Nov 15, 2020

fanf Nov 16, 2020

fanf commented Nov 16, 2020

Normation-Quality-Assistant commented Nov 16, 2020

fanf commented Nov 16, 2020

Fixes #16773: Batch of new nodes can overflow rudder server with inventories #3367

Fixes #16773: Batch of new nodes can overflow rudder server with inventories #3367

Conversation

fanf commented Nov 15, 2020 • edited

fanf commented Nov 15, 2020

fanf commented Nov 15, 2020

fanf commented Nov 15, 2020

ncharles Nov 15, 2020

Choose a reason for hiding this comment

ncharles Nov 15, 2020

Choose a reason for hiding this comment

ncharles Nov 15, 2020

Choose a reason for hiding this comment

fanf Nov 16, 2020

Choose a reason for hiding this comment

ncharles Nov 15, 2020

Choose a reason for hiding this comment

fanf Nov 16, 2020

Choose a reason for hiding this comment

fanf commented Nov 16, 2020

Normation-Quality-Assistant commented Nov 16, 2020

fanf commented Nov 16, 2020

fanf commented Nov 15, 2020 •

edited