Fixes #23727: Group all node related access into one NodeFactRepository #5167
+31,928
−99,096
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
https://issues.rudder.io/issues/23727
overview
A lot of change going on, so let's start from the big picture :
So we go from that :
To that :
NodeFact and CoreNodeFact, mapping to old node classes
NodeFact and CoreNodeFact
The core concept is
NodeFact
, the whole representation of a node with all the inventory items, including software, and all rudder settings (audit mode, state, run, etc), properties, agent config.We use the fully serializable and parsable case class introduced in #4869.
We add a subset of that structure,
CoreNodeFact
, as an equivalent ofNodeInfo
, but cleaner. And to link both, they implement the traitMinimalNodeFactInterface
.CoreNodeFact
serves the same role asNodeInfo
:You can go between the two node fact structures with translation methods like
NodeFact.toCore
orNodeFact.fromMinimal
Why not CoreNodeFact as a field in NodeFact?
It was though to have
CoreNodeFact
as a subpart ofNodeFact
.It would have the following benefits :
core
inNodeFact
to get/setBenefits for that :
slowGet
(see below), they will automatically switch to the cache versionGiving the lesser impact for external on change, I preferred to avoid the "core as subobject" path.
SelectFacts, SefectFactConfiguration
To allow fine grained selection of which attributes need to be retrieved in a node fact and avoid to load all softwares, processes and ports when only the bios was needed, we introduce the concept of
SelectFacts
.This configuration object is given to method retrieving or merging node facts to select what attribute need to be retrieved exactly among the ones out of core node facts.
CoreNodeFact
attribute are always retrieved, because they are already in ram and it would be more costly to recreate objects than just reuse the existing one.Several default configuration are provided for none attribute (equivalent to CoreNodeFact, useful as a base to create a config for just one attribute), all facts, all facts but software, only software.
SelectFacts
is likely the most significant change in the architecture. It allows to cherry-pick what part of the big data structure are retrieved and loaded in memory. It is also an important part of the flexibility around change callbacks.Mapping to old node classes
A lot of mapping from and to
FullInventory
,Node
,NodeInfo
,NodeSummary
,Srv
was implemented.The first three are the most important, NodeInfo being used everywhere in Rudder for read only info, FullInventory and Node being used as a primary source of update for respectively inventory and rudder settings/node properties.
When possible, same behavior as the one existing today was done.
CoreNodeFactRepository
This is the central piece of change (even if
NodeFactQuery
may be more complex, see below).It will be the new nexus API for accessing nodes in rudder, and it must be used for all and any node update or get since it's how we will ensure that node ACL are correctly and consistently applied.
why no CQRS?
Yes, why not? It's the text book use case for that.
Well, we're in a monolith, one repo for both change and request is just so much simpler.
So I choose to take an archi allowing to latter switch to CQRS if we want (typically if the query part becomes more complex) : all change lead to events that are what will be exposed outside. And the repo only manage the node cache consistency and centralized the code around node.
CQRS would be a massive change in rudder by itself, that would have been too much risk in one refactoring.
Node update, change event and callbacks
Since that repository is now the only way to modify nodes, it provides :
save(node Fact)
method that can be tailored with SelectFscts config to cherry pick what is changed,Any change method leads to a change event, ie an information of the state evolution (node created, node updated, node accepted, etc) that contains relevant info about the change.
The changes are
NodeFact+SelectFacts
.We choose that b/c:
CoreNodeFact
, the change will be missed).Callbacks
structure, which is a pain to manage and ends-up in the lower common denominator (ieMinimalNodeFactApi
), forcing callbaks to retrieve more info if they need it, when it may be already present.These change event are pass to callbacks that can be runtime added to the repo and allows other part of rudder to react to node changes in an extremely simple way.
Let's call that our messaging bus of the poor.
It typically allows to:
NodeFactChangeEvent
reification toStorage
Storage (ie the backend in charge of persistence) also has change event, but with less cases (because it does not deal with business logic like node acceptation / etc).
3 storages are implemented (noop, git, ldap) and it seems that the mechanism is general enought to accomodate the different cases.
StorageNodeFactChangeEvent
specifializationEvents are specialised by type of action:
This is to allows more specific pattern matching and enrichment of events from severa sources.
Pending nodes and rudder settings / properties
Now a pending node is created with all its aspects, ie not limited to inventory infoand including rudder settings, properties etc
This lifts a lot of old limits about rudder preconfiguration of node before acceptation.
get and slowGet
To make clear the performance contract, methods to retrieve node come in two flavors : a default get that only (but quickly) retrieve CoreNodeFact from cache, and a
slow
version that retrieve the selected information but needs round trips to cold storage for that.streamed getAll
The getAll methods return streams so that filtering can be applied as soon as possible.
Perhaps in some future, we will be able to stream from cold storage to final usage.
NodeFactStorage
This is the part in charge of the cold storage (stérilisation, save, get, unserialization) of node information.
We wanted to change as little things as possible here :
This lead to :
LdapFullInventiryRepository
andWoNodeRepository
for node chsngesNodeInfoService
for retrieving core node fact (for exemple at boot)Some logic was added on top of that to generate change events and selectively retrieve information based on SelectFacts, but the atomicity is coarser that SelectFacts, and limited to inventory (yes/no), software (yes/no).
This is because it allows the reuse of exidting repositories as they were.
Further iterations in future minor versions of rudder will be able to implement optimization here if relevant (but perhaps more relevant would be to use postgres in place of ldap)
NodeFactQuery
This bit is a bit complicated. It was a chance the hard bits where done in #4771 PoC.
So, we reimplemented a new query processor that primary works on CoreNodeFact structures, and so can totally avoid I/O to ldap for queries on these attributes.
This exactly what was done in #4771, but contrary to it, not all data are in memory.
So we use SelectFacts to decide where the data lives, and either build a check on the CoreNodeFact structure, or an ldap query. That choice is done during query analisys.
There's some grouping done here to try to reduce the number of ldap queries and we take advantage of the existing InteenalLdapQueryProcessor because it us a known, stabilized beast.
Appart from that, the new engine is a joy : extremely small, easy to understand, good logging, and easily extensible. Adding the ldap in direction was almost easy once SelectFacts were in place (that was the hard bit).
NewNodeManager
This service got a big clean-up during the refactoring. It was overcomplicated and now, we just have to be able to:
CoreNodeFactRepository#changeStatys
and let it manage the change.Also removed more LDAP related thing, along them
NodeSummaryService
that was only used here.NodeDeletionService
This one was let almost untouch because: