Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #23727: Group all node related access into one NodeFactRepository #5167

Conversation

fanf
Copy link
Member

@fanf fanf commented Nov 10, 2023

https://issues.rudder.io/issues/23727

overview

A lot of change going on, so let's start from the big picture :

  • we want one central, performant node repository for any query do that ACL can be applied at the core of rudder
  • we want to minimize change to existing service (so some kind of proxy will be needed)
  • we want to have a simpler implementation (at least in the long term) we fewer concepts for nodes

So we go from that :

2023-11-08-ldap-node-repo-before-change

To that :

2023-11-08-core-node-fact-repo

NodeFact and CoreNodeFact, mapping to old node classes

NodeFact and CoreNodeFact

The core concept is NodeFact, the whole representation of a node with all the inventory items, including software, and all rudder settings (audit mode, state, run, etc), properties, agent config.
We use the fully serializable and parsable case class introduced in #4869.

We add a subset of that structure, CoreNodeFact, as an equivalent of NodeInfo, but cleaner. And to link both, they implement the trait MinimalNodeFactInterface.
CoreNodeFact serves the same role as NodeInfo:

  • hold enough always used information about node,
  • still smallish enough (similar to NodeInfo) so that all nodes can be kept in ram for cache.

You can go between the two node fact structures with translation methods like NodeFact.toCore or NodeFact.fromMinimal

Why not CoreNodeFact as a field in NodeFact?

It was though to have CoreNodeFact as a subpart of NodeFact.
It would have the following benefits :

  • much simpler translation between the two. Without, the translation methods need explicit copy of each fields, which is boilerplaty ; with it, it's just a field core in NodeFact to get/set
  • it leads to less object gc, since we would really just reuse core object exactly to build node facts (but it doesn't change the fact that the full node fact object must be created, so we're just talking if the space taken by 10 or so pointers in it)

Benefits for that :

  • the (json, direct) serialization of node fact is independant of what we needs to do for performance reasons in the webapp. If tomorrow software update goes in core, it doesn't change cold format
  • it's transparent to move field in/out core fact if needed : it's just adding or removing it from MinimalNodeFactInterface. And even if client were using slowGet (see below), they will automatically switch to the cache version
  • it's easier to refactore client code

Giving the lesser impact for external on change, I preferred to avoid the "core as subobject" path.

SelectFacts, SefectFactConfiguration

To allow fine grained selection of which attributes need to be retrieved in a node fact and avoid to load all softwares, processes and ports when only the bios was needed, we introduce the concept of SelectFacts.
This configuration object is given to method retrieving or merging node facts to select what attribute need to be retrieved exactly among the ones out of core node facts.

CoreNodeFact attribute are always retrieved, because they are already in ram and it would be more costly to recreate objects than just reuse the existing one.

Several default configuration are provided for none attribute (equivalent to CoreNodeFact, useful as a base to create a config for just one attribute), all facts, all facts but software, only software.

⚠️ SelectFacts is likely the most significant change in the architecture. It allows to cherry-pick what part of the big data structure are retrieved and loaded in memory. It is also an important part of the flexibility around change callbacks.

Mapping to old node classes

A lot of mapping from and to FullInventory, Node, NodeInfo, NodeSummary, Srv was implemented.

The first three are the most important, NodeInfo being used everywhere in Rudder for read only info, FullInventory and Node being used as a primary source of update for respectively inventory and rudder settings/node properties.
When possible, same behavior as the one existing today was done.

CoreNodeFactRepository

This is the central piece of change (even if NodeFactQuery may be more complex, see below).

It will be the new nexus API for accessing nodes in rudder, and it must be used for all and any node update or get since it's how we will ensure that node ACL are correctly and consistently applied.

why no CQRS?

Yes, why not? It's the text book use case for that.
Well, we're in a monolith, one repo for both change and request is just so much simpler.
So I choose to take an archi allowing to latter switch to CQRS if we want (typically if the query part becomes more complex) : all change lead to events that are what will be exposed outside. And the repo only manage the node cache consistency and centralized the code around node.

CQRS would be a massive change in rudder by itself, that would have been too much risk in one refactoring.

Node update, change event and callbacks

Since that repository is now the only way to modify nodes, it provides :

  • an update from inventory that is knowledgeable about how to create default, update pending, etc
  • a save(node Fact) method that can be tailored with SelectFscts config to cherry pick what is changed,
  • a change status method for accepting nodes
  • a delete method that delete (purge)

Any change method leads to a change event, ie an information of the state evolution (node created, node updated, node accepted, etc) that contains relevant info about the change.

The changes are NodeFact+SelectFacts.
We choose that b/c:

  • NodeFact are really big, we prefer to avoid passing them around
  • but we still need to be precise about changes, else important callbacks, like regeneration, may not be triggered (ex: is a change happened only in a software and we look only for changes in CoreNodeFact, the change will be missed).
  • it avoid the need for a type parameter in Callbacks structure, which is a pain to manage and ends-up in the lower common denominator (ie MinimalNodeFactApi), forcing callbaks to retrieve more info if they need it, when it may be already present.

These change event are pass to callbacks that can be runtime added to the repo and allows other part of rudder to react to node changes in an extremely simple way.
Let's call that our messaging bus of the poor.

It typically allows to:

  • store event log about node change
  • trigger an update of dynamic groups (it could even made that update minimal, only for groups really impacted)
  • trigger generation
  • invalidate other cache like removing from node compliance cache a deleted node
  • etc

NodeFactChangeEvent reification to Storage

Storage (ie the backend in charge of persistence) also has change event, but with less cases (because it does not deal with business logic like node acceptation / etc).

3 storages are implemented (noop, git, ldap) and it seems that the mechanism is general enought to accomodate the different cases.

StorageNodeFactChangeEvent specifialization

Events are specialised by type of action:

  • save (update, create, no-op)
  • delete
  • change status.

This is to allows more specific pattern matching and enrichment of events from severa sources.

Pending nodes and rudder settings / properties

Now a pending node is created with all its aspects, ie not limited to inventory infoand including rudder settings, properties etc

This lifts a lot of old limits about rudder preconfiguration of node before acceptation.

get and slowGet

To make clear the performance contract, methods to retrieve node come in two flavors : a default get that only (but quickly) retrieve CoreNodeFact from cache, and a slow version that retrieve the selected information but needs round trips to cold storage for that.

streamed getAll

The getAll methods return streams so that filtering can be applied as soon as possible.
Perhaps in some future, we will be able to stream from cold storage to final usage.

NodeFactStorage

This is the part in charge of the cold storage (stérilisation, save, get, unserialization) of node information.

We wanted to change as little things as possible here :

  • of course regarding the actual storage, ie LDAP for nodes. That refactoring is complicated enough, we
    • 1/ absolutely want to be able to compare for change with previous rudder version and
    • 2/ let people believe that nothing change in rudder if you look at the webapp as a black box
  • and for as much serialization/persistanc/query code as possible. That code is complicated and was debugged by a decade of field tests, we don't want to loose that.

This lead to :

  • directly reusing LdapFullInventiryRepository and WoNodeRepository for node chsnges
  • copying the old code from node acceptation for software update
  • copying the code from NodeInfoService for retrieving core node fact (for exemple at boot)

Some logic was added on top of that to generate change events and selectively retrieve information based on SelectFacts, but the atomicity is coarser that SelectFacts, and limited to inventory (yes/no), software (yes/no).
This is because it allows the reuse of exidting repositories as they were.

Further iterations in future minor versions of rudder will be able to implement optimization here if relevant (but perhaps more relevant would be to use postgres in place of ldap)

NodeFactQuery

This bit is a bit complicated. It was a chance the hard bits where done in #4771 PoC.

So, we reimplemented a new query processor that primary works on CoreNodeFact structures, and so can totally avoid I/O to ldap for queries on these attributes.
This exactly what was done in #4771, but contrary to it, not all data are in memory.
So we use SelectFacts to decide where the data lives, and either build a check on the CoreNodeFact structure, or an ldap query. That choice is done during query analisys.
There's some grouping done here to try to reduce the number of ldap queries and we take advantage of the existing InteenalLdapQueryProcessor because it us a known, stabilized beast.

Appart from that, the new engine is a joy : extremely small, easy to understand, good logging, and easily extensible. Adding the ldap in direction was almost easy once SelectFacts were in place (that was the hard bit).

NewNodeManager

This service got a big clean-up during the refactoring. It was overcomplicated and now, we just have to be able to:

  • check some precondition like duplicate hostname,
  • call CoreNodeFactRepository#changeStatys and let it manage the change.
    Also removed more LDAP related thing, along them NodeSummaryService that was only used here.

NodeDeletionService

This one was let almost untouch because:

  • it already get a big clean-up in 7.3,
  • our callback system is not potent enough to accomodate for pre- and post- hooks with possible interruption of flow in case of error (it is only fire-and-forget after the action is finished). So we kept that logic in the deletion service.

@fanf fanf marked this pull request as draft November 10, 2023 18:14
@fanf fanf marked this pull request as ready for review November 14, 2023 07:15
@fanf fanf force-pushed the arch_23727/group_all_node_related_access_into_one_nodefactrepository branch from cac6f48 to 3ac5db8 Compare November 16, 2023 21:33
tryMatches(value, a => MatchHolder[A](DebugInfo(NotEquals.id, Some(value)), extractor(n), _.forall(_ != a)))
case Regex =>
for {
m <- MatcherUtils.getRegex(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand correctly, the regex is recompiled for each node

@fanf fanf force-pushed the arch_23727/group_all_node_related_access_into_one_nodefactrepository branch from ad4a81a to 2d4b821 Compare November 22, 2023 13:07
@fanf
Copy link
Member Author

fanf commented Nov 22, 2023

We reviewed the PR via hands-on/presentation/question, internal docs.
We validatied by checking it does not lead to policy generation difference, perf change are understood, and general structure is ok.

Merging it as is, and let get some useful change build on top of it now.

@fanf
Copy link
Member Author

fanf commented Nov 22, 2023

PR updated with a new commit

@Normation-Quality-Assistant
Copy link
Contributor

This PR is not mergeable to upper versions.
Since it is "Ready for merge" you must merge it by yourself using the following command:
rudder-dev merge https://github.com/Normation/rudder/pull/5167
-- Your faithful QA
Kant merge: "In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so."
(https://ci.normation.com/jenkins/job/merge-accepted-pr/76479/console)

@fanf
Copy link
Member Author

fanf commented Nov 22, 2023

OK, squash merging this PR

@fanf fanf force-pushed the arch_23727/group_all_node_related_access_into_one_nodefactrepository branch from 35daf6a to 1d93013 Compare November 22, 2023 14:53
@fanf fanf merged commit 1d93013 into Normation:master Nov 22, 2023
0 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants