Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #20716: Improve dynamic group computation speed #4155

Conversation

ncharles
Copy link
Member

@ncharles ncharles commented Feb 3, 2022

https://issues.rudder.io/issues/20716

This is a WIP - the code is not finished, and may land in another branch

The goal here is to explore how to do dynamic group computation with minimal LDAP queries, which are really expensive
We have many data in the nodeinfocache, we can use them

As a first approach, I changed the ldap search on hostname to a nodeinfofilter (so that it's a post filter). The drawback is that when we only search on hostname, there is no filter to apply, so it fetches all nodes to do the search. Kind of kill the advantage of doing so, that's why i resolve to add a boolean that says: ok, don't compute anything, and take all from cache

@ncharles ncharles requested a review from fanf February 3, 2022 21:06
@ncharles ncharles added Trigger test WIP Use that label for a Work In Progress PR that must not be merged yet labels Feb 3, 2022
@ncharles
Copy link
Member Author

ncharles commented Feb 3, 2022

Impact on simple search on hostname is dramatic: from 500ms to nearly nothing when the search don't return any node (the display part is really expensive on 10 000 nodes, so i can only measure when nothing is returned)

@@ -222,6 +222,31 @@ final case object NodeStateComparator extends NodeCriterionType {
}


final case object NodeStringComparator extends NodeCriterionType {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should make it a case class, with a parameter telling on which attribute of nodeinfo we should search

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, the big trouble here is that it breaks the detection on which group a pending node will appear, as the pending node is not in cache

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and properties are not correctly used to detect in which group the node will be

@@ -578,6 +578,10 @@ trait NodeInfoServiceCached extends NodeInfoService with NamedZioLogger with Cac
)
}

def getAllNodesEntry() = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not reall proud of this one

@@ -174,7 +174,7 @@ class AcceptedNodesLDAPQueryProcessor(
val timePreCompute = System.currentTimeMillis

for {
res <- processor.internalQueryProcessor(query,select,limitToNodeIds,debugId).toBox
res <- processor.internalQueryProcessor(query,select,limitToNodeIds,debugId, nodeInfoService.getAllNodesEntry).toBox
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not always we fetched, only when there is no filter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we are on the acceptedInventoryDIT only

// we should have a specific case here, saying: it's empty, we are AND, so we AND with all the cache
// case And => (None, true)

case And => (Some(Set[ExtendedFilter](LDAPFilter(BuildFilter.ALL))), true)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here for the test, it's always true, but it will cause many problems for the non accepted DIT

@ncharles
Copy link
Member Author

ncharles commented Feb 3, 2022

The queryChecker need also to be corrected/modified to make use of NodeInfoFilter, using the getPendingNodeInfos of NodeInfoService

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 4, 2022

PR updated with a new commit

@ncharles
Copy link
Member Author

ncharles commented Feb 4, 2022

I managed, on my test bench, with the code at this point, to have a *2 perfs increase, from 2 minutes to 1 minute
I've been really surprised by the total time as most of my 497 query went from about 100-300ms to 30 ms
The sum of all the com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor - LDAP result: 10001 entries obtained in 236ms for query went from 71507ms to 8935ms (so the LDAP part is nearly 10 times faster)
The sum of all the postfilter filter:rudderNode] Found 15 nodes when filtering for info service existence and properties (294 ms) has understandably went up, from 7636ms to 22636ms (*3)

Investigating showed:

  • the time spent computing the group is now 66850 ms
  • the time spend computing and UPDATING the group is 86500 ms

so 1/3 of the time is to check that the group changed on ldap
but thee are nearly 33s unaccounted for (8935 + 22636 << 66850 )

@ncharles
Copy link
Member Author

ncharles commented Feb 4, 2022

Ok, i strongly suspect that looking in the nodeinfoservicecached fo matching nodeinfo, to get to LDAPNodeInfo, to convert thme to QueryResult (same looking object) to put them in a Seq, to finally get the NodeId is probably node the finest move in processOnlyId
Same for Process
Both could simply return NodeInfo

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 6, 2022

PR updated with a new commit

case (true, false) => "System group '%s' (%s) can not be modified".format(oldGroup.name, oldGroup.id.value).fail
case (false, true) => "You can not modify a non system group (%s) with that method".format(oldGroup.name).fail
case _ => oldGroup.succeed
// check if old group is the same as new group
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change make the overhead of updated vs computed go from 18s to 13s

…speed

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 6, 2022

PR updated with a new commit

@ncharles
Copy link
Member Author

ncharles commented Feb 6, 2022

Now, dynamic group update is 50s with this change on my test bench, so a bit better than *2 perf improvement
52s spend in updating, 40s in computing (still 12 seconds for updated that don't update anythingr) <- note it's ore than 50 because it's multi-threaded

Edit: the logger part add a massive overhead, removing the debug from com.normation.rudder.services.queries.AcceptedNodesLDAPQueryProcessor show 35s for dynamic group update, about *4 perf improvement

…tation speed

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 7, 2022

PR updated with a new commit

@ncharles
Copy link
Member Author

ncharles commented Feb 7, 2022

PR updated with a new commit

… Improve dynamic group computation speed

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 8, 2022

PR updated with a new commit

…#20716: Improve dynamic group computation speed

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 8, 2022

PR updated with a new commit

… Fixes #20716: Improve dynamic group computation speed

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 8, 2022

PR updated with a new commit

@@ -157,6 +147,7 @@ trait NodeInfoService {
*/
def getAllNodes() : IOResult[Map[NodeId, Node]]

def getAllNodeInfos():IOResult[Seq[NodeInfo]]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interestingly, having a Seq rather than a Set makes this much faster

}
}
def getAllNodeInfos():IOResult[Seq[NodeInfo]] = withUpToDateCache("all nodeinfos") { cache =>
cache.view.values.map(_._2).toSeq.succeed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here, with a toSet, this method takes 88ms, with a toSeq, it take 5 ms

… fixup! Fixes #20716: Improve dynamic group computation speed

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 8, 2022

PR updated with a new commit

… fixup! fixup! Fixes #20716: Improve dynamic group computation speed

Fixes #20716: Improve dynamic group computation speed
@ncharles
Copy link
Member Author

ncharles commented Feb 8, 2022

PR updated with a new commit

… fixup! fixup! fixup! Fixes #20716: Improve dynamic group computation speed

Fixes #20716: Improve dynamic group computation speed and fix inverted searched
@ncharles
Copy link
Member Author

ncharles commented Feb 9, 2022

PR updated with a new commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Trigger test WIP Use that label for a Work In Progress PR that must not be merged yet
Projects
None yet
1 participant