perf(expand): pre-prune duplicate (source, descendant) actions (CE-703)#805
perf(expand): pre-prune duplicate (source, descendant) actions (CE-703)#805arreyder wants to merge 2 commits into
Conversation
When the action queue empties and RunSingleStep walks GetExpandableEntitlements × GetExpandableDescendantEntitlements to generate the next depth's actions, it previously appended without checking whether the same (source, descendant) edge was already in the queue or had been yielded earlier in the same iteration. In the SCC-merged case (and after some manual graph edits) the same entitlement ID can appear in a node's EntitlementIDs slice more than once, causing GetExpandableEntitlements to yield it repeatedly and the inner loop to enqueue identical actions. MarkEdgeExpanded is idempotent so correctness was not at risk, but every duplicate paid one full connector ListGrants round-trip plus the SQLite I/O to write the same expansion twice — directly multiplied by the slowest part of large enterprise tenants (Eli Lilly Entra-Global ran ~2161 actions over 12h with a 142 GB c1z file; even a small fraction of duplicates compounds across that envelope). The fix builds a `seen` set seeded from existing actions and skips duplicates as the iteration progresses. A skipped > 0 path emits an Info log with depth, queued, and skipped counts so operators can grep without flipping debug, but stays rate-limited because the loop fires once per depth-bump. Refs: CE-703
Address review feedback on the initial CE-703 commit:
- Drop the no-op TestExpander_DedupAlreadyQueuedAction. The test never
called RunSingleStep — it asserted only on the slice it had just set
itself. Replace with two real tests:
- TestExpander_DedupHandlesDestinationSideDuplicate covers the
inner-loop yield-twice case (destination node EntitlementIDs
duplicate), the mirror of the existing source-side test.
- TestExpander_DedupEmitsLogOnSkip verifies the operator-visible
"expander: pre-pruned duplicate (source, descendant) actions" log
line carries the expected `queued` and `skipped` fields, since
those are the signal operators will grep for to estimate dedup
win across syncs.
- Remove the dead seen-set seeding loop over e.graph.Actions. The
control flow above guarantees the queue is empty at the
action-generation branch, so iterating it was a no-op. Updated the
surrounding comment to explain why.
- Resize the seen map hint from len(e.graph.Actions) (always 0 here)
to len(e.graph.Nodes), a coarse upper bound that avoids rehashes on
large graphs.
- Drop the redundant zap.Int("depth", e.graph.Depth) field on the
dedup log line — `l.With(...)` at the top of RunSingleStep already
attaches it.
- Hoist edgeKey out of RunSingleStep to package scope so the type is
declared once.
Refs: CE-703
This is false. expansion does not have anything to do with connector ListGrants round-trips |
There was a problem hiding this comment.
I ran the benhmarks as so:
go test -run=NONE -bench='BenchmarkExpand|BenchmarkSCC_FixCycles' -benchmem -benchtime=1x -count=6 ./pkg/sync/expand/
I did this on main and this branch
- `ExpandSmall` — main 42.02 s → branch 59.40 s
- `ExpandSmallMedium` — main 55.02 s → branch 79.29 s
- `SCC_FixCycles_Large N=10000 M=100000` — main 64.82 ms → branch 93.19 ms
- `SCC_FixCycles_Large N=20000 M=200000` — main 155.1 ms → branch 242.1 ms
Seems like a regression in perf
|
I've tried with a few c1zs and can't seem to get a duplicate source + descendant action. I tested this out by checking out this branch and adding some logging in the conditional. It never went into that code. |
|
Performance improvements have been merged since this PR was created, so this branch would need to be rebased for benchmarks to be accurate. I notice no difference in performance, but that's because my test data never hits a duplicate source + descendant. |
|
Closing this — rebased onto current
All deltas statistically insignificant. The regression @btipling saw earlier was against pre-#807 Leaving CE-703 open so we can revisit if production ever surfaces real |
Summary
expander.RunSingleSteppreviously generated next-depth actions without checking whether the same(source, descendant)edge was already queued or had just been yielded earlier in the same iteration. This was a throughput cost —MarkEdgeExpandedis idempotent so correctness was not at risk, but every duplicate paid one full connector `ListGrants` round-trip plus the SQLite write to expand the same edge twice.Why this matters
The path runs once per depth-bump in
expandGrantsForEntitlements, which is the dominant phase for enterprise Entra-class connectors. Today's a large tenant Entra-Global expansion processed ~2161 actions over 12 hours with a c1z file growing to 142 GB. Even a small fraction of duplicates would compound across that envelope.The most likely real-world reproducer: SCC cycle reduction collapses a cycle into a single node whose
EntitlementIDsslice carries the merged entitlements. If the same ID ends up listed twice (which can happen across some manual graph edits or a JSON round-trip),GetExpandableEntitlementsyields the source twice, the inner loop runs twice, and we end up with two identical actions in the queue.Fix
Build a `seen` set seeded from existing `graph.Actions` and skip duplicates as we iterate:
When `skipped > 0` we log an Info line with `depth`, `queued`, and `skipped` so operators can grep duplicate-rate per depth without flipping debug globally. The log fires at most once per depth-bump (the same cadence as the action-generation loop).
Seeding the seen-set from existing `graph.Actions` is a safety net: the production code path only enters action-generation when the queue is empty, but a future code change (checkpoint-restore, retry-with-saved-queue) could call it with non-empty Actions, and the dedup would handle that correctly.
Tests
TestExpander_DedupQueuesEachEdgeOnce\— constructs a node whose `EntitlementIDs` slice contains "A" twice (the SCC-style reproducer), runs `RunSingleStep`, asserts the queue contains `(A,B)` exactly once instead of twice.TestExpander_DedupAlreadyQueuedAction\— companion test documenting the existing-queue safety-net invariant.Validation
go test -count=1 ./pkg/sync/expand/...\— greengo vet ./pkg/sync/expand/...\— cleangofmt -l pkg/sync/expand/\— cleanLinear
CE-703
Companion PR
#804 (CE-702) adds OTel metrics for grant-expansion progress. Once both ship and operators see real `skipped` counts on Entra-Global syncs, we'll know whether this fix is meaningful in practice or just a safety net.