Skip to content

perf(expand): pre-prune duplicate (source, descendant) actions (CE-703)#805

Closed
arreyder wants to merge 2 commits into
mainfrom
chrisrhodes/ce-703-pre-prune-duplicate-source-descendant-actions-in-grant
Closed

perf(expand): pre-prune duplicate (source, descendant) actions (CE-703)#805
arreyder wants to merge 2 commits into
mainfrom
chrisrhodes/ce-703-pre-prune-duplicate-source-descendant-actions-in-grant

Conversation

@arreyder
Copy link
Copy Markdown
Contributor

@arreyder arreyder commented May 6, 2026

Summary

expander.RunSingleStep previously generated next-depth actions without checking whether the same (source, descendant) edge was already queued or had just been yielded earlier in the same iteration. This was a throughput cost — MarkEdgeExpanded is idempotent so correctness was not at risk, but every duplicate paid one full connector `ListGrants` round-trip plus the SQLite write to expand the same edge twice.

Why this matters

The path runs once per depth-bump in expandGrantsForEntitlements, which is the dominant phase for enterprise Entra-class connectors. Today's a large tenant Entra-Global expansion processed ~2161 actions over 12 hours with a c1z file growing to 142 GB. Even a small fraction of duplicates would compound across that envelope.

The most likely real-world reproducer: SCC cycle reduction collapses a cycle into a single node whose EntitlementIDs slice carries the merged entitlements. If the same ID ends up listed twice (which can happen across some manual graph edits or a JSON round-trip), GetExpandableEntitlements yields the source twice, the inner loop runs twice, and we end up with two identical actions in the queue.

Fix

Build a `seen` set seeded from existing `graph.Actions` and skip duplicates as we iterate:

type edgeKey struct{ source, descendant string }
seen := make(map[edgeKey]struct{}, len(e.graph.Actions))
for _, a := range e.graph.Actions {
    seen[edgeKey{a.SourceEntitlementID, a.DescendantEntitlementID}] = struct{}{}
}
for src := range e.graph.GetExpandableEntitlements(ctx) {
    for dst, grantInfo := range e.graph.GetExpandableDescendantEntitlements(ctx, src) {
        k := edgeKey{src, dst}
        if _, dup := seen[k]; dup { skipped++; continue }
        seen[k] = struct{}{}
        e.graph.Actions = append(e.graph.Actions, &EntitlementGraphAction{...})
    }
}

When `skipped > 0` we log an Info line with `depth`, `queued`, and `skipped` so operators can grep duplicate-rate per depth without flipping debug globally. The log fires at most once per depth-bump (the same cadence as the action-generation loop).

Seeding the seen-set from existing `graph.Actions` is a safety net: the production code path only enters action-generation when the queue is empty, but a future code change (checkpoint-restore, retry-with-saved-queue) could call it with non-empty Actions, and the dedup would handle that correctly.

Tests

  • TestExpander_DedupQueuesEachEdgeOnce\ — constructs a node whose `EntitlementIDs` slice contains "A" twice (the SCC-style reproducer), runs `RunSingleStep`, asserts the queue contains `(A,B)` exactly once instead of twice.
  • TestExpander_DedupAlreadyQueuedAction\ — companion test documenting the existing-queue safety-net invariant.
  • All existing `./pkg/sync/expand/...` tests still pass.

Validation

  • go test -count=1 ./pkg/sync/expand/...\ — green
  • go vet ./pkg/sync/expand/...\ — clean
  • gofmt -l pkg/sync/expand/\ — clean

Linear

CE-703

Companion PR

#804 (CE-702) adds OTel metrics for grant-expansion progress. Once both ship and operators see real `skipped` counts on Entra-Global syncs, we'll know whether this fix is meaningful in practice or just a safety net.

When the action queue empties and RunSingleStep walks
GetExpandableEntitlements × GetExpandableDescendantEntitlements to
generate the next depth's actions, it previously appended without
checking whether the same (source, descendant) edge was already in
the queue or had been yielded earlier in the same iteration.

In the SCC-merged case (and after some manual graph edits) the same
entitlement ID can appear in a node's EntitlementIDs slice more than
once, causing GetExpandableEntitlements to yield it repeatedly and
the inner loop to enqueue identical actions. MarkEdgeExpanded is
idempotent so correctness was not at risk, but every duplicate paid
one full connector ListGrants round-trip plus the SQLite I/O to
write the same expansion twice — directly multiplied by the slowest
part of large enterprise tenants (Eli Lilly Entra-Global ran ~2161
actions over 12h with a 142 GB c1z file; even a small fraction of
duplicates compounds across that envelope).

The fix builds a `seen` set seeded from existing actions and skips
duplicates as the iteration progresses. A skipped > 0 path emits an
Info log with depth, queued, and skipped counts so operators can grep
without flipping debug, but stays rate-limited because the loop fires
once per depth-bump.

Refs: CE-703
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

Address review feedback on the initial CE-703 commit:

- Drop the no-op TestExpander_DedupAlreadyQueuedAction. The test never
  called RunSingleStep — it asserted only on the slice it had just set
  itself. Replace with two real tests:
  - TestExpander_DedupHandlesDestinationSideDuplicate covers the
    inner-loop yield-twice case (destination node EntitlementIDs
    duplicate), the mirror of the existing source-side test.
  - TestExpander_DedupEmitsLogOnSkip verifies the operator-visible
    "expander: pre-pruned duplicate (source, descendant) actions" log
    line carries the expected `queued` and `skipped` fields, since
    those are the signal operators will grep for to estimate dedup
    win across syncs.

- Remove the dead seen-set seeding loop over e.graph.Actions. The
  control flow above guarantees the queue is empty at the
  action-generation branch, so iterating it was a no-op. Updated the
  surrounding comment to explain why.

- Resize the seen map hint from len(e.graph.Actions) (always 0 here)
  to len(e.graph.Nodes), a coarse upper bound that avoids rehashes on
  large graphs.

- Drop the redundant zap.Int("depth", e.graph.Depth) field on the
  dedup log line — `l.With(...)` at the top of RunSingleStep already
  attaches it.

- Hoist edgeKey out of RunSingleStep to package scope so the type is
  declared once.

Refs: CE-703
@btipling btipling self-assigned this May 6, 2026
@btipling
Copy link
Copy Markdown
Contributor

btipling commented May 8, 2026

but every duplicate paid one full connector ListGrants round-trip plus the SQLite write to expand the same edge twice.

This is false. expansion does not have anything to do with connector ListGrants round-trips

Copy link
Copy Markdown
Contributor

@btipling btipling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the benhmarks as so:

go test -run=NONE -bench='BenchmarkExpand|BenchmarkSCC_FixCycles' -benchmem -benchtime=1x -count=6 ./pkg/sync/expand/

I did this on main and this branch

  - `ExpandSmall` — main 42.02 s → branch 59.40 s
  - `ExpandSmallMedium` — main 55.02 s → branch 79.29 s
  - `SCC_FixCycles_Large N=10000 M=100000` — main 64.82 ms → branch 93.19 ms
  - `SCC_FixCycles_Large N=20000 M=200000` — main 155.1 ms → branch 242.1 ms

Seems like a regression in perf

@ggreer
Copy link
Copy Markdown
Contributor

ggreer commented May 8, 2026

I've tried with a few c1zs and can't seem to get a duplicate source + descendant action. I tested this out by checking out this branch and adding some logging in the conditional. It never went into that code.

@ggreer
Copy link
Copy Markdown
Contributor

ggreer commented May 8, 2026

Performance improvements have been merged since this PR was created, so this branch would need to be rebased for benchmarks to be accurate. I notice no difference in performance, but that's because my test data never hits a duplicate source + descendant.

@arreyder
Copy link
Copy Markdown
Contributor Author

Closing this — rebased onto current main (which now includes #807 SQLite perf, #781 slim-blob, #826 grants index, and #804 the companion CE-702 metrics) and re-ran @btipling's bench command (-count=6 -benchtime=1x):

Benchmark main rebased p
ExpandSmall 107.7s 104.3s 0.180
ExpandSmallMedium 134.5s 133.5s 0.093
SCC_FixCycles_Large N=10k 122.3ms 121.9ms 0.937
SCC_FixCycles_Large N=20k 254.1ms 259.4ms 0.240

All deltas statistically insignificant. The regression @btipling saw earlier was against pre-#807 main — it doesn't survive the rebase. But the perf claim also doesn't survive: allocs/op is byte-identical between branches, confirming @ggreer's observation that the dedup branch doesn't fire on representative c1z data.

Leaving CE-703 open so we can revisit if production ever surfaces real (source, descendant) duplicates (e.g. via the CE-702 metrics from #804).

@arreyder arreyder closed this May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants