feat(sync): emit OTel metrics for grant-expansion progress (CE-702)#804
Merged
arreyder merged 2 commits intoMay 8, 2026
Merged
Conversation
LogExpandProgress already emits structured log fields (actions_remaining, decompressed_bytes, decompressed_bytes_delta) but no metrics. Operators end up scraping log timestamps to answer "is this connector burning through actions or wedged?" — exactly what happened during the recent Eli Lilly Entra-Global investigation, where confirming a 12-hour expansion was actually progressing required a SQL query against logs. Add an optional metrics.Handler on ProgressLog and mirror the four log fields as OTel instruments: - baton.sync.expand.actions_remaining (gauge) - baton.sync.expand.actions_burned (counter; delta of remaining) - baton.sync.expand.decompressed_bytes (gauge) - baton.sync.expand.decompressed_bytes_growth (counter; delta of size) The counters guard against monotonic-violation: when the action queue grows between samples (depth++ enqueueing the next layer), the burned counter is held flat rather than going negative, so rate() queries on operator dashboards remain meaningful. Default is metrics.NewNoOpHandler so callers that don't pass WithMetricsHandler pay no cost. Caller is expected to pre-tag the handler via Handler.WithTags for connector_id / tenant_id so emitted metrics carry the dimensions operators slice by. Refs: CE-702
Address review feedback on the initial CE-702 commit:
- Rename baton.sync.expand.decompressed_bytes_growth →
baton.sync.expand.decompressed_bytes_delta to match the structured-
log field of the same name. Operators who see the field in a Datadog
log entry can now derive the metric name by mechanical substitution.
- Fix the counter descriptions on actions_burned and
decompressed_bytes_delta: OTel counters are cumulative, so the
description must reflect that and direct operators to use rate() for
per-window throughput. The previous wording ("since the previous
emission") implied a per-sample gauge, which a future operator
querying .last() instead of rate() would misread.
- Replace time.Sleep-gated rate-limit boundaries in two tests with
deterministic backdating of p.lastActionLog. The previous tests
relied on a 15ms sleep against a 10ms window — a 1.5× margin that's
flaky on coarse-clock CI runners (Windows; busy hosts). The new
tests set the rate-limit window to 1h and inject the precondition
directly, dropping the package's test runtime from ~240ms to ~100ms
and removing wall-clock dependency from assertions.
- Add a small comment on the metric-name constants block clarifying
that the string values are the stable external contract (operators
build dashboards on them) while the Go-level identifiers stay
unexported because callers don't need them.
Note (separate follow-up): WithMetricsHandler is registered on
ProgressLog but not yet wired through pkg/sync/syncer.go in any
production path, so the four instruments currently emit to a no-op
handler in production. Wiring belongs in a sibling PR — this PR
ships the registration-side contract and tests; the syncer.go change
will land separately under a new Linear follow-up.
Refs: CE-702
btipling
approved these changes
May 8, 2026
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
LogExpandProgressalready emits structured log fields (actions_remaining,decompressed_bytes,decompressed_bytes_delta) but no metrics. Operators end up scraping log timestamps to answer "is this connector burning through actions or wedged?". A recent prod investigation required scraping log timestamps to confirm a 12-hour grant-expansion was making progress, not stuck.This PR adds an optional
metrics.HandleronProgressLogand mirrors the four log fields as OTel instruments so the same answer comes out of a Datadog query in seconds.Metrics emitted
Counters are monotonic-safe: when the action queue grows between samples (depth++ enqueueing the next layer of expansion), the burned counter is held flat rather than going negative, so `rate()` queries on operator dashboards stay meaningful.
Behavior
Tests
Validation
Linear
CE-702
Follow-up
Sibling PR coming next: pre-prune duplicate `(source, descendant)` actions in the expander queueing loop (CE-703). The metric added here will tell us whether that fix is meaningful by counting duplicates over real Entra-Global syncs.