Skip to content

feat(sync): emit OTel metrics for grant-expansion progress (CE-702)#804

Merged
arreyder merged 2 commits into
mainfrom
chrisrhodes/ce-702-add-throughput-metric-for-grant-expansion-progress
May 8, 2026
Merged

feat(sync): emit OTel metrics for grant-expansion progress (CE-702)#804
arreyder merged 2 commits into
mainfrom
chrisrhodes/ce-702-add-throughput-metric-for-grant-expansion-progress

Conversation

@arreyder
Copy link
Copy Markdown
Contributor

@arreyder arreyder commented May 6, 2026

Summary

LogExpandProgress already emits structured log fields (actions_remaining, decompressed_bytes, decompressed_bytes_delta) but no metrics. Operators end up scraping log timestamps to answer "is this connector burning through actions or wedged?". A recent prod investigation required scraping log timestamps to confirm a 12-hour grant-expansion was making progress, not stuck.

This PR adds an optional metrics.Handler on ProgressLog and mirrors the four log fields as OTel instruments so the same answer comes out of a Datadog query in seconds.

Metrics emitted

Name Type Source field
`baton.sync.expand.actions_remaining` gauge `len(graph.Actions)`
`baton.sync.expand.actions_burned` counter delta of remaining since previous emission
`baton.sync.expand.decompressed_bytes` gauge `DBSizeProvider.CurrentDBSizeBytes()`
`baton.sync.expand.decompressed_bytes_growth` counter delta of size since previous emission

Counters are monotonic-safe: when the action queue grows between samples (depth++ enqueueing the next layer of expansion), the burned counter is held flat rather than going negative, so `rate()` queries on operator dashboards stay meaningful.

Behavior

  • Defaults to `metrics.NewNoOpHandler` so callers that don't pass `WithMetricsHandler` pay no cost.
  • Caller is expected to pre-tag the handler via `Handler.WithTags` for `connector_id` / `tenant_id` so emitted metrics carry the dimensions operators slice by.
  • The first sample doesn't emit a burned/growth delta — same reason `decompressed_bytes_delta` is omitted on the first log line: a delta against zero is a spurious spike on every dashboard.

Tests

  • `TestLogExpandProgress_EmitsMetrics` — pins gauge/counter values across three samples (5 → 3 → 7 actions, 100M → 130M bytes), including the queue-growth case where the burned counter must NOT go backwards.
  • `TestLogExpandProgress_NoMetricsHandlerIsSafe` — guards against a regression where a future refactor swaps the no-op default for nil and panics on `Observe`.
  • All existing tests still pass.

Validation

  • `go test -count=1 ./pkg/sync/progresslog/...` — green
  • `go vet ./pkg/sync/progresslog/...` — clean
  • `gofmt -l pkg/sync/progresslog/` — clean

Linear

CE-702

Follow-up

Sibling PR coming next: pre-prune duplicate `(source, descendant)` actions in the expander queueing loop (CE-703). The metric added here will tell us whether that fix is meaningful by counting duplicates over real Entra-Global syncs.

LogExpandProgress already emits structured log fields (actions_remaining,
decompressed_bytes, decompressed_bytes_delta) but no metrics. Operators
end up scraping log timestamps to answer "is this connector burning
through actions or wedged?" — exactly what happened during the recent
Eli Lilly Entra-Global investigation, where confirming a 12-hour
expansion was actually progressing required a SQL query against logs.

Add an optional metrics.Handler on ProgressLog and mirror the four log
fields as OTel instruments:

- baton.sync.expand.actions_remaining (gauge)
- baton.sync.expand.actions_burned (counter; delta of remaining)
- baton.sync.expand.decompressed_bytes (gauge)
- baton.sync.expand.decompressed_bytes_growth (counter; delta of size)

The counters guard against monotonic-violation: when the action queue
grows between samples (depth++ enqueueing the next layer), the burned
counter is held flat rather than going negative, so rate() queries on
operator dashboards remain meaningful.

Default is metrics.NewNoOpHandler so callers that don't pass
WithMetricsHandler pay no cost. Caller is expected to pre-tag the
handler via Handler.WithTags for connector_id / tenant_id so emitted
metrics carry the dimensions operators slice by.

Refs: CE-702
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 6, 2026

Address review feedback on the initial CE-702 commit:

- Rename baton.sync.expand.decompressed_bytes_growth →
  baton.sync.expand.decompressed_bytes_delta to match the structured-
  log field of the same name. Operators who see the field in a Datadog
  log entry can now derive the metric name by mechanical substitution.

- Fix the counter descriptions on actions_burned and
  decompressed_bytes_delta: OTel counters are cumulative, so the
  description must reflect that and direct operators to use rate() for
  per-window throughput. The previous wording ("since the previous
  emission") implied a per-sample gauge, which a future operator
  querying .last() instead of rate() would misread.

- Replace time.Sleep-gated rate-limit boundaries in two tests with
  deterministic backdating of p.lastActionLog. The previous tests
  relied on a 15ms sleep against a 10ms window — a 1.5× margin that's
  flaky on coarse-clock CI runners (Windows; busy hosts). The new
  tests set the rate-limit window to 1h and inject the precondition
  directly, dropping the package's test runtime from ~240ms to ~100ms
  and removing wall-clock dependency from assertions.

- Add a small comment on the metric-name constants block clarifying
  that the string values are the stable external contract (operators
  build dashboards on them) while the Go-level identifiers stay
  unexported because callers don't need them.

Note (separate follow-up): WithMetricsHandler is registered on
ProgressLog but not yet wired through pkg/sync/syncer.go in any
production path, so the four instruments currently emit to a no-op
handler in production. Wiring belongs in a sibling PR — this PR
ships the registration-side contract and tests; the syncer.go change
will land separately under a new Linear follow-up.

Refs: CE-702
@btipling btipling self-assigned this May 6, 2026
@btipling btipling removed their assignment May 8, 2026
@arreyder arreyder merged commit 843444c into main May 8, 2026
6 checks passed
@arreyder arreyder deleted the chrisrhodes/ce-702-add-throughput-metric-for-grant-expansion-progress branch May 8, 2026 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants