Skip to content

feat(data-warehouse): add tracked gRPC transport for warehouse sources#60439

Merged
Gilbert09 merged 2 commits into
masterfrom
tom/data-warehouse-grpc-tracked-transport
Jun 2, 2026
Merged

feat(data-warehouse): add tracked gRPC transport for warehouse sources#60439
Gilbert09 merged 2 commits into
masterfrom
tom/data-warehouse-grpc-tracked-transport

Conversation

@Gilbert09
Copy link
Copy Markdown
Member

Problem

Warehouse-source syncs make a lot of outbound gRPC traffic from the two gRPC sources — google_ads (GoogleAdsClient) and bigquery (Storage Read API) — and none of it is observable today. The tracked HTTP transport gave every requests-based source uniform logs, per-source latency/byte metrics, and an opt-in sample-capture pipeline, but it stops at HTTP. gRPC calls bypass all of it: no team_id/source_type on the logs, no metrics, no easy way to grab a real request/response pair as a test fixture, and nothing stopping a new gRPC source from dropping in a raw channel.

This brings the two gRPC sources to parity.

Changes

  • New tracked gRPC transport at posthog/temporal/data_imports/sources/common/grpc/. Client interceptors (TrackedUnaryUnaryClientInterceptor, TrackedUnaryStreamClientInterceptor) time, size, and log every RPC, emit three OTel instruments (data_import_grpc_requests_total, data_import_grpc_latency_ms, data_import_grpc_response_bytes) labelled by team_id, source_type, method, and a low-cardinality gRPC status_class (ok / client_error / resource_exhausted / unavailable / server_error / error), and feed opt-in sample capture. Streaming responses are sized-and-released — never buffered — so BigQuery's large reads keep streaming; only a small head of messages is retained, and only when capture is armed. All telemetry is wrapped so it can never raise into the call path.
  • Two SDK seams wired:
    • google_ads/google_ads.py passes interceptors=tracked_interceptors(host) to both get_service(...) calls. google-ads rebuilds the channel on every get_service, so the interceptors are re-supplied each time rather than wrapping a single persistent channel.
    • bigquery/bigquery.py builds a credential-bearing channel via BigQueryReadGrpcTransport.create_channel(...), wraps it with make_tracked_channel(...), and hands it to the transport. create_read_session (unary) and read_rows (server-streaming) both ride the interceptors. The REST/metadata path was already HTTP-tracked and is untouched.
  • Full-parity sample capture (common/grpc/sampling.py) plus a warehouse_sources_capture_grpc_samples management command. Redis-gated config with TTL, protobuf → scrubbed JSON via MessageToDict + scrubadub, auth keys (incl. developer_token / refresh_token / client_secret) dropped by name, streaming responses truncated to a small head with a truncated flag, and a 256 KiB payload ceiling. Samples land in S3 under warehouse-sources-grpc-samples/{capture_id}/{source_type}/{seq}.json. Rules match on the gRPC status class or numeric code.
  • Shared, transport-neutral foundations: JobContext (and the bind/scoped/current helpers) moved to common/job_context.py; common/http/context.py is now a re-export shim so existing HTTP imports are unchanged. CaptureRule / CaptureConfig and the scrubadub scrubbing moved to common/sample_scrub.py, shared by both transports.
  • Semgrep rule data-imports-grpc-transport bans raw grpc.*_channel(...) and direct BigQueryReadClient(...) / GoogleAdsClient(...) construction inside sources/** (excluding the common/grpc/ package and the two reference source files, whose enforced invariant is routing through the tracked transport). 3/3 fixtures pass; 0 findings on the sources tree.
  • Docs: SOURCES.md flips google_ads and bigquery's gRPC state to ✅ and updates the legend; the implementing-warehouse-sources skill gains an "Outbound gRPC must go through the tracked gRPC transport" section.

How did you test this code?

I'm an agent (Claude Opus 4.7). Only automated tests — I did not exercise either source against its real upstream API (no credentials locally).

  • 91 new unit tests across common/test/test_grpc_{transport,metrics,observer,sampling}.py, common/test/test_job_context.py, and the management-command test. Coverage includes: unary success/error/observer-raises (future returned unchanged so the SDK still raises), streaming success/lazy-consumption/error-mid-iteration/head-retention, the full status_class mapping over every grpc.StatusCode, metric instrument caching + null fallback, protobuf scrubbing + key redaction, rule matching on gRPC status, Redis slot reservation, payload truncation + size cap, S3 key layout, and the context shim re-export identity.
  • Existing HTTP transport + warehouse-source suites re-run green after the context / scrub extraction (the one HTTP test whose patch target moved was updated). The only failures in the broader sweep were pre-existing XMinioStorageFull MinIO infra errors unrelated to this change.
  • semgrep --test on the rule fixtures (3/3) and a full scan of sources/ (0 findings).
  • ruff format + ruff check --fix over all changed files.

Publish to changelog?

no

Docs update

n/a

🤖 Agent context

Authored end-to-end by an agent (Claude Opus 4.7) at the request of the human author, who reviews and owns the PR. Scoped from the existing HTTP-transport PR with three decisions made up front: (1) full parity including the sample-capture pipeline rather than logs/metrics only; (2) extract the shared JobContext into a transport-neutral module with a re-export shim, rather than have gRPC import from the HTTP package; (3) a semgrep guard banning raw channel/client construction rather than relying on docs alone.

Key non-obvious decisions worth a reviewer's eye:

  • google-ads rebuilds its gRPC channel per get_service call, so there is no single channel to wrap once — the interceptors must be passed on every call. Both call sites (get_schemas, get_rows) do this.
  • BigQuery's transport ignores credentials when a channel= is supplied, so the channel must be built with create_channel(credentials=...) before interception; passing both credentials and a channel would silently drop the credentials.
  • The unary interceptor returns the continuation's outcome unchanged (reading .code()/.result() off the already-resolved future for telemetry), so error propagation via the SDK's own .result() is preserved.
  • Sample capture writes to S3 from inside the call path but is best-effort and fully guarded; if it ever adds latency the observer hook is shaped to move to an async sink.

Copilot AI review requested due to automatic review settings May 28, 2026 13:41
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 28, 2026 13:42
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
posthog/management/commands/warehouse_sources_capture_grpc_samples.py:149-158
**Implicit wildcard primary rule silently swallows `--rule` entries**

`_enable` always prepends a `primary` rule built from the flat flags (which default to `*/*/*/*/50`). Because `_first_match` stops at the first match, when an operator follows the documented multi-rule example — supplying only `--rule` flags and `--ttl` — every call matches the wildcard primary with `limit=50` before the explicit `--rule` entries are ever evaluated, so those rules are dead code.

For the exact command in the module docstring:
```
enable --rule source_type=google_ads,… --rule source_type=bigquery,… --ttl 1h
```
the effective rule list is `[CaptureRule(*, *, *, *, 50), google_ads rule, bigquery rule]`: the wildcard fires on every call, the two targeted rules never fire. The operator gets a noisy broad capture instead of the intended targeted one. The test `test_enable_supports_extra_rules` avoids the problem by explicitly passing `--source-type google_ads` on the primary, so it does not catch this case.

### Issue 2 of 2
posthog/temporal/data_imports/sources/common/grpc/proto_utils.py:34-45
**`code` in `_REDACT_PARAM_NAMES` may over-redact protobuf fields**

`_REDACT_PARAM_NAMES` is designed for OAuth URL parameters and includes generic names like `code`. When applied to proto-to-dict output via `_redact_keys`, any protobuf field named `code` — such as the `code` field on Google Ads `ErrorCode` or `Status` protos — will be replaced with `"REDACTED"`, making the captured sample less useful for debugging. The other names in the set (`client_secret`, `developer_token`, `refresh_token`, etc.) are specifically auth-bearing and are safe to share across HTTP and gRPC contexts, but `code` is broadly overloaded in protobuf schemas.

Reviews (1): Last reviewed commit: "chore(data-warehouse): document temporal..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a tracked gRPC transport for warehouse-source syncs so outbound gRPC calls (notably Google Ads and BigQuery Storage Read API) get consistent logs, OTel metrics, and opt-in sample capture—bringing gRPC sources to parity with the existing tracked HTTP transport.

Changes:

  • Introduces tracked gRPC client interceptors + observer/metrics + sample-capture pipeline (Redis-gated, scrubbed payloads written to S3).
  • Wires the tracked gRPC transport into google_ads and bigquery sources, and updates shared context/scrubbing primitives.
  • Adds a Semgrep rule to prevent raw gRPC channel/client construction in sources/**, plus docs updates.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
posthog/temporal/data_imports/workflow_activities/import_data_sync.py Switches job-context binding import to the new transport-neutral module.
posthog/temporal/data_imports/sources/SOURCES.md Updates source inventory/docs to reflect gRPC tracking support.
posthog/temporal/data_imports/sources/google_ads/google_ads.py Passes tracked gRPC interceptors into Google Ads get_service(...) calls.
posthog/temporal/data_imports/sources/bigquery/bigquery.py Builds a credential-bearing BigQuery Storage channel and wraps it with tracked interceptors.
posthog/temporal/data_imports/sources/common/job_context.py Adds transport-neutral JobContext + bind/scoped/current helpers.
posthog/temporal/data_imports/sources/common/http/context.py Re-exports JobContext from the neutral module to preserve existing imports.
posthog/temporal/data_imports/sources/common/sample_scrub.py Extracts shared capture config/rules + scrubadub-based scrubbing.
posthog/temporal/data_imports/sources/common/http/sampling.py Switches HTTP sample capture to shared primitives from sample_scrub.py.
posthog/temporal/data_imports/sources/common/http/url_utils.py Adds developer_token to redaction list.
posthog/temporal/data_imports/sources/common/grpc/init.py Exposes tracked gRPC transport API surface.
posthog/temporal/data_imports/sources/common/grpc/transport.py Implements unary + streaming client interceptors and channel wrapping.
posthog/temporal/data_imports/sources/common/grpc/observer.py Adds per-call logging, metrics emission, and sample-capture delegation.
posthog/temporal/data_imports/sources/common/grpc/metrics.py Adds OTel instruments + status-class mapping and instrument caching.
posthog/temporal/data_imports/sources/common/grpc/sampling.py Implements gRPC sample capture (Redis config, S3 writes, truncation/size caps).
posthog/temporal/data_imports/sources/common/grpc/proto_utils.py Adds proto sizing + proto→dict conversion + scrubbing/redaction helpers.
posthog/management/commands/warehouse_sources_capture_grpc_samples.py Adds management command to enable/disable/list gRPC sample capture.
posthog/management/commands/test/test_warehouse_sources_capture_grpc_samples.py Tests for the gRPC sample capture management command.
posthog/temporal/data_imports/sources/common/test/test_job_context.py Tests for the job-context neutralization + HTTP shim behavior.
posthog/temporal/data_imports/sources/common/test/test_http_sampling.py Updates HTTP sampling tests to patch scrubber from the new shared module.
posthog/temporal/data_imports/sources/common/test/test_grpc_transport.py Unit tests for the gRPC interceptors + channel/interceptor factories.
posthog/temporal/data_imports/sources/common/test/test_grpc_observer.py Unit tests for observer logging/metrics/capture behavior and failure isolation.
posthog/temporal/data_imports/sources/common/test/test_grpc_metrics.py Unit tests for metric instruments, caching, and status bucketing.
posthog/temporal/data_imports/sources/common/test/test_grpc_sampling.py Unit tests for proto scrubbing, rule matching, Redis gating, and S3 layout.
.semgrep/rules/data-imports-grpc-transport.yaml Adds Semgrep rules enforcing usage of the tracked gRPC transport.
.semgrep/rules/data-imports-grpc-transport.py Semgrep rule test fixtures.
.agents/skills/implementing-warehouse-sources/SKILL.md Documents the gRPC transport requirement and how to use it in new sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread posthog/temporal/data_imports/sources/common/grpc/transport.py
Comment thread posthog/temporal/data_imports/sources/bigquery/bigquery.py
Comment thread posthog/temporal/data_imports/sources/common/grpc/proto_utils.py
Comment thread posthog/temporal/data_imports/sources/common/grpc/proto_utils.py
Comment thread posthog/temporal/data_imports/sources/common/grpc/proto_utils.py Outdated
@veria-ai
Copy link
Copy Markdown

veria-ai Bot commented May 28, 2026

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 1 · PR risk: 0/10

@Gilbert09 Gilbert09 closed this Jun 1, 2026
@Gilbert09 Gilbert09 force-pushed the tom/data-warehouse-grpc-tracked-transport branch from 1887f0d to 490a9ad Compare June 1, 2026 16:44
Adds a tracked gRPC transport so outbound gRPC calls from warehouse-source syncs (Google Ads, BigQuery Storage Read API) get consistent logging, OTel metrics, and opt-in sample capture — bringing gRPC sources to parity with the tracked HTTP transport.

- Tracked gRPC client interceptors + observer/metrics + Redis-gated sample-capture pipeline (scrubbed payloads written to S3).
- Transport-neutral JobContext and shared scrubbing primitives (sample_scrub) reused by both HTTP and gRPC.
- Wires the tracked gRPC transport into google_ads and bigquery sources.
- Semgrep rule preventing raw gRPC channel/client construction under sources/**, plus SOURCES.md and skill docs updates.

Rebased onto master to resolve conflicts in SOURCES.md and http/sampling.py.

Generated-By: PostHog Code
Task-Id: 1a08165a-e60b-4ca8-8dc7-fe6de8e876c8
@Gilbert09 Gilbert09 reopened this Jun 1, 2026
Comment thread posthog/temporal/data_imports/sources/bigquery/bigquery.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
posthog/temporal/data_imports/sources/common/grpc/transport.py:115-189
**Stream telemetry silently dropped on early consumer termination**

`_TrackedStreamWrapper` only calls `_record` from `__next__` (on `StopIteration`, `RpcError`, or `Exception`). When a consumer exits iteration early — via `break`, an early return, or an explicit `close()` call — Python's generator protocol calls `close()` on the wrapper. Because `close()` is not defined on `_TrackedStreamWrapper`, `__getattr__` delegates it to `self._iter.close()` but `_record` is never triggered.

In practice this means: if a BigQuery or Google Ads streaming read is aborted mid-way (pipeline cancellation, timeout, `break` in the consumer loop), the call produces no log line, no metric increment, and no sample. The wrapper becomes invisible to telemetry. Adding a `close()` method that calls `self._record(code=None, exception=None)` before delegating to `self._iter.close()` would close the gap, guarded by the existing `_recorded` flag so double-recording is impossible.

Reviews (2): Last reviewed commit: "feat(data-warehouse): add tracked gRPC t..." | Re-trigger Greptile

Address review feedback:

- bigquery: move BigQueryReadClient construction inside the try block so transport.close() still runs (and the gRPC channel is freed) when client construction raises.
- grpc transport: add _TrackedStreamWrapper.close() so a stream aborted early (explicit close, cancellation, generator teardown) still records its telemetry once, guarded by the existing _recorded flag, before delegating to the inner iterator's close().
- tests for the new close() behavior (partial-stream record, idempotency after completion, inner-close delegation).

Generated-By: PostHog Code
Task-Id: 1a08165a-e60b-4ca8-8dc7-fe6de8e876c8
Copy link
Copy Markdown
Member Author

Addressed the latest bot review feedback in a930adc:

  • graphite (bigquery resource leak on failed client construction): moved BigQueryReadClient(transport=transport) inside the try so finally: transport.close() always tears down the gRPC channel.
  • greptile (stream telemetry dropped on early consumer termination): added _TrackedStreamWrapper.close() which records the partial stream once (guarded by the existing _recorded flag) before delegating to the inner iterator's close(), so streams aborted via explicit close / cancellation / generator teardown still emit a log line, metric, and sample. Added unit tests for the partial-record, idempotency, and inner-close-delegation paths.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 1, 2026

🎭 Playwright report · View test results →

⚠️ 1 flaky test:

  • Creating a SQL insight with a variable and overriding it on a dashboard (chromium)

These issues are not necessarily caused by your changes.
Annoyed by this comment? Help fix flakies and failures and it'll disappear!

@danielcarletti
Copy link
Copy Markdown
Contributor

Alright, so I could not get a BQ source to test this, so I went into a different direction: try to break it, make it not work, provide evidence that this doesn't actually work as intended. Happy to say me and PH Code failed, although we did find something. Sending the full report below, but this is basically what we need to fix:

Telemetry gap on early-terminated streams

_TrackedStreamWrapper in transport.py (class at line 115) only records via _record() when next hits terminal StopIteration/RpcError (lines 148–159) or when close() is called explicitly (lines 186–194). It's a plain class iterator with no del, so on early termination the call's metric + sample + log are silently lost.

The real trigger: pipeline.py:117 ends a source by calling iterator.close() on the source generator in a finally (cancel/stop). That GeneratorExit unwinds the for … yield loop but does not cascade .close() down to _TrackedStreamWrapper, so _record() never fires. Proven: full-drain records once; break / GC-abandon / source_generator.close() after partial read all record 0 times.

Impact: telemetry-only (no data loss / sync failure), but it blinds metrics+samples exactly on cancelled and failing streaming syncs — the highest-value observability moments. Affects bigquery read_rows (the SDK layers between the dlt generator and our wrapper make full cascade unlikely; not 100% confirmed E2E without creds).

(Note: contextvar propagation is fine — pipeline.py:88 copy_context().run preserves the JobContext; verified.)

Fix — add a del safety net to _TrackedStreamWrapper in transport.py (right after close(), ~line 195). Idempotent via the existing _recorded guard; _record is already try/except-wrapped:

def del(self) -> None:
# Last resort: a stream torn down without reaching StopIteration/close()
# (early break, pipeline's source_generator.close(), GC) still records once.
try:
self._record(code=None, exception=None)
except Exception:
pass
Add a unit test in test_grpc_transport.py: partial-read + del/close → exactly one record_stream call.

@danielcarletti
Copy link
Copy Markdown
Contributor

Full report:

Adversarial test report — tracked gRPC transport (PR #60439)

Goal: try to falsify the PR — find a way it silently fails.
Script: scripts/grpc_transport_adversarial.py
Date: 2026-06-01

Honest headline

I could not show that "the PR doesn't work." Its main paths are sound — the
earlier e2e proved success and stream-error-to-completion both record correctly,
and the control case here (full consumption records exactly once) holds.

But I did find one real, narrow defect: on early stream termination
(consumer stops pulling — cancellation, downstream error, or a row/size limit),
the per-call telemetry is silently lost — no metric, no log, no sample —
even though the PR added a close() method specifically to cover this case.

It is a telemetry-only gap: no data loss, no corruption, no sync failure.

Root cause (one bug, three demonstrations)

_TrackedStreamWrapper (transport.py:115) is a plain class iterator, not a
generator. Its _record() fires in only two ways:

  1. __next__ reaches terminal StopIteration / RpcError (full drain or stream-side error), or
  2. someone explicitly calls wrapper.close().

Python does not call .close() on a non-generator iterator when a for loop
breaks, when it's abandoned to GC, or when an enclosing generator is closed.
There is no __del__ fallback. So any early stop that doesn't drive the wrapper to
its terminal state skips recording entirely.

Attack results

# Attack Verdict Real-path relevance
control full consumption HELD — records once, msg_count=10
1 break after 3/10 messages BROKEN — 0 records mechanism demo
2 abandon partial stream to GC BROKEN — 0 records mechanism demo
4 pipeline-style source_generator.close() after partial read BROKEN — 0 records this is the real early-termination path
3a iterate on a bare thread "BROKEN" artificial — not how the pipeline runs
3b iterate under copy_context().run HELD this is the real pipeline path — safe

Why attack 4 is the one that matters

pipelines/pipeline/pipeline.py:117 terminates a source by calling
iterator.close() on the source generator in a finally (on stop/cancel).
That raises GeneratorExit inside the bigquery generator's
for page in rows_iterator.pages: … yield loop. GeneratorExit unwinds the for
loop but does not call .close() on the underlying iterators — so it never
reaches _TrackedStreamWrapper.close(). Attack 4 reproduces exactly this
(source_generator.close() after a partial read) → 0 records.

Caveat, stated plainly: attack 4 proves the mechanism on a direct
generator→wrapper chain. The real BigQuery chain has google-SDK reader layers
(ReadRowsStream / .rows() / .pages) between the generator and our wrapper
that I could not exercise locally without credentials. Default Python semantics
are against cascade, and the PR provides nothing that forces it — but I have not
observed the full SDK teardown. Treat the bigquery blast radius as
"very likely affected, not 100% confirmed end-to-end."

Why attack 3 is NOT a flaw

The job-context docstring claims contextvars propagate via copy_context().
pipeline.py:88 does exactly that and runs every next() via ctx.run. Attack 3b
confirms context is preserved that way. My 3a (bare threading.Thread) loses it,
but that is not the pipeline's path. No defect here. I include it only to show
the claim was tested, not assumed.

When does this actually bite?

The success path (stream ends naturally → StopIteration → records) is the
common case and works. The gap hits early termination:

  • activity cancellation / heartbeat timeout mid-read,
  • a downstream exception that stops the dlt pipeline pulling,
  • any future row/size cap that stops early.

These are precisely the failure / interruption moments where an operator most
wants the latency/bytes metric and the captured sample. So while infrequent, the
gap removes observability exactly where it's most valuable — and the PR's stated
purpose is observability.

Recommended fix (small)

Add a __del__ safety net to _TrackedStreamWrapper so an abandoned/closed
wrapper still records on collection. The existing _recorded guard keeps it
idempotent, and _record is already fully try/except-wrapped:

def __del__(self) -> None:
    # Last-resort: a stream torn down without reaching StopIteration/close()
    # (early break, generator.close() from the pipeline, GC) still records once.
    try:
        self._record(code=None, exception=None)
    except Exception:
        pass

This closes attacks 1, 2, and 4 (after teardown the wrapper is unreferenced → GC →
__del__ → one idempotent record). Alternatively, make the pipeline's _close
path explicitly cascade .close() to the gRPC stream, but the __del__ net is
self-contained in the transport and needs no pipeline change.

Two known __del__ caveats, both acceptable here: it may run at interpreter
shutdown (the try/except absorbs that), and GC timing means the record may be
slightly delayed — fine for telemetry.

Verdict

  • Not a blocker for correctness — no data/sync impact; the PR works.
  • A real observability gap worth fixing before or shortly after merge, because
    it blinds the metrics/samples on cancelled and failing streaming syncs — the
    cases the PR most wants to see.
  • Recommendation: land the ~6-line __del__ fix (plus a unit test: partial-read +
    del/close → exactly one record_stream).

Copy link
Copy Markdown
Contributor

@danielcarletti danielcarletti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a small fix, see my comment with the summary of my analysis

@Gilbert09 Gilbert09 merged commit 437e9d2 into master Jun 2, 2026
271 checks passed
@Gilbert09 Gilbert09 deleted the tom/data-warehouse-grpc-tracked-transport branch June 2, 2026 11:02
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented Jun 2, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-06-02 11:26 UTC Run
prod-us ✅ Deployed 2026-06-02 11:42 UTC Run
prod-eu ✅ Deployed 2026-06-02 11:46 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants