feat(cymbal): add remote symbol resolution via gRPC service#60289
Conversation
|
🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down. Add the Most PRs don't need this. Real regressions still get caught on master and fix-forward. |
|
Size Change: 0 B Total Size: 80.8 MB ℹ️ View Unchanged
|
320ea2e to
1c171d2
Compare
Introduces a standalone `cymbal-resolution` service that owns exception symbol resolution behind a `cymbal.resolution.v1` gRPC contract, and routes the cymbal pipeline's resolution stage through it via a caller-owned endpoint pool when enabled. Local resolution remains the fallback when the flag is off; there is no silent local fallback when the remote pool is enabled — failures surface as batch-level errors so upstream handling stays explicit. New crates / services - `cymbal-proto`: build-time proto contract crate with a `Resolve` + `Subscribe` stream split and contract tests - `cymbal-resolution`: standalone gRPC service with admission gating (`MAX_ITEM_CONCURRENCY`), load-event bus (`Subscribe` stream reports `LoadEvent.in_flight / max_in_flight / degraded / draining`), and per-item Done/Error/Retry outcomes per `Resolve` stream Client integration in cymbal - `EndpointPool` with DNS-driven endpoint reconciliation, rendezvous-hashing per team for warm-cache affinity, exclusion of degraded/draining endpoints, and round-robin tie-breaking - per-endpoint `Subscribe` task maintains a fresh load snapshot, with exponential reconnect backoff + jitter - request grouping by team routing key, chunking by item count + byte budget, and per-attempt endpoint rotation for retries - caller-side retries with exponential backoff + jitter; pool-empty consumes the retry budget rather than failing on the first attempt - event-level deterministic sampling for graduated rollout Data integrity (all-or-nothing rollout policy) - clean stream end without `BatchSummary` triggers retry (guards against server spawn-task panic / mid-stream `tx.send` failure) - per-item `Error` outcomes fail the batch loudly rather than silently downgrading affected exceptions to unresolved passthrough - `RemoteEventSlot` uses `Option<>` + `take()` instead of a `mem::replace` placeholder so a regression can't leak a sentinel - final `output[batch_index].ok_or_else(...)` fails the batch loud if any slot is left empty Observability - per-endpoint in-flight gauge correctly increments on select and decrements on handle Drop; emitted only for the chosen endpoint - latency histogram drops the high-cardinality `endpoint` label - routing fallback / no-summary / pool-empty / items-failed counters - `last_error` preserved across all retryable cause classes Local dev wiring - `bin/mprocs.yaml` + `bin/start-rust-service` so cymbal-resolution comes up alongside cymbal in the local stack - dev RUST_LOG scoping for cymbal crates Tests - parity suite (`remote_resolution_parity.rs`): local vs remote produce identical resolved `exception_list` for the same input - hardening suite (`remote_resolution_hardening.rs`): connection refused, mid-stream interruption, missing items, no-summary truncation, per-item Error all-or-nothing, draining-only pool, cancellation releases pool slot - core integration (`remote_resolution.rs`): grouping, chunking, retries, sampling, deadline cancellation, ordering, per-team affinity - subscription routing (`remote_resolution_subscribe.rs`): pool routes to endpoint with lower server-reported load - service tests (`cymbal-resolution/tests/service_tests.rs`): end-to-end Resolve + Subscribe behavior, accounting invariants, limiter pressure surfacing Rollout - `CYMBAL_REMOTE_RESOLUTION_ENABLED` gates the client side; all knobs documented in `rust/cymbal-resolution/README.md` including retry, deadline, batching, sampling, and admission/degraded thresholds Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1c171d2 to
bcc7a49
Compare
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
rust/cymbal/src/stages/resolution/remote/client.rs:92-102
The stream-drain loop applies `tokio::time::timeout(deadline, stream.next())` per message, not as a single total wrapper. Each call to `stream.next()` gets a fresh `deadline`-length budget, so a server sending N items could hold this function for up to N × `deadline` (e.g. 64 items × 15 s = 960 s). The doc-comment above says "tokio `timeout` wrapping the entire stream drain", which is not what the code does. The only end-to-end budget is the `grpc-timeout` header sent to the server — if a proxy strips that header or the server has a slow-but-not-stalled item loop, there is no client-side cap on total wall time.
```suggestion
let mut stream = response.into_inner();
let mut outcomes: Vec<Outcome> = Vec::new();
let drain_deadline = tokio::time::Instant::now() + deadline;
loop {
let remaining = drain_deadline.saturating_duration_since(tokio::time::Instant::now());
if remaining.is_zero() {
return Err(RemoteCallError::Deadline(deadline));
}
match tokio::time::timeout(remaining, stream.next()).await {
Ok(Some(Ok(outcome))) => outcomes.push(outcome),
Ok(Some(Err(status))) => return Err(classify_status(status)),
Ok(None) => break,
Err(_) => return Err(RemoteCallError::Deadline(deadline)),
}
}
Ok(outcomes)
```
### Issue 2 of 3
rust/cymbal/src/stages/resolution/remote/pool.rs:417-433
`Notify::notified()` is constructed after `now().await`, so a `notify_waiters()` call that fires in between the two is silently dropped (tokio `Notify` does not store missed notifications from `notify_waiters`). A caller waiting on pool readiness at startup can miss the first `ready` signal and sit out the full `timeout` even though the pool already has healthy endpoints. The standard fix is to pin the `notified()` future before the condition check.
```suggestion
pub async fn wait_ready(&self, timeout: Duration) -> Result<(), EndpointPoolError> {
// Create the notified() future FIRST so we don't miss a notify_waiters()
// that fires between the condition check and the await below.
let notified = self.ready.notified();
tokio::pin!(notified);
if self
.inner
.lock()
.await
.endpoints
.values()
.any(|s| !s.draining)
{
return Ok(());
}
match tokio::time::timeout(timeout, notified).await {
Ok(()) => Ok(()),
Err(_) => Err(EndpointPoolError::Empty),
}
}
```
### Issue 3 of 3
rust/cymbal/src/app_context.rs:303
**Background task handle not stored** — `spawn_refresh_task` returns a `JoinHandle<()>` that is immediately dropped. The task holds its own `Arc<EndpointPool>` clone, so it keeps running and keeps the pool alive even after `AppContext` is dropped. Storing the handle on `AppContext` or `RemoteResolutionContext` would enable graceful cancellation on teardown.
Reviews (1): Last reviewed commit: "feat(cymbal): add remote symbol resoluti..." | Re-trigger Greptile |
PR overviewThis PR adds remote symbol resolution to cymbal using a gRPC-backed resolver path for processing sampled exception data. It extends the resolution stage so work can be delegated to a remote service rather than only handled through the existing local path. There is one remaining security concern: the remote resolution path can launch work for every exception in a submitted event concurrently, instead of preserving the bounded concurrency used elsewhere. A crafted event with a very large exception list could therefore cause excessive remote work and retries, creating a resource-exhaustion risk. One prior issue has already been addressed, so the remaining posture is narrowed to this concurrency-control gap. Open issues (1)
Fixed/addressed: 1 · PR risk: 6/10 |
|
🎭 Playwright report · View test results →
These issues are not necessarily caused by your changes. |
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
rust/cymbal/src/stages/resolution/remote/config.rs:87-92
`normalized_sample_rate` maps any non-finite float (including `f64::INFINITY` and `f64::NEG_INFINITY`) to `1.0`, which would silently enable full-rate remote routing. Notably, `f64::from_str("inf")` succeeds in Rust, so setting `CYMBAL_REMOTE_RESOLUTION_SAMPLE_RATE=inf` in an env file would parse without error and produce full remote traffic — bypassing the safety of the `0.0` default. A non-finite value more naturally signals a misconfiguration, so mapping it to `0.0` (fail-closed) is safer.
```suggestion
fn normalized_sample_rate(sample_rate: f64) -> f64 {
if !sample_rate.is_finite() {
return 0.0;
}
sample_rate.clamp(0.0, 1.0)
}
```
Reviews (2): Last reviewed commit: "chore: remove pi settings override" | Re-trigger Greptile |
cat-ph
left a comment
There was a problem hiding this comment.
very solid & cool approach! :amaze:
this looks great, let's start testing it ![]()
ablaszkiewicz
left a comment
There was a problem hiding this comment.
Looks extremely solid. Left small nit and a question. Overall I think PR description could be more informative. Some diagrams explaining what's going on would be cool. Reviewing 8k lines of raw code with that description is extremely complicated but I did a lot of back and forth with Claude so I feel like I have a general understanding.
Small thing that worries me is that right now sometimes we have a hard time figuring out why something is not working as it should in cymbal and here we are adding new service, new communication layer and a whole 2-way communication. I'd be interested in seeing observability side of things for this new approach
Drop in_flight/max_in_flight from the LoadEvent protocol in favor of a degraded flag plus suggested_batch_size (renamed from suggested_max_batch_items). The server now flips degraded at an absolute in-flight threshold (DEGRADED_IN_FLIGHT_THRESHOLD) decoupled from concurrency, and suggests remaining headroom as the batch size. Also splits the cymbal-resolution service module into config/codes/ resolve/subscribe submodules and converts the Subscribe RPC to a lazy async_stream, dropping its mpsc channel and background task. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rnegron
left a comment
There was a problem hiding this comment.
reviewed .github and bin changes
23809db to
bbceb91
Compare
| .checked_add(ctx.config.request_deadline) | ||
| .unwrap_or_else(Instant::now); | ||
|
|
||
| let resolved = join_all( |
There was a problem hiding this comment.
Medium: Unbounded remote exception fan-out
join_all starts one resolve_work_item future for every exception in the sampled remote batch. An attacker who can submit an error event with a large $exception_list can make cymbal create and retry all of those remote work items concurrently, bypassing the existing BATCH_APPLY_CONCURRENCY limit used by the local exception/frame resolution path. Please run these work items through a bounded stream, or another configurable semaphore, so remote mode preserves the same per-request concurrency ceiling.
#32) Upstream split symbol/exception-frame resolution out of cymbal into a separate `cymbal-resolution` rust service (PostHog/posthog#60289, `bin/start-rust-service cymbal-resolution`). Add it to the chart so self-hosters have it ready when the rollout ramps. - New `cymbalResolution` component (Deployment + ClusterIP Service on gRPC 50061 / metrics 9106), reusing the cymbal image in resolution mode. It needs only Postgres + S3, both supplied via the shared posthog env + the rustfs bucket init. Includes HPA / PDB / ServiceMonitor parity with the personhog gRPC services and a NetworkPolicy allow-list entry. - New `cymbal.remoteResolution` knobs that wire CYMBAL_REMOTE_RESOLUTION_* onto the existing cymbal pods. These render ONLY when remoteResolution.enabled=true, so cymbal is byte-for-byte unchanged by default. SAMPLE_RATE defaults to 0.0 to match the upstream production default (flipping ENABLED is a no-op until ramped). Disabled by default -> zero change to the rendered happy-path manifests (verified via helm template). Resolves #30. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Cymbal currently runs symbol resolution inline on every event. That couples ingestion throughput to symbol-store I/O, makes warm caches pod-local rather than team-local, and prevents scaling resolution independently of the rest of the pipeline.
Changes
Introduces a standalone
cymbal-resolutiongRPC service and wirescymbalto call it. Five things:cymbal-resolutionservice +cymbal.resolution.v1proto contract —Resolve(server-streaming items + terminalBatchSummary) andSubscribe(long-lived load-event bus). Reuses cymbal's resolver stack via a new narrow constructor that connects only to Postgres and S3 — no Kafka, Redis, or signalspool_emptyconsumes the retry budget so transient all-pods-degraded windows don't kill the batchSubscribestream — a pod is routable only when its server-reportedLoadEventis fresh AND not degraded/draining. No caller-side load-guess fallback. Freshness window derived from the subscription tick (twice the cadence)LoadEvent.suggested_max_batch_itemslets the server tell callers how to chunk; client takesmin(server suggestion, config ceiling). Chunking is event-atomic (an event's exceptions are never split across chunks)Erroroutcomes, missingBatchSummary, or any client-side parse failure fail the batch loudly (no silent passthrough of unresolved exceptions). Rollout gated byCYMBAL_REMOTE_RESOLUTION_ENABLED(falsedefault) andSAMPLE_RATE(0.0default), so flipping the enable flag alone is a no-opMinimal observability surface (10 metrics, deliberately scoped) and a
READMEper crate.How did you test this code?
I'm an agent. Only automated tests below were run by me; no manual production testing.
cymballib + 5cymbal-resolutionlib + 5cymbal-protocontract tests passremote_resolution.rs(20),remote_resolution_hardening.rs(9 — connection refused, mid-stream interruption, missing items, no-summary truncation, per-item-error all-or-nothing, draining pool, cancellation),remote_resolution_parity.rs(2 — local vs remote produce identical output),remote_resolution_subscribe.rs(1 — load-driven routing),service_tests.rs(19 — end-to-end Resolve/Subscribe, accounting invariants)cargo clippyclean on touched codePublish to changelog?
no
🤖 Agent context
Authored with Claude Code across several iterative passes — initial assembly, then multi-agent code-quality/efficiency review, retry hardening, silent-truncation fix, data-flow refactor (typed
RemoteChunkowning its events; replaced index bookkeeping +Option<>/mem::replacemutation), startup-time scoping (cymbal-resolution no longer connects to Kafka/Redis).Decisions worth flagging:
SAMPLE_RATE=0.0default makesENABLED=truea no-op until rollout is ramped explicitly; dev script overrides to1.0so local exercises the pathRequires human review before merging.