feat(error-tracking): add per-issue pre-cymbal rate limit#59611
Conversation
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
nodejs/src/ingestion/error-tracking/load-error-tracking-settings-step.ts:18-25
Debug `console.log` left in production code — this fires for every event processed by the error-tracking pipeline. Under sustained load that will generate enormous log volume and expose settings data (including rate-limit values) on every invocation.
```suggestion
const errorTrackingSettings = manager ? await manager.getSettings(input.team.id) : null
return ok({ ...input, errorTrackingSettings })
```
### Issue 2 of 3
nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts:312-356
Two `console.log` debug statements are left in production code. `getBucketConfig` is called once per unique rate-limit key per batch, so these fire frequently under real traffic. The same issue also appears in the per-issue block immediately below (`[ET-RL] per-issue bucket config`).
```suggestion
const config = {
bucketSize: value,
refillRate: value / (minutes * 60),
}
return config
},
},
// Per-stack cap: independent opt-in from the team-global limit.
{
rateLimiter: this.rateLimiter,
appMetricsAggregator: this.rateLimiterAppMetricsAggregator,
appSource: 'exceptions',
getKey: (input) => {
if (input.errorTrackingSettings?.perIssueRateLimitValue == null) {
return null
}
const sig = preCymbalGroupKey(input.event)
return sig ? `${input.team.id}:exceptions:stack:${sig}` : null
},
getTeamId: (input) => input.team.id,
reportingMode: this.config.rateLimiterReportingMode,
dropReason: 'rate_limited:per_stack',
getBucketConfig: (input) => {
const settings = input.errorTrackingSettings!
const value = settings.perIssueRateLimitValue!
const minutes = settings.perIssueRateLimitBucketSizeMinutes ?? 60
const config = {
bucketSize: value,
refillRate: value / (minutes * 60),
}
return config
```
### Issue 3 of 3
nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.ts:32-35
The `|` separator in the message-only path can silently merge distinct exceptions into the same rate-limit bucket. For example, `type="A|B", value="C"` and `type="A", value="B|C"` both produce the string `"A|B|C"` and therefore the same hash. Using a null byte (as the per-frame fields already do) removes the ambiguity.
```suggestion
return createHash('sha1')
.update(String(exc.type ?? ''))
.update('\0')
.update(value)
.digest('hex')
.slice(0, 16)
```
Reviews (1): Last reviewed commit: "feat(error-tracking): add per-issue pre-..." | Re-trigger Greptile |
8037387 to
387bb41
Compare
PR overviewThe PR has worked down most findings, with 2 issues already addressed and 1 remaining open. The remaining concern is that the new per-issue pre-cymbal rate limiter can create high-cardinality Redis keys from attacker-controlled exception payloads before a team-wide cap is enforced, allowing a sender with a project token to drive unbounded Redis writes by varying stack traces. This should be gated by a bounded team/global admission check or another cap on per-team issue-key creation before writing per-issue buckets. Open issues (1)
Fixed/addressed: 2 · PR risk: 6/10 |
| return null | ||
| } | ||
|
|
||
| const frames = exc.stacktrace?.frames |
There was a problem hiding this comment.
for 'extra safety' should we handle invalid frames? something like frames being [null], maybe we can validate the frames are valid objects too, otherwise I think this fails the entire batch?
|
|
||
| /** Pre-Cymbal stack signature for per-issue rate limit buckets. */ | ||
| export function preCymbalGroupKey(event: PluginEvent): string | null { | ||
| const exc = event.properties?.$exception_list?.[0] |
There was a problem hiding this comment.
thinking out loud, would chained exceptions with the first one acting as a sort of "wrapper", swallow more than wanted, and throttle issues that are not part of it? thinking that we go through all exceptions in cymbal
There was a problem hiding this comment.
Very good point. Fixing that
387bb41 to
3aa7234
Compare
3aa7234 to
03cbace
Compare
There was a problem hiding this comment.
Consolidated into existing tests
Adds a per-issue token-bucket rate limiter that fires before Cymbal so a single hot issue can't dominate the consumer. Keyed by the pre-cymbal group key (team + exception fingerprint), wired through the shared KeyedRateLimiterService. - New step: pre-cymbal-group-key — derives the issue key for rate-limit decisions ahead of full Cymbal grouping. - Refactored keyed-rate-limiter step from per-key cost summation to per-input fan-out so partial batches can pass through (some inputs in a key group allowed, others dropped) rather than all-or-nothing. - Reporting mode flag so the limiter can be shadowed in prod before enforcement. Builds on the V3 lua denial-persistence fix so per-issue limits with sub-2 fillRate (e.g. 100/15min ≈ 0.11/s) can recover instead of wedging.
03cbace to
20132f6
Compare
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts:289-308
**Per-issue limiter runs after project-global, so hot issues still burn the project budget**
The two specs are applied in array order — project-global first, per-issue second. When both limits are configured an event from a hot issue that passes the project-global check (consuming a project token) can still be dropped at the per-issue step. Those consumed tokens are then unavailable for other issues, which is exactly the crowd-out problem the per-issue limiter is meant to solve.
Reversing the order so per-issue runs first means a hot issue's overflow is dropped before it can consume any project tokens, and the remaining project budget is fully available to other issues. When only one limit is configured the other step's `getKey` returns `null` for all inputs (no-op), so the swap is transparent in single-limit deployments.
Reviews (2): Last reviewed commit: "feat(error-tracking): add per-issue pre-..." | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
Adds a per-issue (pre-Cymbal) rate limit to the error-tracking ingestion path to prevent a single noisy fingerprint from monopolizing consumer capacity, and reorders limiters so per-issue throttling happens before the team-global cap.
Changes:
- Extend error-tracking settings loading to include per-issue rate limit value and bucket window.
- Introduce a new
preCymbalGroupKeyhashing step and wire a keyed per-issue rate limiter into the consumer ahead of Cymbal grouping. - Update/expand consumer tests for both project-wide and per-issue rate limiting; remove an older ET rate-limiter E2E test.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| nodejs/src/utils/error-tracking-settings-manager.ts | Loads per-issue rate limit settings from posthog_errortrackingsettings. |
| nodejs/src/utils/error-tracking-settings-manager.test.ts | Updates settings upsert helper + expectations to include per-issue fields. |
| nodejs/src/ingestion/error-tracking/rate-limiter-e2e.test.ts | Removes prior Redis-backed E2E test for sustained traffic rate limiting. |
| nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.ts | Adds pre-Cymbal stable signature hashing for exceptions. |
| nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.test.ts | Adds unit tests covering hashing behavior and edge cases. |
| nodejs/src/ingestion/error-tracking/load-error-tracking-settings-step.test.ts | Updates test typings/fixtures to include per-issue settings. |
| nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts | Adds per-issue keyed limiter before team-global limiter using the pre-Cymbal signature. |
| nodejs/src/ingestion/error-tracking/error-tracking-consumer.test.ts | Adds integration-ish tests verifying project and per-issue limiter behavior with Redis-backed limiter. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
curious why assign-reviewers-posthog[bot] is auto assigning client libs approvers as reviewers quite often |
|
@marandaneto I think it's probably that one of the base branch changes caused the diff to actually contain a js version change (also noticed that in another PR), maybe there's a way to compare current PR's package.json against master in /cc @PostHog/team-devex for thoughts |
yep i figured, thx |
| const hash = createHash('sha1') | ||
| let hasContent = false | ||
|
|
||
| for (const exc of excList) { |
There was a problem hiding this comment.
there's probably a very limited edge case here where some SDKs sending captureException("") would bypass it, but IMO not something we need to explicitly handle here, just noting
| if (input.errorTrackingSettings?.perIssueRateLimitValue == null) { | ||
| return null | ||
| } | ||
| const sig = preCymbalGroupKey(input.event) |
There was a problem hiding this comment.
Medium: High-cardinality rate-limit keys
preCymbalGroupKey(input.event) is derived from public exception payloads, and this per-issue limiter runs before the team-global limiter. A sender with the project token can vary stack frames to create one Redis bucket per synthetic issue, so globally rate-limited traffic still causes unbounded Redis writes; put a bounded team/global admission check before this per-issue bucket, or otherwise cap per-team issue-key creation before writing per-issue keys.
There was a problem hiding this comment.
Ehhh that's true. I need to think about this
There was a problem hiding this comment.
a thought: one benefit to having the team limiter run first is that we could actually move it further up the stack (using just the parsed headers), and drop events before we parse them (saving some compute cost) - we could then have the separate per group check after we've parsed the events.
There was a problem hiding this comment.
IMO the fact that this would exhaust team-limit is a deal breaker.
A customer we are mainly designing this for is research gate (amongst top 100 most visited websites on the internet). Their exploding issue (issue which has uncontrollably high amount of exceptions) will basically eat team-limit in seconds probably effectively stopping ingestion of all other issues.
I was thinking about either:
- hashing the exception hash to some limited space (like let's say even 10k hashes) - dead simple to implement but prone to clashes. We are sure that there is no way for potential attacker to use more than 10k slots in redis,
- implementing something actually smart based on Redis and keeping track of how many distinct hashes we have recorded for a given team in a given time. If we are over this limit, we return null in issue-limit and then team-limit applies. This should NEVER happen when customer has a proper implementation. This is purely to prevent possible attacks. I see there are quite some options how to implement it - will research it tomorrow.
If you have any thoughts on above - please drop them here ❤️
There was a problem hiding this comment.
i think option 2 makes sense there! option 1 can be tricky, and can possibly lead to false positives. starting with 2 (claude says to use redis hll) makes sense, we can also break it up further over time and have heavy traffic buckets vs a tail end bucket if needed.
z0br0wn
left a comment
There was a problem hiding this comment.
all in all looks good to me - have a thought about ordering the limiters but not a blocker. we should have some per step latency metrics so i'd be keen to see the impact there
Problem
A single noisy exception fingerprint can dominate the error-tracking consumer and crowd out the rest. We need per-issue throttling before Cymbal grouping so hot issues don't burn capacity downstream.
Stacked on #59610 (V3 lua denial-persistence fix) — without that, per-issue limits with sub-2 fillRate (100/15min ≈ 0.11/s) would wedge at
-1forever under sustained 1 req/s traffic.Also I discovered an unwanted behavior - per team rate limiter runs before per issue rate limiter so it is possible that we will be consuming team budget and then rejecting that exception on the per-issue level. I swapped them so per-issue runs first. This means worse performance because we have to compute hashes for every exception that comes in but let's observe metrics - this is the proper way.
Changes
pre-cymbal-group-keystep — derives a stable issue key (team + exception fingerprint) for rate-limit decisions ahead of full Cymbal grouping.KeyedRateLimiterServiceinto the error-tracking consumer, keyed by that issue key.keyed-rate-limiter-stepfrom per-key cost summation to per-input fan-out → partial pass-through within a key group (some inputs allowed, the rest dropped) instead of all-or-nothing.ERROR_TRACKING_RATE_LIMITER_REPORTING_MODE) so the limiter can be shadowed in prod before enforcement.How did you test this code?
Publish to changelog?
no
🤖 Agent context
Branched off #59610 and built up from an earlier WIP branch by the agent. PR base will switch to
masterautomatically once #59610 merges.