Skip to content

feat(error-tracking): add per-issue pre-cymbal rate limit#59611

Merged
ablaszkiewicz merged 6 commits into
masterfrom
feat/error-tracking-per-issue-rate-limit-v2
May 27, 2026
Merged

feat(error-tracking): add per-issue pre-cymbal rate limit#59611
ablaszkiewicz merged 6 commits into
masterfrom
feat/error-tracking-per-issue-rate-limit-v2

Conversation

@ablaszkiewicz
Copy link
Copy Markdown
Contributor

@ablaszkiewicz ablaszkiewicz commented May 22, 2026

Problem

A single noisy exception fingerprint can dominate the error-tracking consumer and crowd out the rest. We need per-issue throttling before Cymbal grouping so hot issues don't burn capacity downstream.

Stacked on #59610 (V3 lua denial-persistence fix) — without that, per-issue limits with sub-2 fillRate (100/15min ≈ 0.11/s) would wedge at -1 forever under sustained 1 req/s traffic.

Also I discovered an unwanted behavior - per team rate limiter runs before per issue rate limiter so it is possible that we will be consuming team budget and then rejecting that exception on the per-issue level. I swapped them so per-issue runs first. This means worse performance because we have to compute hashes for every exception that comes in but let's observe metrics - this is the proper way.

Changes

  • New pre-cymbal-group-key step — derives a stable issue key (team + exception fingerprint) for rate-limit decisions ahead of full Cymbal grouping.
  • Wires KeyedRateLimiterService into the error-tracking consumer, keyed by that issue key.
  • Refactors the shared keyed-rate-limiter-step from per-key cost summation to per-input fan-out → partial pass-through within a key group (some inputs allowed, the rest dropped) instead of all-or-nothing.
  • Reporting-mode flag (ERROR_TRACKING_RATE_LIMITER_REPORTING_MODE) so the limiter can be shadowed in prod before enforcement.
  • Swapped order of per-team and per-issue rate limiters so per-issue runs first.

How did you test this code?

  • tests pass,
  • I manually tested this and it works - I set per issue rate limit to 100 per 15 min and I am sending 1 exception per second. It quickly reaches ~105 (105 instead of 100 because bucket has a refill rate) and then traffic slows down drastically leaving us with ~1 exception per 10s (100 per 15 min is 0.11 per s). Other exceptions have their own limit
image image

Publish to changelog?

no

🤖 Agent context

Branched off #59610 and built up from an earlier WIP branch by the agent. PR base will switch to master automatically once #59610 merges.

@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 22, 2026 12:40
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
nodejs/src/ingestion/error-tracking/load-error-tracking-settings-step.ts:18-25
Debug `console.log` left in production code — this fires for every event processed by the error-tracking pipeline. Under sustained load that will generate enormous log volume and expose settings data (including rate-limit values) on every invocation.

```suggestion
        const errorTrackingSettings = manager ? await manager.getSettings(input.team.id) : null

        return ok({ ...input, errorTrackingSettings })
```

### Issue 2 of 3
nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts:312-356
Two `console.log` debug statements are left in production code. `getBucketConfig` is called once per unique rate-limit key per batch, so these fire frequently under real traffic. The same issue also appears in the per-issue block immediately below (`[ET-RL] per-issue bucket config`).

```suggestion
                    const config = {
                        bucketSize: value,
                        refillRate: value / (minutes * 60),
                    }

                    return config
                },
            },
            // Per-stack cap: independent opt-in from the team-global limit.
            {
                rateLimiter: this.rateLimiter,
                appMetricsAggregator: this.rateLimiterAppMetricsAggregator,
                appSource: 'exceptions',
                getKey: (input) => {
                    if (input.errorTrackingSettings?.perIssueRateLimitValue == null) {
                        return null
                    }
                    const sig = preCymbalGroupKey(input.event)
                    return sig ? `${input.team.id}:exceptions:stack:${sig}` : null
                },
                getTeamId: (input) => input.team.id,
                reportingMode: this.config.rateLimiterReportingMode,
                dropReason: 'rate_limited:per_stack',
                getBucketConfig: (input) => {
                    const settings = input.errorTrackingSettings!
                    const value = settings.perIssueRateLimitValue!
                    const minutes = settings.perIssueRateLimitBucketSizeMinutes ?? 60
                    const config = {
                        bucketSize: value,
                        refillRate: value / (minutes * 60),
                    }

                    return config
```

### Issue 3 of 3
nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.ts:32-35
The `|` separator in the message-only path can silently merge distinct exceptions into the same rate-limit bucket. For example, `type="A|B", value="C"` and `type="A", value="B|C"` both produce the string `"A|B|C"` and therefore the same hash. Using a null byte (as the per-frame fields already do) removes the ambiguity.

```suggestion
        return createHash('sha1')
            .update(String(exc.type ?? ''))
            .update('\0')
            .update(value)
            .digest('hex')
            .slice(0, 16)
```

Reviews (1): Last reviewed commit: "feat(error-tracking): add per-issue pre-..." | Re-trigger Greptile

Comment thread nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts Outdated
Comment thread nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.ts Outdated
@ablaszkiewicz ablaszkiewicz force-pushed the feat/error-tracking-per-issue-rate-limit-v2 branch 3 times, most recently from 8037387 to 387bb41 Compare May 22, 2026 12:59
@veria-ai
Copy link
Copy Markdown

veria-ai Bot commented May 22, 2026

PR overview

The PR has worked down most findings, with 2 issues already addressed and 1 remaining open. The remaining concern is that the new per-issue pre-cymbal rate limiter can create high-cardinality Redis keys from attacker-controlled exception payloads before a team-wide cap is enforced, allowing a sender with a project token to drive unbounded Redis writes by varying stack traces. This should be gated by a bounded team/global admission check or another cap on per-team issue-key creation before writing per-issue buckets.

Open issues (1)

Fixed/addressed: 2 · PR risk: 6/10

Comment thread nodejs/src/ingestion/error-tracking/keyed-rate-limiter-step.ts Outdated
return null
}

const frames = exc.stacktrace?.frames
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 'extra safety' should we handle invalid frames? something like frames being [null], maybe we can validate the frames are valid objects too, otherwise I think this fails the entire batch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


/** Pre-Cymbal stack signature for per-issue rate limit buckets. */
export function preCymbalGroupKey(event: PluginEvent): string | null {
const exc = event.properties?.$exception_list?.[0]
Copy link
Copy Markdown
Contributor

@cat-ph cat-ph May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking out loud, would chained exceptions with the first one acting as a sort of "wrapper", swallow more than wanted, and throttle issues that are not part of it? thinking that we go through all exceptions in cymbal

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. Fixing that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ablaszkiewicz ablaszkiewicz marked this pull request as draft May 26, 2026 09:22
@ablaszkiewicz ablaszkiewicz force-pushed the feat/error-tracking-per-issue-rate-limit-v2 branch from 387bb41 to 3aa7234 Compare May 26, 2026 09:23
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 26, 2026 09:23
@ablaszkiewicz ablaszkiewicz reopened this May 26, 2026
@ablaszkiewicz ablaszkiewicz changed the base branch from fix/redis-token-bucket-v3-persist-balance-on-denial to master May 26, 2026 09:30
@ablaszkiewicz ablaszkiewicz force-pushed the feat/error-tracking-per-issue-rate-limit-v2 branch from 3aa7234 to 03cbace Compare May 26, 2026 09:48
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated into existing tests

Adds a per-issue token-bucket rate limiter that fires before Cymbal so a
single hot issue can't dominate the consumer. Keyed by the pre-cymbal
group key (team + exception fingerprint), wired through the shared
KeyedRateLimiterService.

- New step: pre-cymbal-group-key — derives the issue key for rate-limit
  decisions ahead of full Cymbal grouping.
- Refactored keyed-rate-limiter step from per-key cost summation to
  per-input fan-out so partial batches can pass through (some inputs in
  a key group allowed, others dropped) rather than all-or-nothing.
- Reporting mode flag so the limiter can be shadowed in prod before
  enforcement.

Builds on the V3 lua denial-persistence fix so per-issue limits with
sub-2 fillRate (e.g. 100/15min ≈ 0.11/s) can recover instead of wedging.
@ablaszkiewicz ablaszkiewicz force-pushed the feat/error-tracking-per-issue-rate-limit-v2 branch from 03cbace to 20132f6 Compare May 26, 2026 10:11
@ablaszkiewicz ablaszkiewicz marked this pull request as ready for review May 26, 2026 11:40
@ablaszkiewicz ablaszkiewicz requested review from cat-ph and Copilot May 26, 2026 11:40
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts:289-308
**Per-issue limiter runs after project-global, so hot issues still burn the project budget**

The two specs are applied in array order — project-global first, per-issue second. When both limits are configured an event from a hot issue that passes the project-global check (consuming a project token) can still be dropped at the per-issue step. Those consumed tokens are then unavailable for other issues, which is exactly the crowd-out problem the per-issue limiter is meant to solve.

Reversing the order so per-issue runs first means a hot issue's overflow is dropped before it can consume any project tokens, and the remaining project budget is fully available to other issues. When only one limit is configured the other step's `getKey` returns `null` for all inputs (no-op), so the swap is transparent in single-limit deployments.

Reviews (2): Last reviewed commit: "feat(error-tracking): add per-issue pre-..." | Re-trigger Greptile

Comment thread nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a per-issue (pre-Cymbal) rate limit to the error-tracking ingestion path to prevent a single noisy fingerprint from monopolizing consumer capacity, and reorders limiters so per-issue throttling happens before the team-global cap.

Changes:

  • Extend error-tracking settings loading to include per-issue rate limit value and bucket window.
  • Introduce a new preCymbalGroupKey hashing step and wire a keyed per-issue rate limiter into the consumer ahead of Cymbal grouping.
  • Update/expand consumer tests for both project-wide and per-issue rate limiting; remove an older ET rate-limiter E2E test.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
nodejs/src/utils/error-tracking-settings-manager.ts Loads per-issue rate limit settings from posthog_errortrackingsettings.
nodejs/src/utils/error-tracking-settings-manager.test.ts Updates settings upsert helper + expectations to include per-issue fields.
nodejs/src/ingestion/error-tracking/rate-limiter-e2e.test.ts Removes prior Redis-backed E2E test for sustained traffic rate limiting.
nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.ts Adds pre-Cymbal stable signature hashing for exceptions.
nodejs/src/ingestion/error-tracking/pre-cymbal-group-key.test.ts Adds unit tests covering hashing behavior and edge cases.
nodejs/src/ingestion/error-tracking/load-error-tracking-settings-step.test.ts Updates test typings/fixtures to include per-issue settings.
nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts Adds per-issue keyed limiter before team-global limiter using the pre-Cymbal signature.
nodejs/src/ingestion/error-tracking/error-tracking-consumer.test.ts Adds integration-ish tests verifying project and per-issue limiter behavior with Redis-backed limiter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nodejs/src/ingestion/error-tracking/error-tracking-consumer.test.ts Outdated
Comment thread nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts
@marandaneto
Copy link
Copy Markdown
Member

curious why assign-reviewers-posthog[bot] is auto assigning client libs approvers as reviewers quite often

@cat-ph
Copy link
Copy Markdown
Contributor

cat-ph commented May 26, 2026

@marandaneto I think it's probably that one of the base branch changes caused the diff to actually contain a js version change (also noticed that in another PR), maybe there's a way to compare current PR's package.json against master in

DIFF=$(gh api "repos/${REPO}/pulls/${PR_NUMBER}/files" --paginate --jq '.[] | select(.filename | endswith("package.json")) | .patch // empty')
/cc @PostHog/team-devex for thoughts

@marandaneto
Copy link
Copy Markdown
Member

@marandaneto I think it's probably that one of the base branch changes caused the diff to actually contain a js version change (also noticed that in another PR), maybe there's a way to compare current PR's package.json against master in

DIFF=$(gh api "repos/${REPO}/pulls/${PR_NUMBER}/files" --paginate --jq '.[] | select(.filename | endswith("package.json")) | .patch // empty')

/cc @PostHog/team-devex for thoughts

yep i figured, thx
#60076 so its a bit better now

const hash = createHash('sha1')
let hasContent = false

for (const exc of excList) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's probably a very limited edge case here where some SDKs sending captureException("") would bypass it, but IMO not something we need to explicitly handle here, just noting

if (input.errorTrackingSettings?.perIssueRateLimitValue == null) {
return null
}
const sig = preCymbalGroupKey(input.event)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: High-cardinality rate-limit keys

preCymbalGroupKey(input.event) is derived from public exception payloads, and this per-issue limiter runs before the team-global limiter. A sender with the project token can vary stack frames to create one Redis bucket per synthetic issue, so globally rate-limited traffic still causes unbounded Redis writes; put a bounded team/global admission check before this per-issue bucket, or otherwise cap per-team issue-key creation before writing per-issue keys.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehhh that's true. I need to think about this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a thought: one benefit to having the team limiter run first is that we could actually move it further up the stack (using just the parsed headers), and drop events before we parse them (saving some compute cost) - we could then have the separate per group check after we've parsed the events.

Copy link
Copy Markdown
Contributor Author

@ablaszkiewicz ablaszkiewicz May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the fact that this would exhaust team-limit is a deal breaker.

A customer we are mainly designing this for is research gate (amongst top 100 most visited websites on the internet). Their exploding issue (issue which has uncontrollably high amount of exceptions) will basically eat team-limit in seconds probably effectively stopping ingestion of all other issues.

I was thinking about either:

  • hashing the exception hash to some limited space (like let's say even 10k hashes) - dead simple to implement but prone to clashes. We are sure that there is no way for potential attacker to use more than 10k slots in redis,
  • implementing something actually smart based on Redis and keeping track of how many distinct hashes we have recorded for a given team in a given time. If we are over this limit, we return null in issue-limit and then team-limit applies. This should NEVER happen when customer has a proper implementation. This is purely to prevent possible attacks. I see there are quite some options how to implement it - will research it tomorrow.

If you have any thoughts on above - please drop them here ❤️

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think option 2 makes sense there! option 1 can be tricky, and can possibly lead to false positives. starting with 2 (claude says to use redis hll) makes sense, we can also break it up further over time and have heavy traffic buckets vs a tail end bucket if needed.

Copy link
Copy Markdown
Contributor

@z0br0wn z0br0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all in all looks good to me - have a thought about ordering the limiters but not a blocker. we should have some per step latency metrics so i'd be keen to see the impact there

@ablaszkiewicz ablaszkiewicz merged commit 3c3d43b into master May 27, 2026
175 checks passed
@ablaszkiewicz ablaszkiewicz deleted the feat/error-tracking-per-issue-rate-limit-v2 branch May 27, 2026 13:15
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 27, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-27 13:34 UTC Run
prod-us ✅ Deployed 2026-05-27 13:54 UTC Run
prod-eu ✅ Deployed 2026-05-27 13:58 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants