feat: error tracking per issue guard by ablaszkiewicz · Pull Request #60325 · PostHog/posthog

ablaszkiewicz · 2026-05-27T19:25:54Z

There is a possible angle of attack where an attacker starts generating unique exceptions and sends a lot of them. Our per issue rate limiter creates one key per unique exception

This PR introduces a new lua script which is basically a v3 but with a secret sauce. That secret sauce monitors how many keys we are currently holding per team.

That new script apart from tokensBefore and tokensAfter returns statusCode. So new script can return:

allowed - all good,
limited - this request drained the bucket. Consumer can decide what to do with this based on tokens before and tokens after,
tripped - this request exhausted number of unique keys limit,
cooldown - this team is currently on cooldown after its status getting tripped

Some specifics of this change:

introduced an interface for the old keyed-rate-limiter.service.ts and new one per-issue-guarded-rate-limiter.service.ts also implements it. Thanks to that we can just plug it to existing pipeline without any changes,
KeyedRateLimitRequest now optionally accepts teamId. Old service just doesn't use that. New
whenever team has too many keys, we store this in Redis with 5m TTL. Thanks to that next requests short circuit the script,
if per issue rate limiter is currently tripped, we fall back to per-team rate limiter,

How do I know that this works

we have great new tests, all of them pass, all previous tests also pass

Big local test

Set per issue rate limit to 10 per 15m. Set number of unique tracked issues to 10 at most. Then started bombarding ingestion with 20 different issues once per second.

In redis only 10 issues are tracked as expected:

In Redis I can see that we have recorded 11 unique issues for this team hence the switch tripped:

In Redis I can see that this teams switch is tripped:

In PH I can see that issues keep flowing as intended (per issue rate limiter got ignored at this point and it is using team's rate limit):

… limiter Replaces the per-issue token-bucket Lua with a guarded variant that counts new bucket-key creations per team per window. When the count crosses a configurable threshold, a per-team fallback flag short- circuits new-key creation until the cooldown elapses, capping the Redis memory an attacker can consume by fuzzing \$exception fields. The team-global limiter downstream still enforces, so attacker events aren't free — they just stop minting new bucket keys. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…iter spec list Drops the dedicated guarded step, pipeline slot, and consumer plumbing by having PerIssueGuardedRateLimiterService implement the same rateLimitGrouped signature as KeyedRateLimiterService. Both now slot into the existing KeyedRateLimiterStepOptions array via a structural KeyedRateLimiterLike type — the per-issue spec uses the guarded service, the team-global spec keeps using the base one. teamId/sig are parsed out of the rate-limit id format ({teamId}:exceptions:issue:{sig}) on the guarded service; a runtime assert makes a future rename fail loudly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d parsing - Move the guarded token-bucket Lua to common/redis next to v2/v3 and register it unconditionally in createRedisV2PoolFromConfig — same shape as the other two scripts. - Add a KeyedRateLimiter interface to keyed-rate-limiter.service.ts; both KeyedRateLimiterService and PerIssueGuardedRateLimiterService now explicitly implement it. The step's rateLimiter field uses the interface directly. - Add optional teamId to KeyedRateLimitRequest. The step always populates it from its existing getTeamId. The guarded service reads it directly and throws a clear error if missing — no more regex-parsing the id string to recover the team. - Rename the local "sig" terminology to "scopeKey" so other limiters can reuse the contract if needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…own' The flag and TTL describe a temporary state where per-issue limiting is disabled for a team that tripped the new-key threshold. "Cooldown" reads clearer and avoids overloading "fallback" (which already means several other things in this codebase, e.g. fallbackRedisUrl). Touches: Lua keys/args/return, service config + methods + Prom labels, Redis prefix (sigfb -> sigcooldown), env var, consumer plumbing, tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d data Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Grouped Five inputs share one scopeKey against a bucket of 3; verifies the first 3 pass and the last 2 are rate-limited (client-side fan-out). Mirrors the equivalent base-service test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… limiter Single batch exercises four teams in different states: - team 100: already in cooldown → events pass through, no buckets minted - team 200: trips during this batch on third new scopeKey - team 300: 4 inputs share one scopeKey against bucket=2 → partial pass-through - team 400: clean pass under threshold Asserts per-input rate-limit outcomes, plus the resulting bucket-key existence (or absence) and per-team cooldown state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR hardens error-tracking per-issue rate limiting against an attack pattern that generates many unique exception keys, by introducing a guarded token-bucket Redis Lua script that caps new bucket key creation per team and temporarily bypasses per-issue limiting (falling back to the team-global limiter) when a team trips the guard.

Changes:

Added a new guarded Redis Lua script + Node service that tracks per-team “new issue key” creation and enforces a cooldown once a threshold is exceeded.
Updated the error-tracking ingestion pipeline to use the guarded limiter for per-issue limits while keeping the existing limiter for team-global limits.
Introduced new config knobs (threshold/window/cooldown TTL) and comprehensive tests for the guarded limiter behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
nodejs/src/servers/error-tracking-server.ts	Wires new per-issue guard config through to the consumer.
nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts	Implements guarded per-issue limiter (status-aware Lua call + client-side fan-out).
nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts	Adds tests covering allow/limited/trip/cooldown flows, window rotation, and fail-open.
nodejs/src/ingestion/error-tracking/keyed-rate-limiter-step.ts	Generalizes the step to accept a `KeyedRateLimiter` interface and passes `teamId` into requests.
nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts	Instantiates and uses the guarded limiter for per-issue spec; keeps base limiter for team-global.
nodejs/src/ingestion/error-tracking/error-tracking-consumer.test.ts	Adds new guard config fields to consumer test setup.
nodejs/src/ingestion/error-tracking/config.ts	Adds defaults and typing for the new guard configuration env vars.
nodejs/src/common/services/keyed-rate-limiter.service.ts	Adds `teamId` to requests and introduces the `KeyedRateLimiter` interface contract.
nodejs/src/common/redis/redis-v2.ts	Registers the new guarded Lua command during Redis client initialization.
nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts	New guarded Lua script: v3 token bucket + per-team new-key counter + cooldown flag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

greptile-apps · 2026-05-28T13:45:41Z

Comments Outside Diff (1)

nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts, line 371-374 (link)

Redis pool not closed in afterEach

The redis pool created in beforeEach is never closed — only hub is torn down. With jest.retryTimes(3) each retry spins up a new pool without closing the previous one, which can exhaust connections and cause flaky hangs in CI. Add await redis.disconnect() (or the equivalent RedisV2 teardown method) to afterEach.

Prompt To Fix With AI

This is a comment left during a code review.
Path: nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts
Line: 371-374

Comment:
**Redis pool not closed in `afterEach`**

The `redis` pool created in `beforeEach` is never closed — only `hub` is torn down. With `jest.retryTimes(3)` each retry spins up a new pool without closing the previous one, which can exhaust connections and cause flaky hangs in CI. Add `await redis.disconnect()` (or the equivalent `RedisV2` teardown method) to `afterEach`.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts:371-374
**Redis pool not closed in `afterEach`**

The `redis` pool created in `beforeEach` is never closed — only `hub` is torn down. With `jest.retryTimes(3)` each retry spins up a new pool without closing the previous one, which can exhaust connections and cause flaky hangs in CI. Add `await redis.disconnect()` (or the equivalent `RedisV2` teardown method) to `afterEach`.

### Issue 2 of 2
nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts:10-37
**Multi-key Lua script may break on Redis Cluster**

The existing v2/v3 token-bucket scripts each touch a single key per call. This new script accesses three keys with unrelated name prefixes — `sigcooldown/<teamId>`, `sigcount/<teamId>/<window>`, and `tokens/<teamId>:…` — so they are very likely to hash to different slots. On Redis Cluster this would fail immediately with `CROSSSLOT Keys in request don't hash to the same slot`. If a Redis Cluster is ever used for the rate-limiter instance, all three key names need to share a hash tag (e.g., wrapping the `teamId` in `{}`). Worth adding a comment noting the single-node constraint so the assumption isn't lost.

_{Reviews (1): Last reviewed commit: "fix: opy" | Re-trigger Greptile}

veria-ai · 2026-05-28T13:49:36Z

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 4 · PR risk: 0/10

cat-ph

nice!!

z0br0wn · 2026-05-28T20:22:42Z

+    /** Token-bucket TTL (seconds). Mirrors the existing rate-limiter's bucket TTL. */
+    bucketTtlSeconds: number
+    /** When false, throw if the Redis pipeline returns null instead of fail-open allowing all. Default: true. */
+    failOpen?: boolean


i'm a bit confused by the layers of fail open, i had claude help break it down for me

The same word failOpen is being used for three policies at three different layers. A reader sees failOpen: true and thinks "OK, we fail open" — but that only covers one of three failure modes. The matrix is: ┌───────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┬─────────────────────────────────┐ │ Failure mode │ Behavior │ Controlled by │ ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤ │ Redis connection down → entire pipeline fails │ Allow all (or throw if config.failOpen: false) │ config.failOpen (post-hoc only) │ ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤ │ Pipeline succeeds, one Lua call errored │ Allow that one event, counted as allowed │ Nothing — hardcoded silent │ ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤ │ Lua returned an unexpected shape (e.g., wrong tuple size) │ Number(undefined) = NaN, then ?? 'allowed' kicks in │ Nothing — silently swallowed │ └───────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┴─────────────────────────────────┘

It mimics behavior of the base keyed-rate-limiter.service.ts. failOpen only guards general Redis connection error. If something is malformed and can't be processed, it's silently dropped and the pipeline moves on.

So right now (we have failOpen turned on) we are not tracking such errors in neither of pipelines. I see that logging them is not really a pattern in this service. Also we will probably never switch the failOpen - I honestly don't know what would be the ideal way to start tracking this. We could have a separate prom metric which tracks errors while failOpen is false with a label reason or something like that but the cardinality might be too high.

I saw that you don't really log inside a step but here I would log it since we expect it to never fail so if it does, it should be loud. And if it fails, I think it should fall back to being open. IMO this shouldn't be affecting ingestion. We can't drop someones events because something failed on our side. And if we start to log it, we can easily track this in logs and see what exactly is wrong

that makes sense! i think warn logging makes sense here, and maybe adding an outcome to error_tracking_per_issue_guard_outcome_total that is fail_open (or even broken down just a little further by redis being down or our lua script having an issue) could let us alert on that proactively (because i think today we'll just mark it as allowed, even though it's only allowed by virtue of error).

z0br0wn

all in all lgtm! i think i'd be useful to have a fail open state in the metrics, aside from that any other gaps / needs in observability we can add if needed over time.

deployment-status-posthog · 2026-05-31T10:18:01Z

Deploy status

Environment	Status	Deployed At	Workflow
dev	✅ Deployed	2026-05-31 10:18 UTC	Run
prod-us	✅ Deployed	2026-05-31 10:30 UTC	Run
prod-eu	✅ Deployed	2026-05-31 10:31 UTC	Run

ablaszkiewicz and others added 9 commits May 27, 2026 21:17

fix: unify

bb2e227

chore(error-tracking): tighten comments added by the guard work

d6e284c

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore(error-tracking): finish sig -> scopeKey rename in test names an…

8c7d738

…d data Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ablaszkiewicz marked this pull request as ready for review May 28, 2026 13:37

ablaszkiewicz requested review from Copilot and z0br0wn May 28, 2026 13:37

Copilot started reviewing on behalf of ablaszkiewicz May 28, 2026 13:38 View session

assign-reviewers-posthog Bot requested a review from a team May 28, 2026 13:38

ablaszkiewicz requested review from a team, cat-ph and hpouillot May 28, 2026 13:39

fix: opy

2fd7ec3

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread nodejs/src/common/services/keyed-rate-limiter.service.ts

Comment thread nodejs/src/common/redis/redis-v2.ts

Comment thread nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts

Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts

graphite-app Bot reviewed May 28, 2026

View reviewed changes

Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts

greptile-apps Bot reviewed May 28, 2026

View reviewed changes

Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts

veria-ai Bot reviewed May 28, 2026

View reviewed changes

Comment thread nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts

cat-ph approved these changes May 28, 2026

View reviewed changes

fix: type safety

ebafc73

veria-ai Bot reviewed May 28, 2026

View reviewed changes

Comment thread nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts

fix: one pass

d1d0f36