feat: error tracking per issue guard#60325
Conversation
… limiter Replaces the per-issue token-bucket Lua with a guarded variant that counts new bucket-key creations per team per window. When the count crosses a configurable threshold, a per-team fallback flag short- circuits new-key creation until the cooldown elapses, capping the Redis memory an attacker can consume by fuzzing \$exception fields. The team-global limiter downstream still enforces, so attacker events aren't free — they just stop minting new bucket keys. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…iter spec list
Drops the dedicated guarded step, pipeline slot, and consumer plumbing
by having PerIssueGuardedRateLimiterService implement the same
rateLimitGrouped signature as KeyedRateLimiterService. Both now slot
into the existing KeyedRateLimiterStepOptions array via a structural
KeyedRateLimiterLike type — the per-issue spec uses the guarded
service, the team-global spec keeps using the base one.
teamId/sig are parsed out of the rate-limit id format
({teamId}:exceptions:issue:{sig}) on the guarded service; a runtime
assert makes a future rename fail loudly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d parsing - Move the guarded token-bucket Lua to common/redis next to v2/v3 and register it unconditionally in createRedisV2PoolFromConfig — same shape as the other two scripts. - Add a KeyedRateLimiter interface to keyed-rate-limiter.service.ts; both KeyedRateLimiterService and PerIssueGuardedRateLimiterService now explicitly implement it. The step's rateLimiter field uses the interface directly. - Add optional teamId to KeyedRateLimitRequest. The step always populates it from its existing getTeamId. The guarded service reads it directly and throws a clear error if missing — no more regex-parsing the id string to recover the team. - Rename the local "sig" terminology to "scopeKey" so other limiters can reuse the contract if needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…own' The flag and TTL describe a temporary state where per-issue limiting is disabled for a team that tripped the new-key threshold. "Cooldown" reads clearer and avoids overloading "fallback" (which already means several other things in this codebase, e.g. fallbackRedisUrl). Touches: Lua keys/args/return, service config + methods + Prom labels, Redis prefix (sigfb -> sigcooldown), env var, consumer plumbing, tests. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d data Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Grouped Five inputs share one scopeKey against a bucket of 3; verifies the first 3 pass and the last 2 are rate-limited (client-side fan-out). Mirrors the equivalent base-service test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… limiter Single batch exercises four teams in different states: - team 100: already in cooldown → events pass through, no buckets minted - team 200: trips during this batch on third new scopeKey - team 300: 4 inputs share one scopeKey against bucket=2 → partial pass-through - team 400: clean pass under threshold Asserts per-input rate-limit outcomes, plus the resulting bucket-key existence (or absence) and per-team cooldown state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens error-tracking per-issue rate limiting against an attack pattern that generates many unique exception keys, by introducing a guarded token-bucket Redis Lua script that caps new bucket key creation per team and temporarily bypasses per-issue limiting (falling back to the team-global limiter) when a team trips the guard.
Changes:
- Added a new guarded Redis Lua script + Node service that tracks per-team “new issue key” creation and enforces a cooldown once a threshold is exceeded.
- Updated the error-tracking ingestion pipeline to use the guarded limiter for per-issue limits while keeping the existing limiter for team-global limits.
- Introduced new config knobs (threshold/window/cooldown TTL) and comprehensive tests for the guarded limiter behavior.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| nodejs/src/servers/error-tracking-server.ts | Wires new per-issue guard config through to the consumer. |
| nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts | Implements guarded per-issue limiter (status-aware Lua call + client-side fan-out). |
| nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts | Adds tests covering allow/limited/trip/cooldown flows, window rotation, and fail-open. |
| nodejs/src/ingestion/error-tracking/keyed-rate-limiter-step.ts | Generalizes the step to accept a KeyedRateLimiter interface and passes teamId into requests. |
| nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts | Instantiates and uses the guarded limiter for per-issue spec; keeps base limiter for team-global. |
| nodejs/src/ingestion/error-tracking/error-tracking-consumer.test.ts | Adds new guard config fields to consumer test setup. |
| nodejs/src/ingestion/error-tracking/config.ts | Adds defaults and typing for the new guard configuration env vars. |
| nodejs/src/common/services/keyed-rate-limiter.service.ts | Adds teamId to requests and introduces the KeyedRateLimiter interface contract. |
| nodejs/src/common/redis/redis-v2.ts | Registers the new guarded Lua command during Redis client initialization. |
| nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts | New guarded Lua script: v3 token bucket + per-team new-key counter + cooldown flag. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 4 · PR risk: 0/10 |
| /** Token-bucket TTL (seconds). Mirrors the existing rate-limiter's bucket TTL. */ | ||
| bucketTtlSeconds: number | ||
| /** When false, throw if the Redis pipeline returns null instead of fail-open allowing all. Default: true. */ | ||
| failOpen?: boolean |
There was a problem hiding this comment.
i'm a bit confused by the layers of fail open, i had claude help break it down for me
The same word failOpen is being used for three policies at three different layers. A reader sees failOpen: true and thinks "OK, we fail open" — but that only covers one of three failure modes. The matrix is:
┌───────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┬─────────────────────────────────┐
│ Failure mode │ Behavior │ Controlled by │
├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤
│ Redis connection down → entire pipeline fails │ Allow all (or throw if config.failOpen: false) │ config.failOpen (post-hoc only) │
├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤
│ Pipeline succeeds, one Lua call errored │ Allow that one event, counted as allowed │ Nothing — hardcoded silent │
├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤
│ Lua returned an unexpected shape (e.g., wrong tuple size) │ Number(undefined) = NaN, then ?? 'allowed' kicks in │ Nothing — silently swallowed │
└───────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┴─────────────────────────────────┘
There was a problem hiding this comment.
It mimics behavior of the base keyed-rate-limiter.service.ts. failOpen only guards general Redis connection error. If something is malformed and can't be processed, it's silently dropped and the pipeline moves on.
So right now (we have failOpen turned on) we are not tracking such errors in neither of pipelines. I see that logging them is not really a pattern in this service. Also we will probably never switch the failOpen - I honestly don't know what would be the ideal way to start tracking this. We could have a separate prom metric which tracks errors while failOpen is false with a label reason or something like that but the cardinality might be too high.
There was a problem hiding this comment.
I saw that you don't really log inside a step but here I would log it since we expect it to never fail so if it does, it should be loud. And if it fails, I think it should fall back to being open. IMO this shouldn't be affecting ingestion. We can't drop someones events because something failed on our side. And if we start to log it, we can easily track this in logs and see what exactly is wrong
There was a problem hiding this comment.
that makes sense! i think warn logging makes sense here, and maybe adding an outcome to error_tracking_per_issue_guard_outcome_total that is fail_open (or even broken down just a little further by redis being down or our lua script having an issue) could let us alert on that proactively (because i think today we'll just mark it as allowed, even though it's only allowed by virtue of error).
z0br0wn
left a comment
There was a problem hiding this comment.
all in all lgtm! i think i'd be useful to have a fail open state in the metrics, aside from that any other gaps / needs in observability we can add if needed over time.
There is a possible angle of attack where an attacker starts generating unique exceptions and sends a lot of them. Our per issue rate limiter creates one key per unique exception
This PR introduces a new lua script which is basically a v3 but with a secret sauce. That secret sauce monitors how many keys we are currently holding per team.
That new script apart from
tokensBeforeandtokensAfterreturnsstatusCode. So new script can return:allowed- all good,limited- this request drained the bucket. Consumer can decide what to do with this based on tokens before and tokens after,tripped- this request exhausted number of unique keys limit,cooldown- this team is currently on cooldown after its status getting trippedSome specifics of this change:
keyed-rate-limiter.service.tsand new oneper-issue-guarded-rate-limiter.service.tsalso implements it. Thanks to that we can just plug it to existing pipeline without any changes,KeyedRateLimitRequestnow optionally acceptsteamId. Old service just doesn't use that. NewHow do I know that this works
Big local test
Set per issue rate limit to 10 per 15m. Set number of unique tracked issues to 10 at most. Then started bombarding ingestion with 20 different issues once per second.
In redis only 10 issues are tracked as expected:

In Redis I can see that we have recorded 11 unique issues for this team hence the switch tripped:

In Redis I can see that this teams switch is tripped:

In PH I can see that issues keep flowing as intended (per issue rate limiter got ignored at this point and it is using team's rate limit):
