Skip to content

feat: error tracking per issue guard#60325

Merged
ablaszkiewicz merged 16 commits into
masterfrom
feat/error-tracking-per-issue-guard
May 31, 2026
Merged

feat: error tracking per issue guard#60325
ablaszkiewicz merged 16 commits into
masterfrom
feat/error-tracking-per-issue-guard

Conversation

@ablaszkiewicz
Copy link
Copy Markdown
Contributor

@ablaszkiewicz ablaszkiewicz commented May 27, 2026

There is a possible angle of attack where an attacker starts generating unique exceptions and sends a lot of them. Our per issue rate limiter creates one key per unique exception

This PR introduces a new lua script which is basically a v3 but with a secret sauce. That secret sauce monitors how many keys we are currently holding per team.

That new script apart from tokensBefore and tokensAfter returns statusCode. So new script can return:

  • allowed - all good,
  • limited - this request drained the bucket. Consumer can decide what to do with this based on tokens before and tokens after,
  • tripped - this request exhausted number of unique keys limit,
  • cooldown - this team is currently on cooldown after its status getting tripped

Some specifics of this change:

  • introduced an interface for the old keyed-rate-limiter.service.ts and new one per-issue-guarded-rate-limiter.service.ts also implements it. Thanks to that we can just plug it to existing pipeline without any changes,
  • KeyedRateLimitRequest now optionally accepts teamId. Old service just doesn't use that. New
  • whenever team has too many keys, we store this in Redis with 5m TTL. Thanks to that next requests short circuit the script,
  • if per issue rate limiter is currently tripped, we fall back to per-team rate limiter,

How do I know that this works

  • we have great new tests, all of them pass, all previous tests also pass

Big local test

Set per issue rate limit to 10 per 15m. Set number of unique tracked issues to 10 at most. Then started bombarding ingestion with 20 different issues once per second.

In redis only 10 issues are tracked as expected:
image

In Redis I can see that we have recorded 11 unique issues for this team hence the switch tripped:
image

In Redis I can see that this teams switch is tripped:
image

In PH I can see that issues keep flowing as intended (per issue rate limiter got ignored at this point and it is using team's rate limit):
image

ablaszkiewicz and others added 9 commits May 27, 2026 21:17
… limiter

Replaces the per-issue token-bucket Lua with a guarded variant that
counts new bucket-key creations per team per window. When the count
crosses a configurable threshold, a per-team fallback flag short-
circuits new-key creation until the cooldown elapses, capping the
Redis memory an attacker can consume by fuzzing \$exception fields.
The team-global limiter downstream still enforces, so attacker events
aren't free — they just stop minting new bucket keys.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…iter spec list

Drops the dedicated guarded step, pipeline slot, and consumer plumbing
by having PerIssueGuardedRateLimiterService implement the same
rateLimitGrouped signature as KeyedRateLimiterService. Both now slot
into the existing KeyedRateLimiterStepOptions array via a structural
KeyedRateLimiterLike type — the per-issue spec uses the guarded
service, the team-global spec keeps using the base one.

teamId/sig are parsed out of the rate-limit id format
({teamId}:exceptions:issue:{sig}) on the guarded service; a runtime
assert makes a future rename fail loudly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d parsing

- Move the guarded token-bucket Lua to common/redis next to v2/v3 and
  register it unconditionally in createRedisV2PoolFromConfig — same shape
  as the other two scripts.
- Add a KeyedRateLimiter interface to keyed-rate-limiter.service.ts; both
  KeyedRateLimiterService and PerIssueGuardedRateLimiterService now
  explicitly implement it. The step's rateLimiter field uses the
  interface directly.
- Add optional teamId to KeyedRateLimitRequest. The step always populates
  it from its existing getTeamId. The guarded service reads it directly
  and throws a clear error if missing — no more regex-parsing the id
  string to recover the team.
- Rename the local "sig" terminology to "scopeKey" so other limiters can
  reuse the contract if needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…own'

The flag and TTL describe a temporary state where per-issue limiting is
disabled for a team that tripped the new-key threshold. "Cooldown" reads
clearer and avoids overloading "fallback" (which already means several
other things in this codebase, e.g. fallbackRedisUrl).

Touches: Lua keys/args/return, service config + methods + Prom labels,
Redis prefix (sigfb -> sigcooldown), env var, consumer plumbing, tests.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d data

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Grouped

Five inputs share one scopeKey against a bucket of 3; verifies the
first 3 pass and the last 2 are rate-limited (client-side fan-out).
Mirrors the equivalent base-service test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… limiter

Single batch exercises four teams in different states:
  - team 100: already in cooldown → events pass through, no buckets minted
  - team 200: trips during this batch on third new scopeKey
  - team 300: 4 inputs share one scopeKey against bucket=2 → partial pass-through
  - team 400: clean pass under threshold

Asserts per-input rate-limit outcomes, plus the resulting bucket-key
existence (or absence) and per-team cooldown state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ablaszkiewicz ablaszkiewicz marked this pull request as ready for review May 28, 2026 13:37
@ablaszkiewicz ablaszkiewicz requested review from Copilot and z0br0wn May 28, 2026 13:37
@assign-reviewers-posthog assign-reviewers-posthog Bot requested a review from a team May 28, 2026 13:38
@ablaszkiewicz ablaszkiewicz requested review from a team, cat-ph and hpouillot May 28, 2026 13:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens error-tracking per-issue rate limiting against an attack pattern that generates many unique exception keys, by introducing a guarded token-bucket Redis Lua script that caps new bucket key creation per team and temporarily bypasses per-issue limiting (falling back to the team-global limiter) when a team trips the guard.

Changes:

  • Added a new guarded Redis Lua script + Node service that tracks per-team “new issue key” creation and enforces a cooldown once a threshold is exceeded.
  • Updated the error-tracking ingestion pipeline to use the guarded limiter for per-issue limits while keeping the existing limiter for team-global limits.
  • Introduced new config knobs (threshold/window/cooldown TTL) and comprehensive tests for the guarded limiter behavior.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
nodejs/src/servers/error-tracking-server.ts Wires new per-issue guard config through to the consumer.
nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts Implements guarded per-issue limiter (status-aware Lua call + client-side fan-out).
nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts Adds tests covering allow/limited/trip/cooldown flows, window rotation, and fail-open.
nodejs/src/ingestion/error-tracking/keyed-rate-limiter-step.ts Generalizes the step to accept a KeyedRateLimiter interface and passes teamId into requests.
nodejs/src/ingestion/error-tracking/error-tracking-consumer.ts Instantiates and uses the guarded limiter for per-issue spec; keeps base limiter for team-global.
nodejs/src/ingestion/error-tracking/error-tracking-consumer.test.ts Adds new guard config fields to consumer test setup.
nodejs/src/ingestion/error-tracking/config.ts Adds defaults and typing for the new guard configuration env vars.
nodejs/src/common/services/keyed-rate-limiter.service.ts Adds teamId to requests and introduces the KeyedRateLimiter interface contract.
nodejs/src/common/redis/redis-v2.ts Registers the new guarded Lua command during Redis client initialization.
nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts New guarded Lua script: v3 token bucket + per-team new-key counter + cooldown flag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nodejs/src/common/services/keyed-rate-limiter.service.ts
Comment thread nodejs/src/common/redis/redis-v2.ts
Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts
Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Comments Outside Diff (1)

  1. nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts, line 371-374 (link)

    P2 Redis pool not closed in afterEach

    The redis pool created in beforeEach is never closed — only hub is torn down. With jest.retryTimes(3) each retry spins up a new pool without closing the previous one, which can exhaust connections and cause flaky hangs in CI. Add await redis.disconnect() (or the equivalent RedisV2 teardown method) to afterEach.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts
    Line: 371-374
    
    Comment:
    **Redis pool not closed in `afterEach`**
    
    The `redis` pool created in `beforeEach` is never closed — only `hub` is torn down. With `jest.retryTimes(3)` each retry spins up a new pool without closing the previous one, which can exhaust connections and cause flaky hangs in CI. Add `await redis.disconnect()` (or the equivalent `RedisV2` teardown method) to `afterEach`.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.test.ts:371-374
**Redis pool not closed in `afterEach`**

The `redis` pool created in `beforeEach` is never closed — only `hub` is torn down. With `jest.retryTimes(3)` each retry spins up a new pool without closing the previous one, which can exhaust connections and cause flaky hangs in CI. Add `await redis.disconnect()` (or the equivalent `RedisV2` teardown method) to `afterEach`.

### Issue 2 of 2
nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts:10-37
**Multi-key Lua script may break on Redis Cluster**

The existing v2/v3 token-bucket scripts each touch a single key per call. This new script accesses three keys with unrelated name prefixes — `sigcooldown/<teamId>`, `sigcount/<teamId>/<window>`, and `tokens/<teamId>:…` — so they are very likely to hash to different slots. On Redis Cluster this would fail immediately with `CROSSSLOT Keys in request don't hash to the same slot`. If a Redis Cluster is ever used for the rate-limiter instance, all three key names need to share a hash tag (e.g., wrapping the `teamId` in `{}`). Worth adding a comment noting the single-node constraint so the assumption isn't lost.

Reviews (1): Last reviewed commit: "fix: opy" | Re-trigger Greptile

Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts
@veria-ai
Copy link
Copy Markdown

veria-ai Bot commented May 28, 2026

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 4 · PR risk: 0/10

Copy link
Copy Markdown
Contributor

@cat-ph cat-ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!! :shipit:

Comment thread nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts Outdated
Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts
Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts
Comment thread nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts Outdated
Comment thread nodejs/src/ingestion/error-tracking/per-issue-guarded-rate-limiter.service.ts Outdated
/** Token-bucket TTL (seconds). Mirrors the existing rate-limiter's bucket TTL. */
bucketTtlSeconds: number
/** When false, throw if the Redis pipeline returns null instead of fail-open allowing all. Default: true. */
failOpen?: boolean
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a bit confused by the layers of fail open, i had claude help break it down for me

  The same word failOpen is being used for three policies at three different layers. A reader sees failOpen: true and thinks "OK, we fail open" — but that only covers one of three failure modes. The matrix is:                                                            
                                                             
  ┌───────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────┬─────────────────────────────────┐                                                                                                                      
  │                       Failure mode                        │                      Behavior                       │          Controlled by          │
  ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤                                                                                                                      
  │ Redis connection down → entire pipeline fails             │ Allow all (or throw if config.failOpen: false)      │ config.failOpen (post-hoc only) │                                                                                                                      
  ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤                                                                                                                      
  │ Pipeline succeeds, one Lua call errored                   │ Allow that one event, counted as allowed            │ Nothing — hardcoded silent      │                                                                                                                      
  ├───────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────┼─────────────────────────────────┤                                                                                                                      
  │ Lua returned an unexpected shape (e.g., wrong tuple size) │ Number(undefined) = NaN, then ?? 'allowed' kicks in │ Nothing — silently swallowed    │
  └───────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────┴─────────────────────────────────┘            

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mimics behavior of the base keyed-rate-limiter.service.ts. failOpen only guards general Redis connection error. If something is malformed and can't be processed, it's silently dropped and the pipeline moves on.

So right now (we have failOpen turned on) we are not tracking such errors in neither of pipelines. I see that logging them is not really a pattern in this service. Also we will probably never switch the failOpen - I honestly don't know what would be the ideal way to start tracking this. We could have a separate prom metric which tracks errors while failOpen is false with a label reason or something like that but the cardinality might be too high.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that you don't really log inside a step but here I would log it since we expect it to never fail so if it does, it should be loud. And if it fails, I think it should fall back to being open. IMO this shouldn't be affecting ingestion. We can't drop someones events because something failed on our side. And if we start to log it, we can easily track this in logs and see what exactly is wrong

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense! i think warn logging makes sense here, and maybe adding an outcome to error_tracking_per_issue_guard_outcome_total that is fail_open (or even broken down just a little further by redis being down or our lua script having an issue) could let us alert on that proactively (because i think today we'll just mark it as allowed, even though it's only allowed by virtue of error).

Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts
Copy link
Copy Markdown
Contributor

@z0br0wn z0br0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all in all lgtm! i think i'd be useful to have a fail open state in the metrics, aside from that any other gaps / needs in observability we can add if needed over time.

Comment thread nodejs/src/common/redis/redis-token-bucket-guarded.lua.ts
@ablaszkiewicz ablaszkiewicz merged commit a116997 into master May 31, 2026
160 of 162 checks passed
@ablaszkiewicz ablaszkiewicz deleted the feat/error-tracking-per-issue-guard branch May 31, 2026 09:56
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 31, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-31 10:18 UTC Run
prod-us ✅ Deployed 2026-05-31 10:30 UTC Run
prod-eu ✅ Deployed 2026-05-31 10:31 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants