Skip to content

[pull] main from triggerdotdev:main#165

Merged
pull[bot] merged 6 commits into
Dustin4444:mainfrom
triggerdotdev:main
May 27, 2026
Merged

[pull] main from triggerdotdev:main#165
pull[bot] merged 6 commits into
Dustin4444:mainfrom
triggerdotdev:main

Conversation

@pull

@pull pull Bot commented May 27, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

0ski and others added 6 commits May 27, 2026 16:18
…nge integers (#3759)

## Summary

Second class of poisoned-row failure in the runs replication path. PR
#3708 plugged lone UTF-16 surrogates; this one handles bare JSON integer
literals outside ClickHouse's `Int64`..`UInt64` range. Recovery stays
purely reactive — the existing `sanitizeRows` walker just gains an extra
branch, so the hot replication path pays nothing on healthy rows.

Fixes the still-firing customer-facing symptom from
[TRI-9755](https://linear.app/triggerdotdev/issue/TRI-9755):
`scan-social-profiles` runs continued to be stranded in `EXECUTING` on
the Tasks page after #3708 deployed. CloudWatch showed `Dropped batch —
ClickHouse JSON parse error but sanitizer found nothing to fix` firing
**8/8 times** since the previous deploy (zero successful sanitizations).
Root cause: upstream JS Number precision loss on a 21-digit Google Plus
ID (`117039831458782873093` → `117039831458782870000`) — the
precision-lossy value still serialises as a bare integer that exceeds
`UInt64.MAX`, which ClickHouse rejects with `INCORRECT_DATA`.

## How the bug ships

The customer task emits an output containing a Poshmark profile's
`spec_format`:

```json
{"key":"gp_id","proper_key":"Gp Id","value":117039831458782870000,"type":"int"}
```

That value is `1.17e20` — comfortably above `UInt64.MAX` (`1.84e19`) but
comfortably below `1e21`. `Number.prototype.toString` only switches to
exponential form at `|value| >= 1e21`, so `JSON.stringify` emits the
bare token `117039831458782870000` and the ClickHouse
`JSON(max_dynamic_paths)` column fails with:

```
Code: 117. DB::Exception: Cannot parse JSON object here: {…}: (while reading the value of key output): (at row 1)
: While executing ParallelParsingBlockInputFormat. (INCORRECT_DATA) (version 25.12.x)
```

Same error verbatim as prod. The same number quoted
(`"117039831458782870000"`) inserts fine — ClickHouse's dynamic JSON
column accepts a `String` subtype on the same path.

## What changed

`apps/webapp/app/v3/eventRepository/sanitizeRowsOnParseError.server.ts`:

- New private `isUnsafeJsonInteger(value)` helper — true iff `value` is
a finite integer-valued JS Number where `|value| < 1e21` (so
`JSON.stringify` emits integer form, not exponent) **and** `value` falls
outside `[Int64.MIN, UInt64.MAX]`.
- `sanitizeUnknownInPlace` gains a number-branch: when the predicate
holds, replace the Number with `String(value)`. The downstream JSON
column dynamic-types the path as String for that row — fine, since the
value was already precision-lossy upstream (no JS Number above 2^53 is
numerically meaningful anyway).
- Float-valued numbers, large floats (>= 1e21), NaN and Infinity are
left alone — `JSON.stringify` emits them with exponents or as `null`,
both of which ClickHouse accepts.

`apps/webapp/test/sanitizeRowsOnParseError.test.ts`: four new unit tests
+ an extension to `sanitizeRows` covering surrogate + integer fixes
counted together across rows. The unit suite now covers:

- Positive value above `UInt64.MAX` (`117039831458782870000` — the
actual prod value)
- Negative value below `Int64.MIN`
- Boundary values pass through (`42`, `Number.MAX_SAFE_INTEGER`, `2^63`)
- Non-integer numbers untouched (floats, `1e25`, NaN, Infinity)
- The actual `scan-social-profiles` nested shape — finds the offending
`gp_id` deep inside
`output.data.profiles[].spec_format[].platform_variables[].value`

`.server-changes/runs-replication-bigint-recovery.md` — release notes
entry.

## Why reactive, not pre-flight

`#prepareJson` runs millions of times per day on the replication hot
path. Walking every JSON tree to look for oversized integers would add
bounded-but-real CPU on every healthy row. `sanitizeRows` only fires
after a ClickHouse parse-error rejection, which is a few times a day
platform-wide. Extending it costs effectively zero on healthy traffic
and gains us recovery on the rare poisoned row.

## Verification

- Reproduced 1:1 in a throwaway Docker
`clickhouse/clickhouse-server:25.12.11.4` (closest available to the prod
`25.12.1.1579` build). Pre-sanitize JSON fails with the exact prod
error; post-sanitize JSON inserts cleanly and the row is readable with
`gp_id` stored as a String subtype.
- `pnpm --filter webapp exec vitest run
test/sanitizeRowsOnParseError.test.ts` — 22/22 passing (18 existing + 4
new).
- `pnpm run typecheck --filter webapp` — clean.

## Test plan

- [x] `pnpm run typecheck --filter webapp`
- [x] Unit tests pass against new + existing cases
- [x] End-to-end Docker ClickHouse repro confirms recovery
- [ ] Post-deploy: confirm `Sanitizing batch after ClickHouse JSON parse
error` warns fire instead of `Dropped batch …` errors when
`scan-social-profiles` outputs trip CH again
- [ ] Post-deploy: confirm `permanentlyDroppedBatches` counter stops
climbing in
`/stp/trigger-app-prod/ecs/replication/service-container/process-logs`

## What this does NOT do

- Doesn't backfill the ~120k+ existing stranded `EXECUTING` rows in
production. Same as #3708 — that needs a reconciliation/backfill sweep
(separate ticket — TRI-9755 fix #3).
- Doesn't address the upstream root cause (the customer task emitting a
JS-Number-precision-lossy big int). That's a customer-task concern; our
replication path needs to be robust to whatever shape arrives.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…3676)

Retries `TASK_MIDDLEWARE_ERROR` under the task's retry policy.
`shouldRetryError` already classed it as retryable, but
`shouldLookupRetrySettings` did not, so the run fell through to
`fail_run` on attempt 1 instead of using the task's `retry` config.
Fixes #3231.
@pull pull Bot locked and limited conversation to collaborators May 27, 2026
@pull pull Bot added the ⤵️ pull label May 27, 2026
@pull pull Bot merged commit 5083d16 into Dustin4444:main May 27, 2026
0 of 4 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants