Skip to content

fix: prevent OOM in embedding backfill with long texts#289

Merged
BYK merged 1 commit into
mainfrom
fix/embedding-oom
May 13, 2026
Merged

fix: prevent OOM in embedding backfill with long texts#289
BYK merged 1 commit into
mainfrom
fix/embedding-oom

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 13, 2026

Summary

  • Add truncation: true to the transformers.js pipeline call, capping individual texts at the model's max token length
  • Reduce BACKFILL_CHUNK_SIZE from 32 to 8, limiting peak tensor size when long texts are batched
  • Detect OOM errors (opaque ONNX numeric codes like 287180544) and return human-friendly messages with batch diagnostics
  • Upgrade backfill error logging from log.info to log.error so failures are captured by Sentry

Root Cause

Nomic v1.5 pads ALL texts in a batch to the longest sequence length. A batch of 24 texts where one has 1383 tokens creates a [24, 1383] input tensor — the resulting intermediate activations cause a ~287 MB allocation failure in onnxruntime. The error was reported as an opaque numeric code "287180544" with no human-readable context.

Fix Details

Change File Effect
truncation: true embedding-worker.ts:170 Caps each text at model max (8192 tokens), preventing any single oversized input
BACKFILL_CHUNK_SIZE = 8 embedding.ts:1097 Limits batch×sequence tensor size; worst case [8, 8192] vs previous [32, 8192]
isOomError() helper embedding-worker.ts:166 Detects numeric-only error codes (≥6 digits) and OOM patterns, wraps in descriptive message
log.error in catch blocks embedding.ts:1140,1197 Sends to Sentry via captureException (was log.info which skips Sentry)

Nomic v1.5 pads all texts in a batch to the longest sequence length.
A batch of 24 texts with one 1383-token text creates a [24, 1383]
tensor, causing a ~287 MB allocation failure in onnxruntime.

Three fixes:
- Add truncation: true to pipeline call, capping individual texts at
  the model's max length (8192 tokens)
- Reduce BACKFILL_CHUNK_SIZE from 32 to 8, limiting peak tensor size
- Detect OOM errors (opaque numeric codes like '287180544') and return
  a human-friendly message; upgrade backfill catch blocks to log.error
  so failures are captured by Sentry
@BYK BYK merged commit fee878b into main May 13, 2026
7 checks passed
@BYK BYK deleted the fix/embedding-oom branch May 13, 2026 15:45
BYK added a commit that referenced this pull request May 13, 2026
## Summary

Follow-up to PR #289 — fixed batch size of 8 still OOM'd on long
distillation observations (4476+ chars, ~1119 tokens). Replaces fixed
`BACKFILL_CHUNK_SIZE` with adaptive `nextBatch()` that caps total tensor
area.

## Root Cause

ONNX runtime pads all texts in a batch to the longest sequence. With
`BACKFILL_CHUNK_SIZE=8`, a batch containing one 4476-char distillation
observation (~1119 tokens) creates a `[8, 1119]` tensor — still too
large. The error from PR #289's OOM detection confirmed it: `ONNX
runtime out of memory (batch=8, longest≈4476 chars)`.

## Fix

Replace fixed chunk size with **token-budget batching** via
`nextBatch()`:
- Estimates token count per text (~4 chars/token for WordPiece)
- Caps total batch "area" (`batch_size × max_tokens_in_batch`) at
`MAX_BATCH_TOKEN_AREA = 4096`
- Still respects `MAX_BACKFILL_CHUNK = 8` for priority queue
interleaving

| Input | Batch size | Area | Fits? |
|---|---|---|---|
| 8 × 300-token knowledge entries | 8 | 2400 | yes |
| 1 × 1119-token distillation + 7 short | 1-3 | ≤4096 | yes |
| 1 × 2000-token long distillation | 1 | 2000 | yes (solo) |

Both knowledge and distillation backfill loops updated to use
`nextBatch()`.
This was referenced May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant