fix: prevent OOM in embedding backfill with long texts#289
Merged
Conversation
Nomic v1.5 pads all texts in a batch to the longest sequence length. A batch of 24 texts with one 1383-token text creates a [24, 1383] tensor, causing a ~287 MB allocation failure in onnxruntime. Three fixes: - Add truncation: true to pipeline call, capping individual texts at the model's max length (8192 tokens) - Reduce BACKFILL_CHUNK_SIZE from 32 to 8, limiting peak tensor size - Detect OOM errors (opaque numeric codes like '287180544') and return a human-friendly message; upgrade backfill catch blocks to log.error so failures are captured by Sentry
BYK
added a commit
that referenced
this pull request
May 13, 2026
## Summary Follow-up to PR #289 — fixed batch size of 8 still OOM'd on long distillation observations (4476+ chars, ~1119 tokens). Replaces fixed `BACKFILL_CHUNK_SIZE` with adaptive `nextBatch()` that caps total tensor area. ## Root Cause ONNX runtime pads all texts in a batch to the longest sequence. With `BACKFILL_CHUNK_SIZE=8`, a batch containing one 4476-char distillation observation (~1119 tokens) creates a `[8, 1119]` tensor — still too large. The error from PR #289's OOM detection confirmed it: `ONNX runtime out of memory (batch=8, longest≈4476 chars)`. ## Fix Replace fixed chunk size with **token-budget batching** via `nextBatch()`: - Estimates token count per text (~4 chars/token for WordPiece) - Caps total batch "area" (`batch_size × max_tokens_in_batch`) at `MAX_BATCH_TOKEN_AREA = 4096` - Still respects `MAX_BACKFILL_CHUNK = 8` for priority queue interleaving | Input | Batch size | Area | Fits? | |---|---|---|---| | 8 × 300-token knowledge entries | 8 | 2400 | yes | | 1 × 1119-token distillation + 7 short | 1-3 | ≤4096 | yes | | 1 × 2000-token long distillation | 1 | 2000 | yes (solo) | Both knowledge and distillation backfill loops updated to use `nextBatch()`.
This was referenced May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
truncation: trueto the transformers.js pipeline call, capping individual texts at the model's max token lengthBACKFILL_CHUNK_SIZEfrom 32 to 8, limiting peak tensor size when long texts are batched287180544) and return human-friendly messages with batch diagnosticslog.infotolog.errorso failures are captured by SentryRoot Cause
Nomic v1.5 pads ALL texts in a batch to the longest sequence length. A batch of 24 texts where one has 1383 tokens creates a
[24, 1383]input tensor — the resulting intermediate activations cause a ~287 MB allocation failure in onnxruntime. The error was reported as an opaque numeric code"287180544"with no human-readable context.Fix Details
truncation: trueembedding-worker.ts:170BACKFILL_CHUNK_SIZE = 8embedding.ts:1097[8, 8192]vs previous[32, 8192]isOomError()helperembedding-worker.ts:166log.errorin catch blocksembedding.ts:1140,1197captureException(waslog.infowhich skips Sentry)