Conversation
classifyBatch() previously allocated a single tensor for all sentences, causing O(N × seqLen²) native memory in ONNX attention layers. For large payloads (e.g. 100-item list responses), this could reach several GB and crash Lambda environments. Now processes in chunks of 32 sentences max, capping native memory at ~50MB per inference call regardless of input size. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the ONNX-based prompt-injection classifier to avoid Lambda OOMs by bounding native (non-V8) memory usage during batch inference, especially when classifying hundreds of sentences extracted from large tool/list responses.
Changes:
- Split
OnnxClassifier.classifyBatch()into fixed-size chunks (max 32 texts persession.run()). - Added
classifyBatchChunk()to run a single bounded ONNX inference call and concatenate results. - Added a test that classifies 40 texts to exercise cross-chunk behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/classifiers/onnx-classifier.ts |
Implements chunked batch inference to cap ONNX attention-matrix memory usage. |
specs/onnx-classifier.spec.ts |
Adds a larger batch test to validate correctness across multiple chunks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }); | ||
|
|
||
| it('should handle batches larger than chunk size', async () => { | ||
| // arrange — 40 texts forces multiple chunks (MAX_BATCH_CHUNK = 32) |
There was a problem hiding this comment.
The test comment hard-codes MAX_BATCH_CHUNK = 32, but that constant is private to OnnxClassifier and could change later; the comment would then become inaccurate even if the test still passes. Consider making the comment value-agnostic (e.g., “40 texts exceeds the default chunk size”) or deriving the threshold from the implementation if you want to guarantee multi-chunk behavior.
| // arrange — 40 texts forces multiple chunks (MAX_BATCH_CHUNK = 32) | |
| // arrange — use enough texts to exceed the default chunk size |
Summary
classifyBatch()now processes sentences in chunks of 32 instead of all at onceProblem
When a list response (e.g.
ats_list_noteswith 100 items) is passed todefendToolResult(),extractStrings()collects all text andclassifyBySentence()splits it into hundreds of sentences.classifyBatch()then allocated a single ONNX tensor for all sentences — the attention matrices scale asO(batch × seqLen²)in native memory (outside V8 heap), reaching several GB for large batches and crashing Lambda.Fix
Split
classifyBatch()into chunks of 32 sentences max. Each chunk runs a separatesession.run()call with bounded memory (~50MB). Results are concatenated.Changes
src/classifiers/onnx-classifier.ts—classifyBatch()now loops in chunks viaclassifyBatchChunk(); addedMAX_BATCH_CHUNK = 32constantspecs/onnx-classifier.spec.ts— added test with 40 texts to verify cross-chunk correctnessTest plan
classifyBatchtest passes (3 items, single chunk)🤖 Generated with Claude Code
Summary by cubic
Chunked batch classification to bound ONNX native memory and prevent Lambda OOM on large payloads. Addresses ENG-12604 by processing texts in chunks of 32 per inference without changing outputs.
Written for commit 86e4a60. Summary will update on new commits.