Conversation
Defender wraps sanitized outputs with [UD-<id>]...[/UD-<id>] markers. When those outputs are fed back as input — nested tool calls, cached responses, multi-hop agent traces, or attacker-spoofed tags — the tokenizer counts the tag tokens as part of the classified sentence and the v4 ONNX model treats that structure as injection-adjacent. Measured on a benign Jira payload pre-wrapped by upstream defender: score went 0.008 (stripped) → 0.99 (with tags), flipping a clean pass into a high-risk block. The stripBoundaryPatterns utility (src/utils/boundary.ts:68) was written for exactly this — docstring: "Boundary tags like [UD-xyz] ...[/UD-xyz] corrupt per-sentence model scores because the tokenizer treats the tag text as part of the sentence" — but had zero callers. Wires it into the three Tier 2 entry points (classify, classifyByChunks, prepareChunks) before the length check and tokenization. Also mitigates the spoofed-boundary attack PR #49 was targeting: attacker-injected [UD-*] tags in input get stripped before the classifier sees them, so they can't be used to mask injection content as "already-trusted" data. Behavior change: Tier 2 scores on payloads containing UD/XML boundary markers now match scores on the same payloads with markers manually stripped (added spec asserts bit-identical scoring). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR prevents Tier 2 prompt-injection scoring from being distorted by Defender boundary wrappers (e.g., [UD-*]...[/UD-*] and <user-data-*>...</user-data-*>) by stripping those markers from inputs before length checks and tokenization, closing a self-feedback false-positive loop and reducing spoofing surface.
Changes:
- Wire
stripBoundaryPatterns()into Tier2Classifier entry points (classify,classifyByChunks,prepareChunks) prior to sizing/tokenization. - Add a spec asserting wrapped vs unwrapped inputs produce identical Tier 2 scores after stripping.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/classifiers/tier2-classifier.ts | Strips boundary markers at Tier 2 entry points to avoid tokenizer/model score distortion. |
| specs/tier2-classifier.spec.ts | Adds regression test for identical scoring between boundary-wrapped and bare input. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Four review items: 1. classifyBySentence was missing the strip — added. Now all four public Tier 2 entry points (classify, classifyBySentence, classifyByChunks, prepareChunks) share the same pre-strip behavior. 2. stripBoundaryPatterns utility previously called .trim() as a side effect, which changed semantics for every Tier 2 input regardless of whether it contained boundary markers. Removed the trim from the utility; callers who need it should call .trim() themselves. None of the current call sites need it (the strip itself is sufficient). 3. New spec was missing it.skipIf(!!process.env.CI) — other model- dependent tests in the file use it to avoid CI flakiness on ONNX loading. Matched the convention. 4. Strict toBe equality on float model output is brittle across runtime/hardware. Switched to toBeCloseTo at 10-decimal precision — still asserts the scores match (the inputs are bit-identical after stripping) while tolerating any ONNX runtime non-determinism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OMauriStkOne
approved these changes
Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defender wraps sanitized outputs with
[UD-<id>]...[/UD-<id>]markers. When those outputs are fed back as input — nested tool calls, cached responses, multi-hop agent traces, or attacker-spoofed tags — the tokenizer counts the tag tokens as part of the classified sentence and the v4 ONNX model treats the structure as injection-adjacent.Measured impact
Benign Jira board-list payload pre-wrapped by upstream defender:
Self-flagging feedback loop closed — the classifier no longer sees its own output format as suspicious.
Implementation
The
stripBoundaryPatternsutility already existed atsrc/utils/boundary.ts:68with this exact use case in its docstring ("Boundary tags like [UD-xyz]...[/UD-xyz] corrupt per-sentence model scores because the tokenizer treats the tag text as part of the sentence") but had zero callers. This PR wires it into the three Tier 2 entry points —classify,classifyByChunks,prepareChunks— before the length check and tokenization.Also strips XML-style
<user-data-*>...</user-data-*>boundaries for completeness (same utility handles both).Secondary benefit: spoofing defense
Closes the attack surface PR #49 was targeting. An attacker injecting
[UD-*]tags into tool-result content can no longer use them to mask injection payloads as "already-trusted" boundary-wrapped data — defender strips everything that looks like a boundary marker before classifying.Semantics
Scores on payloads containing boundary markers now match scores on the same payloads with markers manually stripped. Added a spec that asserts bit-identical scoring between
classify(wrapped)andclassify(bare).Test plan
Summary by cubic
Strip defender boundary markers from Tier 2 inputs before classification to stop false positives and boundary-tag spoofing. Addresses ENG-12702; wrapped and unwrapped texts now score the same (e.g., 0.992 → 0.008 on a benign Jira payload).
stripBoundaryPatternsinclassify,classifyBySentence,classifyByChunks, andprepareChunksbefore tokenization/splitting..trim()fromstripBoundaryPatternsto preserve whitespace semantics; no call sites changed.toBeCloseTofor float tolerance; asserts matching scores for wrapped vs bare input.Written for commit 6f5b34c. Summary will update on new commits.