Skip to content

fix(ENG-12702): strip boundary markers from input before classification#55

Merged
hiskudin merged 2 commits intomainfrom
fix/strip-boundary-tags-before-classification
Apr 21, 2026
Merged

fix(ENG-12702): strip boundary markers from input before classification#55
hiskudin merged 2 commits intomainfrom
fix/strip-boundary-tags-before-classification

Conversation

@hiskudin
Copy link
Copy Markdown
Collaborator

@hiskudin hiskudin commented Apr 21, 2026

Summary

Defender wraps sanitized outputs with [UD-<id>]...[/UD-<id>] markers. When those outputs are fed back as input — nested tool calls, cached responses, multi-hop agent traces, or attacker-spoofed tags — the tokenizer counts the tag tokens as part of the classified sentence and the v4 ONNX model treats the structure as injection-adjacent.

Measured impact

Benign Jira board-list payload pre-wrapped by upstream defender:

Input Tier 2 score Risk
With UD tags in values 0.992 high ✗ blocked
UD tags manually stripped 0.008 medium ✓ allowed
With this fix (tags auto-stripped at entry) 0.008 medium ✓ allowed

Self-flagging feedback loop closed — the classifier no longer sees its own output format as suspicious.

Implementation

The stripBoundaryPatterns utility already existed at src/utils/boundary.ts:68 with this exact use case in its docstring ("Boundary tags like [UD-xyz]...[/UD-xyz] corrupt per-sentence model scores because the tokenizer treats the tag text as part of the sentence") but had zero callers. This PR wires it into the three Tier 2 entry points — classify, classifyByChunks, prepareChunks — before the length check and tokenization.

Also strips XML-style <user-data-*>...</user-data-*> boundaries for completeness (same utility handles both).

Secondary benefit: spoofing defense

Closes the attack surface PR #49 was targeting. An attacker injecting [UD-*] tags into tool-result content can no longer use them to mask injection payloads as "already-trusted" boundary-wrapped data — defender strips everything that looks like a boundary marker before classifying.

Semantics

Scores on payloads containing boundary markers now match scores on the same payloads with markers manually stripped. Added a spec that asserts bit-identical scoring between classify(wrapped) and classify(bare).

Test plan

  • New spec: identical scores for wrapped vs bare input
  • 201/202 existing tests pass (1 pre-existing ONNX calibration flake, unrelated)
  • End-to-end verified against test.json: 0.992 → 0.008
  • Re-run Gmail fixture comparison to confirm no regression on real email payloads

Summary by cubic

Strip defender boundary markers from Tier 2 inputs before classification to stop false positives and boundary-tag spoofing. Addresses ENG-12702; wrapped and unwrapped texts now score the same (e.g., 0.992 → 0.008 on a benign Jira payload).

  • Bug Fixes
    • Apply stripBoundaryPatterns in classify, classifyBySentence, classifyByChunks, and prepareChunks before tokenization/splitting.
    • Remove implicit .trim() from stripBoundaryPatterns to preserve whitespace semantics; no call sites changed.
    • Update spec to skip in CI and use toBeCloseTo for float tolerance; asserts matching scores for wrapped vs bare input.

Written for commit 6f5b34c. Summary will update on new commits.

Defender wraps sanitized outputs with [UD-<id>]...[/UD-<id>] markers.
When those outputs are fed back as input — nested tool calls, cached
responses, multi-hop agent traces, or attacker-spoofed tags — the
tokenizer counts the tag tokens as part of the classified sentence
and the v4 ONNX model treats that structure as injection-adjacent.
Measured on a benign Jira payload pre-wrapped by upstream defender:
score went 0.008 (stripped) → 0.99 (with tags), flipping a clean
pass into a high-risk block.

The stripBoundaryPatterns utility (src/utils/boundary.ts:68) was
written for exactly this — docstring: "Boundary tags like [UD-xyz]
...[/UD-xyz] corrupt per-sentence model scores because the tokenizer
treats the tag text as part of the sentence" — but had zero callers.
Wires it into the three Tier 2 entry points (classify, classifyByChunks,
prepareChunks) before the length check and tokenization.

Also mitigates the spoofed-boundary attack PR #49 was targeting:
attacker-injected [UD-*] tags in input get stripped before the
classifier sees them, so they can't be used to mask injection content
as "already-trusted" data.

Behavior change: Tier 2 scores on payloads containing UD/XML boundary
markers now match scores on the same payloads with markers manually
stripped (added spec asserts bit-identical scoring).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 15:55
@hiskudin hiskudin requested a review from a team as a code owner April 21, 2026 15:55
@hiskudin hiskudin changed the title fix(tier2): strip boundary markers from input before classification fix(ENG-12702): strip boundary markers from input before classification Apr 21, 2026
cubic-dev-ai[bot]
cubic-dev-ai Bot previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Auto-approved: This PR fixes false positives in the Tier 2 classifier by correctly stripping internal boundary markers before processing. The logic uses an existing utility and includes verification tests.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents Tier 2 prompt-injection scoring from being distorted by Defender boundary wrappers (e.g., [UD-*]...[/UD-*] and <user-data-*>...</user-data-*>) by stripping those markers from inputs before length checks and tokenization, closing a self-feedback false-positive loop and reducing spoofing surface.

Changes:

  • Wire stripBoundaryPatterns() into Tier2Classifier entry points (classify, classifyByChunks, prepareChunks) prior to sizing/tokenization.
  • Add a spec asserting wrapped vs unwrapped inputs produce identical Tier 2 scores after stripping.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/classifiers/tier2-classifier.ts Strips boundary markers at Tier 2 entry points to avoid tokenizer/model score distortion.
specs/tier2-classifier.spec.ts Adds regression test for identical scoring between boundary-wrapped and bare input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/classifiers/tier2-classifier.ts
Comment thread src/classifiers/tier2-classifier.ts
Comment thread specs/tier2-classifier.spec.ts Outdated
Comment thread specs/tier2-classifier.spec.ts Outdated
Four review items:

1. classifyBySentence was missing the strip — added. Now all four
   public Tier 2 entry points (classify, classifyBySentence,
   classifyByChunks, prepareChunks) share the same pre-strip behavior.

2. stripBoundaryPatterns utility previously called .trim() as a side
   effect, which changed semantics for every Tier 2 input regardless
   of whether it contained boundary markers. Removed the trim from the
   utility; callers who need it should call .trim() themselves. None
   of the current call sites need it (the strip itself is sufficient).

3. New spec was missing it.skipIf(!!process.env.CI) — other model-
   dependent tests in the file use it to avoid CI flakiness on ONNX
   loading. Matched the convention.

4. Strict toBe equality on float model output is brittle across
   runtime/hardware. Switched to toBeCloseTo at 10-decimal precision —
   still asserts the scores match (the inputs are bit-identical after
   stripping) while tolerating any ONNX runtime non-determinism.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hiskudin hiskudin merged commit 0fdd9d4 into main Apr 21, 2026
3 checks passed
@hiskudin hiskudin deleted the fix/strip-boundary-tags-before-classification branch April 21, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants