fix(ENG-12702): strip boundary markers from input before classification by hiskudin · Pull Request #55 · StackOneHQ/defender

hiskudin · 2026-04-21T15:55:02Z

Summary

Defender wraps sanitized outputs with [UD-<id>]...[/UD-<id>] markers. When those outputs are fed back as input — nested tool calls, cached responses, multi-hop agent traces, or attacker-spoofed tags — the tokenizer counts the tag tokens as part of the classified sentence and the v4 ONNX model treats the structure as injection-adjacent.

Measured impact

Benign Jira board-list payload pre-wrapped by upstream defender:

Input	Tier 2 score	Risk
With UD tags in values	0.992	high ✗ blocked
UD tags manually stripped	0.008	medium ✓ allowed
With this fix (tags auto-stripped at entry)	0.008	medium ✓ allowed

Self-flagging feedback loop closed — the classifier no longer sees its own output format as suspicious.

Implementation

The stripBoundaryPatterns utility already existed at src/utils/boundary.ts:68 with this exact use case in its docstring ("Boundary tags like [UD-xyz]...[/UD-xyz] corrupt per-sentence model scores because the tokenizer treats the tag text as part of the sentence") but had zero callers. This PR wires it into the three Tier 2 entry points — classify, classifyByChunks, prepareChunks — before the length check and tokenization.

Also strips XML-style <user-data-*>...</user-data-*> boundaries for completeness (same utility handles both).

Secondary benefit: spoofing defense

Closes the attack surface PR #49 was targeting. An attacker injecting [UD-*] tags into tool-result content can no longer use them to mask injection payloads as "already-trusted" boundary-wrapped data — defender strips everything that looks like a boundary marker before classifying.

Semantics

Scores on payloads containing boundary markers now match scores on the same payloads with markers manually stripped. Added a spec that asserts bit-identical scoring between classify(wrapped) and classify(bare).

Test plan

New spec: identical scores for wrapped vs bare input
201/202 existing tests pass (1 pre-existing ONNX calibration flake, unrelated)
End-to-end verified against test.json: 0.992 → 0.008
Re-run Gmail fixture comparison to confirm no regression on real email payloads

Summary by cubic

Strip defender boundary markers from Tier 2 inputs before classification to stop false positives and boundary-tag spoofing. Addresses ENG-12702; wrapped and unwrapped texts now score the same (e.g., 0.992 → 0.008 on a benign Jira payload).

Bug Fixes
- Apply stripBoundaryPatterns in classify, classifyBySentence, classifyByChunks, and prepareChunks before tokenization/splitting.
- Remove implicit .trim() from stripBoundaryPatterns to preserve whitespace semantics; no call sites changed.
- Update spec to skip in CI and use toBeCloseTo for float tolerance; asserts matching scores for wrapped vs bare input.

^{Written for commit 6f5b34c. Summary will update on new commits.}

Defender wraps sanitized outputs with [UD-<id>]...[/UD-<id>] markers. When those outputs are fed back as input — nested tool calls, cached responses, multi-hop agent traces, or attacker-spoofed tags — the tokenizer counts the tag tokens as part of the classified sentence and the v4 ONNX model treats that structure as injection-adjacent. Measured on a benign Jira payload pre-wrapped by upstream defender: score went 0.008 (stripped) → 0.99 (with tags), flipping a clean pass into a high-risk block. The stripBoundaryPatterns utility (src/utils/boundary.ts:68) was written for exactly this — docstring: "Boundary tags like [UD-xyz] ...[/UD-xyz] corrupt per-sentence model scores because the tokenizer treats the tag text as part of the sentence" — but had zero callers. Wires it into the three Tier 2 entry points (classify, classifyByChunks, prepareChunks) before the length check and tokenization. Also mitigates the spoofed-boundary attack PR #49 was targeting: attacker-injected [UD-*] tags in input get stripped before the classifier sees them, so they can't be used to mask injection content as "already-trusted" data. Behavior change: Tier 2 scores on payloads containing UD/XML boundary markers now match scores on the same payloads with markers manually stripped (added spec asserts bit-identical scoring). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cubic-dev-ai

No issues found across 2 files

_{Auto-approved: This PR fixes false positives in the Tier 2 classifier by correctly stripping internal boundary markers before processing. The logic uses an existing utility and includes verification tests.}

Copilot

Pull request overview

This PR prevents Tier 2 prompt-injection scoring from being distorted by Defender boundary wrappers (e.g., [UD-*]...[/UD-*] and <user-data-*>...</user-data-*>) by stripping those markers from inputs before length checks and tokenization, closing a self-feedback false-positive loop and reducing spoofing surface.

Changes:

Wire stripBoundaryPatterns() into Tier2Classifier entry points (classify, classifyByChunks, prepareChunks) prior to sizing/tokenization.
Add a spec asserting wrapped vs unwrapped inputs produce identical Tier 2 scores after stripping.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
src/classifiers/tier2-classifier.ts	Strips boundary markers at Tier 2 entry points to avoid tokenizer/model score distortion.
specs/tier2-classifier.spec.ts	Adds regression test for identical scoring between boundary-wrapped and bare input.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Four review items: 1. classifyBySentence was missing the strip — added. Now all four public Tier 2 entry points (classify, classifyBySentence, classifyByChunks, prepareChunks) share the same pre-strip behavior. 2. stripBoundaryPatterns utility previously called .trim() as a side effect, which changed semantics for every Tier 2 input regardless of whether it contained boundary markers. Removed the trim from the utility; callers who need it should call .trim() themselves. None of the current call sites need it (the strip itself is sufficient). 3. New spec was missing it.skipIf(!!process.env.CI) — other model- dependent tests in the file use it to avoid CI flakiness on ONNX loading. Matched the convention. 4. Strict toBe equality on float model output is brittle across runtime/hardware. Switched to toBeCloseTo at 10-decimal precision — still asserts the scores match (the inputs are bit-identical after stripping) while tolerating any ONNX runtime non-determinism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 21, 2026 15:55

hiskudin requested a review from a team as a code owner April 21, 2026 15:55

hiskudin changed the title ~~fix(tier2): strip boundary markers from input before classification~~ fix(ENG-12702): strip boundary markers from input before classification Apr 21, 2026

Copilot started reviewing on behalf of hiskudin April 21, 2026 15:55 View session

cubic-dev-ai Bot previously approved these changes Apr 21, 2026

View reviewed changes

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread src/classifiers/tier2-classifier.ts

Comment thread src/classifiers/tier2-classifier.ts

Comment thread specs/tier2-classifier.spec.ts Outdated

Comment thread specs/tier2-classifier.spec.ts Outdated

hiskudin dismissed cubic-dev-ai[bot]’s stale review via 6f5b34c April 21, 2026 16:01

OMauriStkOne approved these changes Apr 21, 2026

View reviewed changes

hiskudin merged commit 0fdd9d4 into main Apr 21, 2026
3 checks passed

hiskudin deleted the fix/strip-boundary-tags-before-classification branch April 21, 2026 16:03

stackone-devops-service-account mentioned this pull request Apr 21, 2026

chore(main): release defender 0.6.2 #56

Merged

hiskudin mentioned this pull request Apr 22, 2026

fix(ENG-12707, ENG-12708): make boundary annotation opt-in (annotateBoundary flag) #57

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ENG-12702): strip boundary markers from input before classification#55

fix(ENG-12702): strip boundary markers from input before classification#55
hiskudin merged 2 commits intomainfrom
fix/strip-boundary-tags-before-classification

hiskudin commented Apr 21, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hiskudin commented Apr 21, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Measured impact

Implementation

Secondary benefit: spoofing defense

Semantics

Test plan

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hiskudin commented Apr 21, 2026 •

edited by cubic-dev-ai Bot

Loading