Skip to content

fix(ENG-12707, ENG-12708): make boundary annotation opt-in (annotateBoundary flag)#57

Merged
hiskudin merged 4 commits intomainfrom
feat/opt-in-boundary-annotation
Apr 23, 2026
Merged

fix(ENG-12707, ENG-12708): make boundary annotation opt-in (annotateBoundary flag)#57
hiskudin merged 4 commits intomainfrom
feat/opt-in-boundary-annotation

Conversation

@hiskudin
Copy link
Copy Markdown
Collaborator

@hiskudin hiskudin commented Apr 22, 2026

Summary

Opt-in boundary annotation (annotateBoundary)

  • Boundary wrapping ([UD-<id>]...[/UD-<id>]) is now opt-in via annotateBoundary on PromptDefenseOptions / ToolResultSanitizerConfig / SanitizerConfig. Default: false.
  • Hard gate across all risk levels — the old alwaysAnnotate flag only gated low risk; medium/high still wrapped. New flag is unconditional.
  • Explicit methods: ['boundary_annotation'] in SanitizeOptions still wraps regardless of the flag (per-call escape hatch).
  • When disabled, generateDataBoundary() is skipped entirely — no nanoid() call per tool result, no "boundary_annotation" entries in metadata.methodsByField.
  • Exported generateBoundaryInstructions() and containsBoundaryPatterns() from the package entrypoint for callers that opt in.

generateBoundaryInstructions() — the system-prompt template that teaches an LLM to treat [UD-*] content as untrusted data — had zero downstream consumers across defender, connect, connect-handler, and unified-cloud-api. The tags were inert scaffolding, so making them opt-in removes per-field metadata noise and output bloat for all existing callers while preserving the feature for anyone who adds the system-prompt half of the contract.

Fix: SFE filtering is classifier-only

SFE was replacing the output with the filtered payload, permanently dropping metadata/identifier fields from DefenseResult.sanitized. This meant the LLM received a truncated tool result whenever useSfe was enabled — losing IDs, cursors, and other fields the agent might need for follow-up calls.

Fix: sfeFilteredValue now feeds Tier 2 string extraction only. Tier 1 sanitization and the returned sanitized payload always operate on the original value. fieldsDropped is now accurately documented as "fields excluded from classification" not "fields absent from output".

Non-breaking

Downstream

connect-sdk DefenderSettings will gain annotateBoundary in a follow-up PR for exposure in unified-cloud-api project settings.

Test plan

  • Existing sanitizer + integration specs updated to opt-in or assert default-off behavior.
  • New tests: default off (no [UD- at any risk level), flag on (end-to-end via defendToolResult), explicit-methods escape hatch.
  • npm test — 206/207 passing (pre-existing ONNX batch classification flake on clean main).
  • npx tsc --noEmit — clean.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 22, 2026 15:13
@hiskudin hiskudin requested a review from a team as a code owner April 22, 2026 15:13
@hiskudin hiskudin changed the title feat: make boundary annotation opt-in (annotateBoundary flag) feat(ENG-12707): make boundary annotation opt-in (annotateBoundary flag) Apr 22, 2026
@hiskudin hiskudin changed the title feat(ENG-12707): make boundary annotation opt-in (annotateBoundary flag) fix(ENG-12707): make boundary annotation opt-in (annotateBoundary flag) Apr 22, 2026
Wraps with [UD-<id>]...[/UD-<id>] markers are now off by default. Callers
that want them set annotateBoundary: true on createPromptDefense options.

- Skips generateDataBoundary() entirely when disabled (no nanoid() call
  per tool result; no boundary_annotation entries in methodsByField)
- Hard gate across all risk levels (old alwaysAnnotate only gated low)
- Explicit methods: ["boundary_annotation"] in SanitizeOptions still wraps
  regardless of the flag (per-call escape hatch)

Non-breaking in practice: generateBoundaryInstructions — the system-prompt
template that teaches an LLM what [UD-*] means — has zero downstream
consumers across defender/connect/connect-handler/unified-cloud-api, so
the tags were inert scaffolding costing per-field metadata noise and
output bloat. SFE and Tier 2 already strip boundary markers on input
(v0.6.2) so self-flagging is unaffected.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@hiskudin hiskudin force-pushed the feat/opt-in-boundary-annotation branch from 8edd268 to 9c2a23c Compare April 22, 2026 15:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Makes boundary annotation ([UD-<id>]...[/UD-<id>]) opt-in across the sanitizer pipeline via a new annotateBoundary flag, defaulting to false, and updates tests/docs to reflect the new default behavior.

Changes:

  • Replace risk-level/legacy gating with a hard opt-in annotateBoundary switch in Sanitizer, plus an explicit per-call methods: ["boundary_annotation"] escape hatch.
  • Plumb annotateBoundary through PromptDefenseOptionsToolResultSanitizerConfig, and skip generateDataBoundary() entirely when wrapping is disabled.
  • Update specs and README to validate/document default-off behavior and opt-in wrapping.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/sanitizers/sanitizer.ts Introduces annotateBoundary config, changes boundary wrapping to opt-in, keeps explicit method override behavior.
src/core/tool-result-sanitizer.ts Adds annotateBoundary config and avoids boundary generation when disabled; passes flag into createSanitizer.
src/core/prompt-defense.ts Adds annotateBoundary option to the public PromptDefenseOptions and forwards it to ToolResultSanitizer.
specs/sanitizers.spec.ts Updates unit/integration expectations for default-off wrapping and adds opt-in/override coverage.
specs/integration.spec.ts Adds PromptDefense-level assertions for default-off vs opt-in boundary wrapping; opts into wrapping in a scenario test.
README.md Documents boundary annotation as opt-in and references system-prompt pairing guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core/prompt-defense.ts Outdated
Comment thread README.md
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 6 files

Requires human review: This change modifies the default output behavior of a core sanitization path in a security tool. Such changes to default behavior require human validation of downstream impact.

Callers that opt into annotateBoundary need the system-prompt template
that tells an LLM how to handle [UD-*] markers. Previously only
available via a deep import.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
cubic-dev-ai[bot]
cubic-dev-ai Bot previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Auto-approved: Makes boundary annotation opt-in to reduce output noise. Includes comprehensive test updates and documentation. Low risk as the feature had no downstream consumers.

glebedel
glebedel previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Contributor

@glebedel glebedel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…caller

SFE was replacing the output value with the filtered payload, permanently
dropping metadata/identifier fields before Tier 1 sanitization and the
final DefenseResult.sanitized. This meant the LLM received a truncated
tool result whenever useSfe was enabled.

Fix: scope sfeFilteredValue to Tier 2 string extraction only. Tier 1
sanitization and the returned sanitized payload always operate on the
original value. fieldsDropped now documents paths excluded from
classification, not paths absent from the returned data.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@hiskudin hiskudin dismissed stale reviews from glebedel and cubic-dev-ai[bot] via 7d590ea April 23, 2026 08:04
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: Modifies core sanitization logic and changes default behavior for boundary annotation. The SFE filtering fix also alters data flow in a core path, requiring human verification.

@hiskudin hiskudin changed the title fix(ENG-12707): make boundary annotation opt-in (annotateBoundary flag) fix(ENG-12707, ENG-12708): make boundary annotation opt-in (annotateBoundary flag) Apr 23, 2026
@hiskudin hiskudin requested a review from Copilot April 23, 2026 08:13
- Update annotateBoundary JSDoc to reference the exported package symbol
  instead of the internal utils/boundary path
- Add sfe.spec.ts test asserting DefenseResult.sanitized retains fields
  that SFE drops from Tier 2 classification

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Requires human review: Changes core sanitization logic and default output behavior in a security package. The data flow changes in the defense pipeline (SFE fix) warrant human verification.

Copy link
Copy Markdown

@OMauriStkOne OMauriStkOne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hiskudin hiskudin merged commit bf10849 into main Apr 23, 2026
4 checks passed
@hiskudin hiskudin deleted the feat/opt-in-boundary-annotation branch April 23, 2026 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants