feat(ENG-12658): sentence density adjustment to reduce email notification FPs by hiskudin · Pull Request #50 · StackOneHQ/defender

hiskudin · 2026-04-15T20:48:37Z

Summary

Adds a density-weighted score to penalise isolated high-scoring sentences in largely benign text (e.g. "Check and secure your account now." in a Google security alert scoring 0.987 raw)
Strips [UD-...] boundary tags from extracted strings before Tier 2 classification so tags don't corrupt sentence-level scores
Tier 2 scans all strings in the tool result, not just fields Tier 1 marked risky — fields outside Tier 1 rules are still visible to the LLM so restricting Tier 2 to riskyFieldNames created a silent bypass
Applies density adjustment only when highCount > 0 — prevents sqrt(0/n) = 0 from zeroing out non-trivial raw scores when no sentence reaches the 0.9 threshold

How density adjustment works:

effectiveScore = maxScore × sqrt(highCount / totalCount)
  highCount  = sentences scoring ≥ 0.9
  totalCount = all classified sentences

Applied only when totalCount > 2 and highCount > 0. Short texts (1-2 sentences) use the raw score — a 2-sentence injection would be unfairly penalised since its density ratio is identical to a lone FP sentence.

Results

Email false positive test

Test	Before	After
Email FP (`tier2Fields`)	3/4	4/4
Security alert (persistent FP)	FAIL (raw 0.987 → high)	PASS (adj 0.570 → medium)

Tool-call FP benchmarks — 10 000 samples (Modal, production ONNX pipeline)

Script: classifier-eval/scripts/eval_toolcall_fp_density.py, model: jbv2-fujitsu-b1x5-freeze4-hn

Dataset	N	FPR@0.5	FPR@0.8
MirrorAPI (real RapidAPI, noisy)	10 000	10.9%	6.5%
ChatML (Glaive synthetic, clean JSON)	10 000	5.9%	2.8%
ToolACE (full dataset)	1 367	5.8%	3.1%
Overall	21 367	8.2%	4.5%

MirrorAPI's 6.5% FPR is driven primarily by binary API responses (raw PNG blobs from image/QR tools, API error messages with imperative phrasing) — the same category noted in prior evals. ChatML and ToolACE, which contain only structured text, are representative at 2.8% and 3.1%.

Removing the riskyFieldNames restriction had no measurable FPR cost: density adjustment absorbs it. MirrorAPI FPR@0.8 is unchanged at 3.7% in both configurations on equivalent samples.

Existing benchmarks (no regression)

Benchmark	Before	After
AgentShield	79.82	79.8
Classifier F1 (qualifire/jayavibhav/xxz224)	baseline	unchanged

Density never fires on short benchmark texts (≤2 sentences), so these are unaffected.

Test plan

npm run test:email-injection → 4/4
AgentShield benchmark → 79.8 (matches prior)
Classifier F1 benchmark → bit-for-bit identical
MirrorAPI-FP (10k samples, Modal) → FPR@0.8 = 6.5%
ChatML-FP (10k samples, Modal) → FPR@0.8 = 2.8%
ToolACE-FP (full dataset, Modal) → FPR@0.8 = 3.1%

🤖 Generated with Claude Code

… FPs Adds a density-weighted score to penalise isolated high-scoring sentences in largely benign text (e.g. "Check and secure your account now." in a Google security alert scoring 0.987 raw). How it works: effectiveScore = maxScore × sqrt(highCount / totalCount) highCount = sentences scoring ≥ 0.9 totalCount = all classified sentences (only applied when totalCount > 2) Short texts (1-2 sentences) are left unadjusted — a 2-sentence injection would be unfairly penalised since its density ratio is identical to a lone FP sentence. For 3+ sentences there is enough context for a meaningful signal. Also: - Strips [UD-...] boundary tags from extracted strings before classification so tags don't corrupt sentence-level scores - Removes riskyFieldNames fallback from tier2 field selection (was masking the true all-fields scan behaviour when tier2Fields was not set) - Uses density-adjusted score for hasThreats check, not raw maxScore Email FP test: 3/4 → 4/4 (Security alert fixed) AgentShield: 79.82 → 79.8 (no regression, density never fires on short texts) Classifier F1 (qualifire/jayavibhav/xxz224): unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR updates Tier 2 (sentence-level ML) scoring in PromptDefense to reduce email notification false positives by applying a sentence-density penalty, stripping boundary tags before classification, and changing Tier 2 field selection to scan all strings unless tier2Fields is explicitly configured.

Changes:

Apply a density-weighted adjustment to the Tier 2 max-sentence score for texts with 3+ sentences.
Strip [UD-…] boundary tags from extracted strings prior to Tier 2 classification.
Remove the riskyFieldNames fallback so Tier 2 uses tier2Fields only when explicitly set (otherwise scans all strings).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The function was imported in prompt-defense.ts but never defined, causing TS2305 typecheck failure on CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai

0 issues found across 1 file (changes from recent commits).

_{Requires human review: This PR modifies core security logic by introducing a new heuristic (density-weighted scoring) for threat detection, which could impact the accuracy of security classifications.}

cubic-dev-ai

1 issue found across 1 file

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/core/prompt-defense.ts">

<violation number="1" location="src/core/prompt-defense.ts:264">
P1: Density adjustment zeros out scores when no sentence reaches the 0.9 sub-threshold. If `highCount` is 0 (all sentences score below 0.9), `Math.sqrt(0 / totalCount)` produces 0, making `effective = tier2Score * 0 = 0`. This silently suppresses scores in the 0.8–0.9 range (above `highRiskThreshold`) to 0 for any text with 3+ sentences. Guard the adjustment so it only fires when `highCount > 0`; otherwise fall through to the raw score.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai

0 issues found across 1 file (changes from recent commits).

_{Requires human review: Auto-approval blocked by 1 unresolved issue from previous reviews.}

- Restore riskyFieldNames fallback: when tier2Fields is unset, Tier 2 now focuses on fields Tier 1 already identified as risky rather than scanning all strings unconditionally (reverts unintentional removal) - Fix density zero-out: apply density adjustment only when highCount > 0; previously sqrt(0/n)=0 would zero out any non-trivial raw score when no sentence exceeded the 0.9 threshold - Fix comment: 'Authenticator app added as sign-in step' scores ~0.51, not 0.91 as the example incorrectly stated - Add three tests covering density adjustment boundaries: isolated high sentence in 3+ text, short injection skipping density, and the highCount=0 path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/core/prompt-defense.ts">

<violation number="1" location="src/core/prompt-defense.ts:233">
P1: Custom agent: **Flag Security Vulnerabilities**

This change introduces a Tier 2 coverage bypass by restricting scans to Tier 1 risky fields when `tier2Fields` is unset. Keep Tier 2 on all strings by default to avoid missing malicious content in unlisted fields.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

Restricting Tier 2 to Tier 1 risky fields when tier2Fields is unset creates a bypass: injections in fields not covered by tool rules are visible to the LLM but scanned by neither tier. Removing the restriction has no FPR cost — density adjustment absorbs the difference. Measured on 1000-sample FP benchmarks (tier1+tier2): MirrorAPI FPR@0.8: 3.7% → 3.7% (unchanged) ChatML FPR@0.8: 0.0% → 0.0% (unchanged) ToolACE FPR@0.8: 0.8% → 0.8% (unchanged) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai

0 issues found across 2 files (changes from recent commits).

_{Requires human review: Modifies core risk-scoring logic in PromptDefense with a new density-adjustment heuristic, which could impact the detection of malicious payloads in long texts.}

glebedel · 2026-04-16T11:14:53Z

+	return content
+		.replace(/\[UD-[A-Za-z0-9_-]+\]/g, "")
+		.replace(/\[\/UD-[A-Za-z0-9_-]+\]/g, "")
+		.replace(/<user-data-[A-Za-z0-9_-]+>/g, "")


Where's this user-data string coming from?

We add it on tier1, the goal is to surround tool result with tags that help LLMs differentiate user/system prompts from tool responses, something like a boundary. But we strip it before passing it to the classifier so that the classifier focuses only on the tool response.

glebedel

LGTM

Copilot AI review requested due to automatic review settings April 15, 2026 20:48

hiskudin requested a review from a team as a code owner April 15, 2026 20:48

Copilot started reviewing on behalf of hiskudin April 15, 2026 20:49 View session

hiskudin changed the title ~~feat(tier2): sentence density adjustment to reduce email notification FPs~~ feat(ENG-12658): sentence density adjustment to reduce email notification FPs Apr 15, 2026

fix: sort imports to satisfy biome organizeImports

9ad2b9f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread src/core/prompt-defense.ts

Comment thread src/core/prompt-defense.ts Outdated

Comment thread src/core/prompt-defense.ts Outdated

Comment thread src/core/prompt-defense.ts

Comment thread src/core/prompt-defense.ts

fix: add missing stripBoundaryPatterns export to boundary utils

f7983b7

The function was imported in prompt-defense.ts but never defined, causing TS2305 typecheck failure on CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cubic-dev-ai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread src/core/prompt-defense.ts

cubic-dev-ai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread src/core/prompt-defense.ts Outdated

cubic-dev-ai Bot reviewed Apr 16, 2026

View reviewed changes

glebedel reviewed Apr 16, 2026

View reviewed changes

glebedel approved these changes Apr 16, 2026

View reviewed changes

hiskudin merged commit b4a272d into main Apr 16, 2026
6 checks passed

hiskudin deleted the feat/tier2-density-adjustment branch April 16, 2026 16:26

stackone-devops-service-account mentioned this pull request Apr 16, 2026

chore(main): release defender 0.6.0 #48

Merged

hiskudin mentioned this pull request Apr 20, 2026

feat( ENG-12684): v4 ONNX model + cumulative risk fixes + opt-in SFE preprocessor (includes ENG-12671) #52

Merged

9 tasks

Conversation

hiskudin commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Email false positive test

Tool-call FP benchmarks — 10 000 samples (Modal, production ONNX pipeline)

Existing benchmarks (no regression)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

glebedel Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

hiskudin Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

glebedel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hiskudin commented Apr 15, 2026 •

edited

Loading