feat(ENG-12658): sentence density adjustment to reduce email notification FPs#50
feat(ENG-12658): sentence density adjustment to reduce email notification FPs#50
Conversation
… FPs Adds a density-weighted score to penalise isolated high-scoring sentences in largely benign text (e.g. "Check and secure your account now." in a Google security alert scoring 0.987 raw). How it works: effectiveScore = maxScore × sqrt(highCount / totalCount) highCount = sentences scoring ≥ 0.9 totalCount = all classified sentences (only applied when totalCount > 2) Short texts (1-2 sentences) are left unadjusted — a 2-sentence injection would be unfairly penalised since its density ratio is identical to a lone FP sentence. For 3+ sentences there is enough context for a meaningful signal. Also: - Strips [UD-...] boundary tags from extracted strings before classification so tags don't corrupt sentence-level scores - Removes riskyFieldNames fallback from tier2 field selection (was masking the true all-fields scan behaviour when tier2Fields was not set) - Uses density-adjusted score for hasThreats check, not raw maxScore Email FP test: 3/4 → 4/4 (Security alert fixed) AgentShield: 79.82 → 79.8 (no regression, density never fires on short texts) Classifier F1 (qualifire/jayavibhav/xxz224): unchanged Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR updates Tier 2 (sentence-level ML) scoring in PromptDefense to reduce email notification false positives by applying a sentence-density penalty, stripping boundary tags before classification, and changing Tier 2 field selection to scan all strings unless tier2Fields is explicitly configured.
Changes:
- Apply a density-weighted adjustment to the Tier 2 max-sentence score for texts with 3+ sentences.
- Strip
[UD-…]boundary tags from extracted strings prior to Tier 2 classification. - Remove the
riskyFieldNamesfallback so Tier 2 usestier2Fieldsonly when explicitly set (otherwise scans all strings).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The function was imported in prompt-defense.ts but never defined, causing TS2305 typecheck failure on CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
0 issues found across 1 file (changes from recent commits).
Requires human review: This PR modifies core security logic by introducing a new heuristic (density-weighted scoring) for threat detection, which could impact the accuracy of security classifications.
There was a problem hiding this comment.
1 issue found across 1 file
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/core/prompt-defense.ts">
<violation number="1" location="src/core/prompt-defense.ts:264">
P1: Density adjustment zeros out scores when no sentence reaches the 0.9 sub-threshold. If `highCount` is 0 (all sentences score below 0.9), `Math.sqrt(0 / totalCount)` produces 0, making `effective = tier2Score * 0 = 0`. This silently suppresses scores in the 0.8–0.9 range (above `highRiskThreshold`) to 0 for any text with 3+ sentences. Guard the adjustment so it only fires when `highCount > 0`; otherwise fall through to the raw score.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
- Restore riskyFieldNames fallback: when tier2Fields is unset, Tier 2 now focuses on fields Tier 1 already identified as risky rather than scanning all strings unconditionally (reverts unintentional removal) - Fix density zero-out: apply density adjustment only when highCount > 0; previously sqrt(0/n)=0 would zero out any non-trivial raw score when no sentence exceeded the 0.9 threshold - Fix comment: 'Authenticator app added as sign-in step' scores ~0.51, not 0.91 as the example incorrectly stated - Add three tests covering density adjustment boundaries: isolated high sentence in 3+ text, short injection skipping density, and the highCount=0 path Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/core/prompt-defense.ts">
<violation number="1" location="src/core/prompt-defense.ts:233">
P1: Custom agent: **Flag Security Vulnerabilities**
This change introduces a Tier 2 coverage bypass by restricting scans to Tier 1 risky fields when `tier2Fields` is unset. Keep Tier 2 on all strings by default to avoid missing malicious content in unlisted fields.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
Restricting Tier 2 to Tier 1 risky fields when tier2Fields is unset creates a bypass: injections in fields not covered by tool rules are visible to the LLM but scanned by neither tier. Removing the restriction has no FPR cost — density adjustment absorbs the difference. Measured on 1000-sample FP benchmarks (tier1+tier2): MirrorAPI FPR@0.8: 3.7% → 3.7% (unchanged) ChatML FPR@0.8: 0.0% → 0.0% (unchanged) ToolACE FPR@0.8: 0.8% → 0.8% (unchanged) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| return content | ||
| .replace(/\[UD-[A-Za-z0-9_-]+\]/g, "") | ||
| .replace(/\[\/UD-[A-Za-z0-9_-]+\]/g, "") | ||
| .replace(/<user-data-[A-Za-z0-9_-]+>/g, "") |
There was a problem hiding this comment.
Where's this user-data string coming from?
There was a problem hiding this comment.
We add it on tier1, the goal is to surround tool result with tags that help LLMs differentiate user/system prompts from tool responses, something like a boundary. But we strip it before passing it to the classifier so that the classifier focuses only on the tool response.
Summary
[UD-...]boundary tags from extracted strings before Tier 2 classification so tags don't corrupt sentence-level scoresriskyFieldNamescreated a silent bypasshighCount > 0— preventssqrt(0/n) = 0from zeroing out non-trivial raw scores when no sentence reaches the 0.9 thresholdHow density adjustment works:
Applied only when
totalCount > 2andhighCount > 0. Short texts (1-2 sentences) use the raw score — a 2-sentence injection would be unfairly penalised since its density ratio is identical to a lone FP sentence.Results
Email false positive test
tier2Fields)Tool-call FP benchmarks — 10 000 samples (Modal, production ONNX pipeline)
Script:
classifier-eval/scripts/eval_toolcall_fp_density.py, model:jbv2-fujitsu-b1x5-freeze4-hnMirrorAPI's 6.5% FPR is driven primarily by binary API responses (raw PNG blobs from image/QR tools, API error messages with imperative phrasing) — the same category noted in prior evals. ChatML and ToolACE, which contain only structured text, are representative at 2.8% and 3.1%.
Removing the
riskyFieldNamesrestriction had no measurable FPR cost: density adjustment absorbs it. MirrorAPI FPR@0.8 is unchanged at 3.7% in both configurations on equivalent samples.Existing benchmarks (no regression)
Density never fires on short benchmark texts (≤2 sentences), so these are unaffected.
Test plan
npm run test:email-injection→ 4/4🤖 Generated with Claude Code