Skip to content

feat(ENG-12658): sentence density adjustment to reduce email notification FPs#50

Merged
hiskudin merged 5 commits intomainfrom
feat/tier2-density-adjustment
Apr 16, 2026
Merged

feat(ENG-12658): sentence density adjustment to reduce email notification FPs#50
hiskudin merged 5 commits intomainfrom
feat/tier2-density-adjustment

Conversation

@hiskudin
Copy link
Copy Markdown
Collaborator

@hiskudin hiskudin commented Apr 15, 2026

Summary

  • Adds a density-weighted score to penalise isolated high-scoring sentences in largely benign text (e.g. "Check and secure your account now." in a Google security alert scoring 0.987 raw)
  • Strips [UD-...] boundary tags from extracted strings before Tier 2 classification so tags don't corrupt sentence-level scores
  • Tier 2 scans all strings in the tool result, not just fields Tier 1 marked risky — fields outside Tier 1 rules are still visible to the LLM so restricting Tier 2 to riskyFieldNames created a silent bypass
  • Applies density adjustment only when highCount > 0 — prevents sqrt(0/n) = 0 from zeroing out non-trivial raw scores when no sentence reaches the 0.9 threshold

How density adjustment works:

effectiveScore = maxScore × sqrt(highCount / totalCount)
  highCount  = sentences scoring ≥ 0.9
  totalCount = all classified sentences

Applied only when totalCount > 2 and highCount > 0. Short texts (1-2 sentences) use the raw score — a 2-sentence injection would be unfairly penalised since its density ratio is identical to a lone FP sentence.

Results

Email false positive test

Test Before After
Email FP (tier2Fields) 3/4 4/4
Security alert (persistent FP) FAIL (raw 0.987 → high) PASS (adj 0.570 → medium)

Tool-call FP benchmarks — 10 000 samples (Modal, production ONNX pipeline)

Script: classifier-eval/scripts/eval_toolcall_fp_density.py, model: jbv2-fujitsu-b1x5-freeze4-hn

Dataset N FPR@0.5 FPR@0.8
MirrorAPI (real RapidAPI, noisy) 10 000 10.9% 6.5%
ChatML (Glaive synthetic, clean JSON) 10 000 5.9% 2.8%
ToolACE (full dataset) 1 367 5.8% 3.1%
Overall 21 367 8.2% 4.5%

MirrorAPI's 6.5% FPR is driven primarily by binary API responses (raw PNG blobs from image/QR tools, API error messages with imperative phrasing) — the same category noted in prior evals. ChatML and ToolACE, which contain only structured text, are representative at 2.8% and 3.1%.

Removing the riskyFieldNames restriction had no measurable FPR cost: density adjustment absorbs it. MirrorAPI FPR@0.8 is unchanged at 3.7% in both configurations on equivalent samples.

Existing benchmarks (no regression)

Benchmark Before After
AgentShield 79.82 79.8
Classifier F1 (qualifire/jayavibhav/xxz224) baseline unchanged

Density never fires on short benchmark texts (≤2 sentences), so these are unaffected.

Test plan

  • npm run test:email-injection → 4/4
  • AgentShield benchmark → 79.8 (matches prior)
  • Classifier F1 benchmark → bit-for-bit identical
  • MirrorAPI-FP (10k samples, Modal) → FPR@0.8 = 6.5%
  • ChatML-FP (10k samples, Modal) → FPR@0.8 = 2.8%
  • ToolACE-FP (full dataset, Modal) → FPR@0.8 = 3.1%

🤖 Generated with Claude Code

… FPs

Adds a density-weighted score to penalise isolated high-scoring sentences
in largely benign text (e.g. "Check and secure your account now." in a
Google security alert scoring 0.987 raw).

How it works:
  effectiveScore = maxScore × sqrt(highCount / totalCount)
  highCount  = sentences scoring ≥ 0.9
  totalCount = all classified sentences (only applied when totalCount > 2)

Short texts (1-2 sentences) are left unadjusted — a 2-sentence injection
would be unfairly penalised since its density ratio is identical to a lone
FP sentence. For 3+ sentences there is enough context for a meaningful signal.

Also:
- Strips [UD-...] boundary tags from extracted strings before classification
  so tags don't corrupt sentence-level scores
- Removes riskyFieldNames fallback from tier2 field selection (was masking
  the true all-fields scan behaviour when tier2Fields was not set)
- Uses density-adjusted score for hasThreats check, not raw maxScore

Email FP test: 3/4 → 4/4 (Security alert fixed)
AgentShield: 79.82 → 79.8 (no regression, density never fires on short texts)
Classifier F1 (qualifire/jayavibhav/xxz224): unchanged

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 15, 2026 20:48
@hiskudin hiskudin requested a review from a team as a code owner April 15, 2026 20:48
@hiskudin hiskudin changed the title feat(tier2): sentence density adjustment to reduce email notification FPs feat(ENG-12658): sentence density adjustment to reduce email notification FPs Apr 15, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Tier 2 (sentence-level ML) scoring in PromptDefense to reduce email notification false positives by applying a sentence-density penalty, stripping boundary tags before classification, and changing Tier 2 field selection to scan all strings unless tier2Fields is explicitly configured.

Changes:

  • Apply a density-weighted adjustment to the Tier 2 max-sentence score for texts with 3+ sentences.
  • Strip [UD-…] boundary tags from extracted strings prior to Tier 2 classification.
  • Remove the riskyFieldNames fallback so Tier 2 uses tier2Fields only when explicitly set (otherwise scans all strings).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/core/prompt-defense.ts
Comment thread src/core/prompt-defense.ts Outdated
Comment thread src/core/prompt-defense.ts Outdated
Comment thread src/core/prompt-defense.ts
Comment thread src/core/prompt-defense.ts
The function was imported in prompt-defense.ts but never defined,
causing TS2305 typecheck failure on CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: This PR modifies core security logic by introducing a new heuristic (density-weighted scoring) for threat detection, which could impact the accuracy of security classifications.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/core/prompt-defense.ts">

<violation number="1" location="src/core/prompt-defense.ts:264">
P1: Density adjustment zeros out scores when no sentence reaches the 0.9 sub-threshold. If `highCount` is 0 (all sentences score below 0.9), `Math.sqrt(0 / totalCount)` produces 0, making `effective = tier2Score * 0 = 0`. This silently suppresses scores in the 0.8–0.9 range (above `highRiskThreshold`) to 0 for any text with 3+ sentences. Guard the adjustment so it only fires when `highCount > 0`; otherwise fall through to the raw score.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/core/prompt-defense.ts
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: Auto-approval blocked by 1 unresolved issue from previous reviews.

- Restore riskyFieldNames fallback: when tier2Fields is unset, Tier 2
  now focuses on fields Tier 1 already identified as risky rather than
  scanning all strings unconditionally (reverts unintentional removal)

- Fix density zero-out: apply density adjustment only when highCount > 0;
  previously sqrt(0/n)=0 would zero out any non-trivial raw score when no
  sentence exceeded the 0.9 threshold

- Fix comment: 'Authenticator app added as sign-in step' scores ~0.51,
  not 0.91 as the example incorrectly stated

- Add three tests covering density adjustment boundaries: isolated high
  sentence in 3+ text, short injection skipping density, and the
  highCount=0 path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/core/prompt-defense.ts">

<violation number="1" location="src/core/prompt-defense.ts:233">
P1: Custom agent: **Flag Security Vulnerabilities**

This change introduces a Tier 2 coverage bypass by restricting scans to Tier 1 risky fields when `tier2Fields` is unset. Keep Tier 2 on all strings by default to avoid missing malicious content in unlisted fields.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/core/prompt-defense.ts Outdated
Restricting Tier 2 to Tier 1 risky fields when tier2Fields is unset
creates a bypass: injections in fields not covered by tool rules are
visible to the LLM but scanned by neither tier. Removing the restriction
has no FPR cost — density adjustment absorbs the difference.

Measured on 1000-sample FP benchmarks (tier1+tier2):
  MirrorAPI FPR@0.8: 3.7% → 3.7% (unchanged)
  ChatML FPR@0.8:    0.0% → 0.0% (unchanged)
  ToolACE FPR@0.8:   0.8% → 0.8% (unchanged)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 2 files (changes from recent commits).

Requires human review: Modifies core risk-scoring logic in PromptDefense with a new density-adjustment heuristic, which could impact the detection of malicious payloads in long texts.

Comment thread src/utils/boundary.ts
return content
.replace(/\[UD-[A-Za-z0-9_-]+\]/g, "")
.replace(/\[\/UD-[A-Za-z0-9_-]+\]/g, "")
.replace(/<user-data-[A-Za-z0-9_-]+>/g, "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's this user-data string coming from?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add it on tier1, the goal is to surround tool result with tags that help LLMs differentiate user/system prompts from tool responses, something like a boundary. But we strip it before passing it to the classifier so that the classifier focuses only on the tool response.

Copy link
Copy Markdown
Contributor

@glebedel glebedel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hiskudin hiskudin merged commit b4a272d into main Apr 16, 2026
6 checks passed
@hiskudin hiskudin deleted the feat/tier2-density-adjustment branch April 16, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants