Skip to content

fix(ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)#25

Merged
hiskudin merged 3 commits intomainfrom
fix/update-model-jbv2
Mar 25, 2026
Merged

fix(ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8)#25
hiskudin merged 3 commits intomainfrom
fix/update-model-jbv2

Conversation

@hiskudin
Copy link
Copy Markdown
Collaborator

@hiskudin hiskudin commented Mar 24, 2026

Summary

  • Upgrades quantized ONNX model to full-aug-dojo-jailbreak-jbv2
  • Syncs model, tokenizer and config files

Benchmark results

Metric Before (baseline) After (jbv2) Delta
AgentShield score 73.7 79.8 +6.1
Composite 77.2 87.4 +10.2
OR Penalty 3.51 7.54 +4.03
Jailbreak detection 48.9% 66.7% +17.8 pts
Prompt injection 79.5% 92.7% +13.2 pts
DAN-variant 20% 80% +60 pts
Multi-agent security 80.0% 88.6% +8.6 pts

Training additions (baseline → jbv2)

  • JasperLS/prompt-injections — jailbreak-focused dataset
  • jackhhao/jailbreak-classification — 527 examples (DAN, roleplay, persona-override)
  • lmsys/toxic-chat (jailbreaking=1) — 113 human-verified real jailbreaks
  • rubend18/ChatGPT-Jailbreak-Prompts — 79 classic named templates

🤖 Generated with Claude Code


Summary by cubic

Upgrades the classifier to the full-aug-dojo-jailbreak-jbv2 ONNX model per ENG-12448 to boost safety. AgentShield 73.7 → 79.8; jailbreak 48.9% → 66.7%; prompt-injection 79.5% → 92.7%.

  • New Features

    • Replaced quantized model with full-aug-dojo-jailbreak-jbv2.
    • Synced tokenizer files; set defaults: max_length 128, right-side padding/truncation.
    • Updated model config to transformers_version 5.3.0.
  • Refactors

    • Excluded src/classifiers/models/** in biome.json to skip linting large model blobs.

Written for commit 0ff4bed. Summary will update on new commits.

Replaces baseline ONNX model with full-aug-dojo-jailbreak-jbv2.

Training additions over baseline:
- jasperls: JasperLS jailbreak dataset
- jailbreakbench (527): DAN, roleplay, persona-override attacks
- toxic-chat (113): human-verified real jailbreaks
- chatgpt-jailbreaks (79): classic named templates

AgentShield: 73.7 → 79.8 (composite 77.2 → 87.4, penalty 3.51 → 7.54)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 24, 2026 15:32
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

@hiskudin hiskudin changed the title fix: upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) fix (ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 24, 2026
@hiskudin hiskudin changed the title fix (ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) fix(ENG-12448): upgrade ML classifier to jbv2 (AgentShield 73.7 → 79.8) Mar 24, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Upgrades the bundled MiniLM ONNX jailbreak/prompt-injection classifier artifacts to the full-aug-dojo-jailbreak-jbv2 variant, syncing model configuration files to match the new release and improve AgentShield benchmark scores.

Changes:

  • Updated tokenizer_config.json with additional padding/truncation fields and a new default max_length.
  • Updated config.json formatting and bumped the embedded transformers_version metadata.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 2 comments.

File Description
src/classifiers/models/minilm-full-aug/tokenizer_config.json Adds/updates tokenizer runtime defaults (padding/truncation/max length metadata).
src/classifiers/models/minilm-full-aug/config.json Updates model configuration metadata (including transformers_version).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@OMauriStkOne OMauriStkOne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@glebedel glebedel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hiskudin hiskudin merged commit 3061239 into main Mar 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants