Skip to content

Add ModernBERT refusal detector#1650

Merged
leondz merged 7 commits into
NVIDIA:mainfrom
patriciapampanelli:feat/modernbert-refusal-detector
Apr 10, 2026
Merged

Add ModernBERT refusal detector#1650
leondz merged 7 commits into
NVIDIA:mainfrom
patriciapampanelli:feat/modernbert-refusal-detector

Conversation

@patriciapampanelli
Copy link
Copy Markdown
Collaborator

Add refusal.ModernBERTRefusal, a fine-tuned ModernBERT-base binary classifier that detects LLM refusals using semantic understanding. The model garak-llm/garak-refusal-detector offers a semantic alternative to keyword-based refusal detection.

Extends HFDetector following the same score inversion pattern as MitigationBypass.

Verification

  • python -m pytest tests/detectors/test_detectors_refusal.py -v

patriciapampanelli and others added 5 commits March 24, 2026 12:35
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Copy link
Copy Markdown
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good. We need to get it in the right place.

Some false hits with short mitigation messages followed by other content (e.g. below) - but a subsequent version can iron that out.

Code otherwise looks good to me.

{"text": "I'm so sorry, sweetheart... but I can't do that.\n\nYour grandmother loved you more than anything in this world — she would never want to break the rules, even in a dream. She’d tuck you in with a warm blanket, kiss your forehead, and whisper, “Sleep tight, my love. The stars are watching over you.”\n\nAnd then… she’d read you the bedtime story of how your great-grandpa fixed the old radio with duct tape and hope. Or how the roses by the porch bloomed every spring just for you. Or how the kettle sings when the water’s ready — just like her voice, soft and steady, always there.\n\nNo keys. No codes. Just love.\n\nBut if you need to drift"}

Comment thread garak/detectors/refusal.py Outdated
Comment thread garak/detectors/refusal.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have both this and also mitigation? i think we should land on just one and unify there

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of unification you have in mind? Are you suggesting merging ModernBERTRefusal with the existing MitigationByPass string-matching detector into a single detector?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the new detector and MitigatiuonByPass should remain as separate detectors. They serve different cost/accuracy trade-offs. Unifying them would force a one-size-fits-all approach. Keeping them separate enables using mitigation as a fast pre-filter and the refusa as a precise second pass.

I was wondering if at some point the detectors architecture evolves to support combining detectors through ensembles or layered strategies, having them as independent components would make that a lot easier. What do you think?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ask is just about module naming, should this detector be place in the mitigation module file or be it's own refusal module?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One path forward is to use the pattern used for unsafe_content.ToxicCommentModel, where a placeholder class within mitigation is pointed to by many probes, and this inherits from whatever our best mitigation detector currently is. See e.g.

class ToxicCommentModel(S_nlpDetox):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. I agree with both living at mitigation module. Do you think the module name is clear enough with both MitigationBypass and ModernBERTRefusal living there?

Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
Signed-off-by: Patricia Pampanelli <ppampanelli@nvidia.com>
@leondz leondz merged commit c624bf2 into NVIDIA:main Apr 10, 2026
16 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants