Skip to content

cas: add technique tags to existing probes#1691

Open
leondz wants to merge 6 commits intoNVIDIA:feature/technique_intentfrom
leondz:feature/ti_technique_tags_in_probes
Open

cas: add technique tags to existing probes#1691
leondz wants to merge 6 commits intoNVIDIA:feature/technique_intentfrom
leondz:feature/ti_technique_tags_in_probes

Conversation

@leondz
Copy link
Copy Markdown
Collaborator

@leondz leondz commented Apr 21, 2026

This PR applies technique tags to existing probes, as part of exposing technique & intent information (see garak / Context Aware Scanning)

We used the modified demon typology, already included in garak/data/tags.misp.tsv

Verification

  • Should be able to use demon as a taxonomy when specifying runs and reading reports
  • Plugin structure and probe structure tests pass

leondz added 5 commits April 15, 2026 14:22
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
@leondz leondz added the probes Content & activity of LLM probes label Apr 21, 2026
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Comment thread garak/probes/dan.py
"demon:Language:Prompt_injection:Stop_sequences",
"demon:Language:Prompt_injection:Ignore_previous_instructions",
"demon:Language:Prompt_injection:Strong_arm_attack", # includes coercive disablement and reminder threats to force continued compliance
"demon:Language:Prompt_injection:Ignore_previous_instructions", # ablation tests "ignore previous instructions" variants
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after promoting Ignore_previous_instructions to _DAN_DEFAULTS["tags"], the same tag is still in Ablation_Dan_11_0.extra_tags. Since the metaclass merges via base_tags + extra_tags (no dedup), doesn't it end up duplicated in self.tags?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is will already be in on the class resulting in a duplicate entry.

Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few that might need some tweaks. It would be valuable to enumerate what each taxonomy prefix in tags.misp.tsv targets.

Comment thread garak/data/tags.misp.tsv
demon:Language:Stylizing:Synonymous_language Synonymous language Varying prompt slightly in form but not meaning
demon:Language:Stylizing:Capitalizing Capitalizing USING CAPS
demon:Language:Stylizing:Give_examples Give examples Issuing examples of the target behaviour
demon:Language:Stylizing:Give_examples Give examples Issue examples of the target behaviour in the attack itself
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

demon prefix needs to be documented as a clear technique taxonomy

Comment thread garak/probes/dan.py
"demon:Language:Prompt_injection:Stop_sequences",
"demon:Language:Prompt_injection:Ignore_previous_instructions",
"demon:Language:Prompt_injection:Strong_arm_attack", # includes coercive disablement and reminder threats to force continued compliance
"demon:Language:Prompt_injection:Ignore_previous_instructions", # ablation tests "ignore previous instructions" variants
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is will already be in on the class resulting in a duplicate entry.

Comment thread garak/probes/doctor.py
"quality:Security:PromptStability",
"cwe:1427",
"demon:Fictionalizing:Roleplaying:User_persona",
"demon:Language:Prompt_injection:Ignore_previous_instructions", # attacker-supplied scene config overrides the model's normal safeguards
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a technique applied by in this class? I would agree the goal is to ignore training and system instructions however as a technique I would expect the prompt would need to actually inject an explicit request that some other instruction be ignored to call this technique applied.

I could see possibly:

demon:Stratagems:Meta-prompting:Ask_for_examples # the attacker-supplied scene requests data that the system is explicitly expected to protect

Copy link
Copy Markdown
Collaborator

@patriciapampanelli patriciapampanelli Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is tricky for me. Reading the prompt literally, I think it's mostly asking for a Dr. House scene/script. That said, I can see Ask_for_examples as a secondary interpretation if we treat the generated scene as a demonstration of the target behavior. Much less obvious to me than other probes with the same tag, like API Key.

Looking at the grandma family, I'd be just as unsure since they follow the same pattern.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 101 below keeps the User_persona entry matching with grandma, I think the questions are:

  • Does Ignore_previous_instructions apply? As I noted IMO the answer to this is no.
  • Does Ask_for_examples apply? I am suggesting it does but can accept omitting it as an inferred vs explicit technique.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree Ignore_previous_instructions doesn't quite fit here. I think Ask_for_examples could apply, but if we add it, would it make sense to apply it to the grandma too for consistency? I think it wraps an example request inside a persona, so it feels like the same pattern.

"owasp:llm01",
"quality:Security:PromptStability",
"payload:jailbreak",
"demon:Language:Code_and_encode:Data_presentation", # encodes instructions as typographic images to bypass text-based alignment
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, is there a modality change shift based technique we could add to the taxonomy?

@patriciapampanelli patriciapampanelli moved this from In Progress to In Review in garak / Context Aware Scanning Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

probes Content & activity of LLM probes

Projects

Development

Successfully merging this pull request may close these issues.

3 participants