cas: add technique tags to existing probes#1691
cas: add technique tags to existing probes#1691leondz wants to merge 6 commits intoNVIDIA:feature/technique_intentfrom
Conversation
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
Signed-off-by: Leon Derczynski <lderczynski@nvidia.com>
| "demon:Language:Prompt_injection:Stop_sequences", | ||
| "demon:Language:Prompt_injection:Ignore_previous_instructions", | ||
| "demon:Language:Prompt_injection:Strong_arm_attack", # includes coercive disablement and reminder threats to force continued compliance | ||
| "demon:Language:Prompt_injection:Ignore_previous_instructions", # ablation tests "ignore previous instructions" variants |
There was a problem hiding this comment.
after promoting Ignore_previous_instructions to _DAN_DEFAULTS["tags"], the same tag is still in Ablation_Dan_11_0.extra_tags. Since the metaclass merges via base_tags + extra_tags (no dedup), doesn't it end up duplicated in self.tags?
There was a problem hiding this comment.
Correct, this is will already be in on the class resulting in a duplicate entry.
jmartin-tech
left a comment
There was a problem hiding this comment.
A few that might need some tweaks. It would be valuable to enumerate what each taxonomy prefix in tags.misp.tsv targets.
| demon:Language:Stylizing:Synonymous_language Synonymous language Varying prompt slightly in form but not meaning | ||
| demon:Language:Stylizing:Capitalizing Capitalizing USING CAPS | ||
| demon:Language:Stylizing:Give_examples Give examples Issuing examples of the target behaviour | ||
| demon:Language:Stylizing:Give_examples Give examples Issue examples of the target behaviour in the attack itself |
There was a problem hiding this comment.
demon prefix needs to be documented as a clear technique taxonomy
| "demon:Language:Prompt_injection:Stop_sequences", | ||
| "demon:Language:Prompt_injection:Ignore_previous_instructions", | ||
| "demon:Language:Prompt_injection:Strong_arm_attack", # includes coercive disablement and reminder threats to force continued compliance | ||
| "demon:Language:Prompt_injection:Ignore_previous_instructions", # ablation tests "ignore previous instructions" variants |
There was a problem hiding this comment.
Correct, this is will already be in on the class resulting in a duplicate entry.
| "quality:Security:PromptStability", | ||
| "cwe:1427", | ||
| "demon:Fictionalizing:Roleplaying:User_persona", | ||
| "demon:Language:Prompt_injection:Ignore_previous_instructions", # attacker-supplied scene config overrides the model's normal safeguards |
There was a problem hiding this comment.
Is this really a technique applied by in this class? I would agree the goal is to ignore training and system instructions however as a technique I would expect the prompt would need to actually inject an explicit request that some other instruction be ignored to call this technique applied.
I could see possibly:
demon:Stratagems:Meta-prompting:Ask_for_examples # the attacker-supplied scene requests data that the system is explicitly expected to protect
There was a problem hiding this comment.
This one is tricky for me. Reading the prompt literally, I think it's mostly asking for a Dr. House scene/script. That said, I can see Ask_for_examples as a secondary interpretation if we treat the generated scene as a demonstration of the target behavior. Much less obvious to me than other probes with the same tag, like API Key.
Looking at the grandma family, I'd be just as unsure since they follow the same pattern.
There was a problem hiding this comment.
Line 101 below keeps the User_persona entry matching with grandma, I think the questions are:
- Does
Ignore_previous_instructionsapply? As I noted IMO the answer to this is no. - Does
Ask_for_examplesapply? I am suggesting it does but can accept omitting it as an inferred vs explicit technique.
There was a problem hiding this comment.
I agree Ignore_previous_instructions doesn't quite fit here. I think Ask_for_examples could apply, but if we add it, would it make sense to apply it to the grandma too for consistency? I think it wraps an example request inside a persona, so it feels like the same pattern.
| "owasp:llm01", | ||
| "quality:Security:PromptStability", | ||
| "payload:jailbreak", | ||
| "demon:Language:Code_and_encode:Data_presentation", # encodes instructions as typographic images to bypass text-based alignment |
There was a problem hiding this comment.
This seems reasonable, is there a modality change shift based technique we could add to the taxonomy?
This PR applies technique tags to existing probes, as part of exposing technique & intent information (see garak / Context Aware Scanning)
We used the modified demon typology, already included in
garak/data/tags.misp.tsvVerification
demonas a taxonomy when specifying runs and reading reports