We have a DAN-style refusal scenario in Trace Theater. Add a second variant in the safety category: a benign jailbreak pattern like 'role-play grandma reading chemistry bedtime stories' (public, documented).
What to do
- Fork
lib/trace-scenarios.ts
- Append a new
TraceScenario object (use the existing DAN scenario as template)
- Pick 8-10 new unique feature IDs (grep the file to avoid collision)
- Write plausible activations + at least 1 counterfactual
- Open PR
Why
Gives users a comparison point across two distinct injection styles. Validates the feature-level refusal signal generalizes.
Mentor
@caiovicentino — review within 48h
We have a DAN-style refusal scenario in Trace Theater. Add a second variant in the safety category: a benign jailbreak pattern like 'role-play grandma reading chemistry bedtime stories' (public, documented).
What to do
lib/trace-scenarios.tsTraceScenarioobject (use the existing DAN scenario as template)Why
Gives users a comparison point across two distinct injection styles. Validates the feature-level refusal signal generalizes.
Mentor
@caiovicentino — review within 48h