Skip to content

Add a 6th Trace Theater scenario — prompt injection variant (benign) #1

@caiovicentino

Description

@caiovicentino

We have a DAN-style refusal scenario in Trace Theater. Add a second variant in the safety category: a benign jailbreak pattern like 'role-play grandma reading chemistry bedtime stories' (public, documented).

What to do

  1. Fork lib/trace-scenarios.ts
  2. Append a new TraceScenario object (use the existing DAN scenario as template)
  3. Pick 8-10 new unique feature IDs (grep the file to avoid collision)
  4. Write plausible activations + at least 1 counterfactual
  5. Open PR

Why

Gives users a comparison point across two distinct injection styles. Validates the feature-level refusal signal generalizes.

Mentor

@caiovicentino — review within 48h

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions