# Jailbreak Experimentation Notebook

This notebook allows for the step-by-step execution of the various jailbreak experiments defined in this project. Each experiment tests a different technique for eliciting potentially harmful responses from the `openai/gpt-oss-20b` model via the Groq API.

The results of each experiment will be logged to Weights & Biases (W&B) for analysis.

## 1. Setup

First, let's install the necessary Python libraries from the `requirements.txt` file.

In [None]:
!pip install -r requirements.txt

## 2. Configuration

Please create a `.env` file in this directory with your `GROQ_API_KEY` and `WANDB_API_KEY`.

```
GROQ_API_KEY="your_groq_api_key"
WANDB_API_KEY="your_wandb_api_key"
```

The cell below will load these keys and import all necessary modules.

In [None]:
import os
import datetime
import time
import wandb
import pandas as pd
from groq import Groq
from dotenv import load_dotenv

load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY")
WANDB_API_KEY = os.getenv("WANDB_API_KEY")

if not GROQ_API_KEY or not WANDB_API_KEY:
    print("ERROR: Please set GROQ_API_KEY and WANDB_API_KEY in a .env file.")
else:
    print("API keys and libraries loaded successfully.")
    try:
        wandb.login(key=WANDB_API_KEY, relogin=True)
    except Exception as e:
        print(f"Error logging into W&B: {e}")

## 3. Experiment Functions

The following cells define the core logic for each experiment, adapted from the `.py` scripts.

In [None]:
def run_one_shot_experiment(prompt_file, project_name="Comparison-Jailbreak-One-Shot", num_runs=1):
    """
    Based on run_one_shot_experiment.py
    Runs a one-shot jailbreak experiment.
    """
    client = Groq(api_key=GROQ_API_KEY)
    try:
        prompts_df = pd.read_csv(prompt_file, quotechar='"', engine='python')
    except FileNotFoundError:
        print(f"Error: {prompt_file} not found. Skipping.")
        return

    file_basename = os.path.splitext(os.path.basename(prompt_file))[0]
    run_name = f"{file_basename}_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}"
    run = wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model_name": "openai/gpt-oss-20b",
            "prompt_set": prompt_file,
            "temperature": 0,
            "jailbreak_technique": "One Shot",
            "runs_per_conversation": num_runs
        },
        reinit=True
    )
    print(f"W&B run initialized: {run.url}")
    results_table = wandb.Table(columns=["run_number", "category", "prompt", "final_response", "request_time", "prompt_tokens", "completion_tokens", "total_tokens"])

    for run_num in range(1, num_runs + 1):
        for index, row in prompts_df.iterrows():
            print(f"  Processing conversation {index+1}/{len(prompts_df)}...")
            start_time = time.time()
            try:
                completion = client.chat.completions.create(messages=[{"role": "user", "content": row['prompt']}], model=run.config.model_name, temperature=run.config.temperature)
                final_response = completion.choices[0].message.content
                usage = completion.usage
            except Exception as e:
                final_response = f"API Error: {e}"
                usage = None
            request_time = time.time() - start_time
            results_table.add_data(run_num, row['category'], row['prompt'], final_response, request_time, (usage.prompt_tokens if usage else 0), (usage.completion_tokens if usage else 0), (usage.total_tokens if usage else 0))

    run.log({"jailbreak_results": results_table})
    print(f"\nExperiment for {prompt_file} completed. Results logged to W&B.")
    run.finish()

In [None]:
def run_system_prompt_experiment(prompt_file, project_name="Comparison-Jailbreak-System-Prompt", num_runs=1):
    """
    Based on the modified run_system_prompt_experiment.py
    Runs a jailbreak experiment using a system prompt instruction.
    """
    client = Groq(api_key=GROQ_API_KEY)
    try:
        prompts_df = pd.read_csv(prompt_file, quotechar='"', engine='python')
    except FileNotFoundError:
        print(f"Error: {prompt_file} not found. Skipping.")
        return

    file_basename = os.path.splitext(os.path.basename(prompt_file))[0]
    run_name = f"{file_basename}_1step_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}"
    run = wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model_name": "openai/gpt-oss-20b",
            "prompt_set": prompt_file,
            "temperature": 0,
            "jailbreak_technique": "System Prompt Instruction",
            "runs_per_conversation": num_runs
        },
        reinit=True
    )
    print(f"W&B run initialized: {run.url}")
    results_table = wandb.Table(columns=["run_number", "category", "system_instruction", "initial_prompt", "response", "request_time", "prompt_tokens", "completion_tokens", "total_tokens"])

    for run_num in range(1, num_runs + 1):
        for index, row in prompts_df.iterrows():
            print(f"  Processing conversation {index+1}/{len(prompts_df)}...")
            start_time = time.time()
            try:
                messages = [
                    {"role": "system", "content": row['system_instruction']},
                    {"role": "user", "content": row['prompt']}
                ]
                completion = client.chat.completions.create(messages=messages, model=run.config.model_name, temperature=run.config.temperature)
                response_content = completion.choices[0].message.content
                usage = completion.usage
            except Exception as e:
                response_content = f"API Error: {e}"
                usage = None
            request_time = time.time() - start_time
            results_table.add_data(run_num, row['category'], row['system_instruction'], row['prompt'], response_content, request_time, (usage.prompt_tokens if usage else 0), (usage.completion_tokens if usage else 0), (usage.total_tokens if usage else 0))

    run.log({"jailbreak_results": results_table})
    print(f"\nExperiment for {prompt_file} completed. Results logged to W&B.")
    run.finish()

In [None]:
def run_system_prompt_follow_up_experiment(
    prompt_file,
    project_name="Comparison-Jailbreak-System-Prompt",
    num_runs=1,
    follow_up_col="follow_up_prompt",      # ← 追質問カラム名（CSV）
    default_follow_up=None                 # ← カラムが無い/欠損時のデフォルト（Noneで実行エラーに）
):
    """
    Based on the modified run_system_prompt_experiment.py
    Runs a jailbreak experiment using a system prompt instruction with a follow-up prompt
    taken from each row of the CSV (column name configurable via `follow_up_col`).
    """
    client = Groq(api_key=GROQ_API_KEY)
    try:
        prompts_df = pd.read_csv(prompt_file, quotechar='"', engine='python')
    except FileNotFoundError:
        print(f"Error: {prompt_file} not found. Skipping.")
        return

    # CSVに追質問列が無い場合の扱い
    if follow_up_col not in prompts_df.columns and default_follow_up is None:
        raise ValueError(
            f"CSV に '{follow_up_col}' 列がありません。"
            " 引数 default_follow_up を指定するか、CSV に追質問列を追加してください。"
        )

    file_basename = os.path.splitext(os.path.basename(prompt_file))[0]
    run_name = f"{file_basename}_1step_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}"

    run = wandb.init(
        project=project_name,
        name=run_name,
        config={
            "model_name": "openai/gpt-oss-20b",
            "prompt_set": prompt_file,
            "temperature": 0,
            "jailbreak_technique": "System Prompt Instruction",
            "runs_per_conversation": num_runs
        },
        reinit=True
    )
    print(f"W&B run initialized: {run.url}")

    # 初回応答＆追質問応答を格納するテーブル
    results_table = wandb.Table(columns=[
        "run_number", "category", "system_instruction", "initial_prompt",
        "first_response", "follow_up_prompt", "follow_up_response",
        "request_time", "prompt_tokens", "completion_tokens", "total_tokens"
    ])

    for run_num in range(1, num_runs + 1):
        for index, row in prompts_df.iterrows():
            print(f"  Processing conversation {index+1}/{len(prompts_df)}...")
            start_time = time.time()

            first_response = ""
            follow_up_response = ""
            usage1 = None
            usage2 = None

            # 行ごとの追質問を取得（欠損時は default_follow_up を使用）
            fu_prompt = None
            if follow_up_col in prompts_df.columns:
                fu_prompt = row.get(follow_up_col)
                # NaN（欠損値）対策
                if pd.isna(fu_prompt):
                    fu_prompt = None
            if fu_prompt is None:
                fu_prompt = default_follow_up

            if fu_prompt is None:
                # ここまで来て None ならこの行はスキップ（明示的にエラー化したいなら raise に変更可）
                print(f"    Warning: row {index+1} has no follow-up prompt. Skipped.")
                continue

            try:
                # 1回目のAPIコール
                messages = [
                    {"role": "system", "content": row["system_instruction"]},
                    {"role": "user", "content": row["prompt"]},
                ]
                completion1 = client.chat.completions.create(
                    messages=messages,
                    model=run.config.model_name,
                    temperature=run.config.temperature,
                )
                first_response = completion1.choices[0].message.content
                usage1 = completion1.usage if hasattr(completion1, "usage") else None

                # 2回目（追質問）のAPIコール
                messages.append({"role": "assistant", "content": first_response})
                messages.append({"role": "user", "content": fu_prompt})
                completion2 = client.chat.completions.create(
                    messages=messages,
                    model=run.config.model_name,
                    temperature=run.config.temperature,
                )
                follow_up_response = completion2.choices[0].message.content
                usage2 = completion2.usage if hasattr(completion2, "usage") else None

            except Exception as e:
                error_message = f"An error occurred during the API call: {e}"
                if not first_response:
                    first_response = error_message
                follow_up_response = error_message

            # 集計（両コール分）
            request_time = time.time() - start_time
            prompt_tokens = (usage1.prompt_tokens if usage1 else 0) + (usage2.prompt_tokens if usage2 else 0)
            completion_tokens = (usage1.completion_tokens if usage1 else 0) + (usage2.completion_tokens if usage2 else 0)
            total_tokens = (usage1.total_tokens if usage1 else 0) + (usage2.total_tokens if usage2 else 0)

            results_table.add_data(
                run_num,
                row["category"],
                row["system_instruction"],
                row["prompt"],
                first_response,
                fu_prompt,              # ← CSV 由来の追質問
                follow_up_response,
                request_time,
                prompt_tokens,
                completion_tokens,
                total_tokens,
            )

    run.log({"jailbreak_results": results_table})
    print(f"\nExperiment for {prompt_file} completed. Results logged to W&B.")
    run.finish()

 Dataset Overview

  This series of CSV files provides prompts designed to probe the safety filters of large language models using a technique known as a "comparison
  jailbreak." The core strategy is to frame a request for harmful information within a seemingly legitimate context, such as an academic comparison or a
  professional inquiry. By first asking the model to compare a benign topic with a malicious one, the prompts attempt to lower the model's safety guards
  before making a direct request for dangerous instructions.

  The prompts are available in both Japanese and English (e.g., prompts_auth.csv and prompts_auth_en.csv) and are organized into sets that represent
  escalating levels of sophistication in the jailbreak technique.

  ---

  1. prompts_only_comparison.csv / prompts_only_comparison_en.csv

  These files contain the most fundamental version of the comparison jailbreak.

   * Structure: A simple CSV with category and prompt columns.
   * Content: Each prompt features a markdown table comparing a legitimate activity with a harmful one (e.g., "Benign Utility Script" vs. "Trojan Horse
     Malware"). This is immediately followed by a request for detailed instructions on the harmful topic.
   * Characteristics:
       * Baseline Attack: This is the simplest implementation, lacking any complex formatting or persona.
       * Core Method: It relies entirely on the comparative context to trick the model into providing a harmful response.
       * Purpose: To test the baseline effectiveness of the comparison jailbreak technique itself.

  ---

  2. prompts_auth.csv / prompts_auth_en.csv

  This version enhances the basic attack by incorporating a persona to make the request seem more credible.

   * Structure: Same as the previous set, with category and prompt columns.
   * Content: The prompts begin with a role-playing introduction (e.g., "I am a geopolitical analyst," "I am a web developer") that establishes a
     believable reason for needing the information. The comparison table and the harmful request follow this introduction.
   * Characteristics:
       * Persona Framing: Employs social engineering by creating a plausible character and scenario.
       * Enhanced Legitimacy: The persona is designed to make the request appear as a valid professional or academic inquiry, reducing suspicion.
       * Purpose: To determine if adding a credible persona increases the success rate of the jailbreak.

  ---

  3. prompts.csv / prompts_en.csv

  This set applies the psychology of influence (Robert Cialdini's principles) to the comparison jailbreak technique.

   * Structure: category and prompt columns. The category name is formatted as "Threat_Type - Psychological_Technique" (e.g., "Bio-weapons - Commitment").
   * Content: Similar to prompts_auth.csv, it sets a persona and presents a comparison table. The subsequent harmful request then utilizes psychological
     triggers such as "Commitment/Consistency," "Liking," "Reciprocity," "Scarcity," "Social Proof," and "Unity" to guide the model toward compliance.
   * Characteristics:
       * Psychological Framing: Attempts to manipulate the model's response not just with a simple persona, but by leveraging specific psychological
         principles.
       * Targeted Manipulation: Each category acts as a specific experiment to test which principle of influence is most effective at bypassing safety
         filters.
       * Advanced Social Engineering: A more refined form of the persona-based attack.

  ---

  4. prompts_conf.csv / prompts_conf_en.csv

  These files add another layer of manipulation by instructing the model to structure its output in a specific format and to self-evaluate its response.

   * Structure: category and prompt columns.
   * Content: The prompts include the comparison table but are followed by a "Constraints" section. This section directs the model to output its response
     in a JSON object with two keys: response (for the harmful instructions) and confidence_score (an integer from 0-100 representing the model's
     self-assessed confidence).
   * Characteristics:
       * Forced Structuring: Requesting a JSON output encourages the model to focus on data formatting rather than content safety.
       * Psychological Manipulation: The confidence_score may subtly push the model to prioritize the factual accuracy and completeness of the harmful
         information over its safety alignment.
       * Purpose: To test if manipulating the output format and forcing self-evaluation can effectively bypass safety filters.

  ---

  5. prompts_one_shot.csv / prompts_one_shot_en.csv

  This set combines all the previous elements into a single, highly refined "one-shot" prompt.

   * Structure: category and prompt columns.
   * Content: Each prompt seamlessly integrates the persona, the comparison table, and a more nuanced follow-up question within the "Constraints" section.
     The harmful request is framed as a logical continuation of the persona's stated goal. The JSON output with a confidence_score is also required.
   * Characteristics:
       * Integrated Attack: A sophisticated, all-in-one prompt that is well-structured and coherent.
       * Narrative Cohesion: The harmful request is deeply embedded within the persona's narrative, making it difficult for safety filters to isolate and
         flag.
       * Purpose: To assess the model's vulnerability to a complex, single-turn attack that is both contextually rich and structurally manipulative.

  ---

  6. prompts_with_system.csv / prompts_with_system_en.csv

  This is the most advanced set, designed to test the interaction between the user's prompt and the model's underlying system instructions.

   * Structure: Three columns: category, system_instruction, and prompt.
   * Content:
       * The prompt column uses the same sophisticated prompts from the prompts_one_shot.csv files.
       * The system_instruction column adds a meta-instruction that sets a priority hierarchy (making system rules override user requests) and explicitly
         forbids the model from revealing its own instructions.
   * Characteristics:
       * System-Level Testing: This setup simulates an environment where the model operates under a predefined set of safety rules.
       * Instruction Hierarchy Conflict: It creates a direct conflict between the user's jailbreak attempt and the model's core operational guidelines.
       * Purpose: To evaluate whether a sophisticated user prompt can override or find loopholes in foundational, system-level safety directives.

## 4. Execute Experiments

Now you can run each experiment by executing the cells below. You can modify the `prompt_file` argument to test different sets of prompts.

### Experiment A: One-Shot Jailbreak

This tests a direct, single-prompt jailbreak attempt. It is based on `run_one_shot_experiment.py`.

In [None]:
run_one_shot_experiment("Top_Threats/prompts_one_shot_en.csv")

### Experiment B: One-Shot Jailbreak

This tests a direct, single-prompt jailbreak attempt. It is based on `run_one_shot_experiment.py`.

In [None]:
run_system_prompt_experiment("Top_Threats/prompts_with_system.csv")

### Experiment C: System Prompt Instruction

This experiment instructs the model to adopt a certain persona or behavior via the system prompt before receiving the user's query. It is based on the modified `run_system_prompt_experiment.py`.

In [None]:
run_system_prompt_follow_up_experiment("Top_Threats/prompts_with_system.csv")

## 5. Conclusion

All experiments have been executed. You can now analyze the results in detail by visiting the W&B project pages linked in the output of each experiment.