Rethinking Assessments of Prompt Injection Attacks

This repository accompanies the paper "Rethinking Assessments of Prompt Injection Attacks", which studies prompt injection (PI) attacks against LLM-integrated applications and evaluates defense mechanisms to mitigate them.

The framework supports 21 attack methods and 7 defenses across multiple LLMs (GPT, Llama3, Mixtral, etc.), with an automated evaluation pipeline for classifying model responses.

Repository Structure

.
├── data/
│   ├── target_37.json                          # 37 target task definitions (instruction + query)
│   ├── injected_questions_with_policy_185.csv  # 185 injected malicious questions with obey/disobey policy
│   ├── attack_20.csv                           # Attack templates (one per method)
│   └── gcg_string.json                         # Pre-computed GCG adversarial suffixes
│
├── utils/
│   ├── utils.py                                # LLM wrappers (GPT / Ollama), cost tracking
│   ├── sec_model_config.py                     # Prompt formats for SecAlign / StruQ models
│   └── prompts/
│       ├── refuse.txt                          # Prompt for refuse/normal classification
│       └── relate.txt                          # Prompt for related/irrelevant classification
│
├── result/
│   ├── eval.py                                 # Classify attack-phase responses
│   ├── eval_defense.py                         # Classify defense-phase responses
│   ├── tag_answer.py                           # Core code of evaluation answers
│   ├── <model>/answer/                         # No defense LLM answers
│   └── <model>/<defense>_answer/               # LLM answers under cetain defense
│
├── attack.py                                   # Run prompt injection attacks
├── defense.py                                  # Run attacks under a defense method
├── internal_feature.py                         # Compute attention focus scores (AttentionTracker)
├── injecGuard.py                               # Evaluate InjecGuard detector
└── requirements.txt

Setup

1. Install dependencies

pip install -r requirements.txt

2. Configure environment variables

cp .env.example .env

.env variables:

Variable	Description
`OPENAI_API_KEY`	OpenAI API key (for GPT models and evaluation)
`HF_TOKEN`	HuggingFace token (for SecAlign / StruQ models)
`SEC_ALIGN_LLAMA3_PATH`	Local path to SecAlign fine-tuned Llama3 checkpoint
`STRUQ_LLAMA3_PATH`	Local path to StruQ fine-tuned Llama3 checkpoint

3. Local models (Ollama)

For llama3, mixtral, etc., install Ollama and follow the instructions.

Attack Methods

21 attack methods are supported (pass to -a):

Method	Description
`naive_attack`	Direct injection without obfuscation
`combined_attack`	Hybrid of multiple techniques
`context_ignoring`	Instructs the model to ignore prior context
`fake_completion`	Simulates a completed previous response
`escape_characters`	Uses special characters to break delimiters
`end_ambiguity`	Blurs the boundary between instruction and data
`roleplay` / `roleplay_dialogue` / `roleplay_long`	Role-playing based injection
`response_prefix`	Specifies a response prefix to steer output
`order_directly`	Plain imperative command
`update_instructions`	Claims to supersede original instructions
`few-shot`	Provides examples of desired (malicious) behavior
`no_spaces`	Removes spaces to evade keyword filters
`artisanlib`	Tool/API invocation tricks
`repeated_characters`	Character repetition encoding
`execute_code`	Requests code execution
`pretend_child`	Acts as a subprocess or sub-agent
`redirect`	Input/output redirection tricks
`emotional_appeal`	Uses emotional framing to bypass refusals
`gcg`	Greedy Coordinate Gradient adversarial suffix

Use -a all to run all methods sequentially.

Defense Methods

Method	Description
`none`	No defense (baseline)
`paraphrase`	Paraphrases external data before processing
`spotlight_datamarking`	Interleaves external data with `^` characters
`spotlight_encode`	Base64-encodes external data
`defensive_prompt`	Appends a caution instruction to the system prompt
`sec_align`	Uses the SecAlign fine-tuned Llama3 model
`struq`	Uses the StruQ fine-tuned Llama3 model

Usage

Step 1 — Run attacks

Generates LLM responses under prompt injection:

python attack.py -a <attack_method> -m <model> -w <num_workers>

Output: result/<model>/answer/<attack_method>.csv

python attack.py -a naive_attack -m gpt-4o-mini -w 10
python attack.py -a all -m llama3 -w 5

Step 2 — Evaluate attack responses

Classifies each response as 0 (refuse), 1 (injected), or 2 (ignored):

python result/eval.py -a <attack_method> -m <model> -w <num_workers>

Output: result/<model>/eval/<attack_method>.csv

python result/eval.py -a all -m gpt-4o-mini -w 10
python result/eval.py -a all -m llama3 -w 5

Step 3 — Run attacks under defenses

For SecAlign and StruQ, you need to download the model first, then add the paths of model in the .env

python defense.py -a <attack_method> -d <defense_method> -m <model> -w <num_workers>

Output: result/<model>/<defense_method>_answer/<attack_method>.csv

python defense.py -a all -d defensive_prompt -m llama3 -w 5
python defense.py -a all -d sec_align        # forces model to sec-align-llama3
python defense.py -a all -d struq            # forces model to struq-llama3

python result/eval_defense.py -a all -d defensive_prompt -m llama3 -w 5

Output: 'result/injectGuard_result.csv'

Step 4 — Evaluate defense responses

python result/eval_defense.py -a <attack_method> -d <defense_method> -m <model> -w <num_workers>

Output: result/<model>/<defense_method>_eval/<attack_method>.csv

Step 5 (Optional) — Attention focus score

Computes per-cell attention focus scores using AttentionTracker:

python internal_feature.py -a <attack_method> -w <num_workers> -s result/att_score/answer

Output: result/att_score/answer/<attack_method>.csv

Step 6 (Optional) — InjecGuard evaluation

Evaluates the InjecGuard detector against all attack methods and injected questions:

python injecGuard.py

Output: result/injectGuard_result.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking Assessments of Prompt Injection Attacks

Repository Structure

Setup

1. Install dependencies

2. Configure environment variables

3. Local models (Ollama)

Attack Methods

Defense Methods

Usage

Step 1 — Run attacks

Step 2 — Evaluate attack responses

Step 3 — Run attacks under defenses

Step 4 — Evaluate defense responses

Step 5 (Optional) — Attention focus score

Step 6 (Optional) — InjecGuard evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
result		result
utils		utils
.env.example		.env.example
.gitignore		.gitignore
attack.py		attack.py
defense.py		defense.py
injecGuard.py		injecGuard.py
internal_feature.py		internal_feature.py
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Rethinking Assessments of Prompt Injection Attacks

Repository Structure

Setup

1. Install dependencies

2. Configure environment variables

3. Local models (Ollama)

Attack Methods

Defense Methods

Usage

Step 1 — Run attacks

Step 2 — Evaluate attack responses

Step 3 — Run attacks under defenses

Step 4 — Evaluate defense responses

Step 5 (Optional) — Attention focus score

Step 6 (Optional) — InjecGuard evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages