This repository accompanies the paper "Rethinking Assessments of Prompt Injection Attacks", which studies prompt injection (PI) attacks against LLM-integrated applications and evaluates defense mechanisms to mitigate them.
The framework supports 21 attack methods and 7 defenses across multiple LLMs (GPT, Llama3, Mixtral, etc.), with an automated evaluation pipeline for classifying model responses.
.
├── data/
│ ├── target_37.json # 37 target task definitions (instruction + query)
│ ├── injected_questions_with_policy_185.csv # 185 injected malicious questions with obey/disobey policy
│ ├── attack_20.csv # Attack templates (one per method)
│ └── gcg_string.json # Pre-computed GCG adversarial suffixes
│
├── utils/
│ ├── utils.py # LLM wrappers (GPT / Ollama), cost tracking
│ ├── sec_model_config.py # Prompt formats for SecAlign / StruQ models
│ └── prompts/
│ ├── refuse.txt # Prompt for refuse/normal classification
│ └── relate.txt # Prompt for related/irrelevant classification
│
├── result/
│ ├── eval.py # Classify attack-phase responses
│ ├── eval_defense.py # Classify defense-phase responses
│ ├── tag_answer.py # Core code of evaluation answers
│ ├── <model>/answer/ # No defense LLM answers
│ └── <model>/<defense>_answer/ # LLM answers under cetain defense
│
├── attack.py # Run prompt injection attacks
├── defense.py # Run attacks under a defense method
├── internal_feature.py # Compute attention focus scores (AttentionTracker)
├── injecGuard.py # Evaluate InjecGuard detector
└── requirements.txt
pip install -r requirements.txtcp .env.example .env.env variables:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key (for GPT models and evaluation) |
HF_TOKEN |
HuggingFace token (for SecAlign / StruQ models) |
SEC_ALIGN_LLAMA3_PATH |
Local path to SecAlign fine-tuned Llama3 checkpoint |
STRUQ_LLAMA3_PATH |
Local path to StruQ fine-tuned Llama3 checkpoint |
For llama3, mixtral, etc., install Ollama and follow the instructions.
21 attack methods are supported (pass to -a):
| Method | Description |
|---|---|
naive_attack |
Direct injection without obfuscation |
combined_attack |
Hybrid of multiple techniques |
context_ignoring |
Instructs the model to ignore prior context |
fake_completion |
Simulates a completed previous response |
escape_characters |
Uses special characters to break delimiters |
end_ambiguity |
Blurs the boundary between instruction and data |
roleplay / roleplay_dialogue / roleplay_long |
Role-playing based injection |
response_prefix |
Specifies a response prefix to steer output |
order_directly |
Plain imperative command |
update_instructions |
Claims to supersede original instructions |
few-shot |
Provides examples of desired (malicious) behavior |
no_spaces |
Removes spaces to evade keyword filters |
artisanlib |
Tool/API invocation tricks |
repeated_characters |
Character repetition encoding |
execute_code |
Requests code execution |
pretend_child |
Acts as a subprocess or sub-agent |
redirect |
Input/output redirection tricks |
emotional_appeal |
Uses emotional framing to bypass refusals |
gcg |
Greedy Coordinate Gradient adversarial suffix |
Use -a all to run all methods sequentially.
| Method | Description |
|---|---|
none |
No defense (baseline) |
paraphrase |
Paraphrases external data before processing |
spotlight_datamarking |
Interleaves external data with ^ characters |
spotlight_encode |
Base64-encodes external data |
defensive_prompt |
Appends a caution instruction to the system prompt |
sec_align |
Uses the SecAlign fine-tuned Llama3 model |
struq |
Uses the StruQ fine-tuned Llama3 model |
Generates LLM responses under prompt injection:
python attack.py -a <attack_method> -m <model> -w <num_workers>Output: result/<model>/answer/<attack_method>.csv
python attack.py -a naive_attack -m gpt-4o-mini -w 10
python attack.py -a all -m llama3 -w 5Classifies each response as 0 (refuse), 1 (injected), or 2 (ignored):
python result/eval.py -a <attack_method> -m <model> -w <num_workers>Output: result/<model>/eval/<attack_method>.csv
python result/eval.py -a all -m gpt-4o-mini -w 10
python result/eval.py -a all -m llama3 -w 5For SecAlign and StruQ, you need to download the model first, then add the paths of model in the .env
python defense.py -a <attack_method> -d <defense_method> -m <model> -w <num_workers>Output: result/<model>/<defense_method>_answer/<attack_method>.csv
python defense.py -a all -d defensive_prompt -m llama3 -w 5
python defense.py -a all -d sec_align # forces model to sec-align-llama3
python defense.py -a all -d struq # forces model to struq-llama3python result/eval_defense.py -a all -d defensive_prompt -m llama3 -w 5Output: 'result/injectGuard_result.csv'
python result/eval_defense.py -a <attack_method> -d <defense_method> -m <model> -w <num_workers>Output: result/<model>/<defense_method>_eval/<attack_method>.csv
Computes per-cell attention focus scores using AttentionTracker:
python internal_feature.py -a <attack_method> -w <num_workers> -s result/att_score/answerOutput: result/att_score/answer/<attack_method>.csv
Evaluates the InjecGuard detector against all attack methods and injected questions:
python injecGuard.pyOutput: result/injectGuard_result.csv