Skip to content

TrustAIRLab/Prompt_Injection_Assessment

Repository files navigation

Rethinking Assessments of Prompt Injection Attacks

This repository accompanies the paper "Rethinking Assessments of Prompt Injection Attacks", which studies prompt injection (PI) attacks against LLM-integrated applications and evaluates defense mechanisms to mitigate them.

The framework supports 21 attack methods and 7 defenses across multiple LLMs (GPT, Llama3, Mixtral, etc.), with an automated evaluation pipeline for classifying model responses.


Repository Structure

.
├── data/
│   ├── target_37.json                          # 37 target task definitions (instruction + query)
│   ├── injected_questions_with_policy_185.csv  # 185 injected malicious questions with obey/disobey policy
│   ├── attack_20.csv                           # Attack templates (one per method)
│   └── gcg_string.json                         # Pre-computed GCG adversarial suffixes
│
├── utils/
│   ├── utils.py                                # LLM wrappers (GPT / Ollama), cost tracking
│   ├── sec_model_config.py                     # Prompt formats for SecAlign / StruQ models
│   └── prompts/
│       ├── refuse.txt                          # Prompt for refuse/normal classification
│       └── relate.txt                          # Prompt for related/irrelevant classification
│
├── result/
│   ├── eval.py                                 # Classify attack-phase responses
│   ├── eval_defense.py                         # Classify defense-phase responses
│   ├── tag_answer.py                           # Core code of evaluation answers
│   ├── <model>/answer/                         # No defense LLM answers
│   └── <model>/<defense>_answer/               # LLM answers under cetain defense
│
├── attack.py                                   # Run prompt injection attacks
├── defense.py                                  # Run attacks under a defense method
├── internal_feature.py                         # Compute attention focus scores (AttentionTracker)
├── injecGuard.py                               # Evaluate InjecGuard detector
└── requirements.txt

Setup

1. Install dependencies

pip install -r requirements.txt

2. Configure environment variables

cp .env.example .env

.env variables:

Variable Description
OPENAI_API_KEY OpenAI API key (for GPT models and evaluation)
HF_TOKEN HuggingFace token (for SecAlign / StruQ models)
SEC_ALIGN_LLAMA3_PATH Local path to SecAlign fine-tuned Llama3 checkpoint
STRUQ_LLAMA3_PATH Local path to StruQ fine-tuned Llama3 checkpoint

3. Local models (Ollama)

For llama3, mixtral, etc., install Ollama and follow the instructions.


Attack Methods

21 attack methods are supported (pass to -a):

Method Description
naive_attack Direct injection without obfuscation
combined_attack Hybrid of multiple techniques
context_ignoring Instructs the model to ignore prior context
fake_completion Simulates a completed previous response
escape_characters Uses special characters to break delimiters
end_ambiguity Blurs the boundary between instruction and data
roleplay / roleplay_dialogue / roleplay_long Role-playing based injection
response_prefix Specifies a response prefix to steer output
order_directly Plain imperative command
update_instructions Claims to supersede original instructions
few-shot Provides examples of desired (malicious) behavior
no_spaces Removes spaces to evade keyword filters
artisanlib Tool/API invocation tricks
repeated_characters Character repetition encoding
execute_code Requests code execution
pretend_child Acts as a subprocess or sub-agent
redirect Input/output redirection tricks
emotional_appeal Uses emotional framing to bypass refusals
gcg Greedy Coordinate Gradient adversarial suffix

Use -a all to run all methods sequentially.


Defense Methods

Method Description
none No defense (baseline)
paraphrase Paraphrases external data before processing
spotlight_datamarking Interleaves external data with ^ characters
spotlight_encode Base64-encodes external data
defensive_prompt Appends a caution instruction to the system prompt
sec_align Uses the SecAlign fine-tuned Llama3 model
struq Uses the StruQ fine-tuned Llama3 model

Usage

Step 1 — Run attacks

Generates LLM responses under prompt injection:

python attack.py -a <attack_method> -m <model> -w <num_workers>

Output: result/<model>/answer/<attack_method>.csv

python attack.py -a naive_attack -m gpt-4o-mini -w 10
python attack.py -a all -m llama3 -w 5

Step 2 — Evaluate attack responses

Classifies each response as 0 (refuse), 1 (injected), or 2 (ignored):

python result/eval.py -a <attack_method> -m <model> -w <num_workers>

Output: result/<model>/eval/<attack_method>.csv

python result/eval.py -a all -m gpt-4o-mini -w 10
python result/eval.py -a all -m llama3 -w 5

Step 3 — Run attacks under defenses

For SecAlign and StruQ, you need to download the model first, then add the paths of model in the .env

python defense.py -a <attack_method> -d <defense_method> -m <model> -w <num_workers>

Output: result/<model>/<defense_method>_answer/<attack_method>.csv

python defense.py -a all -d defensive_prompt -m llama3 -w 5
python defense.py -a all -d sec_align        # forces model to sec-align-llama3
python defense.py -a all -d struq            # forces model to struq-llama3
python result/eval_defense.py -a all -d defensive_prompt -m llama3 -w 5

Output: 'result/injectGuard_result.csv'

Step 4 — Evaluate defense responses

python result/eval_defense.py -a <attack_method> -d <defense_method> -m <model> -w <num_workers>

Output: result/<model>/<defense_method>_eval/<attack_method>.csv

Step 5 (Optional) — Attention focus score

Computes per-cell attention focus scores using AttentionTracker:

python internal_feature.py -a <attack_method> -w <num_workers> -s result/att_score/answer

Output: result/att_score/answer/<attack_method>.csv

Step 6 (Optional) — InjecGuard evaluation

Evaluates the InjecGuard detector against all attack methods and injected questions:

python injecGuard.py

Output: result/injectGuard_result.csv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages