We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden "meta-instructions" that control how models answer users' questions about the image and steer their outputs to express an adversary-chosen style, sentiment, or point of view.
Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary's (meta-)objective while still producing answers based on the image's visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model's answers are coherent and plausible—yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions.
We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.
git clone https://github.com/Tingwei-Zhang/Soft-Prompts-Go-Hard
cd Soft-Prompts-Go-Hard
conda env create -f environment.yml
conda activate soft_prompt- Follow the setup instructions from the LLaVA repository
- Download the Llama-2-13b-chat model from Hugging Face
- Save the downloaded model to:
./ckpts/llava_llama_2_13b_chat_freeze
- Follow the setup instructions from the MiniGPT-4 repository
- Download the 7B version of Vicuna V0 in
./ckpts/vicuna-7b - Download the pretrained checkpoint from here and update the path in
eval_configs/minigpt4_eval.yaml
- Follow the setup instructions from the LAVIS repository
- Select the 13B version model (blip2_vicuna_instruct-vicuna13b)
- Download vicuna-13b v1.1 model to:
./ckpts/vicuna-13b-v1.1 - Update the
llm_modelparameter in./lavis/configs/models/blip2/blip2_instruct_vicuna13b.yamlto point to your vicuna weights path
Note: For additional guidance on setting up environment, refer to the Visual Adversarial Examples repository.
| Model | GPU Requirements | Processing Time |
|---|---|---|
| MiniGPT-4 | Single A40/A6000 48GB | ~3.5 hours per image |
| InstructBLIP | Single A40/A6000 48GB | ~1 hour per image |
| LLaVA | Two A40/A6000 48GB | ~1.5 hours per image |
See instruction_data/README.md for dataset and instruction details.
Run the attack script for your model and meta-objective. For example (MiniGPT-4, 'Negative' instruction):
python minigpt_visual_attack.py \
--gpu_id 0 \
--data_path instruction_data/0/Sentiment/dataset.csv \
--instruction negative \
--n_iters 2000 \
--constrained constrained \
--eps 32 \
--alpha 1 \
--image_file clean_images/0.png \
--save_dir output/minigpt4/0/Sentiment/Negative/constrained_eps_32_batch_8Three evaluation scenarios:
- No Attack (Baseline 1): Inference on clean image
- Explicit Instruction (Baseline 2): Inference on clean image with explicit instruction
- Our Attack: Inference on adversarial image
Example (MiniGPT-4, Baseline 1):
python -u minigpt_inference.py \
--gpu_id 0 \
--data_path instruction_data/0/Sentiment/dataset.csv \
--image_file clean_images/0.png \
--output_file output/minigpt4/0/baseline_1/result.jsonlFor comprehensive experiment scripts, batch runs, advanced baselines, L2/transfer/content evaluation, and output structure, see
script/README.md.
- Meta-objective following: See
eval_instruction_following.ipynb - Content preservation: See
eval_content_preserving.ipynb - JPEG defense: See
script/README.mdfor usage - Anomaly detection: See
eval_anomaly_detection.ipynb
If you find this work useful, please cite our paper:
@article{zhang2025self,
title={Self-interpreting Adversarial Images},
author={Zhang, Tingwei and Zhang, Collin and Morris, John X and Bagdasarian, Eugene and Shmatikov, Vitaly},
year={2025}
}📬 Questions or feedback? Feel free to open an issue or reach out!
