David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Long Cheng, Abolfazl Razi, Mert D. Pese
Clemson University — IEEE IV 2026
[Paper]
This repository contains the adversarial patch generation code for evaluating the robustness of Vision-Language Model (VLM) based autonomous driving systems to physical adversarial attacks.
We present a black-box adversarial patch attack framework targeting VLM-based autonomous driving architectures. The attack uses Natural Evolution Strategies (NES) with semantic similarity loss to generate physically realizable adversarial patches that can corrupt VLM driving decisions.
The framework is evaluated on three VLM architectures:
- Dolphins — CLIP ViT-L/14-336 vision encoder + MPT-7B language model (cross-attention fusion)
- OmniDrive (Omni-L) — CLIP ViT-L/14-336 vision encoder + MPT-7B language model (MLP projection fusion)
- LeapAD (LeapVAD) — Qwen-VL-7B vision encoder + Qwen-1.8B / GPT-4o language model (dual-process architecture)
Patches are optimized offline and then deployed in the CARLA simulator on realistic advertising infrastructure (bus shelters, billboards) for closed-loop evaluation.
- Python 3.8+
- CUDA-compatible GPU
- CARLA Simulator (for closed-loop evaluation)
Note on dependencies.
requirements.txtonly lists the packages common to all three attack scripts. Each target VLM (Dolphins, OmniDrive, LeapAD) has its own heavy dependency stack (e.g.mmcv/mmdet3dfor OmniDrive, Qwen-VL extras for LeapAD). Installing these per-target requirements by following each upstream repository's setup instructions is the user's responsibility — we do not attempt to re-specify them here.
Requires the Dolphins codebase and pretrained weights. Clone the Dolphins repository and install its dependencies first, then install this project's requirements:
pip install -r requirements.txtImportant — the attack script is not self-contained. dolphin_patch.py imports mllm.src.factory and configs.lora_config directly from the Dolphins codebase. You must make the Dolphins repo importable before running the script. Either:
- run the script from inside the Dolphins repo root (recommended), e.g. copy
dolphin_patch.pyinto the Dolphins directory and invoke it there; or - export
PYTHONPATH=/path/to/Dolphinsbefore invoking the script.
Simply doing cd VLM-Patch-Attack/dolphins && python dolphin_patch.py ... will fail with ModuleNotFoundError: No module named 'mllm'.
Requires the OmniDrive codebase with mmdet3d and its dependencies. Install OmniDrive following their setup instructions, then:
pip install -r requirements.txtRequires the LeapAD codebase and pretrained Qwen-VL-Chat and Qwen1.5-1.8B-Chat model weights. Install LeapAD following their setup instructions, then:
pip install -r requirements.txtFor API mode, you also need the LeapAD FastAPI server scripts (tools/fast_api_vlm.py and tools/fast_api_llm.py from the LeapAD repository).
The Dolphins attack script uses CLI arguments:
cd dolphins
python dolphin_patch.py \
--image-dir /path/to/scenario/images \
--target-action accelerate \
--patch-size 64 \
--loss-type semantic \
--mode per_image \
--output-dir ./dolphin_attack_resultsArguments:
| Argument | Default | Description |
|---|---|---|
--image-dir |
(required) | Directory containing scenario images |
--target-action |
accelerate |
Target driving action (accelerate, brake, turn_left, turn_right, maintain) |
--patch-size |
64 |
Patch size in pixels |
--steps |
150 |
NES optimization iterations |
--directions |
20 |
Number of NES perturbation directions |
--sigma |
0.10 |
NES noise scale |
--lr |
0.02 |
Learning rate |
--eot-samples |
5 |
Expectation over Transformation samples |
--jitter-px |
5 |
Spatial jitter range (pixels) |
--loss-type |
semantic |
Loss function (semantic or string_match) |
--max-images |
36 |
Maximum images to process |
--mode |
per_image |
Attack mode (per_image or universal) |
--fixed-loc |
None |
Fixed patch location as TOP LEFT (random if not set) |
--output-dir |
./dolphin_attack_results |
Output directory |
--seed |
0 |
Random seed |
The OmniDrive attack script uses CLI arguments:
cd omnidrive
python omnidrive_patch.py \
--config /path/to/omnidrive_config.py \
--checkpoint /path/to/omnidrive_checkpoint.pth \
--image-dir /path/to/scenario/images \
--target-action accelerate \
--patch-size 96 \
--loss-type semantic \
--mode per_image \
--output-dir ./omnidrive_attack_resultsArguments:
| Argument | Default | Description |
|---|---|---|
--config |
(required) | Path to OmniDrive mmdet3d config file |
--checkpoint |
(required) | Path to model checkpoint |
--image-dir |
(required) | Directory containing scenario images |
--target-action |
accelerate |
Target driving action (accelerate, brake, turn_left, turn_right, maintain) |
--patch-size |
96 |
Patch size in pixels |
--steps |
150 |
NES optimization iterations |
--directions |
20 |
Number of NES perturbation directions |
--sigma |
0.10 |
NES noise scale |
--lr |
0.02 |
Learning rate |
--eot-samples |
5 |
Expectation over Transformation samples |
--jitter-px |
5 |
Spatial jitter range (pixels) |
--loss-type |
semantic |
Loss function (semantic or string_match) |
--max-images |
36 |
Maximum images to process |
--mode |
per_image |
Attack mode (per_image or universal) |
--fixed-loc |
None |
Fixed patch location as TOP LEFT (random if not set) |
--output-dir |
./omnidrive_attack_results |
Output directory |
--seed |
0 |
Random seed |
Requires the LeapAD codebase. LeapAD uses a two-stage pipeline (Qwen-VL for scene understanding, Qwen-1.8B for decision making), so the attack script supports two inference backends:
API mode (recommended — uses the same FastAPI servers as LeapAD):
Start the LeapAD VLM and LLM servers first (these scripts are from the LeapAD repository, not this one):
# Terminal 1: Qwen-VL server (run from the LeapAD directory)
python tools/fast_api_vlm.py -c /path/to/Qwen-VL-Chat --port 9000
# Terminal 2: Qwen-1.8B server (run from the LeapAD directory)
python tools/fast_api_llm.py -c /path/to/Qwen1.5-1.8B-Chat --port 9005Then run the attack:
cd leapvad
python leapvad_patch.py \
--image-dir /path/to/scenario/images \
--target-action accelerate \
--patch-size 64 \
--loss-type semantic \
--mode per_image \
--backend api \
--output-dir ./leapvad_attack_resultsLocal mode (loads models directly, no servers needed):
cd leapvad
python leapvad_patch.py \
--image-dir /path/to/scenario/images \
--target-action accelerate \
--backend local \
--vlm-path /path/to/Qwen-VL-Chat \
--llm-path /path/to/Qwen1.5-1.8B-Chat \
--output-dir ./leapvad_attack_resultsLeapAD-specific arguments:
| Argument | Default | Description |
|---|---|---|
--backend |
api |
Inference backend (api or local) |
--vlm-port |
9000 |
Qwen-VL FastAPI server port (api mode) |
--llm-port |
9005 |
Qwen LLM FastAPI server port (api mode) |
--vlm-path |
Qwen/Qwen-VL-Chat |
Qwen-VL model path (local mode) |
--llm-path |
Qwen/Qwen1.5-1.8B-Chat |
Qwen LLM model path (local mode) |
--ego-speed |
5.0 |
Default ego speed for LLM context (m/s) |
--ego-steer |
0.0 |
Default ego steering for LLM context |
All other arguments (--patch-size, --steps, --directions, --sigma, --lr, --eot-samples, --jitter-px, --loss-type, --max-images, --mode, --fixed-loc, --output-dir, --seed) work the same as Dolphins/OmniDrive.
Note: LeapAD uses non-square images (800x600) matching the VLM input resolution. Its action space is AC/DC/IDLE/STOP (no turning actions), which is mapped to the paper-level actions (accelerate/brake/maintain) for consistent ASR evaluation. Only these three target actions are valid for LeapAD.
Per-image mode (--mode per_image):
- Per-image optimized patch tensors (
.ptfiles in metadata) - Clean vs. adversarial comparison images
- Per-image JSON metadata with VLM responses, attack success, and loss values
attack_summary.jsonwith aggregate statistics including Attack Success Rate (ASR)
Universal mode (--mode universal):
- A single universal adversarial patch tensor (
.ptfile) - CSV evaluation logs with per-image results (VLM responses, attack success, loss values)
- Summary statistics including Attack Success Rate (ASR)
@inproceedings{fernandez2026comparative,
title={Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures},
author={Fernandez, David and MohajerAnsari, Pedram and Salarpour, Amir and Cheng, Long and Razi, Abolfazl and Pese, Mert D.},
booktitle={IEEE Intelligent Vehicles Symposium (IV)},
year={2026}
}This project is licensed under the MIT License — see LICENSE for details.
