This is the official PyTorch implement of the paper "Towards Adversarial Attack on Vision-Language Pre-training Models" at ACM Multimedia 2022.
To get the ASR, you should run "--adv 0" to get the clean accuracy, then run "--adv 4" to get the adversarial accuracy, and the ASR = clean accuracy-adversarial accuracy.
We released the fine-tuned checkpoints (Baidu, password: iqvp) for VE task on ALBEF and TCL, which can be considered not only as an attacked model in this paper, but also useful for other studies.
- pytorch 1.10.2
- transformers 4.8.1
- timm 0.4.9
- bert_score 0.3.11
- Dataset json files for downstream tasks [ALBEF github]
- Finetuned checkpoint for ALBEF [ALBEF github]
- Finetuned checkpoint for TCL [TCL github]
| Adv | Instruction |
|---|---|
| 0 | No Attack |
| 1 | Attack Text |
| 2 | Attack Image |
| 3 | Attack Both (vanilla) |
| 4 | Co-Attack |
When attack unimodal embedding, using "--adv 4" and not using "--cls" will raise an expected error due to the different sequence length of image embedding and text embedding.
Download MSCOCO or Flickr30k datasets from origin website.
# Attack Unimodal Embedding
python RetrievalEval.py --adv 4 --gpu 0 --cls \
--config configs/Retrieval_flickr.yaml \
--output_dir output/Retrieval_flickr \
--checkpoint [Finetuned checkpoint]
# Attack Multimodal Embedding
python RetrievalFusionEval.py ...
# Attack Clip Model
python RetrievalCLIPEval.py --adv 4 --gpu 0 --image_encoder ViT-B/16 ...
Download SNLI-VE datasets from origin website.
# Attack Unimodal Embedding
python VEEval.py --adv 4 --gpu 0 --cls \
--config configs/VE.yaml \
--output_dir output/VE \
--checkpoint [Finetuned checkpoint]
# Attack Multimodal Embedding
python VEFusionEval.py ...
Download MSCOCO dataset from the original website.
# Attack Unimodal Embedding
python GroundingEval.py --adv 4 --gpu 0 --cls \
--config configs/Grounding.yaml \
--output_dir output/Grounding \
--checkpoint [Finetuned checkpoint]
# Attack Multimodal Embedding
python GroundingFusionEval.py ...
python visualization.py --adv 4 --gpu 0
If you find this code to be useful for your research, please consider citing.
@inproceedings{zhang2022towards,
title={Towards Adversarial Attack on Vision-Language Pre-training Models},
author={Zhang, Jiaming and Yi, Qi and Sang, Jitao},
booktitle="Proceedings of the 30th ACM International Conference on Multimedia",
year={2022}
}