Skip to content

Code and data setup for the paper "Are Diffusion Models Vision-and-language Reasoners?"

Notifications You must be signed in to change notification settings

McGill-NLP/diffusion-itm

Repository files navigation

DiffusionITM

Code and data setup for our paper Are Diffusion Models Vision-and-language Reasoners?

We introduce a method to apply Stable Diffusion zero-shot to image-text matching tasks (DiffusionITM), as well as finetune it with hard negatives in a transfer setting (HardNeg-DiffusionITM). Additionally we introduce GDBench, designed to test various phenomena in discriminative and generative models.

Setup

IMPORTANT: Clone the repository with the sumbodules option:

git clone --recurse-submodules git@github.com:McGill-NLP/diffusion-itm.git

Run the following in the diffusers folder:

python3 setup.py install

Make a new python environment and install the libraries in requirements.txt, i.e.:

pip install -r requirements.txt

If you want to use the ARO benchmark, also run:

python -m spacy download en_core_web_sm

Dataset Setup for GDBench

Run setup.sh to download images for several of the data (CLEVR, SVO, ImageCoDe, Pets). If you only want to try a subset of tasks, simply comment out lines, i.e. downloading SVO images can take several hours so only run it if you want to evaluate on SVO. For the rest, there are some small manual steps:

Flickr30K

Download the images from Kaggle and save them under data: data/flickr30k/images.

Winoground

Fill in AUTH_TOKEN in line 259 of dataset_loading.py

Bias

We will have instructions for these data soon.

Experiments

Zero-shot Image-Text-Matching (DiffusionITM)

We propose a simple method to apply Stable Diffusion to image-text-matching tasks (like Winoground).

The simplest command to get things started:

python3 diffusion_itm.py --task winoground

You can replace the task name with any other task from: 'winoground', 'mmbias', 'genderbias', 'imagecode', 'imagecode_video', 'flickr30k', 'flickr30k_text', 'flickr30k_neg', 'lora_flickr30k', 'imagenet', 'svo_verb', 'svo_subj', 'svo_obj', 'clevr', 'pets', 'vg_relation', 'vg_attribution', 'coco_order', 'flickr30k_order', 'mscoco', 'mscoco_val'.

To change the default options, here is the complete call:

python diffusion_itm.py --task TASK --seed SEED --cuda_device DEVICE --batchsize SIZE --subset --sampling_steps AMOUNT_OF_NOISE_TIMESTEP_SAMPLES_PER_EXAMPLE --img_retrieval --version VERSION_1.5_OR_2.1 --lora_dir ONLY_IF_YOU_LOAD_FINETUNED_WEIGHTS --guidance_scale SCALE

Fine-tuned ITM

As explained in the next section, we also finetuned our model with in-batch negatives on MSCOCO. The checkpoints directories can be found under checkpoints and are named as in the paper: hardneg (last row table Table 2), noneg (second row Table 2).

You can simply specify --lora_dir as shown in the above section when running diffusion_itm.py.

Hard Negative Finetuning (HardNeg-DiffusionITM)

On a single GPU:

hard_neg_finetuning.py --pretrained_model_name_or_path DIR_OF_SD2.1_IN_CACHE --train_batch_size 4 --gradient_accumulation_steps 4 --output_dir DIR_TO_SAVE_CKPT_AND_LOGS --checkpointing_steps 500 --num_train_epochs NUM_EPOCHS --mixed_neg

Via the --mixed_neg options, this trains with both hard negative images and texts for a given image-text pair. The weights for Stable Diffusion 2.1 should be in the cache after running the zeroshot script for example.

If you have multiple GPUs:

accelerate launch hard_neg_finetuning.py ...

Citation and Contact

Don't hesitate to reach out to benno.krojer@mila.quebec if you have questions or submit an issue!

We are grateful to many previous code/data we build on top of! Especially the Diffusers library and the great repository for the ARO benchmark.

If you use our work, please cite us as:

@article{krojer2023diffusion,
  title={Are Diffusion Models Vision-And-Language Reasoners?},
  author={Krojer, Benno and Poole-Dayan, Elinor and Voleti, Vikram and Pal, Christopher and Reddy, Siva},
  journal={arXiv preprint arXiv:2305.16397},
  year={2023}
}

About

Code and data setup for the paper "Are Diffusion Models Vision-and-language Reasoners?"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages