DPT-T2I

Discriminative Probing and Tuning for Text-to-Image Generation

Leigang Qu, Wenjie Wang, Yongqi Li, Hanwang Zhang, Liqiang Nie, and Tat-Seng Chua

This repository contains code and links to the DPT model for text-to-image (T2I) generation to improve text-image alignment. We show the potential of discriminative abilities of pre-trained T2I models and significant gains on text-image alignment after discriminative tuning based on image-text matching (ITM) and referring expression comprehension (REC) tasks.

Introduction

Schematic illustration of the proposed discriminative probing and tuning (DPT) framework. We first extract semantic representations from the frozen SD and then propose a discriminative adapter to conduct discriminative probing to investigate the global matching and local grounding abilities of SD. Afterward, we perform parameter-efficient discriminative tuning by introducing LoRA parameters. During inference, we present the self-correction mechanism to guide the denoising-based text-to-image generation.

Release

Release the training code.
Release the inference code and checkpoint (v2.1) for text-to-image synthesis.
Release the paper of DPT on arXiv.

Installation

The requirements file has the dependencies that are needed by DPT-T2I. The following is the instruction how to install dependencies.

First, clone the repository locally:

git clone https://github.com/LgQu/DPT-T2I.git

Make a new conda env and activate it:

conda create -n dpt_t2i python=3.8
conda activate dpt_t2i

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Text-to-Image Synthesis

First, download the checkpoint (v2.1) for LoRA based on Stable Diffusion (v2.1), and then put it in the directory ./ckpt/dpt-v2.1.

Run txt2img.py:

python txt2img.py --gpuid 0 --prompt "a painting of a virus monster playing guitar"

The generated images can be seen in the ./outputs directory.

Training

In stage 1, we freeze the UNet of Stable Diffusion and train the discriminative adapter:

accelerate launch --mixed_precision="fp16" --multi_gpu --main_process_port=255487 train_stage1.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --enable_xformers_memory_efficient_attention \
    --gradient_checkpointing \
    --unet_feature='up1' \
    --dataloader_num_workers=4 \
    --center_crop --random_flip \
    --lr_scheduler="constant"  \
    --checkpointing_steps=500 \
    --train_batch_size=48 \
    --val_batch_size=16 \
    --gradient_accumulation_steps=1 \
    --max_train_steps=60000 \
    --learning_rate=1e-4  \
    --run_name='dpt_stage1' \
    --report_to=wandb

where MODEL_NAME can be "stabilityai/stable-diffusion-2-1-base" or "CompVis/stable-diffusion-v1-4".

In stage 2, we train the whole model including the UNet of Stable Difffusion (with LoRA) and the discriminative adapter:

accelerate launch --mixed_precision="fp16" --multi_gpu --main_process_port=25548 train_stage2.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --enable_xformers_memory_efficient_attention \
    --gradient_checkpointing \
    --unet_feature='up1' \
    --lora_rank=4 \
    --mse_loss_coef=1 \
    --dataloader_num_workers=8 \
    --center_crop --random_flip \
    --lr_scheduler="cosine"  \
    --checkpointing_steps=200 \
    --train_batch_size=8 \
    --val_batch_size=16 \
    --gradient_accumulation_steps=8 \
    --max_train_steps=5000 \
    --learning_rate=1e-4 \
    --qformer_ckpt $CKPT_STAGE1 \
    --run_name='dpt_stage2' \
    --report_to=wandb

where $CKPT_STAGE1 denotes the directory of the checkpoint obtained in stage 1.

Citation

If you find our work useful in your research, please consider citing DPT:

@inproceedings{qu2024discriminative,
  title={Discriminative Probing and Tuning for Text-to-Image Generation},
  author={Qu, Leigang and Wang, Wenjie and Li, Yongqi and Zhang, Hanwang and Nie, Liqiang and Chua, Tat-Seng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={7434--7444},
  year={2024}
}

Acknowledgement

We thank the authors of DETR, MDETR, and DiffusionITM, for making their code available.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
data		data
models		models
util		util
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
evaluation.py		evaluation.py
requirements.txt		requirements.txt
train_stage1.py		train_stage1.py
train_stage2.py		train_stage2.py
txt2img.py		txt2img.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DPT-T2I

Discriminative Probing and Tuning for Text-to-Image Generation

Introduction

Release

Installation

Text-to-Image Synthesis

Training

Citation

Acknowledgement

About

Releases

Packages

Languages

License

LgQu/DPT-T2I

Folders and files

Latest commit

History

Repository files navigation

DPT-T2I

Discriminative Probing and Tuning for Text-to-Image Generation

Introduction

Release

Installation

Text-to-Image Synthesis

Training

Citation

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages