DALL-3

try it out on colab

DALL-3 is a mashup of DALLE-pytorch, VQGAN and Clip-Guided Diffusion. The basic idea is to use a diffusion model instead of VAE for the decoder stage, which allows us to use 16x16 tokens instead of 32x32 while maintaining comparable image quality.

This DALLE model is meant to be used with https://github.com/Jack000/guided-diffusion

minor modifications to DALLE-pytorch:

hardcoded 128px image size in dalle (in order to use mismatched VAE/DALLE image sizes)
added top-p filtering

Cherry picked sample images:

Before diffusion	After diffusion

_{Prompt: A cube made of cloud, a cube with the texture of cloud}

_{Prompt: A cube made of water, a cube with the texture of water}

_{Prompt: A cube made of porcupine, a cube with the texture of porcupine}

_{Prompt: An armchair shaped like an avocado, an avocado armchair}

_{Prompt: A girl with thick glasses, a girl wearing glasses}

_{Prompt: A machine learning researcher smashes his computer in a fit of rage}

Non-Cherry picked images (clip re-ranked best 8 out of 1024):



A cube made of cloud. A cube with the texture of cloud



An armchair shaped like an avocado. An avocado armchair



A girl with thick glasses. A girl wearing glasses

Usage

cd DALLE-pytorch
pip install -e .

# download GumbelVQ VAE model
mkdir -p vqgan_gumbel_f8
wget 'https://heibox.uni-heidelberg.de/f/b24d14998a8d4f19a34f/?dl=1' -O 'vqgan_gumbel_f8/model.yaml' 
wget 'https://heibox.uni-heidelberg.de/f/34a747d5765840b5a99d/?dl=1' -O 'vqgan_gumbel_f8/last.ckpt'

# download DALL-E models
wget https://dall-3.com/models/dalle/bpe.model
wget https://dall-3.com/models/dalle/dalle-latest.pt

# generate (optionally install OpenAI clip for --clip_sort)
python generate.py --top_p 0.85 --temperature 1.0 --clip_sort --output_npy --dalle_path ./dalle-latest.pt --bpe_path bpe.model --taming --vqgan_model_path vqgan_gumbel_f8/last.ckpt --vqgan_config_path vqgan_gumbel_f8/model.yaml --text 'a girl with thick glasses. a girl wearing glasses'

# post process
# use the npy file as input to clip guided diffusion https://github.com/Jack000/guided-diffusion

Citations

@misc{ramesh2021zeroshot,
    title   = {Zero-Shot Text-to-Image Generation}, 
    author  = {Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},
    year    = {2021},
    eprint  = {2102.12092},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}

@misc{kitaev2020reformer,
    title   = {Reformer: The Efficient Transformer},
    author  = {Nikita Kitaev and Łukasz Kaiser and Anselm Levskaya},
    year    = {2020},
    eprint  = {2001.04451},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{esser2021taming,
    title   = {Taming Transformers for High-Resolution Image Synthesis},
    author  = {Patrick Esser and Robin Rombach and Björn Ommer},
    year    = {2021},
    eprint  = {2012.09841},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@misc{ding2021cogview,
    title   = {CogView: Mastering Text-to-Image Generation via Transformers},
    author  = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
    year    = {2021},
    eprint  = {2105.13290},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@software{peng_bo_2021_5196578,
    author       = {PENG Bo},
    title        = {BlinkDL/RWKV-LM: 0.01},
    month        = {aug},
    year         = {2021},
    publisher    = {Zenodo},
    version      = {0.01},
    doi          = {10.5281/zenodo.5196578},
    url          = {https://doi.org/10.5281/zenodo.5196578}
}

@misc{su2021roformer,
    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
    year    = {2021},
    eprint  = {2104.09864},
    archivePrefix = {arXiv},
    primaryClass = {cs.CL}
}

Those who do not want to imitate anything, produce nothing. - Dali

Name		Name	Last commit message	Last commit date
Latest commit History 524 Commits
.github/workflows		.github/workflows
dalle_pytorch		dalle_pytorch
docker		docker
examples		examples
images		images
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
generate.py		generate.py
install_apex.sh		install_apex.sh
install_deepspeed.sh		install_deepspeed.sh
setup.py		setup.py
train_dalle.py		train_dalle.py
train_vae.py		train_vae.py

License

Jack000/DALLE-pytorch

Folders and files

Latest commit

History

Repository files navigation

DALL-3

Usage

Citations

About

Resources

License

Stars

Watchers

Forks

Languages