VL-GPT

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Jinguo Zhu^1*, Xiaohan Ding^2*, Yixiao Ge², Yuying Ge²,
Sijie Zhao², Hengshuang Zhao³, Xiaohua Wang¹, Ying Shan²

¹ Xi'an Jiaotong University ² Tencent AI Lab ³ The University of Hong Kong
^* Equal Contribution

VL-GPT is a generative pre-trained transformer model for vision and language understanding and generation tasks, which can perceive and generate visual and linguistic data concurrently. By employing a straightforward auto-regressive objective, VL-GPT achieves a unified pre-training for both image and text modalities.
We also propose an image tokenizer-detokenizer framework for the conversion between raw images and continuous visual embeddings, analogous to the role of the BPE tokenization in language models.

TODOs

This project is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md