Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

Maintained by WANG Yue (wangyue2714@gmail.com). Last update on 2021/06/14.

Image-based VL-PTMs

Representation Learning

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, ICLR 2020 [code]

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019/08, ACL 2020 [code]

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, AAAI 2020

Unified Vision-Language Pre-Training for Image Captioning and VQA, AAAI 2020, [code], (VLP)

UNITER: Learning Universal Image-text Representations, ECCV 2020, [code]

Weak Supervision helps Emergence of Word-Object Alignment and improves Vision-Language Tasks, arXiv 2019/12

InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining, arXiv 2020/03

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, arXiv 2020/04, ECCV 2020

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arXiv 2020/04

ERNIE-VIL: KNOWLEDGE ENHANCED VISION-LANGUAGE REPRESENTATIONS THROUGH SCENE GRAPH, arXiv 2020/06

DeVLBert: Learning Deconfounded Visio-Linguistic Representations, ACM MM 2020, [code]

SEMVLP: VISION-LANGUAGE PRE-TRAINING BY ALIGNING SEMANTICS AT MULTIPLE LEVELS, ICLR 2021 submission

CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations, arXiv 2020/10

Multimodal Pretraining Unmasked: Unifying the Vision and Language BERTs, arXiv 2020/11

LAMP: Label Augmented Multimodal Pretraining, arXiv 2020/12

Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network, AAAI 2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, arXiv 2021

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021 [code]

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers, EMNLP 2020

VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR 2021

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain, CVPR 2021

Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021/02

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, NeurIPS 2021 Spotlight [code]

Florence: A New Foundation Model for Computer Vision, arXiv 2021/11

Task-specific

VCR: Fusion of Detected Objects in Text for Visual Question Answering, EMNLP 2019, [code], (B2T2)

TextVQA: Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020, [code], (M4C)

VisDial: VD-BERT: A Unified Vision and Dialog Transformer with BERT, EMNLP 2020 [code], (VD-BERT)

VisDial: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline, ECCV 2020 [code], (VisDial-BERT)

VLN: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020, [code], (PREVALENT)

Text-image retrieval: ImageBERT: Cross-Modal Pre-training with Large-scale Weak-supervised Image-text Data, arXiv 2020/01

Image captioning: XGPT: Cross-modal Generative Pre-Training for Image Captioning, arXiv 2020/03

Visual Question Generation: BERT Can See Out of the Box: On the Cross-modal Transferability of Text Representations, arXiv 2020/02

Text-image retrieval: CROSS-PROBE BERT FOR EFFICIENT AND EFFECTIVE CROSS-MODAL SEARCH, ICLR 2021 submission.

Chart VQA: STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering, EMNLP 2020.

VisualMRC: VisualMRC: Machine Reading Comprehension on Document Images, AAAI 2021, (LayoutT5, LayoutBART)

Visual Relationship Detection: Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations, IEEE Access 2021

Other Analysis

Multi-task Learning, 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020, [code]

Multi-task Learning, Unifying Vision-and-Language Tasks via Text Generation, arXiv 2021/02

Social Bias in VL Embedding, Measuring Social Biases in Grounded Vision and Language Embeddings, arXiv 2020/02, [code]

In-depth Analysis, Are we pretraining it right? Digging deeper into visio-linguistic pretraining,

In-depth Analysis, Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models, ECCV 2020 Spotlight

In-depth Analysis, A Closer Look at the Robustness of Vision-and-Language Pre-trained Models, arXiv 2020/12

Adversarial Training, Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 Spotlight

Adaptive Analysis, Adaptive Transformers for Learning Multimodal Representations, ACL SRW 2020

Neural Architecture Search, Deep Multimodal Neural Architecture Search, arXiv 2020/04

Dataset perspective, Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, arXiv 2021/02

Video-based VL-PTMs

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

Learning Video Representations Using Contrastive Bidirectional Transformers, arXiv 2019/06, (CBT)

M-BERT: Injecting Multimodal Information in the BERT Structure, arXiv 2019/08

BERT for Large-scale Video Segment Classification with Test-time Augmentation, ICCV 2019 YouTube8M workshop, [code]

Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog, AAAI2020 DSTC8 workshop

Learning Spatiotemporal Features via Video and Text Pair Discrimination, arXiv 2020/01, (CPD), [code]

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation, arXiv 2020/02

ActBERT: Learning Global-Local Video-Text Representations, CVPR 2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, EMNLP 2020

Video-Grounded Dialogues with Pretrained Generation Language Models, ACL 2020

Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training, arXiv 2020/07

Multimodal Pretraining for Dense Video Captioning, arXiv 2020/11

PARAMETER EFFICIENT MULTIMODAL TRANSFORMERS FOR VIDEO REPRESENTATION LEARNING, arXiv 2020/12

Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021

Speech-based VL-PTMs

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models, arXiv 2019/06

Understanding Semantics from Speech Through Pre-training, arXiv 2019/09

SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering, arXiv 2019/10

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations, arXiv 2019/10

Effectiveness of self-supervised pre-training for speech recognition, arXiv 2019/11

Other Transformer-based multimodal networks

Multi-Modality Cross Attention Network for Image and Sentence Matching, ICCV 2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning, ACL 2020

History for Visual Dialog: Do we really need it?, ACL 2020

Cross-Modality Relevance for Reasoning on Language and Vision, ACL 2020

Other Resources

Two recent surveys on pretrained language models
- Pre-trained Models for Natural Language Processing: A Survey, arXiv 2020/03
- A Survey on Contextual Embeddings, arXiv 2020/03
Other surveys about multimodal research
- Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, JAIR 2021
- Deep Multimodal Representation Learning: A Survey, arXiv 2019
- Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2018
- A Comprehensive Survey of Deep Learning for Image Captioning, ACM Computing Surveys 2018
Other repositories of relevant reading list

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

Table of Contents

Image-based VL-PTMs

Representation Learning

Task-specific

Other Analysis

Video-based VL-PTMs

Speech-based VL-PTMs

Other Transformer-based multimodal networks

Other Resources

About

Releases

Packages

Contributors 5

yuewang-cuhk/awesome-vision-language-pretraining-papers

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Recent Advances in Vision and Language PreTrained Models (VL-PTMs)

Table of Contents

Image-based VL-PTMs

Representation Learning

Task-specific

Other Analysis

Video-based VL-PTMs

Speech-based VL-PTMs

Other Transformer-based multimodal networks

Other Resources

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Packages