This repository contains a collection of resources and papers on Vision-Language Models.
Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
arXiv 2022.[Paper]
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
arXiv 2021. [Paper][Github]
RegionCLIP: Region-based Language-Image Pretraining
CVPR 2022.
[Paper][Github]
Grounded Language-Image Pre-training
CVPR 2022.[Paper][Gitub]
Detecting Twenty-thousand Classes using Image-level Supervision
ECCV 2022.[Paper][Github]
PromptDet: Towards Open-vocabulary Detection using Uncurated Images
ECCV 2022.[Paper][Github]
Simple Open-Vocabulary Object Detection with Vision Transformers
ECCV 2022 [Paper][Github]
Open-Vocabulary DETR with Conditional Matching
ECCV 2022.[Paper][Github]
X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
ECCV 2022.[Paper]
Pix2seq: A Language Modeling Framework for Object Detection
ICLR 2022.[Paper][Github]
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
ICLR 2023.[Paper]
Learning Object-Language Alignments for Open-Vocabulary Object Detection
ICLR 2023.[Paper][Github]
ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues
CVPR 2022.[Paper]
CLIP the Gap: A Single Domain Generalization Approach for Object Detection
arXiv 2023.[Paper]
K-LITE: Learning Transferable Visual Models with External Knowledge
NeurlPS 2022.[Paper]
Visual Classification via Description From Large Language Models
ICLR 2023.[Paper]
Learning to Compose Soft Prompts for Compositional Zero-Shot Learning
ICLR 2023. [Paper][Github]
Masked Unsupervised Self-training for Zero-shot Image Classification
ICLR 2023.[Paper][Github]
CLIPood: Generalizing CLIP to Out-of-Distributions
arXiv 2023.[Paper]
ON-THE-FLY OBJECT DETECTION USING STYLEGAN WITH CLIP GUIDANCE
arXiv 2022.[Paper]
ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation
ICLR 2023.[Paper]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
ICLR 2023. [Paper][Github]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
ICLR 2023.[Paper][Github]
Learning Input-Agnostic Manipulation Directions in StyleGAN with Text Guidance
ICLR 2023.[Paper]
CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
CVPR 2022.[Paper][Github]
MotionCLIP: Exposing Human Motion Generation to CLIP Space
ECCV 2022.[Paper][Github]
VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
ECCV 2022.[Paper][Github]
CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders
NeurIPS 2022.[Paper]
GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
arXiv 2023.[Paper]
CRIS: CLIP-Driven Referring Image Segmentation
CVPR 2022.[Paper][Github]
Extract Free Dense Labels from CLIP
ECCV 2022.[Paper][Github]
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
ECCV 2022.[Paper]
A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model
ECCV 2022.[Paper]
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
arXiv 2022.[Paper]
Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
arXiv 2022.[Paper]
Image Segmentation Using Text and Image Prompts
CVPR 2022.[Paper][Github]
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
arXiv 2022.[Paper]
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
CVPR 2022.[Paper]
GroupViT: Semantic Segmentation Emerges from Text Supervision
CVPR 2022.[Paper]
PointCLIP: Point Cloud Understanding by CLIP
CVPR 2022.[Paper][Github]
CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
arXiv 2023.[Paper]
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
arXiv 2022.[Paper]
CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory
arXiv 2022.[Paper]
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
ICLR 2023.[Paper]
Frozen CLIP Models are Efficient Video Learners
ECCV 2022.[Paper][Github]
Zero-Shot Temporal Action Detection via Vision-Language Prompting
ECCV 2022.[Paper]
FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks
BMVC 2022.[Paper]
Fine-tuned CLIP Models are Efficient Video Learners
arXiv 2022.[Paper]
Face Recognition in the age of CLIP & Billion image datasets
arXiv 2023.[Paper]
Conditional Prompt Learning for Vision-Language Models
CVPR 2022.[Paper][Github]
Prompt Learning with Optimal Transport for Vision-Language Models
ICLR 2023.[Paper]
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
ICLR 2023.[Paper][Github]
"This is my unicorn, Fluffy":Personalizing frozen vision-language representations
ECCV 2022.[Paper]
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
NeurIPS 2022.[Paper]
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
NeurIPS 2022.[Paper]
Attentive Mask CLIP
arXiv 2022.[Paper]
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
arXiv 2022.[Paper]
Frozen CLIP Model is An Efficient Point Cloud Backbone
arXiv 2022.[Paper]
Linearly Mapping from Image to Text Space
ICLR 2023.[Paper]
DECAP: Decoding CLIP Latents for Zero-shot Captioning
ICLR 2023.[Paper]
Weakly Supervised Grounding for VQA in Vision-Language Transformers
ECCV 2022.[Paper]
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
ECCV 2022.[Paper]
When and why vision-language models behave like bags-of-words, and what to do about it?
ICLR 2023. [Paper][Code Coming Soon]
Generative Negative Text Replay for Continual Vision-Language Pretraining
ECCV 2022.[Paper]
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
arXiv 2022.[Paper]
When are Lemons Purple? The Concept Association Bias of CLIP
arXiv 2022.[Paper]
CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection
arXiv 2023.[Paper]