Skip to content

Up-to-date Vision Language Models collection. Mainly focus on computer vision

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



3 Commits

Repository files navigation

This repository contains a collection of resources and papers on Vision-Language Models.




Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
arXiv 2022.[Paper]


Object Detection

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
arXiv 2021. [Paper][Github]

RegionCLIP: Region-based Language-Image Pretraining
CVPR 2022. [Paper][Github]

Grounded Language-Image Pre-training
CVPR 2022.[Paper][Gitub]

Detecting Twenty-thousand Classes using Image-level Supervision
ECCV 2022.[Paper][Github]

PromptDet: Towards Open-vocabulary Detection using Uncurated Images
ECCV 2022.[Paper][Github]

Simple Open-Vocabulary Object Detection with Vision Transformers
ECCV 2022 [Paper][Github]

Open-Vocabulary DETR with Conditional Matching
ECCV 2022.[Paper][Github]

X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
ECCV 2022.[Paper]

Pix2seq: A Language Modeling Framework for Object Detection
ICLR 2022.[Paper][Github]

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
ICLR 2023.[Paper]

Learning Object-Language Alignments for Open-Vocabulary Object Detection
ICLR 2023.[Paper][Github]

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues
CVPR 2022.[Paper]

CLIP the Gap: A Single Domain Generalization Approach for Object Detection
arXiv 2023.[Paper]


K-LITE: Learning Transferable Visual Models with External Knowledge
NeurlPS 2022.[Paper]

Visual Classification via Description From Large Language Models
ICLR 2023.[Paper]

Learning to Compose Soft Prompts for Compositional Zero-Shot Learning
ICLR 2023. [Paper][Github]

Masked Unsupervised Self-training for Zero-shot Image Classification
ICLR 2023.[Paper][Github]

CLIPood: Generalizing CLIP to Out-of-Distributions
arXiv 2023.[Paper]

arXiv 2022.[Paper]

Generation & Manipulation

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation
ICLR 2023.[Paper]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
ICLR 2023. [Paper][Github]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
ICLR 2023.[Paper][Github]

Learning Input-Agnostic Manipulation Directions in StyleGAN with Text Guidance
ICLR 2023.[Paper]

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
CVPR 2022.[Paper][Github]

MotionCLIP: Exposing Human Motion Generation to CLIP Space
ECCV 2022.[Paper][Github]

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
ECCV 2022.[Paper][Github]

CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders
NeurIPS 2022.[Paper]

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
arXiv 2023.[Paper]


CRIS: CLIP-Driven Referring Image Segmentation
CVPR 2022.[Paper][Github]

Extract Free Dense Labels from CLIP
ECCV 2022.[Paper][Github]

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
ECCV 2022.[Paper]

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model
ECCV 2022.[Paper]

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
arXiv 2022.[Paper]

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
arXiv 2022.[Paper]

Image Segmentation Using Text and Image Prompts
CVPR 2022.[Paper][Github]

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
arXiv 2022.[Paper]

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
CVPR 2022.[Paper]

GroupViT: Semantic Segmentation Emerges from Text Supervision
CVPR 2022.[Paper]


PointCLIP: Point Cloud Understanding by CLIP
CVPR 2022.[Paper][Github]

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
arXiv 2023.[Paper]

PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
arXiv 2022.[Paper]

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory
arXiv 2022.[Paper]


CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
ICLR 2023.[Paper]

Frozen CLIP Models are Efficient Video Learners
ECCV 2022.[Paper][Github]

Zero-Shot Temporal Action Detection via Vision-Language Prompting
ECCV 2022.[Paper]

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks
BMVC 2022.[Paper]

Fine-tuned CLIP Models are Efficient Video Learners
arXiv 2022.[Paper]


Face Recognition in the age of CLIP & Billion image datasets
arXiv 2023.[Paper]


Conditional Prompt Learning for Vision-Language Models
CVPR 2022.[Paper][Github]

Prompt Learning with Optimal Transport for Vision-Language Models
ICLR 2023.[Paper]

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
ICLR 2023.[Paper][Github]

"This is my unicorn, Fluffy":Personalizing frozen vision-language representations
ECCV 2022.[Paper]

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
NeurIPS 2022.[Paper]

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
NeurIPS 2022.[Paper]

Attentive Mask CLIP
arXiv 2022.[Paper]

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
arXiv 2022.[Paper]

Frozen CLIP Model is An Efficient Point Cloud Backbone
arXiv 2022.[Paper]

Natural Language

Linearly Mapping from Image to Text Space
ICLR 2023.[Paper]

DECAP: Decoding CLIP Latents for Zero-shot Captioning
ICLR 2023.[Paper]

Weakly Supervised Grounding for VQA in Vision-Language Transformers
ECCV 2022.[Paper]

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
ECCV 2022.[Paper]


When and why vision-language models behave like bags-of-words, and what to do about it?
ICLR 2023. [Paper][Code Coming Soon]

Generative Negative Text Replay for Continual Vision-Language Pretraining
ECCV 2022.[Paper]

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
arXiv 2022.[Paper]

When are Lemons Purple? The Concept Association Bias of CLIP
arXiv 2022.[Paper]

Medical Image

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection
arXiv 2023.[Paper]


Up-to-date Vision Language Models collection. Mainly focus on computer vision






No releases published


No packages published