Papers

This repository contains a collection of resources and papers on Vision-Language Models.

Papers

Survey

Vision-Language Pre-training: Basics, Recent Advances, and Future Trends
arXiv 2022.[Paper]

Vision

Object Detection

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
arXiv 2021. [Paper][Github]

RegionCLIP: Region-based Language-Image Pretraining
CVPR 2022. [Paper][Github]

Grounded Language-Image Pre-training
CVPR 2022.[Paper][Gitub]

Detecting Twenty-thousand Classes using Image-level Supervision
ECCV 2022.[Paper][Github]

PromptDet: Towards Open-vocabulary Detection using Uncurated Images
ECCV 2022.[Paper][Github]

Simple Open-Vocabulary Object Detection with Vision Transformers
ECCV 2022 [Paper][Github]

Open-Vocabulary DETR with Conditional Matching
ECCV 2022.[Paper][Github]

X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks
ECCV 2022.[Paper]

Pix2seq: A Language Modeling Framework for Object Detection
ICLR 2022.[Paper][Github]

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models
ICLR 2023.[Paper]

Learning Object-Language Alignments for Open-Vocabulary Object Detection
ICLR 2023.[Paper][Github]

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues
CVPR 2022.[Paper]

CLIP the Gap: A Single Domain Generalization Approach for Object Detection
arXiv 2023.[Paper]

Classification

K-LITE: Learning Transferable Visual Models with External Knowledge
NeurlPS 2022.[Paper]

Visual Classification via Description From Large Language Models
ICLR 2023.[Paper]

Learning to Compose Soft Prompts for Compositional Zero-Shot Learning
ICLR 2023. [Paper][Github]

Masked Unsupervised Self-training for Zero-shot Image Classification
ICLR 2023.[Paper][Github]

CLIPood: Generalizing CLIP to Out-of-Distributions
arXiv 2023.[Paper]

ON-THE-FLY OBJECT DETECTION USING STYLEGAN WITH CLIP GUIDANCE
arXiv 2022.[Paper]

Generation & Manipulation

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation
ICLR 2023.[Paper]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
ICLR 2023. [Paper][Github]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
ICLR 2023.[Paper][Github]

Learning Input-Agnostic Manipulation Directions in StyleGAN with Text Guidance
ICLR 2023.[Paper]

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation
CVPR 2022.[Paper][Github]

MotionCLIP: Exposing Human Motion Generation to CLIP Space
ECCV 2022.[Paper][Github]

VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance
ECCV 2022.[Paper][Github]

CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders
NeurIPS 2022.[Paper]

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
arXiv 2023.[Paper]

Segmentation

CRIS: CLIP-Driven Referring Image Segmentation
CVPR 2022.[Paper][Github]

Extract Free Dense Labels from CLIP
ECCV 2022.[Paper][Github]

Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
ECCV 2022.[Paper]

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model
ECCV 2022.[Paper]

CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
arXiv 2022.[Paper]

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
arXiv 2022.[Paper]

Image Segmentation Using Text and Image Prompts
CVPR 2022.[Paper][Github]

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
arXiv 2022.[Paper]

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
CVPR 2022.[Paper]

GroupViT: Semantic Segmentation Emerges from Text Supervision
CVPR 2022.[Paper]

3D

PointCLIP: Point Cloud Understanding by CLIP
CVPR 2022.[Paper][Github]

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP
arXiv 2023.[Paper]

PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
arXiv 2022.[Paper]

CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory
arXiv 2022.[Paper]

Video

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
ICLR 2023.[Paper]

Frozen CLIP Models are Efficient Video Learners
ECCV 2022.[Paper][Github]

Zero-Shot Temporal Action Detection via Vision-Language Prompting
ECCV 2022.[Paper]

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks
BMVC 2022.[Paper]

Fine-tuned CLIP Models are Efficient Video Learners
arXiv 2022.[Paper]

Face

Face Recognition in the age of CLIP & Billion image datasets
arXiv 2023.[Paper]

Miscellaneous

Conditional Prompt Learning for Vision-Language Models
CVPR 2022.[Paper][Github]

Prompt Learning with Optimal Transport for Vision-Language Models
ICLR 2023.[Paper]

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
ICLR 2023.[Paper][Github]

"This is my unicorn, Fluffy":Personalizing frozen vision-language representations
ECCV 2022.[Paper]

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
NeurIPS 2022.[Paper]

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
NeurIPS 2022.[Paper]

Attentive Mask CLIP
arXiv 2022.[Paper]

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
arXiv 2022.[Paper]

Frozen CLIP Model is An Efficient Point Cloud Backbone
arXiv 2022.[Paper]

Natural Language

Linearly Mapping from Image to Text Space
ICLR 2023.[Paper]

DECAP: Decoding CLIP Latents for Zero-shot Captioning
ICLR 2023.[Paper]

Weakly Supervised Grounding for VQA in Vision-Language Transformers
ECCV 2022.[Paper]

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
ECCV 2022.[Paper]

Theory

When and why vision-language models behave like bags-of-words, and what to do about it?
ICLR 2023. [Paper][Code Coming Soon]

Generative Negative Text Replay for Continual Vision-Language Pretraining
ECCV 2022.[Paper]

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
arXiv 2022.[Paper]

When are Lemons Purple? The Concept Association Bias of CLIP
arXiv 2022.[Paper]

Medical Image

CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection
arXiv 2023.[Paper]

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Contents

Papers

Survey

Vision

Object Detection

Classification

Generation & Manipulation

Segmentation

3D

Video

Face

Miscellaneous

Natural Language

Theory

Medical Image

About

Releases

Packages

Georgelingzj/up-to-date-Vision-Language-Models

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Contents

Papers

Survey

Vision

Object Detection

Classification

Generation & Manipulation

Segmentation

3D

Video

Face

Miscellaneous

Natural Language

Theory

Medical Image

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages