Skip to content

This repository is a curated collection of the most exciting and influential CVPR 2024 papers. πŸ”₯ [Paper + Code + Demo]

License

Notifications You must be signed in to change notification settings

SkalskiP/top-cvpr-2024-papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

56 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

visitor badge

top CVPR 2024 papers

2023

vancouver

πŸ‘‹ hello

Computer Vision and Pattern Recognition is a massive conference. In 2024 alone, 11,532 papers were submitted, and 2,719 were accepted. I created this repository to help you search for crème de la crème of CVPR publications. If the paper you are looking for is not on my short list, take a peek at the full list of accepted papers.

πŸ—žοΈ papers and posters

πŸ”₯ - highlighted papers

3d from multi-view and sensors

SpatialTracker: Tracking Any 2D Pixels in 3D Space πŸ”₯ SpatialTracker: Tracking Any 2D Pixels in 3D Space
Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, Xiaowei Zhou
[paper] [code]
Topic: 3D from multi-view and sensors
Session: Fri 21 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #84



ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
Lukas Hâllein, Aljaž Božič, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhâfer, Matthias Nießner
[paper] [code] [video]
Topic: 3D from multi-view and sensors
Session: Wed 19 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #20



OmniGlue: Generalizable Feature Matching with Foundation Model Guidance
Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, Andre Araujo
[paper] [code] [demo]
Topic: 3D from multi-view and sensors
Session: Fri 21 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #32


deep learning architectures and techniques

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks πŸ”₯ Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan
[paper] [video] [demo] [colab]
Topic: Deep learning architectures and techniques
Session: Wed 19 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #102



document analysis and understanding

DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
Jiaxin Zhang, Dezhi Peng, Chongyu Liu, Peirong Zhang, Lianwen Jin
[paper] [code] [demo]
Topic: Document analysis and understanding
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #101


efficient and scalable vision

EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything πŸ”₯ EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, Vikas Chandra
[paper] [code] [demo]
Topic: Efficient and scalable vision
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #144



MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel
[paper] [code] [demo]
Topic: Efficient and scalable vision
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #130



explainable computer vision

Describing Differences in Image Sets with Natural Language πŸ”₯ Describing Differences in Image Sets with Natural Language
Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy
[paper] [code]
Topic: Explainable computer vision
Session: Fri 21 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #115



image and video synthesis and generation

DemoFusion: Democratising High-Resolution Image Generation With No $$$
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma
[paper] [code] [demo] [colab]
Topic: Image and video synthesis and generation
Session: Wed 19 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #132


DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing πŸ”₯ DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Hanshu Yan, Wenqing Zhang, Vincent Y. F. Tan, Song Bai
[paper] [code] [video]
Topic: Image and video synthesis and generation
Session: Wed 19 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #392



Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models πŸ”₯ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models
Daniel Geng, Inbum Park, Andrew Owens
[paper] [code] [colab]
Topic: Image and video synthesis and generation
Session: Fri 21 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #118



low-level vision

XFeat: Accelerated Features for Lightweight Image Matching XFeat: Accelerated Features for Lightweight Image Matching
Guilherme Potje, Felipe Cadar, Andre Araujo, Renato Martins, Erickson R. Nascimento
[paper] [code] [video] [demo] [colab]
Topic: Low-level vision
Session: Wed 19 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #245



Robust Image Denoising through Adversarial Frequency Mixup Robust Image Denoising through Adversarial Frequency Mixup
Donghun Ryou, Inju Ha, Hyewon Yoo, Dongwan Kim, Bohyung Han
[paper] [code] [video]
Topic: Low-level vision
Session: Wed 19 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #250



multi-modal learning

πŸ”₯ Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
[paper] [code]
Topic: Multi-modal learning
Session: Fri 21 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #209


recognition: categorization, detection, retrieval

DETRs Beat YOLOs on Real-time Object Detection DETRs Beat YOLOs on Real-time Object Detection
Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen
[paper] [code] [video]
Topic: Recognition: Categorization, detection, retrieval
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #229



YOLO-World: Real-Time Open-Vocabulary Object Detection YOLO-World: Real-Time Open-Vocabulary Object Detection
Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan
[paper] [code] [video] [demo] [colab]
Topic: Recognition: Categorization, detection, retrieval
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #223



Object Recognition as Next Token Prediction πŸ”₯ Object Recognition as Next Token Prediction
Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim
[paper] [code] [video] [colab]
Topic: Recognition: Categorization, detection, retrieval
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #199



segmentation, grouping and shape analysis

RobustSAM: Segment Anything Robustly on Degraded Images πŸ”₯ RobustSAM: Segment Anything Robustly on Degraded Images
Wei-Ting Chen, Yu-Jiet Vong, Sy-Yen Kuo, Sizhou Ma, Jian Wang
[paper] [video]
Topic: Segmentation, grouping and shape analysis
Session: Wed 19 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #378



Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation πŸ”₯ Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation
Bingfeng Zhang, Siyue Yu, Yunchao Wei, Yao Zhao, Jimin Xiao
[paper] [code] [video]
Topic: Segmentation, grouping and shape analysis
Session: Wed 19 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #351



Semantic-aware SAM for Point-Prompted Instance Segmentation πŸ”₯ Semantic-aware SAM for Point-Prompted Instance Segmentation
Zhaoyang Wei, Pengfei Chen, Xuehui Yu, Guorong Li, Jianbin Jiao, Zhenjun Han
[paper] [code] [video]
Topic: Segmentation, grouping and shape analysis
Session: Wed 19 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #331



πŸ”₯ In-Context Matting
He Guo, Zixuan Ye, Zhiguo Cao, Hao Lu
[paper] [code]
Topic: Segmentation, grouping and shape analysis
Session: Wed 19 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #343


General Object Foundation Model for Images and Videos at Scale πŸ”₯ General Object Foundation Model for Images and Videos at Scale
Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
[paper] [code] [video]
Topic: Segmentation, grouping and shape analysis
Session: Wed 19 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #350



self-supervised or unsupervised representation learning

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks πŸ”₯ InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
[paper] [code] [demo]
Topic: Self-supervised or unsupervised representation learning
Session: Fri 21 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #412



video: low-level analysis, motion, and tracking

Matching Anything by Segmenting Anything πŸ”₯ Matching Anything by Segmenting Anything
Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, Fisher Yu
[paper] [code] [video]
Topic: Video: Low-level analysis, motion, and tracking
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #421



DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction
Weiyi Lv, Yuhang Huang, Ning Zhang, Ruei-Sung Lin, Mei Han, Dan Zeng
[paper] [code]
Topic: Video: Low-level analysis, motion, and tracking
Session: Thu 20 Jun 8 p.m. EDT β€” 9:30 p.m. EDT #455



vision, language, and reasoning

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
[paper] [code] [video] [demo]
Topic: Vision, language, and reasoning
Session: Thu 20 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #327



πŸ”₯ Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie
[paper] [code]
Topic: Vision, language, and reasoning
Session: Thu 20 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #390


LISA: Reasoning Segmentation via Large Language Model πŸ”₯ LISA: Reasoning Segmentation via Large Language Model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, Jiaya Jia
[paper] [code] [demo]
Topic: Vision, language, and reasoning
Session: Thu 20 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #413



ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Dennis Park, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Yong Jae Lee
[paper] [code] [video] [demo]
Topic: Vision, language, and reasoning
Session: Thu 20 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #317



MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI πŸ”₯ MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
[paper]
Topic: Vision, language, and reasoning
Session: Thu 20 Jun 1:30 p.m. EDT β€” 3 p.m. EDT #382



🦸 contribution

We would love your help in making this repository even better! If you know of an amazing paper that isn't listed here, or if you have any suggestions for improvement, feel free to open an issue or submit a pull request.

About

This repository is a curated collection of the most exciting and influential CVPR 2024 papers. πŸ”₯ [Paper + Code + Demo]

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages