Skip to content

Latest commit

 

History

History
161 lines (155 loc) · 68.4 KB

vision-and-language.md

File metadata and controls

161 lines (155 loc) · 68.4 KB

ICCV-2023-Papers

Application App

Vision and Language

Section Papers Preprint Papers Papers with Open Code Papers with Video

Title Repo Paper Video
SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training thecvf
arXiv
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model GitHub thecvf
arXiv
Explore and Tell: Embodied Visual Captioning in 3D Environments GitHub Page
GitHub
thecvf
arXiv
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability GitHub thecvf
arXiv
Learning Trajectory-Word Alignments for Video-Language Tasks thecvf
arXiv
Variational Causal Inference Network for Explanatory Visual Question Answering thecvf
TextManiA: Enriching Visual Feature by Text-Driven Manifold Augmentation GitHub Page
GitHub
thecvf
arXiv
Segment Every Reference Object in Spatial and Temporal Spaces thecvf
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models thecvf
arXiv
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pre-Training thecvf
Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge GitHub thecvf
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching thecvf
Moment Detection in Long Tutorial Videos GitHub thecvf
Not All Features Matter: Enhancing Few-Shot CLIP with Adaptive Prior Refinement GitHub thecvf
arXiv
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images GitHub Page thecvf
arXiv
Advancing Referring Expression Segmentation Beyond Single Image GitHub thecvf
arXiv
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-World Learning GitHub thecvf
arXiv
Unsupervised Prompt Tuning for Text-Driven Object Detection thecvf
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding GitHub thecvf
arXiv
I can't Believe there's no Images! Learning Visual Tasks using Only Language Supervision WEB Page
GitHub
thecvf
arXiv
Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples GitHub thecvf
arXiv
MeViS: A Large-Scale Benchmark for Video Segmentation with Motion Expressions GitHub Page
GitHub
thecvf
arXiv
Diverse Data Augmentation with Diffusions for Effective Test-Time Prompt Tuning GitHub thecvf
arXiv
ShapeScaffolder: Structure-Aware 3D Shape Generation from Text thecvf
Pdf
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models GitHub Page
GitHub
thecvf
arXiv
X-Mesh: Towards Fast and Accurate Text-Driven 3D Stylization via Dynamic Textual Guidance GitHub Page
GitHub
thecvf
arXiv
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation GitHub thecvf
arXiv
Attentive Mask CLIP thecvf
arXiv
Knowledge Proxy Intervention for Deconfounded Video Question Answering thecvf
UniVTG: Towards Unified Video-Language Temporal Grounding GitHub thecvf
arXiv
Self-Supervised Cross-View Representation Reconstruction for Change Captioning GitHub thecvf
Unified Coarse-to-Fine Alignment for Video-Text Retrieval GitHub thecvf
arXiv
Confidence-Aware Pseudo-Label Learning for Weakly Supervised Visual Grounding GitHub thecvf
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions WEB Page
GitHub
thecvf
arXiv
YouTube
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge GitHub Page
GitHub
thecvf
arXiv
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation thecvf
arXiv
CLIPTrans: Transferring Visual Knowledge with Pre-Trained Models for Multimodal Machine Translation GitHub Page
GitHub
thecvf
arXiv
Learning Human-Human Interactions in Images from Weak Textual Supervision GitHub Page
GitHub
thecvf
arXiv
BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization thecvf
arXiv
3D-VisTA: Pre-Trained Transformer for 3D Vision and Text Alignment GitHub Page
GitHub
thecvf
arXiv
YouTube
ALIP: Adaptive Language-Image Pre-Training with Synthetic Caption GitHub thecvf
arXiv
LoGoPrompt: Synthetic Text Images can be Good Visual Prompts for Vision-Language Models GitHub Page thecvf
arXiv
Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning GitHub thecvf
arXiv
Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering thecvf
PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 GitHub Page
GitHub
thecvf
arXiv
Grounded Image Text Matching with Mismatched Relation Reasoning GitHub Page
GitHub
thecvf
arXiv
YouTube
GePSAn: Generative Procedure Step Anticipation in Cooking Videos thecvf
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models GitHub Page
GitHub
thecvf
arXiv
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control GitHub thecvf
arXiv
With a Little Help from Your own Past: Prototypical Memory Networks for Image Captioning GitHub thecvf
arXiv
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models GitHub thecvf
arXiv
Learning Navigational Visual Representations with Semantic Map Supervision GitHub thecvf
arXiv
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection GitHub Page thecvf
arXiv
Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting GitHub thecvf
Learning Concise and Descriptive Attributes for Visual Recognition thecvf
arXiv
Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models GitHub thecvf
arXiv
Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories GitHub Page thecvf
arXiv
Story Visualization by Online Text Augmentation with Context Memory GitHub Page
GitHub
thecvf
arXiv
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning GitHub thecvf
arXiv
Too Large; Data Reduction for Vision-Language Pre-Training GitHub thecvf
arXiv
ViLTA: Enhancing Vision-Language Pre-Training through Textual Augmentation thecvf
arXiv
Zero-Shot Composed Image Retrieval with Textual Inversion WEB Page
GitHub
thecvf
arXiv
SATR: Zero-Shot Semantic Segmentation of 3D Shapes GitHub Page
GitHub
thecvf
arXiv
CiT: Curation in Training for Effective Vision-Language Data GitHub thecvf
arXiv
Self-Regulating Prompts: Foundational Model Adaptation without Forgetting GitHub Page
GitHub
thecvf
arXiv
YouTube
Learning to Ground Instructional Articles in Videos through Narrations thecvf
arXiv
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D thecvf
arXiv
Multi3DRefer: Grounding Text Description to Multiple 3D Objects GitHub Page
GitHub
thecvf
arXiv
Bayesian Prompt Learning for Image-Language Model Generalization GitHub thecvf
arXiv
Who are You Referring to? Coreference Resolution in Image Narrations thecvf
arXiv
Guiding Image Captioning Models Toward more Specific Captions thecvf
arXiv
PreSTU: Pre-Training for Scene-Text Understanding thecvf
arXiv
Exploring Group Video Captioning with Efficient Relational Approximation thecvf
VLSlice: Interactive Vision-and-Language Slice Discovery GitHub thecvf
arXiv
Google Drive
Pretrained Language Models as Visual Planners for Human Assistance GitHub thecvf
arXiv
VQA Therapy: Exploring Answer Differences by Visually Grounding Answers WEB Page thecvf
arXiv
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation using only Images thecvf
arXiv
Zero-Shot Composed Image Retrieval with Textual Inversion GitHub thecvf
arXiv
YouTube
PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification thecvf
arXiv
Lip Reading for Low-Resource Languages by Learning and Combining General Speech Knowledge and Language-Specific Knowledge thecvf
arXiv
ViewRefer: Grasp the Multi-View Knowledge for 3D Visual Grounding thecvf
arXiv
AerialVLN: Vision-and-Language Navigation for UAVs GitHub thecvf
arXiv
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models thecvf
arXiv
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-Training thecvf
arXiv
EgoTV: Egocentric Task Verification from Natural Language Task Descriptions GitHub Page
GitHub
thecvf
arXiv
SINC: Self-Supervised in-Context Learning for Vision-Language Tasks GitHub thecvf
arXiv
VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation GitHub thecvf
arXiv
Going Denser with Open-Vocabulary Part Segmentation GitHub thecvf
arXiv
Temporal Collection and Distribution for Referring Video Object Segmentation GitHub Page thecvf
arXiv
Inverse Compositional Learning for Weakly-Supervised Relation Grounding thecvf
Why is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? GitHub thecvf
arXiv
CHAMPAGNE: Learning Real-World Conversation from Large-Scale Web Videos WEB Page
GitHub
thecvf
arXiv
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning thecvf
DIME-FM: DIstilling Multimodal and Efficient Foundation Models WEB Page
GitHub
thecvf
arXiv
Black Box Few-Shot Adaptation for Vision-Language Models GitHub thecvf
arXiv
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision GitHub Page
GitHub
thecvf
arXiv
Accurate and Fast Compressed Video Captioning GitHub thecvf
arXiv
Exploring Temporal Concurrency for Video-Language Representation Learning GitHub thecvf
Verbs in Action: Improving Verb Understanding in Video-Language Models GitHub Page
GitHub
thecvf
arXiv
Sign Language Translation with Iterative Prototype thecvf
arXiv
Contrastive Feature Masking Open-Vocabulary Vision Transformer thecvf
arXiv
YouTube
Toward Unsupervised Realistic Visual Question Answering GitHub thecvf
arXiv
YouTube
GridMM: Grid Memory Map for Vision-and-Language Navigation GitHub thecvf
arXiv
Video Background Music Generation: Dataset, Method and Evaluation GitHub thecvf
arXiv
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval GitHub thecvf
arXiv
Prompt-Aligned Gradient for Prompt Tuning GitHub thecvf
arXiv
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models thecvf
arXiv
Order-Prompted Tag Sequence Generation for Video Tagging thecvf
What does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification GitHub thecvf
arXiv
PromptStyler: Prompt-Driven Style Generation for Source-Free Domain Generalization GitHub Page thecvf
arXiv
YouTube
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability thecvf
arXiv
EdaDet: Open-Vocabulary Object Detection using Early Dense Alignment GitHub Page thecvf
arXiv
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition GitHub thecvf
arXiv
Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts GitHub thecvf
arXiv
March in Chat: Interactive Prompting for Remote Embodied Referring Expression GitHub thecvf
arXiv
Chinese Text Recognition with a Pre-Trained CLIP-Like Model through Image-IDS Aligning GitHub Page
GitHub
thecvf
arXiv
OmniLabel: A Challenging Benchmark for Language-based Object Detection WEB Page thecvf
arXiv
IntentQA: Context-Aware Video Intent Reasoning GitHub thecvf
Sigmoid Loss for Language Image Pre-Training GitHub thecvf
arXiv
YouTube
What does CLIP Know About a Red Circle? Visual Prompt Engineering for VLMs GitHub thecvf
arXiv
Equivariant Similarity for Vision-Language Foundation Models GitHub thecvf
arXiv
Scaling Data Generation in Vision-and-Language Navigation GitHub thecvf
arXiv
YouTube
Name Your Colour for the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer GitHub thecvf
arXiv
G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory thecvf
arXiv
Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation GitHub thecvf
arXiv
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment thecvf
arXiv
Open-Domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities GitHub Page
GitHub
thecvf
arXiv