SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training |
➖ |
|
➖ |
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model |
|
|
➖ |
Explore and Tell: Embodied Visual Captioning in 3D Environments |
|
|
➖ |
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability |
|
|
➖ |
Learning Trajectory-Word Alignments for Video-Language Tasks |
➖ |
|
➖ |
Variational Causal Inference Network for Explanatory Visual Question Answering |
➖ |
|
➖ |
TextManiA: Enriching Visual Feature by Text-Driven Manifold Augmentation |
|
|
➖ |
Segment Every Reference Object in Spatial and Temporal Spaces |
➖ |
|
➖ |
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models |
➖ |
|
➖ |
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pre-Training |
➖ |
|
➖ |
Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge |
|
|
➖ |
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching |
➖ |
|
➖ |
Moment Detection in Long Tutorial Videos |
|
|
➖ |
Not All Features Matter: Enhancing Few-Shot CLIP with Adaptive Prior Refinement |
|
|
➖ |
Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images |
|
|
➖ |
Advancing Referring Expression Segmentation Beyond Single Image |
|
|
➖ |
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-World Learning |
|
|
➖ |
Unsupervised Prompt Tuning for Text-Driven Object Detection |
➖ |
|
➖ |
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding |
|
|
➖ |
I can't Believe there's no Images! Learning Visual Tasks using Only Language Supervision |
|
|
➖ |
Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples |
|
|
➖ |
MeViS: A Large-Scale Benchmark for Video Segmentation with Motion Expressions |
|
|
➖ |
Diverse Data Augmentation with Diffusions for Effective Test-Time Prompt Tuning |
|
|
➖ |
ShapeScaffolder: Structure-Aware 3D Shape Generation from Text |
➖ |
|
➖ |
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models |
|
|
➖ |
X-Mesh: Towards Fast and Accurate Text-Driven 3D Stylization via Dynamic Textual Guidance |
|
|
➖ |
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation |
|
|
➖ |
Attentive Mask CLIP |
➖ |
|
➖ |
Knowledge Proxy Intervention for Deconfounded Video Question Answering |
➖ |
|
➖ |
UniVTG: Towards Unified Video-Language Temporal Grounding |
|
|
➖ |
Self-Supervised Cross-View Representation Reconstruction for Change Captioning |
|
|
➖ |
Unified Coarse-to-Fine Alignment for Video-Text Retrieval |
|
|
➖ |
Confidence-Aware Pseudo-Label Learning for Weakly Supervised Visual Grounding |
|
|
➖ |
TextPSG: Panoptic Scene Graph Generation from Textual Descriptions |
|
|
|
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge |
|
|
➖ |
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation |
➖ |
|
➖ |
CLIPTrans: Transferring Visual Knowledge with Pre-Trained Models for Multimodal Machine Translation |
|
|
➖ |
Learning Human-Human Interactions in Images from Weak Textual Supervision |
|
|
➖ |
BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization |
➖ |
|
➖ |
3D-VisTA: Pre-Trained Transformer for 3D Vision and Text Alignment |
|
|
|
ALIP: Adaptive Language-Image Pre-Training with Synthetic Caption |
|
|
➖ |
LoGoPrompt: Synthetic Text Images can be Good Visual Prompts for Vision-Language Models |
|
|
➖ |
Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning |
|
|
➖ |
Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering |
➖ |
|
➖ |
PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 |
|
|
➖ |
Grounded Image Text Matching with Mismatched Relation Reasoning |
|
|
|
GePSAn: Generative Procedure Step Anticipation in Cooking Videos |
➖ |
|
➖ |
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models |
|
|
➖ |
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control |
|
|
➖ |
With a Little Help from Your own Past: Prototypical Memory Networks for Image Captioning |
|
|
➖ |
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models |
|
|
➖ |
Learning Navigational Visual Representations with Semantic Map Supervision |
|
|
➖ |
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection |
|
|
➖ |
Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting |
|
|
➖ |
Learning Concise and Descriptive Attributes for Visual Recognition |
➖ |
|
➖ |
Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models |
|
|
➖ |
Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories |
|
|
➖ |
Story Visualization by Online Text Augmentation with Context Memory |
|
|
➖ |
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning |
|
|
➖ |
Too Large; Data Reduction for Vision-Language Pre-Training |
|
|
➖ |
ViLTA: Enhancing Vision-Language Pre-Training through Textual Augmentation |
➖ |
|
➖ |
Zero-Shot Composed Image Retrieval with Textual Inversion |
|
|
➖ |
SATR: Zero-Shot Semantic Segmentation of 3D Shapes |
|
|
➖ |
CiT: Curation in Training for Effective Vision-Language Data |
|
|
➖ |
Self-Regulating Prompts: Foundational Model Adaptation without Forgetting |
|
|
|
Learning to Ground Instructional Articles in Videos through Narrations |
➖ |
|
➖ |
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D |
➖ |
|
➖ |
Multi3DRefer: Grounding Text Description to Multiple 3D Objects |
|
|
➖ |
Bayesian Prompt Learning for Image-Language Model Generalization |
|
|
➖ |
Who are You Referring to? Coreference Resolution in Image Narrations |
➖ |
|
➖ |
Guiding Image Captioning Models Toward more Specific Captions |
➖ |
|
➖ |
PreSTU: Pre-Training for Scene-Text Understanding |
➖ |
|
➖ |
Exploring Group Video Captioning with Efficient Relational Approximation |
➖ |
|
➖ |
VLSlice: Interactive Vision-and-Language Slice Discovery |
|
|
|
Pretrained Language Models as Visual Planners for Human Assistance |
|
|
➖ |
VQA Therapy: Exploring Answer Differences by Visually Grounding Answers |
|
|
➖ |
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation using only Images |
➖ |
|
➖ |
Zero-Shot Composed Image Retrieval with Textual Inversion |
|
|
|
PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification |
➖ |
|
➖ |
Lip Reading for Low-Resource Languages by Learning and Combining General Speech Knowledge and Language-Specific Knowledge |
➖ |
|
➖ |
ViewRefer: Grasp the Multi-View Knowledge for 3D Visual Grounding |
➖ |
|
➖ |
AerialVLN: Vision-and-Language Navigation for UAVs |
|
|
➖ |
Linear Spaces of Meanings: Compositional Structures in Vision-Language Models |
➖ |
|
➖ |
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-Training |
➖ |
|
➖ |
EgoTV: Egocentric Task Verification from Natural Language Task Descriptions |
|
|
➖ |
SINC: Self-Supervised in-Context Learning for Vision-Language Tasks |
|
|
➖ |
VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation |
|
|
➖ |
Going Denser with Open-Vocabulary Part Segmentation |
|
|
➖ |
Temporal Collection and Distribution for Referring Video Object Segmentation |
|
|
➖ |
Inverse Compositional Learning for Weakly-Supervised Relation Grounding |
➖ |
|
➖ |
Why is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? |
|
|
➖ |
CHAMPAGNE: Learning Real-World Conversation from Large-Scale Web Videos |
|
|
➖ |
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning |
➖ |
|
➖ |
DIME-FM: DIstilling Multimodal and Efficient Foundation Models |
|
|
➖ |
Black Box Few-Shot Adaptation for Vision-Language Models |
|
|
➖ |
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision |
|
|
➖ |
Accurate and Fast Compressed Video Captioning |
|
|
➖ |
Exploring Temporal Concurrency for Video-Language Representation Learning |
|
|
➖ |
Verbs in Action: Improving Verb Understanding in Video-Language Models |
|
|
➖ |
Sign Language Translation with Iterative Prototype |
➖ |
|
➖ |
Contrastive Feature Masking Open-Vocabulary Vision Transformer |
➖ |
|
|
Toward Unsupervised Realistic Visual Question Answering |
|
|
|
GridMM: Grid Memory Map for Vision-and-Language Navigation |
|
|
➖ |
Video Background Music Generation: Dataset, Method and Evaluation |
|
|
➖ |
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval |
|
|
➖ |
Prompt-Aligned Gradient for Prompt Tuning |
|
|
➖ |
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models |
➖ |
|
➖ |
Order-Prompted Tag Sequence Generation for Video Tagging |
➖ |
|
➖ |
What does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification |
|
|
➖ |
PromptStyler: Prompt-Driven Style Generation for Source-Free Domain Generalization |
|
|
|
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability |
➖ |
|
➖ |
EdaDet: Open-Vocabulary Object Detection using Early Dense Alignment |
|
|
➖ |
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition |
|
|
➖ |
Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts |
|
|
➖ |
March in Chat: Interactive Prompting for Remote Embodied Referring Expression |
|
|
➖ |
Chinese Text Recognition with a Pre-Trained CLIP-Like Model through Image-IDS Aligning |
|
|
➖ |
OmniLabel: A Challenging Benchmark for Language-based Object Detection |
|
|
➖ |
IntentQA: Context-Aware Video Intent Reasoning |
|
|
➖ |
Sigmoid Loss for Language Image Pre-Training |
|
|
|
What does CLIP Know About a Red Circle? Visual Prompt Engineering for VLMs |
|
|
➖ |
Equivariant Similarity for Vision-Language Foundation Models |
|
|
➖ |
Scaling Data Generation in Vision-and-Language Navigation |
|
|
|
Name Your Colour for the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer |
|
|
➖ |
G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory |
➖ |
|
➖ |
Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation |
|
|
➖ |
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment |
➖ |
|
➖ |
Open-Domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities |
|
|
➖ |