Skip to content

BradyFU/Awesome-Multimodal-Large-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-Multimodal-Large-Language-Models

Our MLLM works

🔥🔥🔥 A Survey on Multimodal Large Language Models
Project Page [This Page] | Paper

The first comprehensive survey for Multimodal Large Language Models (MLLMs). ✨

Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟


🔥🔥🔥 VITA: Towards Open-Source Interactive Omni Multimodal LLM

We are excited to introduce the VITA-1.5, a more powerful and more real-time version. ✨

All codes of VITA-1.5 have been released! 🌟

You can experience our Basic Demo on ModelScope directly. The Real-Time Interactive Demo needs to be configured according to the instructions.


🔥🔥🔥 MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Jointly introduced by MME, MMBench, and LLaVA teams. ✨


🔥🔥🔥 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Project Page | Paper | GitHub | Dataset | Leaderboard

We are very proud to launch Video-MME, the first-ever comprehensive evaluation benchmark of MLLMs in Video Analysis! 🌟

It includes short- (< 2min), medium- (4min~15min), and long-term (30min~60min) videos, ranging from 11 seconds to 1 hour. All data are newly collected and annotated by humans, not from any existing video dataset. ✨


🔥🔥🔥 MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Paper | Download | Eval Tool | ✒️ Citation

A representative evaluation benchmark for MLLMs. ✨


🔥🔥🔥 Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper | GitHub

This is the first work to correct hallucination in multimodal large language models. ✨


🔥🔥🔥 Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Project Page | Paper | GitHub

A speech-to-speech dialogue model with both low-latency and high intelligence while the training process is based on a frozen LLM. ✨


Table of Contents


Awesome Papers

Multimodal Instruction Tuning

Title Venue Date Code Demo
Star
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray
arXiv 2025-02-07 Github -
Star
Qwen2.5-VL
Qwen 2025-01-26 Github Demo
Star
Baichuan-Omni-1.5 Technical Report
Tech Report 2025-01-26 Github Local Demo
Star
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
arXiv 2025-01-10 Github -
Star
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
arXiv 2025-01-03 Github -
Star
QVQ: To See the World with Wisdom
Qwen 2024-12-25 Github Demo
Star
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
arXiv 2024-12-13 Github -
Apollo: An Exploration of Video Understanding in Large Multimodal Models arXiv 2024-12-13 - -
Star
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions
arXiv 2024-12-12 Github Local Demo
StreamChat: Chatting with Streaming Video arXiv 2024-12-11 Coming soon -
CompCap: Improving Multimodal Large Language Models with Composite Captions arXiv 2024-12-06 - -
Star
LinVT: Empower Your Image-level Large Language Model to Understand Videos
arXiv 2024-12-06 Github -
Star
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
arXiv 2024-12-06 Github Demo
Star
NVILA: Efficient Frontier Visual Language Models
arXiv 2024-12-05 Github Demo
Star
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
arXiv 2024-12-04 Github -
Star
T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
arXiv 2024-11-29 Github -
Star
TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
arXiv 2024-11-27 Github -
Star
ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
arXiv 2024-11-27 Github Local Demo
Star
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
arXiv 2024-10-22 Github Demo
Star
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
arXiv 2024-10-09 Github -
Star
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
arXiv 2024-10-04 Github Local Demo
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models arXiv 2024-09-25 Huggingface Demo
Star
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
arXiv 2024-09-18 Github Demo
Star
ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding
ICLR 2024-09-05 Github Local Demo
Star
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
arXiv 2024-09-04 Github -
Star
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
arXiv 2024-08-28 Github Demo
Star
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
arXiv 2024-08-28 Github -
Star
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
arXiv 2024-08-09 Github -
Star
VITA: Towards Open-Source Interactive Omni Multimodal LLM
arXiv 2024-08-09 Github -
Star
LLaVA-OneVision: Easy Visual Task Transfer
arXiv 2024-08-06 Github Demo
Star
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
arXiv 2024-08-03 Github Demo
VILA^2: VILA Augmented VILA arXiv 2024-07-24 - -
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models arXiv 2024-07-22 - -
EVLM: An Efficient Vision-Language Model for Visual Understanding arXiv 2024-07-19 - -
Star
IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model
arXiv 2024-07-10 Github -
Star
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
arXiv 2024-07-03 Github Demo
Star
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
arXiv 2024-06-27 Github Local Demo
Star
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
AAAI 2024-06-27 Github -
Star
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
arXiv 2024-06-24 Github Local Demo
Star
Long Context Transfer from Language to Vision
arXiv 2024-06-24 Github Local Demo
Star
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
ICML 2024-06-22 Github -
Star
TroL: Traversal of Layers for Large Language and Vision Models
EMNLP 2024-06-18 Github Local Demo
Star
Unveiling Encoder-Free Vision-Language Models
arXiv 2024-06-17 Github Local Demo
Star
VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR 2024-06-17 Github Local Demo
Star
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
CoRL 2024-06-15 Github Demo
Star
Comparison Visual Instruction Tuning
arXiv 2024-06-13 Github Local Demo
Star
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
arXiv 2024-06-12 Github -
Star
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
arXiv 2024-06-11 Github Local Demo
Star
Parrot: Multilingual Visual Instruction Tuning
arXiv 2024-06-04 Github -
Star
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
arXiv 2024-05-31 Github -
Star
Matryoshka Query Transformer for Large Vision-Language Models
arXiv 2024-05-29 Github Demo
Star
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models
arXiv 2024-05-24 Github -
Star
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
arXiv 2024-05-24 Github Demo
Star
Libra: Building Decoupled Vision System on Large Language Models
ICML 2024-05-16 Github Local Demo
Star
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts
arXiv 2024-05-09 Github Local Demo
Star
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
arXiv 2024-04-25 Github Demo
Star
Graphic Design with Large Multimodal Model
arXiv 2024-04-22 Github -
BRAVE: Broadening the visual encoding of vision-language models ECCV 2024-04-10 - -
Star
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
arXiv 2024-04-09 Github Demo
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs arXiv 2024-04-08 - -
Star
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
CVPR 2024-04-08 Github -
Star
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
NeurIPS 2024-04-04 Github Local Demo
TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model ACM TKDD 2024-03-28 - -
Star
LITA: Language Instructed Temporal-Localization Assistant
arXiv 2024-03-27 Github Local Demo
Star
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
arXiv 2024-03-27 Github Demo
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training arXiv 2024-03-14 - -
Star
MoAI: Mixture of All Intelligence for Large Language and Vision Models
arXiv 2024-03-12 Github Local Demo
Star
DeepSeek-VL: Towards Real-World Vision-Language Understanding
arXiv 2024-03-08 Github Demo
Star
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
arXiv 2024-03-07 Github Demo
Star
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
arXiv 2024-02-29 Github -
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation CVPR 2024-02-26 Coming soon Coming soon
Star
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
arXiv 2024-02-19 Github -
Star
Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning
arXiv 2024-02-18 Github -
Star
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model
arXiv 2024-02-18 Github Demo
Star
CoLLaVO: Crayon Large Language and Vision mOdel
arXiv 2024-02-17 Github -
Star
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
ICML 2024-02-12 Github -
Star
CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations
arXiv 2024-02-06 Github -
Star
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
arXiv 2024-02-06 Github -
Star
GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning
NeurIPS 2024-02-03 Github -
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study arXiv 2024-01-31 Coming soon -
Star
LLaVA-NeXT: Improved reasoning, OCR, and world knowledge
Blog 2024-01-30 Github Demo
Star
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
arXiv 2024-01-29 Github Demo
Star
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
arXiv 2024-01-29 Github Demo
Star
Yi-VL
- 2024-01-23 Github Local Demo
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities arXiv 2024-01-22 - -
Star
ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning
ACL 2024-01-04 Github Local Demo
Star
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
arXiv 2023-12-28 Github -
Star
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2023-12-21 Github Demo
Star
Osprey: Pixel Understanding with Visual Instruction Tuning
CVPR 2023-12-15 Github Demo
Star
CogAgent: A Visual Language Model for GUI Agents
arXiv 2023-12-14 Github Coming soon
Pixel Aligned Language Models arXiv 2023-12-14 Coming soon -
Star
VILA: On Pre-training for Visual Language Models
CVPR 2023-12-13 Github Local Demo
See, Say, and Segment: Teaching LMMs to Overcome False Premises arXiv 2023-12-13 Coming soon -
Star
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models
ECCV 2023-12-11 Github Demo
Star
Honeybee: Locality-enhanced Projector for Multimodal LLM
CVPR 2023-12-11 Github -
Gemini: A Family of Highly Capable Multimodal Models Google 2023-12-06 - -
Star
OneLLM: One Framework to Align All Modalities with Language
arXiv 2023-12-06 Github Demo
Star
Lenna: Language Enhanced Reasoning Detection Assistant
arXiv 2023-12-05 Github -
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding arXiv 2023-12-04 - -
Star
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
arXiv 2023-12-04 Github Local Demo
Star
Making Large Multimodal Models Understand Arbitrary Visual Prompts
CVPR 2023-12-01 Github Demo
Star
Dolphins: Multimodal Language Model for Driving
arXiv 2023-12-01 Github -
Star
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
arXiv 2023-11-30 Github Coming soon
Star
VTimeLLM: Empower LLM to Grasp Video Moments
arXiv 2023-11-30 Github Local Demo
Star
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model
arXiv 2023-11-30 Github -
Star
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
arXiv 2023-11-28 Github Coming soon
Star
LLMGA: Multimodal Large Language Model based Generation Assistant
arXiv 2023-11-27 Github Demo
Star
ChartLlama: A Multimodal LLM for Chart Understanding and Generation
arXiv 2023-11-27 Github -
Star
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
arXiv 2023-11-21 Github Demo
Star
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
arXiv 2023-11-20 Github -
Star
An Embodied Generalist Agent in 3D World
arXiv 2023-11-18 Github Demo
Star
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
arXiv 2023-11-16 Github Demo
Star
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
CVPR 2023-11-14 Github -
Star
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
arXiv 2023-11-13 Github -
Star
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models
arXiv 2023-11-13 Github Demo
Star
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
CVPR 2023-11-11 Github Demo
Star
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
arXiv 2023-11-09 Github Demo
Star
NExT-Chat: An LMM for Chat, Detection and Segmentation
arXiv 2023-11-08 Github Local Demo
Star
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
arXiv 2023-11-07 Github Demo
Star
OtterHD: A High-Resolution Multi-modality Model
arXiv 2023-11-07 Github -
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding arXiv 2023-11-06 Coming soon -
Star
GLaMM: Pixel Grounding Large Multimodal Model
CVPR 2023-11-06 Github Demo
Star
What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
arXiv 2023-11-02 Github -
Star
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
arXiv 2023-10-14 Github Local Demo
Star
SALMONN: Towards Generic Hearing Abilities for Large Language Models
ICLR 2023-10-20 Github -
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
arXiv 2023-10-11 Github -
Star
CogVLM: Visual Expert For Large Language Models
arXiv 2023-10-09 Github Demo
Star
Improved Baselines with Visual Instruction Tuning
arXiv 2023-10-05 Github Demo
Star
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
ICLR 2023-10-03 Github Demo
Star
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
arXiv 2023-10-01 Github -
Star
Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants
arXiv 2023-10-01 Github Local Demo
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model arXiv 2023-09-27 - -
Star
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
arXiv 2023-09-26 Github Local Demo
Star
DreamLLM: Synergistic Multimodal Comprehension and Creation
ICLR 2023-09-20 Github Coming soon
An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models arXiv 2023-09-18 Coming soon -
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
arXiv 2023-09-14 Github Demo
Star
NExT-GPT: Any-to-Any Multimodal LLM
arXiv 2023-09-11 Github Demo
Star
Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics
arXiv 2023-09-13 Github -
Star
ImageBind-LLM: Multi-modality Instruction Tuning
arXiv 2023-09-07 Github Demo
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning arXiv 2023-09-05 - -
Star
PointLLM: Empowering Large Language Models to Understand Point Clouds
arXiv 2023-08-31 Github Demo
Star
✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
arXiv 2023-08-31 Github Local Demo
Star
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
arXiv 2023-08-25 Github -
Star
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
arXiv 2023-08-25 Github Demo
Star
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
arXiv 2023-08-24 Github Demo
Star
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
ICLR 2023-08-23 Github Demo
Star
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
arXiv 2023-08-20 Github -
Star
BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
arXiv 2023-08-19 Github Demo
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
arXiv 2023-08-08 Github -
Star
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
ICLR 2023-08-03 Github Demo
Star
LISA: Reasoning Segmentation via Large Language Model
arXiv 2023-08-01 Github Demo
Star
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
arXiv 2023-07-31 Github Local Demo
Star
3D-LLM: Injecting the 3D World into Large Language Models
arXiv 2023-07-24 Github -
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
arXiv 2023-07-18 - Demo
Star
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
arXiv 2023-07-17 Github Demo
Star
SVIT: Scaling up Visual Instruction Tuning
arXiv 2023-07-09 Github -
Star
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
arXiv 2023-07-07 Github Demo
Star
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
arXiv 2023-07-05 Github -
Star
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
arXiv 2023-07-04 Github Demo
Star
Visual Instruction Tuning with Polite Flamingo
arXiv 2023-07-03 Github Demo
Star
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
arXiv 2023-06-29 Github Demo
Star
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
arXiv 2023-06-27 Github Demo
Star
MotionGPT: Human Motion as a Foreign Language
arXiv 2023-06-26 Github -
Star
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
arXiv 2023-06-15 Github Coming soon
Star
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
arXiv 2023-06-11 Github Demo
Star
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
arXiv 2023-06-08 Github Demo
Star
MIMIC-IT: Multi-Modal In-Context Instruction Tuning
arXiv 2023-06-08 Github Demo
M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning arXiv 2023-06-07 - -
Star
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
arXiv 2023-06-05 Github Demo
Star
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
arXiv 2023-06-01 Github -
Star
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
arXiv 2023-05-30 Github Demo
Star
PandaGPT: One Model To Instruction-Follow Them All
arXiv 2023-05-25 Github Demo
Star
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
arXiv 2023-05-25 Github -
Star
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
arXiv 2023-05-24 Github Local Demo
Star
DetGPT: Detect What You Need via Reasoning
arXiv 2023-05-23 Github Demo
Star
Pengi: An Audio Language Model for Audio Tasks
NeurIPS 2023-05-19 Github -
Star
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
arXiv 2023-05-18 Github -
Star
Listen, Think, and Understand
arXiv 2023-05-18 Github Demo
Star
VisualGLM-6B
- 2023-05-17 Github Local Demo
Star
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
arXiv 2023-05-17 Github -
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
arXiv 2023-05-11 Github Local Demo
Star
VideoChat: Chat-Centric Video Understanding
arXiv 2023-05-10 Github Demo
Star
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
arXiv 2023-05-08 Github Demo
Star
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
arXiv 2023-05-07 Github -
Star
LMEye: An Interactive Perception Network for Large Language Models
arXiv 2023-05-05 Github Local Demo
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
arXiv 2023-04-28 Github Demo
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
arXiv 2023-04-27 Github Demo
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
arXiv 2023-04-20 Github -
Star
Visual Instruction Tuning
NeurIPS 2023-04-17 GitHub Demo
Star
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
ICLR 2023-03-28 Github Demo
Star
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
ACL 2022-12-21 Github -

Multimodal Hallucination

Title Venue Date Code Demo
Star
Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models
arXiv 2024-10-04 Github -
Star
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
arXiv 2024-10-03 Github -
FIHA: Autonomous Hallucination Evaluation in Vision-Language Models with Davidson Scene Graphs arXiv 2024-09-20 Link -
Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation arXiv 2024-08-01 - -
Star
Paying More Attention to Image: A Training-Free Method for Alleviating Hallucination in LVLMs
ECCV 2024-07-31 Github -
Star
Evaluating and Analyzing Relationship Hallucinations in LVLMs
ICML 2024-06-24 Github -