LLaVA-NeXT: Open Large Multimodal Models

Release

[2024/05/10] 🔥 LLaVA-NeXT (Stronger) models are released, with support of stronger LMM inlcuding LLama-3 (8B) and Qwen-1.5 (72B/110B) Check out [blog] and [checkpoints] to see improved performance!
[2024/05/10] 🔥 LLaVA-NeXT (Video) is released. The image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer. DPO training with AI feedback on videos can yield significant improvement. [Blog] and [checkpoints]
[2024/01/30] 🔥 LLaVA-NeXT is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.

More

[2024/03/10] 🔥 Releasing LMMs-Eval, a highly efficient evaluation pipeline we used when developing LLaVA-NeXT. It supports the evaluation of LMMs on dozens of public datasets and allows new dataset onboarding, making the dev of new LMMs much faster. [Blog] [Codebase]
[2023/11/10] LLaVA-Plus is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [Project Page] [Demo] [Code] [Paper]
[2023/11/02] LLaVA-Interactive is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [Project Page] [Demo] [Code] [Paper]
[2023/10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts, script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA.
[2023/10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [🤗 Demo]
[2023/10/05] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo. The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here!
[2023/09/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [LLavA-RLHF]
[2023/09/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.
[2023/11/06] Support Intel dGPU and CPU platforms. More details here.
[2023/10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support!
[2023/10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here!
[2023/10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5.
[2023/09/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper ``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''.

[2023/07/19] 🔥 We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo!
[2023/06/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out [Slides] [Notes] [YouTube] [Bilibli].
[2023/06/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here.
[2023/06/01] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page.
[2023/05/06] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details.
[2023/05/02] 🔥 We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details.
[2023/04/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here.
[2023/04/17] 🔥 We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama-1/2 community license for LLaMA-2 and Vicuna-v1.5, Tongyi Qianwen RESEARCH LICENSE AGREEMENT and Llama-3 Research License). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Models & Scripts

Installation

1. Clone this repository and navigate to the LLaVA folder:

git clone https://github.com/LLaVA-VL/LLaVA-NeXT
cd LLaVA-NeXT

2. Install the inference package:

conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
pip install -e ".[train]"

Project Navigation

Please checkout the following page for more inference & evaluation details.

- LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

LLaVA-NeXT-Image: for image demo inference and evaluation of stronger LMMs using lmms-eval.

- LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

LLaVA-NeXT-Video: for video inference and evaluation scripts.

Citation

If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:

@misc{li2024llavanext-strong,
    title={LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild},
    url={https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/},
    author={Li, Bo and Zhang, Kaichen and Zhang, Hao and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Yuanhan and Liu, Ziwei and Li, Chunyuan},
    month={May},
    year={2024}
}

@misc{zhang2024llavanextvideo,
  title={LLaVA-NeXT: A Strong Zero-shot Video Understanding Model},
  url={https://llava-vl.github.io/blog/2024-04-30-llava-next-video/},
  author={Zhang, Yuanhan and Li, Bo and Liu, haotian and Lee, Yong jae and Gui, Liangke and Fu, Di and Feng, Jiashi and Liu, Ziwei and Li, Chunyuan},
  month={April},
  year={2024}
}

@misc{liu2024llavanext,
    title={LLaVA-NeXT: Improved reasoning, OCR, and world knowledge},
    url={https://llava-vl.github.io/blog/2024-01-30-llava-next/},
    author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Li, Bo and Zhang, Yuanhan and Shen, Sheng and Lee, Yong Jae},
    month={January},
    year={2024}
}

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}

Acknowledgement

Vicuna: the codebase we built upon, and our base model Vicuna-13B that has the amazing language capabilities!
The LLaVA-NeXT project is currently maintained by the team along with our contributors (listed alphabetically by the first names): Bo Li, Dong Guo, Feng Li, Hao Zhang, Kaichen Zhang, Renrui Zhang, Yuanhan Zhang, led by Chunyuan Li and with the guidance and help from Haotian Liu.
The lmms-eval framework and its core contributors, including Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, and Kairui Hu, for their support on the evaluation side.

Related Projects

For future project ideas, please check out:

SEEM: Segment Everything Everywhere All at Once
Grounded-Segment-Anything to detect, segment, and generate anything by marrying Grounding DINO and Segment-Anything.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
docs		docs
llava		llava
llavavid		llavavid
playground/demo		playground/demo
scripts/video		scripts/video
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

llava

llava

llavavid

llavavid

playground/demo

playground/demo

scripts/video

scripts/video

.gitignore

.gitignore

README.md

README.md

pyproject.toml

pyproject.toml

Repository files navigation

LLaVA-NeXT: Open Large Multimodal Models

Release

Models & Scripts

Installation

1. Clone this repository and navigate to the LLaVA folder:

2. Install the inference package:

Project Navigation

- LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

- LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

Citation

Acknowledgement

Related Projects

About

Releases

Packages

Contributors 4

Languages

LLaVA-VL/LLaVA-NeXT

Folders and files

Latest commit

History

Repository files navigation

LLaVA-NeXT: Open Large Multimodal Models

Release

Models & Scripts

Installation

1. Clone this repository and navigate to the LLaVA folder:

2. Install the inference package:

Project Navigation

- LLaVA-NeXT: Stronger LLMs Supercharge Multimodal Capabilities in the Wild

- LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

Citation

Acknowledgement

Related Projects

About

Resources

Stars

Watchers

Forks

Languages