Skip to content

LLaVA-Mistral

Latest
Compare
Choose a tag to compare
@kshitijkg kshitijkg released this 01 Dec 14:37
· 2 commits to llava-mistral since this release
7f2ea17

A LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images

This release and the associated models were created in collaboration between the Robin team at AGI-Collective and Simon Ramstedt, with computing resources from Hessian-AI and OLCF.

As part of this first milestone and release we study the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), further improving capabilities by finetuning the vision encoder.

Available models

We use the following components:

  • Base LLM: We explore using Vicuna, Mistral and OpenHermes-2.5
  • Base Vision Model: We use the SigLIP model since it has shown stronger performance on vision benchmarks compared to CLIP
  • We finetune the Vision Encoder hoping the next token prediction loss helps further improves the vision capabilities of the pretrained vision encoder
Model Base LLM GQA SQA Text SQA Image
liuhaotian/llava-v1.5-7b lmsys/vicuna-7b-v1.5 62 70.43 66.8
liuhaotian/llava-v1.5-13b lmsys/vicuna-7b-v1.5 63.3 71.6
agi-collective/vicuna-7b-clip-finetune-lora lmsys/vicuna-7b-v1.5 62.04 70.86 68.72
agi-collective/vicuna-7b-siglip-so400m-finetune-lora lmsys/vicuna-7b-v1.5 56.79 68.76 67.48
agi-collective/mistral-7b-siglip-so400m-finetune-lora mistralai/Mistral-7B-v0.1 49.44 73.66 68.57
agi-collective/mistral-7b-oh-siglip-so400m-frozen-ve-finetune-lora teknium/OpenHermes-2.5-Mistral-7B 53.59 78.17 72.73
agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora teknium/OpenHermes-2.5-Mistral-7B 54.48 79.56 74.22

(best 7B model results highlighted)

Authors

Daniel Z Kaplan1, Kshitij Gupta1, Simon Ramstedt1, Alexis Roger2, Edwin Fennell2, George Adamopoulos2, Quentin Anthony2, Sun Qi2, Andrew R Williams3, Prateek Humane3, Rishika Bhagwatkar3, Yuchen Lu3, Irina Rish4

1first author, 2second author, 3third author, 4PI

Citation

@misc{RobinV1,
  author = {Daniel Z Kaplan, Kshitij Gupta, Simon Ramstedt, Alexis Roger, Edwin Fennell, George Adamopoulos, Quentin Anthony, Sun Qi, Andrew R Williams, Prateek Humane, Rishika Bhagwatkar, Yuchen Lu, Irina Rish},
  title = {Robin - Visual Language Models},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AGI-Collective/Robin/releases/tag/v1.0.0}},
  commit = {tags/v1.0.0}
}

Acknowledgements

We would like to thank Hessian-AI for providing us with free access to 8-16 A100 GPUs for a few weeks and to Florian and Patrick at Hessian AI for their support. We would also like to thank Oak Ridge Leadership Computing Facility (OLCF), the DOE Office of Science User Facility. Prelimnary experiments were conducted on the INCITE compute grant on Summit supercomputer supported under Contract DE-AC05-00OR22725. This grant was awarded to AAI CERC lab for their Scalable Foundation Models for Transferrable Generalist AI project. This work was in collaboration with representatives from EleutherAI. The code in this repo is based on github.com/haotian-liu/LLaVA