Release LLaVA-Mistral · CERC-AAI/Robin

A LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images

This release and the associated models were created in collaboration between the Robin team at AGI-Collective and Simon Ramstedt, with computing resources from Hessian-AI and OLCF.

As part of this first milestone and release we study the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), further improving capabilities by finetuning the vision encoder.

Available models

We use the following components:

Base LLM: We explore using Vicuna, Mistral and OpenHermes-2.5
Base Vision Model: We use the SigLIP model since it has shown stronger performance on vision benchmarks compared to CLIP
We finetune the Vision Encoder hoping the next token prediction loss helps further improves the vision capabilities of the pretrained vision encoder

Model	Base LLM	GQA	SQA Text	SQA Image
liuhaotian/llava-v1.5-7b	lmsys/vicuna-7b-v1.5	62	70.43	66.8
liuhaotian/llava-v1.5-13b	lmsys/vicuna-7b-v1.5	63.3		71.6
agi-collective/vicuna-7b-clip-finetune-lora	lmsys/vicuna-7b-v1.5	62.04	70.86	68.72
agi-collective/vicuna-7b-siglip-so400m-finetune-lora	lmsys/vicuna-7b-v1.5	56.79	68.76	67.48
agi-collective/mistral-7b-siglip-so400m-finetune-lora	mistralai/Mistral-7B-v0.1	49.44	73.66	68.57
agi-collective/mistral-7b-oh-siglip-so400m-frozen-ve-finetune-lora	teknium/OpenHermes-2.5-Mistral-7B	53.59	78.17	72.73
agi-collective/mistral-7b-oh-siglip-so400m-finetune-lora	teknium/OpenHermes-2.5-Mistral-7B	54.48	79.56	74.22

(best 7B model results highlighted)

Authors

Daniel Z Kaplan¹, Kshitij Gupta¹, Simon Ramstedt¹, Alexis Roger², Edwin Fennell², George Adamopoulos², Quentin Anthony², Sun Qi², Andrew R Williams³, Prateek Humane³, Rishika Bhagwatkar³, Yuchen Lu³, Irina Rish⁴

¹first author, ²second author, ³third author, ⁴PI

Citation

@misc{RobinV1,
  author = {Daniel Z Kaplan, Kshitij Gupta, Simon Ramstedt, Alexis Roger, Edwin Fennell, George Adamopoulos, Quentin Anthony, Sun Qi, Andrew R Williams, Prateek Humane, Rishika Bhagwatkar, Yuchen Lu, Irina Rish},
  title = {Robin - Visual Language Models},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/AGI-Collective/Robin/releases/tag/v1.0.0}},
  commit = {tags/v1.0.0}
}

Acknowledgements

We would like to thank Hessian-AI for providing us with free access to 8-16 A100 GPUs for a few weeks and to Florian and Patrick at Hessian AI for their support. We would also like to thank Oak Ridge Leadership Computing Facility (OLCF), the DOE Office of Science User Facility. Prelimnary experiments were conducted on the INCITE compute grant on Summit supercomputer supported under Contract DE-AC05-00OR22725. This grant was awarded to AAI CERC lab for their Scalable Foundation Models for Transferrable Generalist AI project. This work was in collaboration with representatives from EleutherAI. The code in this repo is based on github.com/haotian-liu/LLaVA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaVA-Mistral

Available models

Authors

Citation

Acknowledgements