TL;DR: No hyperparameter modification and extra data annotation; LOVA3 is a new training paradigm for advancing multimodal training by incorporating new capabilities: asking questions and assessing vqa triplets.
If you are using the codebase LLaVA, just replace the --data_path
with Mixed_VQA_GenQA_EvalQA_1.5M.jsonl to enjoy the performance improvement.
deepspeed llava/train/train_mem.py \
--deepspeed ./scripts/zero3.json \
--model_name_or_path checkpoints/vicuna-7b-v1.5 \
--version v1 \
--data_path ./data/Mixed_VQA_GenQA_EvalQA_1.5M.jsonl \
...
If you have the python environments for LLaVA, please skip this step.
conda create -n LOVA python=3.10
conda activate LOVA
pip install --upgrade pip
pip install -e .
Model Name | Size | Checkpoint | EvalQA Data generated By |
---|---|---|---|
LOVA3-llava-v1.5-7b | 7B | checkpoint | Fuyu-8B |
LOVA3-llava-v1.5-7b-gemini | 7B | checkpoint | Gemini-1.5-Flash |
LOVA3-llava-v1.5-phi1.5-baseline | 1.5B | checkpoint | - |
LOVA3-llava-v1.5-phi1.5-fuyu | 1.5B | checkpoint | Fuyu-8B |
LOVA3-llava-v1.5-phi1.5-gemini | 1.5B | checkpoint | Gemini-1.5-Flash |
Download from huggingface:
git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b
-
Training Data: Mixed_VQA_GenQA_EvalQA_1.5M.jsonl.
-
EvalQABench Data: EvalQABench
Please download the images from constituting datasets:
- COCO: train2014
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg
- AOKVQA: download script
- TextVQA: train_val_images
- VisualGenome: part1, part2
- LLaVA-Instruct: huggingface
-
Download LOVA3-llava-v1.5-7b under the folder
checkpoints
. -
Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder
checkpoints
. -
Run the evaluation scripts under the folder
scripts/v1_5/eval
. There are 12 multimodal datasets and benchmarks awaiting evaluation.
Take VizWiz as an example, the running command is as follows:
modelname=LOVA3-llava-v1.5-7b
python -m llava.eval.model_vqa_loader \
--model-path checkpoints/$modelname \
--question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--image-folder /yourpath/vizwiz/test/ \
--answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--temperature 0 \
--conv-mode vicuna_v1
python scripts/convert_vizwiz_for_submission.py \
--annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
--result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
--result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json
-
Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder
checkpoints
. -
Download the model weight clip-vit-large-patch14-336 under the folder
checkpoints
. -
Download the model weight vicuna-7b-v1.5 under the folder
checkpoints
. -
Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder
data
. -
Run the training script.
bash scripts/v1_5/finetune.sh
If you find LOVA3 useful, please cite using this BibTeX:
@misc{zhao2024lova3learningvisualquestion,
title={LOVA3: Learning to Visual Question Answering, Asking and Assessment},
author={Henry Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
year={2024},
eprint={2405.14974},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2405.14974},
}