Skip to content

showlab/LOVA3

Repository files navigation

LOVA3: Learning to Visual Question Answering, Asking and Assessment


Paper PDF Project Page Models EvalQABench Dataset
TL;DR: No hyperparameter modification and extra data annotation; LOVA3 is a new training paradigm for advancing multimodal training by incorporating new capabilities: asking questions and assessing vqa triplets.

Overall Performance Improvements

πŸš€ Quick Start

If you are using the codebase LLaVA, just replace the --data_path with Mixed_VQA_GenQA_EvalQA_1.5M.jsonl to enjoy the performance improvement.

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path checkpoints/vicuna-7b-v1.5 \
    --version v1 \
    --data_path ./data/Mixed_VQA_GenQA_EvalQA_1.5M.jsonl \
    ...

βš’οΈ Install (Optional)

If you have the python environments for LLaVA, please skip this step.

conda create -n LOVA python=3.10
conda activate LOVA
pip install --upgrade pip
pip install -e .

Model weights

Model Name Size Checkpoint EvalQA Data generated By
LOVA3-llava-v1.5-7b 7B checkpoint Fuyu-8B
LOVA3-llava-v1.5-7b-gemini 7B checkpoint Gemini-1.5-Flash
LOVA3-llava-v1.5-phi1.5-baseline 1.5B checkpoint -
LOVA3-llava-v1.5-phi1.5-fuyu 1.5B checkpoint Fuyu-8B
LOVA3-llava-v1.5-phi1.5-gemini 1.5B checkpoint Gemini-1.5-Flash

Download from huggingface:

git clone https://huggingface.co/hhenryz/LOVA3-llava-v1.5-7b

Data

Data Json

Image Datasets

Please download the images from constituting datasets:

πŸ’ƒ Evaluation

  1. Download LOVA3-llava-v1.5-7b under the folder checkpoints.

  2. Download the CLIP vision encoder clip-vit-large-patch14-336 under the folder checkpoints.

  3. Run the evaluation scripts under the folder scripts/v1_5/eval. There are 12 multimodal datasets and benchmarks awaiting evaluation.

Take VizWiz as an example, the running command is as follows:

modelname=LOVA3-llava-v1.5-7b

python -m llava.eval.model_vqa_loader \
    --model-path checkpoints/$modelname \
    --question-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --image-folder /yourpath/vizwiz/test/ \
    --answers-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --temperature 0 \
    --conv-mode vicuna_v1

python scripts/convert_vizwiz_for_submission.py \
    --annotation-file ./playground/data/eval/vizwiz/llava_test.jsonl \
    --result-file ./playground/data/eval/vizwiz/answers/$modelname.jsonl \
    --result-upload-file ./playground/data/eval/vizwiz/answers_upload/$modelname.json

Training

  1. Download the pretrained MLP adapter weights llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5 from and put it under the folder checkpoints.

  2. Download the model weight clip-vit-large-patch14-336 under the folder checkpoints.

  3. Download the model weight vicuna-7b-v1.5 under the folder checkpoints.

  4. Download the training data Mixed_VQA_GenQA_EvalQA_1.5M.jsonl under the folder data.

  5. Run the training script.

bash scripts/v1_5/finetune.sh

πŸ™ Acknowledgement

  • LLaVA: The codebase we built upon.
  • LAVIS: We download some datasets from its scripts.

πŸŽ“ Citation

If you find LOVA3 useful, please cite using this BibTeX:

@misc{zhao2024lova3learningvisualquestion,
      title={LOVA3: Learning to Visual Question Answering, Asking and Assessment}, 
      author={Henry Hengyuan Zhao and Pan Zhou and Difei Gao and Zechen Bai and Mike Zheng Shou},
      year={2024},
      eprint={2405.14974},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2405.14974}, 
}