Skip to content

EnSec-AI/VLM-Uncertainty-Bench

Repository files navigation

Uncertainty-Aware Evaluation for Vision-Language Models

Introduction

Datasets

Evaluation

Accuracy Results

model_name MB OOD SQA SB AI2D Avg.
LLaVA-v1.6-Vicuna-13B 76.75 72.93 70.56 70.37 73.67 72.85
Monkey-Chat 76.98 70.6 74.66 66.1 67.95 71.26
LLaVA-v1.6-Vicuna-7B 75.56 73.7 65.86 69.06 69.75 70.78
InternLM-XComposer2-VL 71.77 70.04 77.95 64.44 66.13 70.07
Yi-VL-6B 75.24 73.91 66.72 66.25 58.84 68.19
CogAgent-VQA 74.78 68.57 67.12 68.01 58.2 67.34
MobileVLM_V2-7B 75.97 66.53 72.33 66.71 53.55 67.02
MoE-LLaVA-Phi2-2.7B 73.73 74.82 64.04 66.42 55.76 66.95
mPLUG-Owl2 73.05 73.28 65.71 61.49 54.38 65.58
Qwen-VL-Chat 71.4 54.22 63.23 59.79 65.09 62.74

Set Sizes Results

model_name MMB OOD SQA SB AI2D Avg.
LLaVA-v1.6-Vicuna-13B 2.34 2.18 2.45 2.49 2.33 2.36
Monkey-Chat 2.7 2.92 2.56 3.26 3.19 2.93
LLaVA-v1.6-Vicuna-7B 2.37 2.34 2.45 2.53 2.37 2.41
InternLM-XComposer2-VL 2.72 2.2 2.41 3.08 3.02 2.69
Yi-VL-6B 2.47 2.02 2.76 2.61 3 2.57
CogAgent-VQA 2.33 2.46 2.36 2.49 2.94 2.52
MobileVLM_V2-7B 2.53 2.61 2.62 2.8 3.4 2.79
MoE-LLaVA-Phi2-2.7B 2.54 1.89 2.7 2.69 2.92 2.55
mPLUG-Owl2 2.55 2.09 2.71 2.93 3 2.65
Qwen-VL-Chat 2.7 3.32 2.9 3.32 3.1 3.07

Uncertainty-aware Accuracy Results

model_name MMB OOD SQA SB AI2D Avg.
LLaVA-v1.6-Vicuna-13B 90.41 86.29 78.04 73.87 84.58 82.64
Monkey-Chat 83.41 63.22 81.7 52.49 56.08 67.38
LLaVA-v1.6-Vicuna-7B 87.87 82.81 69.77 70.69 77.2 77.67
InternLM-XComposer2-VL 69.98 80.49 94.37 52.6 56.4 70.77
Yi-VL-6B 84.56 95.05 64.01 64.56 49.31 71.5
CogAgent-VQA 85.56 71.14 72.4 69.96 49 69.61
MobileVLM_V2-7B 84.19 64.35 79.07 62.18 39.39 65.84
MoE-LLaVA-Phi2-2.7B 82.83 100.73 61.18 64.67 47.87 71.46
mPLUG-Owl2 78.4 89.24 62.92 52.91 45.09 65.71
Qwen-VL-Chat 69.58 40.28 54.71 44.7 54.3 52.71

Getting started

6 groups of models could be launch from one environment: LLaVa, CogVLM, Yi-VL, Qwen-VL, internlm-xcomposer, MoE-LLaVA. This environment could be created by the following code:

python3 -m venv venv
source venv/bin/activate
pip install git+https://github.com/haotian-liu/LLaVA.git 
pip install git+https://github.com/PKU-YuanGroup/MoE-LLaVA.git --no-deps
pip install deepspeed==0.9.5
pip install -r requirements.txt
pip install xformers==0.0.23 --no-deps

mPLUG-Owl model can be launched from the following environment:

python3 -m venv venv_mplug
source venv_mplug/bin/activate
git clone https://github.com/X-PLUG/mPLUG-Owl.git
cd mPLUG-Owl/mPLUG-Owl2
git checkout 74f6be9f0b8d42f4c0ff9142a405481e0f859e5c
pip install -e .
pip install git+https://github.com/haotian-liu/LLaVA.git --no-deps
cd ../../
pip install -r requirements.txt

Monkey models can be launched from the following environment:

python3 -m venv venv_monkey
source venv_monkey/bin/activate
git clone https://github.com/Yuliang-Liu/Monkey.git
cd ./Monkey
pip install -r requirements.txt
pip install git+https://github.com/haotian-liu/LLaVA.git --no-deps
cd ../
pip install -r requirements.txt

To check all models you can run scripts/test_model_logits.sh

To work with Yi-VL:

apt-get install git-lfs
cd ../
git clone https://huggingface.co/01-ai/Yi-VL-6B

Model logits

To get model logits in four benchmarks run command from scripts/run.sh.

To quantify uncertainty by logits

python -m uncertainty_quantification_via_cp --result_data_path 'output' --file_to_write 'full_result.json'

To get result tables by uncertainty

python -m make_tables --result_path 'full_result.json' --dir_to_write 'tables'

Citation

@article{kostumov2024uncertainty,
  title={Uncertainty-Aware Evaluation for Vision-Language Models},
  author={Kostumov, Vasily and Nutfullin, Bulat and Pilipenko, Oleg and Ilyushin, Eugene},
  journal={arXiv preprint arXiv:2402.14418},
  year={2024}
}

Acknowledgement

LLM-Uncertainty-Bench: conformal prediction applied to LLM. Thanks for the authors for providing the framework.

Contact

We welcome suggestions to help us improve benchmark. For any query, please contact us at v.kostumov@ensec.ai. If you find something interesting, please also feel free to share with us through email or open an issue. Thanks!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages