Evaluation

InternLM-XComposer2-4KHD Evaluation

We support the evaluation of InternLM-XComposer2-4KHD in VLMEvalKit

InternLM-XComposer2-VL Evaluation

In InternLM-XComposer2, we evaluate models on a diverse set of 13 benchmarks with the following scripts. The evaluation is also supported in VLMEvalKit (The results will have slight difference).

MathVista

Run the notebook MathVista.ipynb.

MathVista results

test	testmini
57.93	57.6

MMMU

Run the notebook MMMU/MMMU_Validation.ipynb.

MMMU results

test	val
38.2	42.0

MME

Download the data following the official instructions here.
Downloaded images to MME_Benchmark_release_version.
put the official eval_tool and MME_Benchmark_release_version under ./data/.
Single-GPU inference.

cd MME
CUDA_VISIBLE_DEVICES=0 python -u eval.py

MME results

=========== Perception ===========
total score: 1711.9952981192478

         existence  score: 195.0
         count  score: 160.0
         position  score: 163.33333333333334
         color  score: 195.0
         posters  score: 171.08843537414964
         celebrity  score: 153.8235294117647
         scene  score: 164.75
         landmark  score: 176.0
         artwork  score: 185.5
         OCR  score: 147.5


=========== Cognition ===========
total score: 530.7142857142858

         commonsense_reasoning  score: 145.71428571428572
         numerical_calculation  score: 137.5
         text_translation  score: 147.5
         code_reasoning  score: 100.0

MMBench

Download mmbench_test_20230712.tsv and put under ./data/.
Single-GPU inference.

cd MMBench
CUDA_VISIBLE_DEVICES=0 python -u eval.py

Submit the results to the evaluation server: Output/submit_test.xlsx.

MMBench Testset results

Overall	AR	CP	FP-C	FP-S	LR	RR
79.64	82.35	83.82	72	85.75	66.47	75.11

MMBench-CN

Download mmbench_dev_cn_20231003.tsv and put under ./data/.
Single-GPU inference.

cd MMBench
CUDA_VISIBLE_DEVICES=0 python -u eval_cn.py

Submit the results to the evaluation server: Output/submit_dev_cn.xlsx.

MMBench-CN Testset results

Overall	AR	CP	FP-C	FP-S	LR	RR
77.57	84.37	83.29	69.23	83.16	60.69	68.72

SEED-Bench

Following the official instructions to download the images and the json file. Put images under ./data/SEED-Bench-image and ./data/SEED-Bench.json.
Single-GPU inference.

cd SEED
CUDA_VISIBLE_DEVICES=0 python -u eval.py

Seed-Bench Image Set results

Overall	Instance Attributes	Instance Identity	Instance Interaction	Instance Location	Instances Counting	Scene Understand	Spatial Relation	Text Understand	Visual Reasoning
75.87	77.84	78.37	79.38	72.69	69.96	79.32	63.47	67.85	80.06

AI2D

Download the processed images here, unzip the images to ./data/ai2d/.
Run the notebook AI2D.ipynb.

AI2D results

Overall
78.73

ChartQA

Download the processed images from the official webset, unzip the images to ./data/chartqa/.
Run the notebook ChartQA.ipynb.

AI2D results

Overall	Human	Augmented
72.68	63.52	81.84

LLaVA-Bench-in-the-Wild

Extract contents of llava-bench-in-the-wild to ./data/llava-bench-in-the-wild.
Run the notebook LLaVA_Wild_Eval.ipynb.

LLaVA Wild results

	Answer/GPT4	GPT4 score	Answer score
llava_bench_complex	92.3	83.9	77.5
llava_bench_conv	67.6	87.1	58.8
llava_bench_detail	78.8	83.3	65.7
all	81.8	84.7	69.2

MM-Vet

Download the mm-vet.zip, unzip and move them to ./data/mm-vet/.
Run the notebook MMVet_Eval.ipynb.
Run the notebook MMVet_evaluator.ipynb.

MM-Vet results

rec	ocr	know	gen	spat	math	total
50.8	50.5	35.0	38.5	52.8	41.9	51.2

Q-Bench

Download llvisionqa_dev.json (for dev-subset) and llvisionqa_test.json (for test-subset). Put them under ./data/qbench.
Download and extract images and put all the images directly under ./data/qbench/llv_dev.
Run the notebook QBench.ipynb.
For the testset results, set the split to test and submit the results by instruction here: Output/QBench_test_en_InternLM_XComposer_VL.json.pth.

Q-Bench results

test-en	dev-en
72.52	70.70

Chinese-Q-Bench

Download 质衡-问答-验证集.json (for dev-subset) and 质衡-问答-测试集.json (for test-subset). Put them under ./data/qbench.
Download and extract images and put all the images directly under ./data/qbench/llv_dev.
Run the notebook QBench.ipynb.
For the testset results, set the split to test and submit the results by instruction here: Output/QBench_test_cn_InternLM_XComposer_VL.json.pth.

Chinese-Q-Bench results

test-cn	dev-cn
70.32	72.11

POPE

Download coco from POPE and put three json files under data/json_files/.
Run the notebook POPE.ipynb.

POPE results

Average F1-Score: 0.8773077717611343

Adversarial 
TP	FP	TN	FN	
1217	91	1409	283
Accuracy: 0.8753333333333333
Precision: 0.9304281345565749
Recall: 0.8113333333333334
F1 score: 0.8668091168091169
Yes ratio: 0.436

Popular
TP	FP	TN	FN	
1217	58	1442	283
Accuracy: 0.8863333333333333
Precision: 0.9545098039215686
Recall: 0.8113333333333334
F1 score: 0.877117117117117
Yes ratio: 0.425

Random
TP	FP	TN	FN	
1217	24	1386	283
Accuracy: 0.8945017182130585
Precision: 0.9806607574536664
Recall: 0.8113333333333334
F1 score: 0.8879970813571689
Yes ratio: 0.42646048109965634

HallusionBench

Download the question and images, move them to ./data/hallu/.
Run the notebook AI2D.ipynb.

HallusionBench Image Part results

aAcc	fAcc	qAcc
60.3	30.01	32.97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluation

InternLM-XComposer2-4KHD Evaluation

InternLM-XComposer2-VL Evaluation

MathVista

MMMU

MME

MMBench

MMBench-CN

SEED-Bench

AI2D

ChartQA

LLaVA-Bench-in-the-Wild

MM-Vet

Q-Bench

Chinese-Q-Bench

POPE

HallusionBench

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation

InternLM-XComposer2-4KHD Evaluation

InternLM-XComposer2-VL Evaluation

MathVista

MMMU

MME

MMBench

MMBench-CN

SEED-Bench

AI2D

ChartQA

LLaVA-Bench-in-the-Wild

MM-Vet

Q-Bench

Chinese-Q-Bench

POPE

HallusionBench