📊 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

📖 ArXiv │ 📊 VKnowU │ 📀 VKnowQA │ 🤗 Video-Know+

🔎 Overview

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term $\textbf{\textit{visual knowledge}}$, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present 📊VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both $\textit{world-centric}$ (e.g., intuitive physics) and $\textit{human-centric}$ (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, 📀VKnowQA, and 🤗VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured $\textit{See–Think–Answer}$ paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

🔧 ToDo

✅ Release the training and evaluation codes of VideoKnow+

⏳ Release the benchmark：VKnowU

⏳ Release the model weights of 🤗VideoKnow+

⏳ Release the 30K training datasets：📀VKnowQA-CS-12K and 📀VKnowQA-30K

🛠️ Set up

Requirements

Python >= 3.11
Pytorch >= 2.5.1
transformers == 4.51.3
vLLM == 0.7.3
trl == 0.16.0

Installation

git clone https://github.com/OpenGVLab/VKnowU
cd VKnowU

# Create and activate environment
conda create -n VKnowU python=3.11 
conda activate VKnowU
bash setup.sh

🚀 Training

Supervised Fine-Tuning (SFT)

We begin with supervised fine-tuning on the 📀VKnowQA-CS-12K dataset for one epoch:

bash ./src/scripts/run_sft_video.sh

Reinforcement Learning (RL)

Next, perform reinforcement learning using the 📀VKnowQA-30K dataset (using vLLM acceleration to enable faster training):

Employ an external verifier MLLM for calculate visual knowledge reward and modify the corresponding api in here.
Run the RL scripts:

bash ./src/scripts/run_grpo_vllm_qwen25vl.sh

Note: During training, we adopt the following settings for efficiency:

VIDEO PIXELS: 128 × 28 × 28
FPS FRAMES: 16

All frame-related configurations can be adjusted in src/qwen-vl-utils.

📈 Evaluation

During inference, we increase the maximum frame resolution and length to boost performance:

VIDEO PIXELS: 256 × 28 × 28
FPS FRAMES: 32

You can configure these parameters in src/qwen-vl-utils.

Evaluation Procedure

📊 VKnowU

Download the video and json data from VKnowU and organize them.
Run the evaluation on VKnowU:

bash ./src/eval_vknowu.sh

Caculate overall accuracy:

python ./src/eval/calculate_vknowu.py

📊 Other Video Benchmarks

Download the video data from the official sites of each benchmark and organize them as specified in the JSON files in the eval_data.
Run the evaluation across other video benchmarks:

bash ./src/eval_bench.sh

Caculate overall accuracy:

python ./src/eval/calculate_bench.py

🙏 Acknowledgements

We gratefully acknowledge the contributions of the open-source community, particularly R1-V and VideoRFT.

📚 Citations

If you find this work helpful, please consider citing:

coming soon

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figs		figs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📊 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

🔎 Overview

🔧 ToDo

🛠️ Set up

Requirements

Installation

🚀 Training

Supervised Fine-Tuning (SFT)

Reinforcement Learning (RL)

📈 Evaluation

Evaluation Procedure

📊 VKnowU

📊 Other Video Benchmarks

🙏 Acknowledgements

📚 Citations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

OpenGVLab/VKnowU

Folders and files

Latest commit

History

Repository files navigation

📊 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

🔎 Overview

🔧 ToDo

🛠️ Set up

Requirements

Installation

🚀 Training

Supervised Fine-Tuning (SFT)

Reinforcement Learning (RL)

📈 Evaluation

Evaluation Procedure

📊 VKnowU

📊 Other Video Benchmarks

🙏 Acknowledgements

📚 Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages