Skip to content

OpenGVLab/VKnowU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📊 VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

  📖 ArXiv    │   📊 VKnowU    │   📀 VKnowQA    │   🤗 Video-Know+

While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term $\textbf{\textit{visual knowledge}}$, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present 📊VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both $\textit{world-centric}$ (e.g., intuitive physics) and $\textit{human-centric}$ (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, 📀VKnowQA, and 🤗VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured $\textit{See–Think–Answer}$ paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

✅ Release the training and evaluation codes of VideoKnow+

⏳ Release the benchmark:VKnowU

⏳ Release the model weights of 🤗VideoKnow+

⏳ Release the 30K training datasets:📀VKnowQA-CS-12K and 📀VKnowQA-30K

Requirements

  • Python >= 3.11
  • Pytorch >= 2.5.1
  • transformers == 4.51.3
  • vLLM == 0.7.3
  • trl == 0.16.0

Installation

git clone https://github.com/OpenGVLab/VKnowU
cd VKnowU

# Create and activate environment
conda create -n VKnowU python=3.11 
conda activate VKnowU
bash setup.sh

🚀 Training

Supervised Fine-Tuning (SFT)

We begin with supervised fine-tuning on the 📀VKnowQA-CS-12K dataset for one epoch:

bash ./src/scripts/run_sft_video.sh

Reinforcement Learning (RL)

Next, perform reinforcement learning using the 📀VKnowQA-30K dataset (using vLLM acceleration to enable faster training):

  1. Employ an external verifier MLLM for calculate visual knowledge reward and modify the corresponding api in here.

  2. Run the RL scripts:

bash ./src/scripts/run_grpo_vllm_qwen25vl.sh

Note: During training, we adopt the following settings for efficiency:

  • VIDEO PIXELS: 128 × 28 × 28
  • FPS FRAMES: 16

All frame-related configurations can be adjusted in src/qwen-vl-utils.

📈 Evaluation

During inference, we increase the maximum frame resolution and length to boost performance:

  • VIDEO PIXELS: 256 × 28 × 28
  • FPS FRAMES: 32

You can configure these parameters in src/qwen-vl-utils.

Evaluation Procedure

📊 VKnowU

  1. Download the video and json data from VKnowU and organize them.

  2. Run the evaluation on VKnowU:

bash ./src/eval_vknowu.sh
  1. Caculate overall accuracy:
python ./src/eval/calculate_vknowu.py

📊 Other Video Benchmarks

  1. Download the video data from the official sites of each benchmark and organize them as specified in the JSON files in the eval_data.

  2. Run the evaluation across other video benchmarks:

bash ./src/eval_bench.sh
  1. Caculate overall accuracy:
python ./src/eval/calculate_bench.py

🙏 Acknowledgements

We gratefully acknowledge the contributions of the open-source community, particularly R1-V and VideoRFT.

📚 Citations

If you find this work helpful, please consider citing:

coming soon

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published