📖 ArXiv │ 📊 VKnowU │ 📀 VKnowQA │ 🤗 Video-Know+
While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term
✅ Release the training and evaluation codes of VideoKnow+
⏳ Release the benchmark:VKnowU
⏳ Release the model weights of 🤗VideoKnow+
⏳ Release the 30K training datasets:📀VKnowQA-CS-12K and 📀VKnowQA-30K
Python >= 3.11Pytorch >= 2.5.1transformers == 4.51.3vLLM == 0.7.3trl == 0.16.0
git clone https://github.com/OpenGVLab/VKnowU
cd VKnowU
# Create and activate environment
conda create -n VKnowU python=3.11
conda activate VKnowU
bash setup.shWe begin with supervised fine-tuning on the 📀VKnowQA-CS-12K dataset for one epoch:
bash ./src/scripts/run_sft_video.shNext, perform reinforcement learning using the 📀VKnowQA-30K dataset (using vLLM acceleration to enable faster training):
-
Employ an external verifier MLLM for calculate visual knowledge reward and modify the corresponding api in here.
-
Run the RL scripts:
bash ./src/scripts/run_grpo_vllm_qwen25vl.shNote: During training, we adopt the following settings for efficiency:
- VIDEO PIXELS: 128 × 28 × 28
- FPS FRAMES: 16
All frame-related configurations can be adjusted in src/qwen-vl-utils.
During inference, we increase the maximum frame resolution and length to boost performance:
- VIDEO PIXELS: 256 × 28 × 28
- FPS FRAMES: 32
You can configure these parameters in src/qwen-vl-utils.
-
Download the video and json data from VKnowU and organize them.
-
Run the evaluation on VKnowU:
bash ./src/eval_vknowu.sh- Caculate overall accuracy:
python ./src/eval/calculate_vknowu.py
-
Download the video data from the official sites of each benchmark and organize them as specified in the JSON files in the eval_data.
-
Run the evaluation across other video benchmarks:
bash ./src/eval_bench.sh- Caculate overall accuracy:
python ./src/eval/calculate_bench.py
We gratefully acknowledge the contributions of the open-source community, particularly R1-V and VideoRFT.
If you find this work helpful, please consider citing:
coming soon

