This is the official repository for the paper "OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains".
We provide an automated data synthesis engine, the large-scale instruction-tuning dataset OmniVideo-100K, and the human-verified test set OmniVideo-Test.
Current "video-caption-QA" pipelines often suffer from modality bias, temporal misalignment, and narrative incoherence. To address these, we propose an automated data engine featuring two core mechanisms:
- Entity-Anchored Video Scripting: Transforms raw videos into structured, script-like text (summaries, main entity lists, and segment-wise audio-visual descriptions with timestamps), ensuring cross-segment referential consistency and sound-source associations.
- Clue-Guided QA Generation: Prompts MLLMs to first mine cross-segment and cross-modal clues from the script, and subsequently generate complex QA pairs featuring long-term temporal spans and deep cross-modal dependencies.
Models fine-tuned on OmniVideo-100K achieve consistent performance gains across external audio-visual benchmarks (e.g., Daily-Omni, JointAVBench) while preserving their original capabilities on general video benchmarks like Video-MME.
Our datasets are designed based on a comprehensive cognitive framework covering 10 tasks across three levels:
- Alignment: Fine-Grained Perception (FGP), Scene Transformation Detection (STD).
- Understanding: Context Understanding (CU), Comparison (CP), Sentiment Analysis (SA), Event Sequence Ordering (ESO), Summarization (SM).
- Reasoning: Causal Reasoning (CR), Future Prediction (FP), Hypothetical Reasoning (HR).
<root_path>/
βββ videos/
β βββ ori/ # Raw videos (.mp4)
βββ pre_files/
β βββ final_videos_list.jsonl # Filtered video list with duration & paths
βββ script_files/ # Intermediate generated scripts
βββ script.jsonl # Final consolidated structured scripts
βββ qa_files/ # Generated Open-ended & MCQ QA pairs
- OS: Linux recommended
- Python: 3.12+
Create a conda environment and install dependencies:
conda create -n omnivideo python=3.12 -y
conda activate omnivideo
# Install system dependency for audio/video processing
conda install -c conda-forge ffmpeg -y
# Install Python packages
pip install tqdm aiofiles google-genai
# Install LLaMA-Factory for fine-tuning
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e .[metrics,deepspeed]You can run our pipeline to synthesize structured scripts and QA pairs for your own videos.
cd omni_train_data_pipeline/gen_script
# 0. Separate Audio/Video & Downsample (Compress video to reduce API load)
python 0_seprate_av.py --root_path <root_path> --num_processes 4 --target_fps 1 --target_max_dim 480
# 1. Extract Multimodal Information (requires API keys)
export API_KEY="your_api_key"
export BASEURL_POOL="https://your-api-gateway.example"
python 1_1_main_entities.py --root_path <root_path>
python 1_2_non_speech.py --root_path <root_path>
python 1_3_transcribe.py --root_path <root_path>
# 2. Add Speaker Labels and Global Summary
python 2_1_label_speaker.py --root_path <root_path>
python 2_2_video_summary.py --root_path <root_path>
# 3. Integrate Segments & Generate Visual Descriptions
python 3_inte_seg.py --root_path <root_path> --max_seg_length 15
python 4_seg_visual.py --root_path <root_path> --num_chunks 6
# 4. Final Formatting Check -> Outputs `<root_path>/script.jsonl`
python 5_check_script.py --root_path <root_path>cd ../gen_qa
# Generate Open-ended QA for specific tasks (e.g., causal reasoning)
python generate_qa.py --root_path <root_path> --task causal_reasoning
python parse_qa.py --root_path <root_path>
# Generate Multiple-Choice QA (MCQ)
python generate_mcq_2.py --root_path <root_path> --task comparison
python parse_mcq.py --root_path <root_path>We perform full-parameter fine-tuning on VITA-1.5, Qwen2.5-Omni-7B, and Qwen3-Omni-30B-A3B-Instruct.
| Model | Max Pixels | FPS | Max Frames | Epochs | Batch Size | Learning Rate | Warmup Ratio |
|---|---|---|---|---|---|---|---|
| VITA-1.5 | default | default | default | 1 | 32 | 1e-5 | 0.03 |
| Qwen2.5-Omni-7B | 448 Γ 448 | 1.0 | 180 | 1 | 32 | 1e-5 | 0.1 |
| Qwen3-Omni-30B | 256 Γ 256 | 1.0 | 150 | 1 | 32 | 5e-6 | 0.1 |
Example script for fine-tuning Qwen2.5-Omni-7B using DeepSpeed Zero-3:
cd LLaMA-FactoryTo evaluate your fine-tuned models on OmniVideo-Test, use our evaluation script:
python evaluation.py \
--model_path your_model \
--eval_dataset omnivideo_test \
--video_dir <root_path>/videos/ori \Fine-tuning on OmniVideo-100K yields significant improvements across Alignment, Understanding, and Reasoning tasks.
If you find our data or pipeline useful for your research, please consider citing our paper:
@article{cai2026omnivideo100k,
title={OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains},
author={Cai, Xinyue and Fu, Chaoyou and Zhang, Yi-Fan and He, Ran and Shan, Caifeng},
journal={arXiv preprint arXiv:2606.14702},
year={2026}
}For any questions or feedback, please reach out to:
- Xinyue Cai: yzlmhzz@smail.nju.edu
- Or feel free to open an issue in this repository!



