Ziyang Wang1,2*, Honglu Zhou1,Shijie Wang1, Junnan Li1, Caiming Xiong1, Silvio Savarese1, Mohit Bansal2, Michael S. Ryoo1, Juan Carlos Niebles1
1 Salesforce AI Research
2 UNC Chapel Hill
* Work done during internship at Salesforce
Active Video Perception (AVP) is an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels.
Key ideas:
- Treat long videos as interactive environments
- Iteratively plan → observe → reflect to seek evidence
- Allocate computation adaptively to informative regions
- Improve grounding, efficiency, and reasoning faithfulness
AVP consistently improves over strong MLLM backbones and prior agentic frameworks across multiple long video understanding benchmarks.
Create and activate a fresh conda environment with the required Python version:
conda create -n avp python=3.10 -y
conda activate avpconda install -c conda-forge ffmpeg
ffmpeg -version
pip install -r requirements.txtBefore running evaluation, please download the videos from original benchmark huggingface and update the video paths of the annotation file in avp/eval_anno/ and update the Gemini API information in avp/config.example.json.
For API Keys:
Vertex AI (default): Set project and location in config for GCP Vertex AI.
API key (Google AI Studio): Set the GEMINI_API_KEY environment variable (or optional api_key in config).
Set these in avp/parrelel_run.sh before running:
- ANNOTATION_FILE – Path to your annotation JSON
- OUTPUT_DIR – Directory where results will be written
- CONFIG_FILE – Path to your config JSON (e.g.
config.example.json)
Optional (with defaults):
- LIMIT – Max number of samples (omit for no limit)
- MAX_TURNS – Max plan–execute cycles per sample (default: 3)
- NUM_WORKERS – Number of parallel workers (default: 4)
- TIMEOUT – Timeout per sample in seconds, prevent API failure (omit for no timeout, suggested to use)
Example Script:
bash avp/parrelel_run.shIf you find our work useful, please cite:
@misc{wang2025activevideoperceptioniterative,
title={Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding},
author={Ziyang Wang and Honglu Zhou and Shijie Wang and Junnan Li and Caiming Xiong and Silvio Savarese and Mohit Bansal and Michael S. Ryoo and Juan Carlos Niebles},
year={2025},
eprint={2512.05774},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05774},
}

