Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang^1,2*, Honglu Zhou¹,Shijie Wang¹, Junnan Li¹, Caiming Xiong¹, Silvio Savarese¹, Mohit Bansal², Michael S. Ryoo¹, Juan Carlos Niebles¹

¹ Salesforce AI Research
² UNC Chapel Hill
^* Work done during internship at Salesforce

Highlights

Active Video Perception (AVP) is an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels.

Key ideas:

Treat long videos as interactive environments
Iteratively plan → observe → reflect to seek evidence
Allocate computation adaptively to informative regions
Improve grounding, efficiency, and reasoning faithfulness

AVP consistently improves over strong MLLM backbones and prior agentic frameworks across multiple long video understanding benchmarks.

Setup

1. Create Conda Environment

Create and activate a fresh conda environment with the required Python version:

conda create -n avp python=3.10 -y
conda activate avp

2. Install System Dependencies

conda install -c conda-forge ffmpeg
ffmpeg -version
pip install -r requirements.txt

3. Setup Annotation Files and Config Files

Before running evaluation, please download the videos from original benchmark huggingface and update the video paths of the annotation file in avp/eval_anno/ and update the Gemini API information in avp/config.example.json.

For API Keys:

Vertex AI (default): Set project and location in config for GCP Vertex AI.

API key (Google AI Studio): Set the GEMINI_API_KEY environment variable (or optional api_key in config).

Evaluation

Set these in avp/parrelel_run.sh before running:

ANNOTATION_FILE – Path to your annotation JSON
OUTPUT_DIR – Directory where results will be written
CONFIG_FILE – Path to your config JSON (e.g. config.example.json)

Optional (with defaults):

LIMIT – Max number of samples (omit for no limit)
MAX_TURNS – Max plan–execute cycles per sample (default: 3)
NUM_WORKERS – Number of parallel workers (default: 4)
TIMEOUT – Timeout per sample in seconds, prevent API failure (omit for no timeout, suggested to use)

Example Script:

bash avp/parrelel_run.sh

Citation

If you find our work useful, please cite:

@misc{wang2025activevideoperceptioniterative,
      title={Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding}, 
      author={Ziyang Wang and Honglu Zhou and Shijie Wang and Junnan Li and Caiming Xiong and Silvio Savarese and Mohit Bansal and Michael S. Ryoo and Juan Carlos Niebles},
      year={2025},
      eprint={2512.05774},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05774}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
avp		avp
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Table of Contents

Highlights

Setup

1. Create Conda Environment

2. Install System Dependencies

3. Setup Annotation Files and Config Files

Evaluation

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

SalesforceAIResearch/ActiveVideoPerception

Folders and files

Latest commit

History

Repository files navigation

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Table of Contents

Highlights

Setup

1. Create Conda Environment

2. Install System Dependencies

3. Setup Annotation Files and Config Files

Evaluation

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages