TL;DR: DIRECT-Claw is a hierarchical multi-agent framework for automated video mashup creation. Given user instruction, raw video footage and a background music track, it autonomously edits a visually and rhythmically engaging video mashup (e.g. movie montage, anime music video) with cinematic visual continuity and auditory alignment.
|
|
|
MI1_ex.mp4 |
JK1_ex.mp4 |
FF1_ex.mp4 |
- Hierarchical Multi-Agent Collaboration: A three-tier architecture (Screenwriter, Director, Editor) that bridges high-level editing intents with frame-level editing precision.
- Seamless Visual Transitions: Explicit modeling and optimization for visual-motion continuity and framing consistency to prevent erratic focal jumps and ensure fluid motion flow.
- Precise Auditory Alignment: Achieves professional-grade beat-cut synchronization through adaptive editing pace and energy correspondence between musical intensity and visual dynamics.
- In-depth Multimodal Reasoning: Hierarchical planning for Global Structural Alignment and Local Segment Cohesion with deep multimodal synergy between shot semantics, visual editing styles, and musical progression.
System overview. The framework operates through three collaborative modules: the Screenwriter anchors the global structure; the Director instantiates segment-level editing guidance; and the Editor executes fine-grained shot retrieval and orchestration.
Visualization of the hierarchical planning workflow. The Screenwriter leverages multimodal source analysis to generate a section-wise global structural plan (keywords matching), and the Director expands it into segment-level adaptive editing guidance with precise constraints (semantic query, editing heuristic, rhythmic pacing).
Intent-Guided Shot Sequence Editing. The Editor leverages a tailored beam search algorithm with frame-level dynamic sliding-window trimming to find optimal shot sequences that satisfies both visual continuity and auditory alignment. The validator then detects editing failures caused by rigid constraints and prompts query adjustment to ensure sequence coherence.
git clone https://github.com/AK-Dream/DIRECT-Claw.git
cd DIRECT-Claw
conda create -n direct-claw python=3.12
conda activate direct-claw
conda install -c conda-forge ffmpeg
pip install -r requirements.txt-
U-2-Net: Clone U-2-Net to the root and download weights following their guide.
-
MLLM Backend: Configure your endpoint/local path in
configs/agent.yaml. For instance, we deployed Qwen3-VL-8B-Instruct as our MLLM backend. (Note that the interface insrc/agent/llm_interface.pymight need modification to support video input)
Organize your source data as follows:
data/
├── raw_videos/
│ ├── video_1.mp4
│ ├── video_2.mp4
| └── ...
├── music_tracks/
| └── bgm.mp3
└── source_videos.csv
Your source_videos.csv should follow this format:
video_id,filepath
video_1,raw_videos/video_1.mp4
video_2,raw_videos/video_2.mp4Note: filepath must be a relative path rooted at the data/ directory.
Step 1: Preprocessing video assets (extract features & segment shots)
python -m src.main_preprocess --csv path/to/your/source_videos.csvStep 2: Create a task.yaml to configurate your generation task:
video_csv: "path/to/your/source_videos.csv" # Relative to data/
video_fps: 24 # Please make sure all source videos have identical frames-per-second!
music_path: "music_tracks/bgm.mp3" # Relative to data/
user_prompt: "A high-octane movie montage with fast transitions"Step 3: Start Generation!
python -m src.main_agent --yaml_path path/to/your/task.yamlWe would like to express our gratitude to the researchers and developers of the following open-source projects, which were instrumental in the development of DIRECT-Claw:
We also thank the developers of Qwen3-VL and the vLLM framework for providing the high-performance MLLM backend that powers our hierarchical agents.
@article{li2026direct,
title={DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing},
author={Li, Ke and Li, Maoliang and Chen, Jialiang and Chen, Jiayu and Zheng, Zihao and Wang, Shaoqi and Chen, Xiang},
journal={arXiv preprint arXiv:2604.04875},
year={2026}
}






