DIRECT-Claw: Dynamic Intent for Retrieval &
Editing for Cinematic Transitions

Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

TL;DR: DIRECT-Claw is a hierarchical multi-agent framework for automated video mashup creation. Given user instruction, raw video footage and a background music track, it autonomously edits a visually and rhythmically engaging video mashup (e.g. movie montage, anime music video) with cinematic visual continuity and auditory alignment.

🎬 Showcases

Generated Video Clips (Seamless Visual Transitions🪄)

Full-length Demo Videos (With Music🎵)

MI1_ex.mp4

JK1_ex.mp4

FF1_ex.mp4

🔥 Key Features

Hierarchical Multi-Agent Collaboration: A three-tier architecture (Screenwriter, Director, Editor) that bridges high-level editing intents with frame-level editing precision.
Seamless Visual Transitions: Explicit modeling and optimization for visual-motion continuity and framing consistency to prevent erratic focal jumps and ensure fluid motion flow.
Precise Auditory Alignment: Achieves professional-grade beat-cut synchronization through adaptive editing pace and energy correspondence between musical intensity and visual dynamics.
In-depth Multimodal Reasoning: Hierarchical planning for Global Structural Alignment and Local Segment Cohesion with deep multimodal synergy between shot semantics, visual editing styles, and musical progression.

⚙️ System Overview

System overview. The framework operates through three collaborative modules: the Screenwriter anchors the global structure; the Director instantiates segment-level editing guidance; and the Editor executes fine-grained shot retrieval and orchestration.

Visualization of the hierarchical planning workflow. The Screenwriter leverages multimodal source analysis to generate a section-wise global structural plan (keywords matching), and the Director expands it into segment-level adaptive editing guidance with precise constraints (semantic query, editing heuristic, rhythmic pacing).

Intent-Guided Shot Sequence Editing. The Editor leverages a tailored beam search algorithm with frame-level dynamic sliding-window trimming to find optimal shot sequences that satisfies both visual continuity and auditory alignment. The validator then detects editing failures caused by rigid constraints and prompts query adjustment to ensure sequence coherence.

🚀 Quick Start

1. Environment Setup

git clone https://github.com/AK-Dream/DIRECT-Claw.git
cd DIRECT-Claw
conda create -n direct-claw python=3.12
conda activate direct-claw
conda install -c conda-forge ffmpeg
pip install -r requirements.txt

2. External Dependencies

U-2-Net: Clone U-2-Net to the root and download weights following their guide.
MLLM Backend: Configure your endpoint/local path in configs/agent.yaml. For instance, we deployed Qwen3-VL-8B-Instruct as our MLLM backend. (Note that the interface in src/agent/llm_interface.py might need modification to support video input)

3. Data Preparation

Organize your source data as follows:

data/
├── raw_videos/
│   ├── video_1.mp4
│   ├── video_2.mp4
|   └── ...
├── music_tracks/
|   └── bgm.mp3
└── source_videos.csv

Your source_videos.csv should follow this format:

video_id,filepath
video_1,raw_videos/video_1.mp4
video_2,raw_videos/video_2.mp4

Note: filepath must be a relative path rooted at the data/ directory.

4. Running DIRECT-Claw

Step 1: Preprocessing video assets (extract features & segment shots)

python -m src.main_preprocess --csv path/to/your/source_videos.csv

Step 2: Create a task.yaml to configurate your generation task:

video_csv: "path/to/your/source_videos.csv"    # Relative to data/
video_fps: 24                                  # Please make sure all source videos have identical frames-per-second!
music_path: "music_tracks/bgm.mp3"             # Relative to data/
user_prompt: "A high-octane movie montage with fast transitions"

Step 3: Start Generation!

python -m src.main_agent --yaml_path path/to/your/task.yaml

💖 Acknowledgments

We would like to express our gratitude to the researchers and developers of the following open-source projects, which were instrumental in the development of DIRECT-Claw:

OpenCLIP
RAFT
U-2-Net
FFmpeg
PySceneDetect
All-In-One

We also thank the developers of Qwen3-VL and the vLLM framework for providing the high-performance MLLM backend that powers our hierarchical agents.

📝 Citation

@article{li2026direct,
  title={DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing}, 
  author={Li, Ke and Li, Maoliang and Chen, Jialiang and Chen, Jiayu and Zheng, Zihao and Wang, Shaoqi and Chen, Xiang},
  journal={arXiv preprint arXiv:2604.04875},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIRECT-Claw: Dynamic Intent for Retrieval &
Editing for Cinematic Transitions

Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

🎬 Showcases

Generated Video Clips (Seamless Visual Transitions🪄)

Full-length Demo Videos (With Music🎵)

🔥 Key Features

⚙️ System Overview

🚀 Quick Start

1. Environment Setup

2. External Dependencies

3. Data Preparation

4. Running DIRECT-Claw

💖 Acknowledgments

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DIRECT-Claw: Dynamic Intent for Retrieval & Editing for Cinematic Transitions

Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

🎬 Showcases

Generated Video Clips (Seamless Visual Transitions🪄)

Full-length Demo Videos (With Music🎵)

🔥 Key Features

⚙️ System Overview

🚀 Quick Start

1. Environment Setup

2. External Dependencies

3. Data Preparation

4. Running DIRECT-Claw

💖 Acknowledgments

📝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

DIRECT-Claw: Dynamic Intent for Retrieval &
Editing for Cinematic Transitions

Packages