Skip to content

AK-DREAM/DIRECT-Claw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DIRECT-Claw: Dynamic Intent for Retrieval &
Editing for Cinematic Transitions

Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

TL;DR: DIRECT-Claw is a hierarchical multi-agent framework for automated video mashup creation. Given user instruction, raw video footage and a background music track, it autonomously edits a visually and rhythmically engaging video mashup (e.g. movie montage, anime music video) with cinematic visual continuity and auditory alignment.

arXiv Github Repo License: MIT

🎬 Showcases

Generated Video Clips (Seamless Visual Transitions🪄)

Showcase 1 Showcase 2 Showcase 3

Full-length Demo Videos (With Music🎵)

MI1_ex.mp4
JK1_ex.mp4
FF1_ex.mp4

🔥 Key Features

  • Hierarchical Multi-Agent Collaboration: A three-tier architecture (Screenwriter, Director, Editor) that bridges high-level editing intents with frame-level editing precision.
  • Seamless Visual Transitions: Explicit modeling and optimization for visual-motion continuity and framing consistency to prevent erratic focal jumps and ensure fluid motion flow.
  • Precise Auditory Alignment: Achieves professional-grade beat-cut synchronization through adaptive editing pace and energy correspondence between musical intensity and visual dynamics.
  • In-depth Multimodal Reasoning: Hierarchical planning for Global Structural Alignment and Local Segment Cohesion with deep multimodal synergy between shot semantics, visual editing styles, and musical progression.

⚙️ System Overview

Overview

System overview. The framework operates through three collaborative modules: the Screenwriter anchors the global structure; the Director instantiates segment-level editing guidance; and the Editor executes fine-grained shot retrieval and orchestration.

Planning

Visualization of the hierarchical planning workflow. The Screenwriter leverages multimodal source analysis to generate a section-wise global structural plan (keywords matching), and the Director expands it into segment-level adaptive editing guidance with precise constraints (semantic query, editing heuristic, rhythmic pacing).

Editing Validation

Intent-Guided Shot Sequence Editing. The Editor leverages a tailored beam search algorithm with frame-level dynamic sliding-window trimming to find optimal shot sequences that satisfies both visual continuity and auditory alignment. The validator then detects editing failures caused by rigid constraints and prompts query adjustment to ensure sequence coherence.

🚀 Quick Start

1. Environment Setup

git clone https://github.com/AK-Dream/DIRECT-Claw.git
cd DIRECT-Claw
conda create -n direct-claw python=3.12
conda activate direct-claw
conda install -c conda-forge ffmpeg
pip install -r requirements.txt

2. External Dependencies

  • U-2-Net: Clone U-2-Net to the root and download weights following their guide.

  • MLLM Backend: Configure your endpoint/local path in configs/agent.yaml. For instance, we deployed Qwen3-VL-8B-Instruct as our MLLM backend. (Note that the interface in src/agent/llm_interface.py might need modification to support video input)

3. Data Preparation

Organize your source data as follows:

data/
├── raw_videos/
│   ├── video_1.mp4
│   ├── video_2.mp4
|   └── ...
├── music_tracks/
|   └── bgm.mp3
└── source_videos.csv

Your source_videos.csv should follow this format:

video_id,filepath
video_1,raw_videos/video_1.mp4
video_2,raw_videos/video_2.mp4

Note: filepath must be a relative path rooted at the data/ directory.

4. Running DIRECT-Claw

Step 1: Preprocessing video assets (extract features & segment shots)

python -m src.main_preprocess --csv path/to/your/source_videos.csv

Step 2: Create a task.yaml to configurate your generation task:

video_csv: "path/to/your/source_videos.csv"    # Relative to data/
video_fps: 24                                  # Please make sure all source videos have identical frames-per-second!
music_path: "music_tracks/bgm.mp3"             # Relative to data/
user_prompt: "A high-octane movie montage with fast transitions"

Step 3: Start Generation!

python -m src.main_agent --yaml_path path/to/your/task.yaml

💖 Acknowledgments

We would like to express our gratitude to the researchers and developers of the following open-source projects, which were instrumental in the development of DIRECT-Claw:

We also thank the developers of Qwen3-VL and the vLLM framework for providing the high-performance MLLM backend that powers our hierarchical agents.

📝 Citation

@article{li2026direct,
  title={DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing}, 
  author={Li, Ke and Li, Maoliang and Chen, Jialiang and Chen, Jiayu and Zheng, Zihao and Wang, Shaoqi and Chen, Xiang},
  journal={arXiv preprint arXiv:2604.04875},
  year={2026}
}

About

A multi-agent framework for automated video mashup creation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages