mutualforcing.mp4
Please turn on the sound to hear the audio. MutualForcing 1min30s Video Demo (The video was heavily compressed due to GitHub's 10 MB upload file size limit.)
- Fast auto-regressive audio-video joint generation with only 4โ8 inference steps
- Supports streaming generation and long-duration audio-video synchronization
- A two-stage training strategy for stable multimodal optimization
- A unified dual-mode self-evolution framework for few-step and multi-step generation
- No need for an extra bidirectional teacher model
- Lower memory cost and more flexible training on long sequences
- Matches or outperforms prior methods that require 50 steps
- [๐ฌ Multi-Domain Generalization Result]
- [๐ค Singing]
- [๐ผ BGM Music]
- [๐ฃ๏ธ Multi-Person Speaking]
- [๐พ Animal & ๐ฝ๏ธ Eating]
- [1min Long Video Generation]
- [๐ Open-source TODO]
sing1.mp4 |
sing2.mp4 |
play_musical_instrument.mp4 |
| "Blonde young woman wearing gold earrings and necklace, sits at white piano, singing into microphone; eyes sometimes closed, sometimes looking slightly forward; warm indoor background, mid-close low-angle shot. Clear female voice with soft piano, lyrics: 'said his mind was made up, but we both know that he lies', gentle slightly sad emotion." | Dark-brown long-haired woman in black sheer-sleeve top sings at microphone, eyes slightly to right, dark stage background; static mid-close shot. Clear female voice singing, lyrics: '...found me back home', focus on vocal performance. | Southeast Asian elderly man, bare-chested, plays bamboo flute on porch, cheeks puffed, fingers moving skillfully, body slightly swaying with rhythm; ultra-wide fisheye shot slowly moving right and panning left. Calm and evolving flute solo melody. |
video_with_bgm1.mp4 |
video_with_bgm2.mp4 |
video_with_bgm3.mp4 |
| Brown long-curled-haired woman in red top, wearing a straw hat, staring at the camera by the seaside, head slightly tilted, hair blown by wind, waves behind; camera slowly moves right and pans left. Background music: female pop singing, lyrics include 'oh why, oh why' and 'feeling drunk and high, so high, so high...', creating a pensive atmosphere. | Short-haired woman in gray fluffy coat walking in a park, looking thoughtful; camera slowly pans left tracking, composition shifts from right-heavy to left-heavy. Background music: slow melancholic cello, creating a calm and sad atmosphere. | Blonde woman in pink short sleeve, hair blown by wind, slowly turns head and upper body from right to front, looking alert; camera slowly moves back, composition shifts from right-heavy to center. Background music: tense suspenseful, with clear wind sounds. |
multi_person1.mp4 |
multi_person2.mp4 |
multi_person3.mp4 |
| Two men in camouflage in mossy forest; foreground man talks to camera selfie-style holding rifle, background man wearing full-face mask uses binoculars; mostly static shot. Low-voice English dialogue: 'Ah. I think Claire or something on the other side of the river. So if we get up onto this knoll just in front of us, we might...', tense restrained atmosphere | Glasses-wearing braided schoolgirl and boy in suit uniform sit on beige sofa; girl turns to talk to boy, ornate living room background, static mid-shot bright lighting. Light English conversation: Girl: 'Oh, Brighton, do you have a date for the seventh-grade dance? Yeah' Boy: 'I got a couple of irons in the fire, put out a few feelers.'; youthful relaxed atmosphere. | Bald gray-bearded man knitting, white-haired woman holding folder and pen sitting on brown leather sofa in living room with Christmas tree; static mid-shot, warm indoor light. Calm English dialogue: Male: 'Let me and your mom handle it.' Female: 'I didn't realize we'd come'; quiet homely atmosphere. |
animal.mp4 |
eating.mp4 |
longvid1.mp4 |
longvid2.mp4 |
longvid3.mp4 |
- Project page
- Paper release
- Inference code
- Training code
- Checkpoints
- Data preprocessing pipeline
- Evaluation scripts
- Streaming generation support
- Long-duration generation examples
- Reproducibility instructions
- Hugging Face / demo integration
- Full documentation