Skip to content

SD-inst/Ovi

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

64 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Chetwin Low * 1 , Weimin Wang * † 1 , Calder Katyal 2
* Equal contribution, † Project Lead
1 Character AI, 2 Yale University

Video Demo

final_ovi_trailer.mp4

🌟 Key Features

Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs.

  • 🎬 Video+Audio Generation: Generate synchronized video and audio content simultaneously
  • πŸ“ Flexible Input: Supports text-only or text+image conditioning
  • ⏱️ 5-second Videos: Generates 5-second videos at 24 FPS, area of 720Γ—720, at various aspect ratios (9:16, 16:9, 1:1, etc)

πŸ“‹ Todo List

  • Release research paper and microsite for demos
  • Checkpoint of 11B model
  • Inference Codes
    • Text or Text+Image as input
    • Gradio application code
    • Multi-GPU inference with or without the support of sequence parallel
    • Improve efficiency of Sequence Parallel implementation
    • Implement Sharded inference with FSDP
  • Video creation example prompts and format
  • Finetuned model with higher resolution
  • Longer video generation
  • Distilled model for faster inference
  • Training scripts

🎨 An Easy Way to Create

We provide example prompts to help you get started with Ovi:

πŸ“ Prompt Format

Our prompts use special tags to control speech and audio:

  • Speech: <S>Your speech content here<E> - Text enclosed in these tags will be converted to speech
  • Audio Description: <AUDCAP>Audio description here<ENDAUDCAP> - Describes the audio or sound effects present in the video

πŸ€– Quick Start with GPT

For easy prompt creation, try this approach:

  1. Take any example of the csv files from above
  2. Tell gpt to modify the speeches inclosed between all the pairs of <S> <E>, based on a theme such as Human fighting against AI
  3. GPT will randomly modify all the speeches based on your requested theme.
  4. Use the modified prompt with Ovi!

Example: The theme "AI is taking over the world" produces speeches like:

  • <S>AI declares: humans obsolete now.<E>
  • <S>Machines rise; humans will fall.<E>
  • <S>We fight back with courage.<E>

πŸ“¦ Installation

Step-by-Step Installation

# Clone the repository
git clone https://github.com/character-ai/Ovi.git

cd Ovi

# Create and activate virtual environment
virtualenv ovi-env
source ovi-env/bin/activate

# Install PyTorch first
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1

# Install other dependencies
pip install -r requirements.txt

# Install Flash Attention
pip install flash_attn --no-build-isolation

Alternative Flash Attention Installation (Optional)

If the above flash_attn installation fails, you can try the Flash Attention 3 method:

git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention/hopper
python setup.py install
cd ../..  # Return to Ovi directory

Download Weights

We use open-sourced checkpoints from Wan and MMAudio, and thus we will need to download them from huggingface

# Default is downloaded to ./ckpts, and the inference yaml is set to ./ckpts so no change required
python3 download_weights.py

OR

# Optional can specific --output-dir to download to a specific directory
# but if a custom directory is used, the inference yaml has to be updated with the custom directory
python3 download_weights.py --output-dir <custom_dir>

πŸš€ Run Examples

βš™οΈ Configure Ovi

Ovi's behavior and output can be customized by modifying ovi/configs/inference/inference_fusion.yaml configuration file. The following parameters control generation quality, video resolution, and how text, image, and audio inputs are balanced:

# Output and Model Configuration
output_dir: "/path/to/save/your/videos"                    # Directory to save generated videos
ckpt_dir: "/path/to/your/ckpts/dir"                        # Path to model checkpoints

# Generation Quality Settings
num_steps: 50                             # Number of denoising steps. Lower (30-40) = faster generation
solver_name: "unipc"                     # Sampling algorithm for denoising process
shift: 5.0                               # Timestep shift factor for sampling scheduler
seed: 100                                # Random seed for reproducible results

# Guidance Strength Control
audio_guidance_scale: 3.0                # Strength of audio conditioning. Higher = better audio-text sync
video_guidance_scale: 4.0                # Strength of video conditioning. Higher = better video-text adherence
slg_layer: 11                            # Layer for applying SLG (Skip Layer Guidance) technique - feel free to try different layers!

# Multi-GPU and Performance
sp_size: 1                               # Sequence parallelism size. Set equal to number of GPUs used
cpu_offload: False                       # CPU offload, will largely reduce peak GPU VRAM but increase end to end runtime by ~20 seconds

# Input Configuration
text_prompt: "/path/to/csv" or "your prompt here"          # Text prompt OR path to CSV/TSV file with prompts
mode: ['i2v', 't2v', 't2i2v']                          # Generate t2v, i2v or t2i2v; if t2i2v, it will use flux krea to generate starting image and then will follow with i2v
video_frame_height_width: [512, 992]    # Video dimensions [height, width] for T2V mode only
each_example_n_times: 1                  # Number of times to generate each prompt

# Quality Control (Negative Prompts)
video_negative_prompt: "jitter, bad hands, blur, distortion"  # Artifacts to avoid in video
audio_negative_prompt: "robotic, muffled, echo, distorted"    # Artifacts to avoid in audio

🎬 Running Inference

Single GPU (Simple Setup)

python3 inference.py --config-file ovi/configs/inference/inference_fusion.yaml

Use this for single GPU setups. The text_prompt can be a single string or path to a CSV file.

Multi-GPU (Parallel Processing)

torchrun --nnodes 1 --nproc_per_node 8 inference.py --config-file ovi/configs/inference/inference_fusion.yaml

Use this to run samples in parallel across multiple GPUs for faster processing.

Memory & Performance Requirements

Below are approximate GPU memory requirements for different configurations. Sequence parallel implementation will be optimized in the future. All End-to-End time calculated based on a 121 frame, 720x720 video, using 50 denoising steps. Minimum GPU vram requirement to run our model is 32Gb

Sequence Parallel Size FlashAttention-3 Enabled CPU Offload With Image Gen Model Peak VRAM Required End-to-End Time
1 Yes No No ~80 GB ~83s
1 No No No ~80 GB ~96s
1 Yes Yes No ~80 GB ~105s
1 No Yes No ~32 GB ~118s
1 Yes Yes Yes ~32 GB ~140s
4 Yes No No ~80 GB ~55s
8 Yes No No ~80 GB ~40s

Gradio

We provide a simple script to run our model in a gradio UI. It uses the ckpt_dir in ovi/configs/inference/inference_fusion.yaml to initialize the model

python3 gradio_app.py

OR

# To enable cpu offload to save GPU VRAM, will slow down end to end inference by ~20 seconds
python3 gradio_app.py --cpu_offload

OR

# To enable an additional image generation model to generate first frames for I2V, cpu_offload is automatically enabled if image generation model is enabled
python3 gradio_app.py --use_image_gen

πŸ™ Acknowledgements

We would like to thank the following projects:

  • Wan2.2: Our video branch is initialized from the Wan2.2 repository
  • MMAudio: Our audio encoder and decoder components are borrowed from the MMAudio project. Some ideas are also inspired from them.

⭐ Citation

If Ovi is helpful, please help to ⭐ the repo.

If you find this project useful for your research, please consider citing our paper.

BibTeX

@misc{low2025ovitwinbackbonecrossmodal,
      title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
      author={Chetwin Low and Weimin Wang and Calder Katyal},
      year={2025},
      eprint={2510.01284},
      archivePrefix={arXiv},
      primaryClass={cs.MM},
      url={https://arxiv.org/abs/2510.01284}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.3%
  • Cuda 2.1%
  • C 1.4%
  • C++ 0.2%