Skip to content

Sdata0605/Wav2Lip

Repository files navigation

Wav2Lip - Installation and Setup Guide

Overview

Wav2Lip is a state-of-the-art lip-sync generation system that accurately synchronizes lip movements with audio in videos. This guide provides comprehensive installation and setup instructions for both the open-source version and the commercial API.

System Requirements

Hardware Requirements

  • GPU: NVIDIA GPU with 4GB+ VRAM (CUDA compatible)
  • RAM: Minimum 8GB, recommended 16GB+
  • Storage: At least 5GB free space for models and dependencies

Software Requirements

  • Python: 3.6-3.8 (3.6 recommended for compatibility)
  • CUDA: 10.1+ (for GPU acceleration)
  • FFmpeg: Essential for video/audio processing
  • Git: For cloning the repository

Installation Options

Option 1: Commercial API (Recommended for Production)

The commercial version offers higher quality and easier setup through Sync.so API.

Step 1: Get API Key

  1. Visit Sync.so Dashboard
  2. Create an account and generate your API key
  3. Note your API key for later use

Step 2: Install SDK

Python SDK:

pip install syncsdk

TypeScript SDK:

npm i @sync.so/sdk

Step 3: Quick Start Example

Python Example:

# quickstart.py
import time
from sync import Sync
from sync.common import Audio, GenerationOptions, Video
from sync.core.api_error import ApiError

# Replace with your API key
api_key = "YOUR_API_KEY_HERE"

# Input URLs (or use local files)
video_url = "https://assets.sync.so/docs/example-video.mp4"
audio_url = "https://assets.sync.so/docs/example-audio.wav"

client = Sync(
    base_url="https://api.sync.so", 
    api_key=api_key
).generations

print("Starting lip sync generation...")

try:
    response = client.create(
        input=[Video(url=video_url), Audio(url=audio_url)],
        model="lipsync-2",
        options=GenerationOptions(sync_mode="cut_off"),
        outputFileName="quickstart"
    )
except ApiError as e:
    print(f'Generation failed: {e.status_code} - {e.body}')
    exit()

job_id = response.id
print(f"Job submitted: {job_id}")

# Poll for completion
generation = client.get(job_id)
while generation.status not in ['COMPLETED', 'FAILED']:
    print('Polling status...')
    time.sleep(10)
    generation = client.get(job_id)

if generation.status == 'COMPLETED':
    print(f'Success! Output: {generation.output_url}')
else:
    print('Generation failed')

Run the example:

python quickstart.py

Option 2: Open-Source Version (Research/Personal Use)

Step 1: Environment Setup

# Clone the repository
git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip

# Create conda environment
conda create -n wav2lip python=3.6 -y
conda activate wav2lip

# Install PyTorch (adjust CUDA version as needed)
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 --extra-index-url https://download.pytorch.org/whl/cu101

# Install FFmpeg (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install ffmpeg

# Install Python dependencies
pip install -r requirements.txt

Step 2: Download Required Models

Face Detection Model:

# Download face detection model
mkdir -p face_detection/detection/sfd/
wget -O face_detection/detection/sfd/s3fd.pth https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth

# Alternative link if above fails
wget -O face_detection/detection/sfd/s3fd.pth https://iiitaphyd-my.sharepoint.com/:u:/g/personal/prajwal_k_research_iiit_ac_in/EZsy6qWuivtDnANIG73iHjIBjMSoojcIV0NULXV-yiuiIg?e=qTasa8

Wav2Lip Models:

Choose one of the following:

  1. Wav2Lip (High Accuracy)

    # Download from Google Drive
    # Link: https://drive.google.com/drive/folders/153HLrqlBNxzZcHi17PEvP09kkAfzRshM?usp=share_link
    
    # Manual download and place in checkpoints/
    mkdir -p checkpoints
    # Download wav2lip_gan.pth to checkpoints/
  2. Wav2Lip + GAN (Better Visual Quality)

    # Download from Google Drive
    # Link: https://drive.google.com/file/d/15G3U08c8xsCkOqQxE38Z2XXDnPcOptNk/view?usp=share_link
    
    # Manual download and place in checkpoints/
    mkdir -p checkpoints
    # Download wav2lip.pth to checkpoints/

Step 3: Verify Installation

Create a test script:

# test_wav2lip.py
import os
import torch
import cv2

def test_installation():
    try:
        # Check PyTorch
        print(f"PyTorch version: {torch.__version__}")
        print(f"CUDA available: {torch.cuda.is_available()}")
        
        # Check OpenCV
        print(f"OpenCV version: {cv2.__version__}")
        
        # Check required files
        required_files = [
            'face_detection/detection/sfd/s3fd.pth',
            'checkpoints/wav2lip_gan.pth'  # or wav2lip.pth
        ]
        
        for file in required_files:
            if os.path.exists(file):
                print(f"✓ {file} exists")
            else:
                print(f"✗ {file} missing")
        
        # Test imports
        from Wav2Lip import inference
        print("✓ Wav2Lip imports successful")
        
        print("Installation test completed!")
        
    except Exception as e:
        print(f"Error during test: {e}")

if __name__ == "__main__":
    test_installation()

Run the test:

python test_wav2lip.py

Usage Instructions

Open-Source Version Usage

Basic Inference

python inference.py \
    --checkpoint_path checkpoints/wav2lip_gan.pth \
    --face input_video.mp4 \
    --audio input_audio.wav \
    --output_file result.mp4

Advanced Parameters

python inference.py \
    --checkpoint_path checkpoints/wav2lip_gan.pth \
    --face input_video.mp4 \
    --audio input_audio.wav \
    --output_file result.mp4 \
    --pads 0 20 0 0 \
    --resize_factor 1 \
    --nosmooth \
    --wav2lip_batch_size 128

Parameter Descriptions

Parameter Description Default
--checkpoint_path Path to model checkpoint Required
--face Input video path Required
--audio Input audio path Required
--output_file Output video path results/result_voice.mp4
--pads Face padding [top,bottom,left,right] 0 10 0 0
--resize_factor Downsample factor 1
--nosmooth Disable face detection smoothing False
--wav2lip_batch_size Batch size for processing 128

Commercial API Usage

Local File Processing

from sync import Sync
from sync.common import Audio, GenerationOptions, Video

client = Sync(api_key="YOUR_API_KEY").generations

# Upload local files
response = client.create(
    input=[
        Video(file_path="local_video.mp4"),
        Audio(file_path="local_audio.wav")
    ],
    model="lipsync-2",
    options=GenerationOptions(sync_mode="cut_off")
)

Batch Processing

import asyncio
from sync import Sync

async def batch_process(video_audio_pairs):
    client = Sync(api_key="YOUR_API_KEY").generations
    jobs = []
    
    for video, audio in video_audio_pairs:
        job = await client.create_async(
            input=[Video(file_path=video), Audio(file_path=audio)],
            model="lipsync-2"
        )
        jobs.append(job)
    
    # Wait for all jobs to complete
    results = await asyncio.gather(*[client.get_async(job.id) for job in jobs])
    return results

Configuration and Optimization

GPU Optimization

# Set GPU memory usage
export CUDA_VISIBLE_DEVICES=0

# Use mixed precision for faster processing
# (modify inference script if needed)

Quality Improvement Tips

  1. Face Padding Adjustment

    # Increase bottom padding to include chin
    python inference.py --pads 0 20 0 0 ...
  2. Disable Smoothing for Artifacts

    # Use if you see multiple mouths or artifacts
    python inference.py --nosmooth ...
  3. Resize Factor for Performance

    # Lower resolution for faster processing
    python inference.py --resize_factor 2 ...

Audio Processing

# Convert audio to required format
ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav

# Extract audio from video
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav

Troubleshooting

Common Issues

  1. CUDA Out of Memory

    # Solution: Reduce batch size
    python inference.py --wav2lip_batch_size 64 ...
    
    # Solution: Use CPU for some steps
    # Solution: Reduce input video resolution
  2. Face Detection Issues

    # Solution: Adjust padding
    python inference.py --pads 10 30 10 10 ...
    
    # Solution: Disable smoothing
    python inference.py --nosmooth ...
  3. Audio Sync Problems

    # Solution: Check audio format
    ffmpeg -i audio.wav
    
    # Solution: Ensure correct sample rate
    ffmpeg -i input.mp3 -ar 16000 output.wav
  4. Model Loading Errors

    # Solution: Verify model paths
    ls -la checkpoints/
    ls -la face_detection/detection/sfd/
    
    # Solution: Check file permissions
    chmod 644 checkpoints/*.pth

Performance Optimization

# Enable torch.compile for PyTorch 2.0+
import torch
if hasattr(torch, 'compile'):
    model = torch.compile(model)

File Formats and Requirements

Input Video Requirements

  • Formats: MP4, AVI, MOV
  • Codecs: H.264 recommended
  • Resolution: Any resolution (will be resized internally)
  • Frame Rate: 25-30 FPS recommended

Input Audio Requirements

  • Formats: WAV, MP3, M4A
  • Sample Rate: 16kHz recommended
  • Channels: Mono preferred
  • Duration: Any length supported

Output Specifications

  • Format: MP4
  • Codec: H.264
  • Resolution: Matches input (or resized based on model)
  • Frame Rate: Matches input

Directory Structure

Wav2Lip/
├── checkpoints/              # Model weights
│   ├── wav2lip.pth         # High accuracy model
│   └── wav2lip_gan.pth     # GAN enhanced model
├── face_detection/          # Face detection module
│   └── detection/
│       └── sfd/
│           └── s3fd.pth     # Face detection model
├── Wav2Lip/                # Main package
├── evaluation/             # Evaluation scripts
├── filelists/             # Dataset filelists
├── inference.py           # Main inference script
├── requirements.txt       # Dependencies
└── README.md             # This file

Advanced Setup

Docker Installation

# Dockerfile
FROM pytorch/pytorch:1.7.1-cuda10.1-cudnn7-runtime

RUN apt-get update && apt-get install -y ffmpeg git
RUN git clone https://github.com/Rudrabha/Wav2Lip.git
WORKDIR /Wav2Lip
RUN pip install -r requirements.txt
# Download models here...

CMD ["python", "inference.py", "--help"]

Training Setup (Advanced)

For training on custom datasets:

# Preprocess dataset
python preprocess.py --data_root /path/to/data --preprocessed_root /path/to/preprocessed

# Train expert discriminator
python color_syncnet_train.py --data_root /path/to/preprocessed --checkpoint_dir /path/to/checkpoints

# Train Wav2Lip model
python wav2lip_train.py --data_root /path/to/preprocessed --checkpoint_dir /path/to/checkpoints --syncnet_checkpoint_path /path/to/expert/checkpoint

Next Steps

  1. Test with Examples: Try provided example videos and audio
  2. Quality Tuning: Experiment with parameters for best results
  3. Batch Processing: Set up automated workflows
  4. API Integration: Use commercial API for production applications

Support and Resources

License Information

  • Open Source Version: Research/Personal use only
  • Commercial Version: Full commercial license available
  • Attribution: Cite the original paper if used in research

About

Wav2Lip: Accurate Lip-Sync from Audio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors