Wav2Lip is a state-of-the-art lip-sync generation system that accurately synchronizes lip movements with audio in videos. This guide provides comprehensive installation and setup instructions for both the open-source version and the commercial API.
- GPU: NVIDIA GPU with 4GB+ VRAM (CUDA compatible)
- RAM: Minimum 8GB, recommended 16GB+
- Storage: At least 5GB free space for models and dependencies
- Python: 3.6-3.8 (3.6 recommended for compatibility)
- CUDA: 10.1+ (for GPU acceleration)
- FFmpeg: Essential for video/audio processing
- Git: For cloning the repository
The commercial version offers higher quality and easier setup through Sync.so API.
- Visit Sync.so Dashboard
- Create an account and generate your API key
- Note your API key for later use
Python SDK:
pip install syncsdkTypeScript SDK:
npm i @sync.so/sdkPython Example:
# quickstart.py
import time
from sync import Sync
from sync.common import Audio, GenerationOptions, Video
from sync.core.api_error import ApiError
# Replace with your API key
api_key = "YOUR_API_KEY_HERE"
# Input URLs (or use local files)
video_url = "https://assets.sync.so/docs/example-video.mp4"
audio_url = "https://assets.sync.so/docs/example-audio.wav"
client = Sync(
base_url="https://api.sync.so",
api_key=api_key
).generations
print("Starting lip sync generation...")
try:
response = client.create(
input=[Video(url=video_url), Audio(url=audio_url)],
model="lipsync-2",
options=GenerationOptions(sync_mode="cut_off"),
outputFileName="quickstart"
)
except ApiError as e:
print(f'Generation failed: {e.status_code} - {e.body}')
exit()
job_id = response.id
print(f"Job submitted: {job_id}")
# Poll for completion
generation = client.get(job_id)
while generation.status not in ['COMPLETED', 'FAILED']:
print('Polling status...')
time.sleep(10)
generation = client.get(job_id)
if generation.status == 'COMPLETED':
print(f'Success! Output: {generation.output_url}')
else:
print('Generation failed')Run the example:
python quickstart.py# Clone the repository
git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip
# Create conda environment
conda create -n wav2lip python=3.6 -y
conda activate wav2lip
# Install PyTorch (adjust CUDA version as needed)
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 --extra-index-url https://download.pytorch.org/whl/cu101
# Install FFmpeg (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install ffmpeg
# Install Python dependencies
pip install -r requirements.txtFace Detection Model:
# Download face detection model
mkdir -p face_detection/detection/sfd/
wget -O face_detection/detection/sfd/s3fd.pth https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth
# Alternative link if above fails
wget -O face_detection/detection/sfd/s3fd.pth https://iiitaphyd-my.sharepoint.com/:u:/g/personal/prajwal_k_research_iiit_ac_in/EZsy6qWuivtDnANIG73iHjIBjMSoojcIV0NULXV-yiuiIg?e=qTasa8Wav2Lip Models:
Choose one of the following:
-
Wav2Lip (High Accuracy)
# Download from Google Drive # Link: https://drive.google.com/drive/folders/153HLrqlBNxzZcHi17PEvP09kkAfzRshM?usp=share_link # Manual download and place in checkpoints/ mkdir -p checkpoints # Download wav2lip_gan.pth to checkpoints/
-
Wav2Lip + GAN (Better Visual Quality)
# Download from Google Drive # Link: https://drive.google.com/file/d/15G3U08c8xsCkOqQxE38Z2XXDnPcOptNk/view?usp=share_link # Manual download and place in checkpoints/ mkdir -p checkpoints # Download wav2lip.pth to checkpoints/
Create a test script:
# test_wav2lip.py
import os
import torch
import cv2
def test_installation():
try:
# Check PyTorch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
# Check OpenCV
print(f"OpenCV version: {cv2.__version__}")
# Check required files
required_files = [
'face_detection/detection/sfd/s3fd.pth',
'checkpoints/wav2lip_gan.pth' # or wav2lip.pth
]
for file in required_files:
if os.path.exists(file):
print(f"✓ {file} exists")
else:
print(f"✗ {file} missing")
# Test imports
from Wav2Lip import inference
print("✓ Wav2Lip imports successful")
print("Installation test completed!")
except Exception as e:
print(f"Error during test: {e}")
if __name__ == "__main__":
test_installation()Run the test:
python test_wav2lip.pypython inference.py \
--checkpoint_path checkpoints/wav2lip_gan.pth \
--face input_video.mp4 \
--audio input_audio.wav \
--output_file result.mp4python inference.py \
--checkpoint_path checkpoints/wav2lip_gan.pth \
--face input_video.mp4 \
--audio input_audio.wav \
--output_file result.mp4 \
--pads 0 20 0 0 \
--resize_factor 1 \
--nosmooth \
--wav2lip_batch_size 128| Parameter | Description | Default |
|---|---|---|
--checkpoint_path |
Path to model checkpoint | Required |
--face |
Input video path | Required |
--audio |
Input audio path | Required |
--output_file |
Output video path | results/result_voice.mp4 |
--pads |
Face padding [top,bottom,left,right] | 0 10 0 0 |
--resize_factor |
Downsample factor | 1 |
--nosmooth |
Disable face detection smoothing | False |
--wav2lip_batch_size |
Batch size for processing | 128 |
from sync import Sync
from sync.common import Audio, GenerationOptions, Video
client = Sync(api_key="YOUR_API_KEY").generations
# Upload local files
response = client.create(
input=[
Video(file_path="local_video.mp4"),
Audio(file_path="local_audio.wav")
],
model="lipsync-2",
options=GenerationOptions(sync_mode="cut_off")
)import asyncio
from sync import Sync
async def batch_process(video_audio_pairs):
client = Sync(api_key="YOUR_API_KEY").generations
jobs = []
for video, audio in video_audio_pairs:
job = await client.create_async(
input=[Video(file_path=video), Audio(file_path=audio)],
model="lipsync-2"
)
jobs.append(job)
# Wait for all jobs to complete
results = await asyncio.gather(*[client.get_async(job.id) for job in jobs])
return results# Set GPU memory usage
export CUDA_VISIBLE_DEVICES=0
# Use mixed precision for faster processing
# (modify inference script if needed)-
Face Padding Adjustment
# Increase bottom padding to include chin python inference.py --pads 0 20 0 0 ... -
Disable Smoothing for Artifacts
# Use if you see multiple mouths or artifacts python inference.py --nosmooth ... -
Resize Factor for Performance
# Lower resolution for faster processing python inference.py --resize_factor 2 ...
# Convert audio to required format
ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav
# Extract audio from video
ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav-
CUDA Out of Memory
# Solution: Reduce batch size python inference.py --wav2lip_batch_size 64 ... # Solution: Use CPU for some steps # Solution: Reduce input video resolution
-
Face Detection Issues
# Solution: Adjust padding python inference.py --pads 10 30 10 10 ... # Solution: Disable smoothing python inference.py --nosmooth ...
-
Audio Sync Problems
# Solution: Check audio format ffmpeg -i audio.wav # Solution: Ensure correct sample rate ffmpeg -i input.mp3 -ar 16000 output.wav
-
Model Loading Errors
# Solution: Verify model paths ls -la checkpoints/ ls -la face_detection/detection/sfd/ # Solution: Check file permissions chmod 644 checkpoints/*.pth
# Enable torch.compile for PyTorch 2.0+
import torch
if hasattr(torch, 'compile'):
model = torch.compile(model)- Formats: MP4, AVI, MOV
- Codecs: H.264 recommended
- Resolution: Any resolution (will be resized internally)
- Frame Rate: 25-30 FPS recommended
- Formats: WAV, MP3, M4A
- Sample Rate: 16kHz recommended
- Channels: Mono preferred
- Duration: Any length supported
- Format: MP4
- Codec: H.264
- Resolution: Matches input (or resized based on model)
- Frame Rate: Matches input
Wav2Lip/
├── checkpoints/ # Model weights
│ ├── wav2lip.pth # High accuracy model
│ └── wav2lip_gan.pth # GAN enhanced model
├── face_detection/ # Face detection module
│ └── detection/
│ └── sfd/
│ └── s3fd.pth # Face detection model
├── Wav2Lip/ # Main package
├── evaluation/ # Evaluation scripts
├── filelists/ # Dataset filelists
├── inference.py # Main inference script
├── requirements.txt # Dependencies
└── README.md # This file
# Dockerfile
FROM pytorch/pytorch:1.7.1-cuda10.1-cudnn7-runtime
RUN apt-get update && apt-get install -y ffmpeg git
RUN git clone https://github.com/Rudrabha/Wav2Lip.git
WORKDIR /Wav2Lip
RUN pip install -r requirements.txt
# Download models here...
CMD ["python", "inference.py", "--help"]For training on custom datasets:
# Preprocess dataset
python preprocess.py --data_root /path/to/data --preprocessed_root /path/to/preprocessed
# Train expert discriminator
python color_syncnet_train.py --data_root /path/to/preprocessed --checkpoint_dir /path/to/checkpoints
# Train Wav2Lip model
python wav2lip_train.py --data_root /path/to/preprocessed --checkpoint_dir /path/to/checkpoints --syncnet_checkpoint_path /path/to/expert/checkpoint- Test with Examples: Try provided example videos and audio
- Quality Tuning: Experiment with parameters for best results
- Batch Processing: Set up automated workflows
- API Integration: Use commercial API for production applications
- Commercial Support: contact@sync.so
- Open Source Issues: GitHub Issues
- Documentation: Repository Wiki
- Research Paper: ACM Multimedia 2020
- Open Source Version: Research/Personal use only
- Commercial Version: Full commercial license available
- Attribution: Cite the original paper if used in research