# Wav2Lip-HQ inference

This notebook is a tutorial describing how to run Wav2Lip-HQ model for lip-sync of high quality videos. You can find more details in [our GitHub repository](https://github.com/Markfryazino/wav2lip-hq).

Here we don't cover training any models. For finetuning super resolution model on your videos, please refer to [the other notebook](https://colab.research.google.com/drive/1IUGYn-fMRbjH2IyYoAn5VKSzEkaXyP2s?usp=sharing).

## At first, clone the repository and load all required models.

In [1]:
!git clone https://github.com/Markfryazino/wav2lip-hq.git
%cd wav2lip-hq
!pip3 install gdown
!pip3 install -r requirements.txt

!wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"

Cloning into 'wav2lip-hq'...
remote: Enumerating objects: 442, done.[K
remote: Counting objects: 100% (53/53), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 442 (delta 16), reused 12 (delta 12), pack-reused 389[K
Receiving objects: 100% (442/442), 4.06 MiB | 14.81 MiB/s, done.
Resolving deltas: 100% (123/123), done.
/content/wav2lip-hq
Collecting addict (from -r requirements.txt (line 1))
  Downloading addict-2.4.0-py3-none-any.whl (3.8 kB)
Collecting librosa==0.7.0 (from -r requirements.txt (line 3))
  Downloading librosa-0.7.0.tar.gz (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting lmdb (from -r requirements.txt (line 4))
  Downloading lmdb-1.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (299 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.2/299.2 kB[0m [31m34.9 MB/s

In [2]:
import gdown

urls = {
    "wav2lip_gan.pth": "10Iu05Modfti3pDbxCFPnofmfVlbkvrCm",
    "face_segmentation.pth": "154JgKpzCPW82qINcVieuPH3fZ2e0P812",
    "esrgan_max.pth": "1e5LT83YckB5wFKXWV4cWOPkVRnCDmvwQ"
}

for name, id in urls.items():
    url = f"https://drive.google.com/uc?id={id}"
    output = f"checkpoints/{name}"
    gdown.download(url, output, quiet=False)
    print(f"Loaded {name}")

Downloading...
From: https://drive.google.com/uc?id=10Iu05Modfti3pDbxCFPnofmfVlbkvrCm
To: /content/wav2lip-hq/checkpoints/wav2lip_gan.pth
100%|██████████| 436M/436M [00:15<00:00, 28.8MB/s]


Loaded wav2lip_gan.pth


Downloading...
From: https://drive.google.com/uc?id=154JgKpzCPW82qINcVieuPH3fZ2e0P812
To: /content/wav2lip-hq/checkpoints/face_segmentation.pth
100%|██████████| 53.3M/53.3M [00:01<00:00, 28.1MB/s]


Loaded face_segmentation.pth


Downloading...
From: https://drive.google.com/uc?id=1e5LT83YckB5wFKXWV4cWOPkVRnCDmvwQ
To: /content/wav2lip-hq/checkpoints/esrgan_max.pth
100%|██████████| 67.0M/67.0M [00:02<00:00, 25.7MB/s]

Loaded esrgan_max.pth





## Now upload target audio and video.

You can just upload via Google Colab interface or load from Google Drive, which can be more quick.

In [13]:
# If you load files from Drive, run this cell

# Paste your filenames and Google Drive IDs below.
urls = {
    "output10.wav": "1tR2OqP05wt8s4epZDraqTGRrHQ2BXKWr",
    "video.mp4": "1apIZseM49erefJLL6pxh_y4LpWu5toHm",
}

for name, id in urls.items():
    url = f"https://drive.google.com/uc?id={id}"
    output = f"videos/{name}"
    gdown.download(url, output, quiet=False)
    print(f"Loaded {name}")

Downloading...
From: https://drive.google.com/uc?id=1tR2OqP05wt8s4epZDraqTGRrHQ2BXKWr
To: /content/wav2lip-hq/videos/output10.wav
100%|██████████| 3.51M/3.51M [00:00<00:00, 178MB/s]


Loaded output10.wav


Downloading...
From: https://drive.google.com/uc?id=1apIZseM49erefJLL6pxh_y4LpWu5toHm
To: /content/wav2lip-hq/videos/video.mp4
100%|██████████| 3.82M/3.82M [00:00<00:00, 169MB/s]

Loaded video.mp4





## Finally, run the model!

Please, replace `--face`, `--audio` and `--outfile` arguments with desired paths. Also, you may want to change `--sr-path` if you've pretrained the super resolution model.

In [14]:
!pip install librosa==0.8.0



In [15]:
!python inference.py \
        --checkpoint_path "checkpoints/wav2lip_gan.pth" \
        --segmentation_path "checkpoints/face_segmentation.pth" \
        --sr_path "checkpoints/esrgan_max.pth" \
        --face "videos/video.mp4" \
        --audio "videos/output10.wav" \
        --outfile "results/finalresult.mp4"

Using cuda for inference.
(80, 3186)
Length of mel chunks: 1191
  0% 0/10 [00:00<?, ?it/s]Reading video frames from start...
Loading segmentation network...
Loading super resolution model...
Load checkpoint from: checkpoints/wav2lip_gan.pth
Model loaded
Reading video frames from start...
 50% 5/10 [05:03<04:55, 59.10s/it]Reading video frames from start...
100% 10/10 [09:12<00:00, 55.22s/it]
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --