# **🎙️ AI Voice Cloning & Lip-Sync System using Tortoise TTS & Wav2Lip**

**Project Description:**

This notebook demonstrates how to generate realistic speech from text using Tortoise TTS and sync it to a video using Wav2Lip. As a sample, we are using Angelina Jolie's voice and video clips. This project is intended solely for educational and research purposes.

### **🔊 Step 1: Introduction to Tortoise TTS**
Tortoise TTS is a high-quality, multi-voice text-to-speech system that can produce realistic speech with intonation and emotion. It supports custom voices by conditioning the model on a reference voice sample.

In [None]:
!pip3 install -U scipy
!git clone https://github.com/neonbjb/tortoise-tts
!pip install -r requirements.txt
!pip install -e .
!pip3 install transformers==4.26.1 einops==0.5.0 rotary_embedding_torch==0.1.5 unidecode==1.3.5
!python3 setup.py install

Collecting scipy
  Downloading scipy-1.15.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Downloading scipy-1.15.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (37.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.7/37.7 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.15.2
    Uninstalling scipy-1.15.2:
      Successfully uninstalled scipy-1.15.2
Successfully installed scipy-1.15.3
Cloning into 'tortoise-tts'...
remote: Enumerating objects: 2001, done.[K
remote: Total 2001 (delta 0), reused 0 (delta 0), pack-reused 2001 (from 1)[K
Receiving objects: 100% (2001/2001), 54.20 MiB | 38.51 MiB/s, done.
Resolving deltas: 100% (914/914), done.
[31mERROR: Could not open requirements file: [Errno 2]

In [None]:
%cd /content/tortoise-tts/

/content/tortoise-tts


In [None]:
# Imports used through the rest of the notebook.
import torch
import torchaudio
import torch.nn as nn
import torch.nn.functional as F

import IPython
!pip uninstall -y TorToiSe


from tortoise.api import TextToSpeech
from tortoise.utils.audio import load_audio, load_voice, load_voices

# This will download all the models used by Tortoise from the HuggingFace hub.
tts = TextToSpeech()

[0m

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/2.11k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.26G [00:00<?, ?B/s]

  WeightNorm.apply(module, name, dim)


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

autoregressive.pth:   0%|          | 0.00/1.72G [00:00<?, ?B/s]

diffusion_decoder.pth:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

clvp2.pth:   0%|          | 0.00/976M [00:00<?, ?B/s]

vocoder.pth:   0%|          | 0.00/391M [00:00<?, ?B/s]

We'll set up Tortoise TTS, load the voice sample (Angelina's in our case), and prepare to convert text into a natural-sounding voice.

In [None]:
CUSTOM_VOICE_NAME = "angelina"

import os
from google.colab import files

custom_voice_folder = f"tortoise/voices/{CUSTOM_VOICE_NAME}"
os.makedirs(custom_voice_folder)
for i, file_data in enumerate(files.upload().values()):
  with open(os.path.join(custom_voice_folder, f'{i}.wav'), 'wb') as f:
    f.write(file_data)

Saving AngelinaJolie.wav to AngelinaJolie.wav


### **📝 Step 2: Convert Text to Audio**
Here, we'll input some text and use Tortoise TTS to generate an audio clip in Angelina's voice. The model uses preprocessed voice sample for conditioning and produces a high-fidelity .wav output.

In [None]:
from tortoise.utils.text import split_and_recombine_text
from tortoise.utils.audio import load_audio
import torchaudio
import torch
import os
import IPython
from time import time

# Prepare output directory
outpath = "results/longform"
os.makedirs(outpath, exist_ok=True)

input_text = (
    "The ongoing tensions between Pakistan and India are deeply concerning. As someone who has witnessed the consequences of conflict around the world, I urge both nations to prioritize dialogue and diplomacy over hostility."
)

# Save to text.txt
with open("text.txt", "w", encoding="utf-8") as f:
    f.write(input_text)

# Step 2: Load text
with open("text.txt", "r", encoding="utf-8") as f:
    text = f.read().strip()

# Step 3: Split long text if needed
texts = split_and_recombine_text(text)

# Step 4: Load your custom voice reference from AngelinaJolie.wav
# This should be in 16kHz, mono WAV
ref_audio_path = "AngelinaJolie.wav"  # Ensure it's uploaded in this path
voice_samples = [load_audio(ref_audio_path, 22050)]
conditioning_latents = tts.get_conditioning_latents(voice_samples)

# Step 5: Generate speech
seed = int(time())
all_parts = []

for j, part in enumerate(texts):
    gen = tts.tts_with_preset(part,
                               voice_samples=voice_samples,
                               conditioning_latents=conditioning_latents,
                               preset="fast",
                               k=1)
    gen = gen.squeeze(0).cpu()
    torchaudio.save(os.path.join(outpath, f'{j}.wav'), gen, 24000)
    all_parts.append(gen)

# Step 6: Combine all parts
full_audio = torch.cat(all_parts, dim=-1)
final_path = os.path.join(outpath, "AngelinaJolie_TTS.wav")
torchaudio.save(final_path, full_audio, 24000)

# Step 7: Play the result
IPython.display.Audio(final_path)

Generating autoregressive samples..


100%|██████████| 6/6 [45:29<00:00, 454.97s/it]


Computing best candidates using CLVP


100%|██████████| 6/6 [00:06<00:00,  1.07s/it]


Transforming autoregressive outputs into audio..


100%|██████████| 80/80 [01:01<00:00,  1.30it/s]


### **🎥 Step 3: Introduction to Wav2Lip**
Wav2Lip is a deep learning model that generates highly accurate lip movements in an image or video, synced to any given speech audio. It enables the creation of talking face videos even when original audio is unavailable or different.

In [None]:
!git clone https://github.com/justinjohn0306/Wav2Lip

Cloning into 'Wav2Lip'...
remote: Enumerating objects: 534, done.[K
remote: Total 534 (delta 0), reused 0 (delta 0), pack-reused 534 (from 1)[K
Receiving objects: 100% (534/534), 29.78 MiB | 39.25 MiB/s, done.
Resolving deltas: 100% (272/272), done.


In [None]:
%cd /content/Wav2Lip/Wav2Lip/

/content/Wav2Lip/Wav2Lip


In [None]:
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/wav2lip.pth' -O 'checkpoints/wav2lip.pth'

--2025-05-11 21:44:36--  https://github.com/justinjohn0306/Wav2Lip/releases/download/models/wav2lip.pth
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/615543729/e18ec62e-10ae-4c65-9862-1c7a0fafe228?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250511%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250511T214423Z&X-Amz-Expires=300&X-Amz-Signature=8a4646dede4add01d18f604b15b5b80cfb2a8673180088d3062a50418665dd7e&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dwav2lip.pth&response-content-type=application%2Foctet-stream [following]
--2025-05-11 21:44:36--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/615543729/e18ec62e-10ae-4c65-9862-1c7a0fafe228?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credent

In [None]:
!pip install batch-face==1.5.0

Collecting batch-face==1.5.0
  Using cached batch_face-1.5.0-py3-none-any.whl.metadata (7.5 kB)
Using cached batch_face-1.5.0-py3-none-any.whl (30.6 MB)
Installing collected packages: batch-face
  Attempting uninstall: batch-face
    Found existing installation: batch-face 1.5.1
    Uninstalling batch-face-1.5.1:
      Successfully uninstalled batch-face-1.5.1
Successfully installed batch-face-1.5.0


In [None]:
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/resnet50.pth' -O 'checkpoints/resnet50.pth'
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases/download/models/mobilenet.pth' -O 'checkpoints/mobilenet.pth'
a = !pip install https://raw.githubusercontent.com/AwaleSajil/ghc/master/ghc-1.0-py3-none-any.whl
!pip install git+https://github.com/elliottzheng/batch-face.git@master

!pip install ffmpeg-python mediapipe==0.10.18

--2025-05-11 21:45:02--  https://github.com/justinjohn0306/Wav2Lip/releases/download/models/wav2lip_gan.pth
Resolving github.com (github.com)... 140.82.116.3
Connecting to github.com (github.com)|140.82.116.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/615543729/76281b9f-48b8-4cbf-9a05-edf61d847109?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250511%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250511T214502Z&X-Amz-Expires=300&X-Amz-Signature=589804437bc721c25d1acff4b93820690b965bc189df3f5dee1e347426c101d4&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dwav2lip_gan.pth&response-content-type=application%2Foctet-stream [following]
--2025-05-11 21:45:02--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/615543729/76281b9f-48b8-4cbf-9a05-edf61d847109?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz

In [None]:
%cd /content/Wav2Lip/Wav2Lip/

/content/Wav2Lip/Wav2Lip


In [None]:
import torch
torch.cuda.empty_cache()

In this step, we’ll prepare a muted video of and use Wav2Lip to synchronize it with the audio generated in Step 2.

In [None]:
!python inference.py --checkpoint_path /content/Wav2Lip/Wav2Lip/checkpoints/wav2lip.pth --face /content/Angelina.mp4 --audio /content/tortoise-tts/results/longform/AngelinaJolie_TTS.wav

Using cuda for inference.
Load checkpoint from: /content/Wav2Lip/Wav2Lip/checkpoints/wav2lip.pth
Models loaded
Reading video frames...
Number of frames available for inference: 508
(80, 1259)
Length of mel chunks: 391
  0% 0/4 [00:00<?, ?it/s]face detect time: 5.689358949661255
100% 4/4 [00:08<00:00,  2.06s/it]
wav2lip prediction time: 8.262280941009521
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enabl

In [None]:
from google.colab import files
files.download('/content/Wav2Lip/Wav2Lip/results/result_voice.mp4')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
!ls -lh /content/Wav2Lip/Wav2Lip/results/result_voice.mp4

-rw-r--r-- 1 root root 880K May 11 21:54 /content/Wav2Lip/Wav2Lip/results/result_voice.mp4


In [None]:
gt!ffmpeg -y -i /content/Wav2Lip/Wav2Lip/results/result_voice.mp4 \
  -vcodec libx264 -pix_fmt yuv420p -acodec aac -strict experimental \
  /content/fixed_output.mp4


ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enable-libvpx --enab

 We've combined the generated speech audio with the video using Wav2Lip. The model will output a new video where the lip movements match the speech, creating the illusion that the person is speaking the generated text.



In [None]:
from IPython.display import HTML
from base64 import b64encode

mp4_path = "/content/Wav2Lip/Wav2Lip/results/result_voice.mp4"  # Replace with your actual video path

# Encode video
with open(mp4_path, 'rb') as f:
    mp4 = f.read()
data_url = "data:video/mp4;base64," + b64encode(mp4).decode()

# Display video
HTML(f'''
<video width="540" height="380" controls>
  <source src="{data_url}" type="video/mp4">
</video>
''')
