In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Wav2Lip on CPU - Google Colab (Free Tier)\n",
    "\n",
    "This notebook allows you to run Wav2Lip inference on a CPU, making it compatible with Google Colab's free tier. It includes text-to-speech using gTTS to generate audio from text, which is then used to animate a static face image."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Setup Environment and Install Dependencies\n",
    "\n",
    "This cell will:\n",
    "- Clone the Wav2Lip repository.\n",
    "- Install specific versions of PyTorch (CPU-compatible) and Librosa (for compatibility).\n",
    "- Install other necessary Python packages (gTTS for text-to-speech, opencv-python for image/video processing, etc.).\n",
    "- Download the pre-trained Wav2Lip model checkpoint.\n",
    "- Download a sample avatar image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone https://github.com/Rudrabha/Wav2Lip.git\n",
    "\n",
    "# Install CPU-compatible PyTorch and other dependencies\n",
    "!pip install torch==1.13.1+cpu torchvision==0.14.1+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html\n",
    "!pip install librosa==0.9.2 numba==0.58.1\n",
    "!pip install gTTS==2.3.2 opencv-python==4.8.0.76 scipy==1.11.4 tqdm==4.66.1\n",
    "\n",
    "# Download Wav2Lip pre-trained model\n",
    "!wget 'https://iiitaphyd-my.sharepoint.com/personal/radrabha_m_research_iiit_ac_in/_layouts/15/download.aspx?share=EAbENTSj11FFp0Q55_iAIVMBcQx28VpVmTuF4h7RnO00rQ' -O '/content/Wav2Lip/checkpoints/wav2lip_gan.pth'\n",
    "\n",
    "# Download a sample avatar image\n",
    "!wget 'https://img.freepik.com/free-photo/young-bearded-man-with-striped-shirt_273609-5677.jpg' -O '/content/avatar.jpg'\n",
    "\n",
    "print(\"Setup Complete! Wav2Lip repository cloned, dependencies installed, model and sample image downloaded.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Import Libraries\n",
    "\n",
    "Import all the Python libraries that will be used throughout the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from gtts import gTTS\n",
    "from IPython.display import Audio, HTML, clear_output\n",
    "from base64 import b64encode\n",
    "import cv2\n",
    "import numpy as np\n",
    "import subprocess\n",
    "import torch\n",
    "import librosa\n",
    "\n",
    "# Clear output after imports for a cleaner notebook\n",
    "# clear_output()\n",
    "\n",
    "print(\"Libraries imported successfully.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Generate Speech from Text using gTTS\n",
    "\n",
    "This cell defines a function to convert your input text into an audio file (`.wav` format) using Google Text-to-Speech (gTTS). \n",
    "The audio will be saved to `/content/generated_tts.wav`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def text_to_speech(text, output_filename=\"/content/generated_tts.wav\"):\n",
    "    tts = gTTS(text=text, lang='en')\n",
    "    tts.save(output_filename)\n",
    "    print(f\"Text converted to speech and saved as {output_filename}\")\n",
    "    return output_filename\n",
    "\n",
    "# --- Test the TTS --- \n",
    "input_text = \"Hello, this is a test of the text to speech system for Wav2Lip.\" # You can change this text\n",
    "audio_file = text_to_speech(input_text)\n",
    "\n",
    "# Display audio player in Colab\n",
    "Audio(audio_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Patch Wav2Lip for CPU and Librosa 0.9.2 Compatibility\n",
    "\n",
    "The original Wav2Lip code requires some adjustments to run on CPU and with `librosa==0.9.2`.\n",
    "This cell will overwrite the `Wav2Lip/audio.py` file with a version compatible with our setup."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile /content/Wav2Lip/audio.py\n",
    "import librosa\n",
    "import numpy as np\n",
    "from scipy.io import wavfile\n",
    "import scipy.signal as sps\n",
    "\n",
    "hparams = {\n",
    "    'sample_rate': 16000,\n",
    "    'preemphasis': 0.97,\n",
    "    'n_fft': 800,\n",
    "    'hop_length': 200,\n",
    "    'win_length': 800,\n",
    "    'num_mels': 80,\n",
    "    'fmin': 55,\n",
    "    'fmax': 7600,\n",
    "    'ref_db': 20,\n",
    "    'min_level_db': -100,\n",
    "    'rescale': True,\n",
    "    'rescaling_max': 0.999,\n",
    "    # Mel filters are scaled to be energy-preserving\n",
    "    'mel_weight_normalize': True, # Use librosa's Slaney an Tromp normalization\n",
    "}\n",
    "\n",
    "def load_wav(path, sr):\n",
    "    return librosa.load(path, sr=sr)[0]\n",
    "\n",
    "def save_wav(wav, path, sr):\n",
    "    wav *= 32767 / max(0.01, np.max(np.abs(wav)))\n",
    "    #proposed by @dsmiller\n",
    "    wavfile.write(path, sr, wav.astype(np.int16))\n",
    "\n",
    "def preemphasis(wav, k, preemphasize=True):\n",
    "    if preemphasize:\n",
    "        return sps.lfilter([1, -k], [1], wav)\n",
    "    return wav\n",
    "\n",
    "def inv_preemphasis(wav, k, inv_preemphasize=True):\n",
    "    if inv_preemphasize:\n",
    "        return sps.lfilter([1], [1, -k], wav)\n",
    "    return wav\n",
    "\n",
    "def melspectrogram(wav):\n",
    "    D = _stft(preemphasis(wav, hparams['preemphasis']))\n",
    "    S = _amp_to_db(_linear_to_mel(np.abs(D))) - hparams['ref_db']\n",
    "    if hparams['rescale']:\n",
    "        S = _normalize(S)\n",
    "    return S\n",
    "\n",
    "def _stft(y):\n",
    "    return librosa.stft(y=y, n_fft=hparams['n_fft'], hop_length=hparams['hop_length'], win_length=hparams['win_length'])\n",
    "\n",
    "def _linear_to_mel(spectrogram):\n",
    "    _mel_basis = _build_mel_basis()\n",
    "    return np.dot(_mel_basis, spectrogram)\n",
    "\n",
    "def _build_mel_basis():\n",
    "    # Use htk=True for Slaney-style MEL weights\n",
    "    return librosa.filters.mel(sr=hparams['sample_rate'], n_fft=hparams['n_fft'], n_mels=hparams['num_mels'],\n",
    "                               fmin=hparams['fmin'], fmax=hparams['fmax'], htk=True,\n",
    "                               norm='slaney' if hparams['mel_weight_normalize'] else None)\n",
    "\n",
    "def _amp_to_db(x):\n",
    "    min_level = np.exp(hparams['min_level_db'] / 20 * np.log(10))\n",
    "    return 20 * np.log10(np.maximum(min_level, x))\n",
    "\n",
    "def _db_to_amp(x):\n",
    "    return np.power(10.0, (x * 0.05))\n",
    "\n",
    "def _normalize(S):\n",
    "    return np.clip((S - hparams['min_level_db']) / -hparams['min_level_db'], 0, 1)\n",
    "\n",
    "def _denormalize(S):\n",
    "    return (np.clip(S, 0, 1) * -hparams['min_level_db']) + hparams['min_level_db']\n",
    "\n",
    "print(\"Patched Wav2Lip/audio.py created successfully.\")\n",
    "\n",
    "# Further patch: Ensure face_detect command uses python and handles paths correctly for subprocess\n",
    "# The inference script might also need small tweaks for CPU, handled in the next step's command line call.\n",
    "def patch_inference_file():\n",
    "    inference_path = '/content/Wav2Lip/inference.py'\n",
    "    with open(inference_path, 'r') as f:\n",
    "        content = f.read()\n",
    "    \n",
    "    # Ensure device is CPU\n",
    "    content = content.replace('device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")',\n",
    "                              'device = torch.device(\"cpu\")')\n",
    "    content = content.replace('model = model.to(device)', 'model = model.to(torch.device(\"cpu\"))')\n",
    "    \n",
    "    # Make sure subprocess calls for face detection use full python path if necessary and quote paths\n",
    "    # This is more of a safeguard, the main call will be python inference.py ...\n",
    "    content = content.replace(\"subprocess.call([args.face_detection_script,\",\n",
    "                              \"subprocess.call(['python', args.face_detection_script,\")\n",
    "\n",
    "    with open(inference_path, 'w') as f:\n",
    "        f.write(content)\n",
    "    print(f\"Patched {inference_path} for CPU usage and subprocess calls.\")\n",
    "\n",
    "patch_inference_file()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5. Run Wav2Lip Inference\n",
    "\n",
    "Now, we'll run the Wav2Lip inference script. \n",
    "This will take the static image (`/content/avatar.jpg`) and the generated audio (`/content/generated_tts.wav`) to produce a lip-synced video.\n",
    "\n",
    "**Important Notes:**\n",
    "- Ensure the `checkpoint_path` points to the downloaded model.\n",
    "- `face` is the path to your input image.\n",
    "- `audio` is the path to your input audio.\n",
    "- `outfile` is where the result video will be saved.\n",
    "- We add `--device cpu` to explicitly use the CPU. While we patched `inference.py`, this is an additional safeguard.\n",
    "- `--pads` and `--face_det_batch_size` are adjusted for potentially slower CPU processing.\n",
    "- If you see errors related to `ffmpeg`, it might not be installed or found in Colab's default environment. The script usually handles this, but it's a common point of failure if misconfigured."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define file paths\n",
    "face_image_path = \"/content/avatar.jpg\"\n",
    "audio_input_path = \"/content/generated_tts.wav\"\n",
    "output_video_path = \"/content/Wav2Lip/results/result.mp4\" # Save it within the Wav2Lip results folder first\n",
    "final_output_path = \"/content/result.mp4\" # Final path for easy access\n",
    "\n",
    "# Wav2Lip Inference Command\n",
    "# Using !python instead of !cd Wav2Lip && python ... to simplify path management for input/output\n",
    "wav2lip_command = (\n",
    "    f\"python /content/Wav2Lip/inference.py \"\n",
    "    f\"--checkpoint_path /content/Wav2Lip/checkpoints/wav2lip_gan.pth \"\n",
    "    f\"--face {face_image_path} \"\n",
    "    f\"--audio {audio_input_path} \"\n",
    "    f\"--outfile {output_video_path} \"\n",
    "    f\"--device cpu \" # Explicitly set device to CPU\n",
    "    f\"--pads 0 10 0 0 \" # Adjust padding as needed\n",
    "    f\"--face_det_batch_size 4 \" # Lower batch size for CPU\n",
    "    f\"--wav2lip_batch_size 32\" # Adjust based on CPU capability\n",
    ")\n",
    "\n",
    "print(f\"Running Wav2Lip command: {wav2lip_command}\")\n",
    "# subprocess.run(wav2lip_command, shell=True, check=True)\n",
    "\n",
    "try:\n",
    "    process = subprocess.run(wav2lip_command, shell=True, check=True, capture_output=True, text=True)\n",
    "    print(\"Wav2Lip process completed successfully.\")\n",
    "    print(\"STDOUT:\")\n",
    "    print(process.stdout)\n",
    "    if process.stderr:\n",
    "        print(\"STDERR: (should be empty on success)\")\n",
    "        print(process.stderr)\n",
    "except subprocess.CalledProcessError as e:\n",
    "    print(f\"Wav2Lip process failed with exit status {e.returncode}.\")\n",
    "    print(\"STDOUT:\")\n",
    "    print(e.stdout)\n",
    "    print(\"STDERR:\")\n",
    "    print(e.stderr)\n",
    "\n",
    "# Move the result to /content for easier access if needed, and to match the plan's output path\n",
    "if os.path.exists(output_video_path):\n",
    "    os.rename(output_video_path, final_output_path)\n",
    "    print(f\"Output video saved as {final_output_path}\")\n",
    "else:\n",
    "    print(f\"Error: Output video not found at {output_video_path}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6. Display the Output Video\n",
    "\n",
    "This cell will display the generated lip-synced video directly in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import HTML\n",
    "from base64 import b64encode\n",
    "\n",
    "video_path = \"/content/result.mp4\"\n",
    "\n",
    "if os.path.exists(video_path):\n",
    "    mp4 = open(video_path,'rb').read()\n",
    "    data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
    "    display(HTML(f'''\n",
    "    <video width=400 controls>\n",
    "          <source src=\"{data_url}\" type=\"video/mp4\">\n",
    "    </video>'''))\n",
    "    print(f\"Displaying video from {video_path}\")\n",
    "else:\n",
    "    print(f\"Video file not found at {video_path}. Please ensure the Wav2Lip script ran successfully.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}