# Alethea AI – Track C (Problem C) · Colab Notebook

This notebook builds a **config-driven video generation pipeline**:

**Text → Segments → TTS → Loudness Normalization → Wav2Lip (per segment) → Validate/Conform → XFade Stitch**

> Works end-to-end with a single face image. You can swap TTS/Wav2Lip models, change FPS/size/xfade in `CFG`.

In [1]:
#@title 0) Runtime check (GPU recommended)
!nvidia-smi || true
import sys, platform, os, subprocess, json, time, pathlib, textwrap, math, shutil, uuid
from pathlib import Path
print("Python:", sys.version)
print("Platform:", platform.platform())

Wed Aug 13 13:14:58 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   58C    P8             13W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
#@title 1) Install dependencies
# FFMPEG + Python deps + Wav2Lip repo + Coqui TTS
!apt-get -y update -qq
!apt-get -y install -qq ffmpeg git
!pip -q install --upgrade pip
!pip -q install numpy==2.* soundfile librosa pyyaml tqdm moviepy==1.0.3
!pip -q install TTS==0.22.0
!pip -q install gdown
# Clone Wav2Lip
if not Path("Wav2Lip").exists():
  !git clone -q https://github.com/Rudrabha/Wav2Lip.git
# Download pretrained weights (try GAN first, fallback to non-GAN)
%cd /content/Wav2Lip
!gdown -q --id 1RpC3p0B8wAAGo2m9qgZ1s2F4QWLWf2Pz -O wav2lip_gan.pth || true
!gdown -q --id 1G8wQ4o3mI0b6p_5nD1i3m8v3qIvGk5ZU -O wav2lip.pth || true
%cd /content
print("Setup complete.")

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[33m  DEPRECATION: Building 'gruut' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the 

In [None]:
!pip -q install opencv-python-headless==4.8.1.78 face-alignment==1.3.5


[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albucore 0.0.24 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.8.1.78 which is incompatible.
albumentations 2.0.8 requires opencv-python-headless>=4.9.0.80, but you have opencv-python-headless 4.8.1.78 which is incompatible.[0m[31m
[0m

In [7]:
# Core wheels that play nice together (NumPy 2 + OpenCV + Colab)
!pip -q install "numpy==1.26.4" "scipy>=1.13.0" "opencv-python-headless==4.10.0.84" "pandas==2.2.2"

# PyTorch CUDA 12.1 stack (you already installed, but safe to reassert)
!pip -q install torch==2.3.1+cu121 torchvision==0.18.1+cu121 torchaudio==2.3.1+cu121 -i https://download.pytorch.org/whl/cu121

# Coqui TTS last (so it doesn’t downgrade NumPy)
!pip -q install -U TTS==0.22.0

# Sanity check
import numpy as np, cv2, torch, pandas as pd
print("NumPy", np.__version__, "| OpenCV", cv2.__version__, "| pandas", pd.__version__, "| CUDA?", torch.cuda.is_available())


[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.22.0 requires pandas<2.0,>=1.4, but you have pandas 2.2.2 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 1.5.3 which is incompatible.
dask-cudf-cu12 25.6.0 requires pandas<2.2.4dev0,>=2.0, but you have pandas 1.5.3 which is incompatible.
plotnine 0.14.5 requires pandas>=2.2.0, but you have pandas 1.5.3 which is incompatible.
mizani 0.13.5 requires pandas>=2.2.0, but you have pandas 1.5.3 which is incompatible.
arviz 0.22.0 requires pandas>=2.1.0, but you have pandas 1.5.3 which is incompatible.
cudf-cu12 25.6.0 requires pandas<2.2.4dev0,>=2.0, but you have pandas 1.5.3 which 

In [None]:
# make sure PyTorch stack is consistent on Colab CUDA 12.1
!pip -q install torch==2.3.1+cu121 torchvision==0.18.1+cu121 torchaudio==2.3.1+cu121 -i https://download.pytorch.org/whl/cu121

# Reinstall TTS after the above, so it doesn't downgrade NumPy
!pip -q install -U TTS==0.22.0

import numpy as np; print("NumPy:", np.__version__)
import os; os.kill(os.getpid(), 9)  # force Colab runtime restart (this is normal)

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pandas<2.2.4dev0,>=2.0, but you have pandas 1.5.3 which is incompatible.[0m[31m
[0m

In [1]:

from pathlib import Path

ROOT = Path("/content/")
ROOT.mkdir(parents=True, exist_ok=True)
print("ROOT =", ROOT)

ROOT = /content


In [2]:
# 3) Configuration (edit as needed)
import yaml
CFG = {
    "io": {
        "face_image": str((ROOT / "face.jpg").resolve()),
        "out_dir": str((ROOT / "outputs").resolve()),
        "serve": "none"  # "hls" or "none" (Flask not needed inside Colab)
    },
    "segment": {
        "target_seconds": 12.0,
        "min_overlap": 0.25
    },
    "tts": {
        "engine": "coqui",
        "sample_rate": 22050,
        "voice": "en_default",
        "model_name": "tts_models/en/ljspeech/tacotron2-DDC"  # change if you like
    },
    "video": {
        "fps": 25,
        "width": 1280,
        "height": 720,
        "sar": 1,
        "pix_fmt": "yuv420p",
        "xfade_seconds": 0.6
    },
    "audio": {
        "target_lufs": -16,
        "true_peak": -1.5,
        "lra": 11
    },
    "stitch": {
        "crf": 18,
        "preset": "medium"
    },
    "serve": {
        "hls_time": 4
    }
}
(Path(ROOT) / "config.yaml").write_text(yaml.safe_dump(CFG))
print(yaml.safe_dump(CFG))

audio:
  lra: 11
  target_lufs: -16
  true_peak: -1.5
io:
  face_image: /content/face.jpg
  out_dir: /content/outputs
  serve: none
segment:
  min_overlap: 0.25
  target_seconds: 12.0
serve:
  hls_time: 4
stitch:
  crf: 18
  preset: medium
tts:
  engine: coqui
  model_name: tts_models/en/ljspeech/tacotron2-DDC
  sample_rate: 22050
  voice: en_default
video:
  fps: 25
  height: 720
  pix_fmt: yuv420p
  sar: 1
  width: 1280
  xfade_seconds: 0.6



In [3]:
# 4) Download a sample face image (or upload your own to ROOT/face.jpg)
import urllib.request, shutil
target = Path(CFG["io"]["face_image"])
if not target.exists():
    url = "https://images.unsplash.com/photo-1507003211169-0a1dd7228f2d?w=800"  # a neutral portrait
    with urllib.request.urlopen(url) as r, open(target, "wb") as f:
        shutil.copyfileobj(r, f)
print("Face image at:", target, "exists:", target.exists())

Face image at: /content/face.jpg exists: True


In [24]:
# 5) Helpers: ffprobe/ffmpeg wrappers, segmentation, TTS, Wav2Lip, stitch
import subprocess, shlex, re, json, uuid, math
from pathlib import Path

OUT = Path(CFG["io"]["out_dir"]); OUT.mkdir(parents=True, exist_ok=True)
TMP = (OUT / f"tmp_{uuid.uuid4().hex[:8]}"); TMP.mkdir(parents=True, exist_ok=True)

def run(cmd, check=True):
    print(">>", " ".join(cmd) if isinstance(cmd,list) else cmd)
    return subprocess.run(cmd, check=check)

def ffprobe_json(path):
    cmd = ["ffprobe","-v","error","-print_format","json","-show_streams","-show_format",str(path)]
    j = subprocess.check_output(cmd)
    return json.loads(j)

def dur_s(path):
    try:
        d = float(subprocess.check_output(["ffprobe","-v","error","-show_entries","format=duration","-of","default=nk=1:nw=1",str(path)]).decode().strip())
        return d
    except:
        return 0.0

def auto_segment(text, target_seconds=12.0):
    # Simple by words with ~160 wpm (~2.67 words/s): chunk ~ target_seconds * 2.67 words
    words = text.strip().split()
    chunk_size = max(12, int(target_seconds * 2.67))
    segs = []
    i = 0
    while i < len(words):
        seg = " ".join(words[i:i+chunk_size])
        segs.append(seg)
        i += chunk_size
    return segs

def tts_coqui(text_list, sr=22050, model="tts_models/en/ljspeech/tacotron2-DDC"):
    from TTS.api import TTS
    tts = TTS(model_name=model, progress_bar=False, gpu=True)
    wavs = []
    for i, t in enumerate(text_list):
        out = TMP / f"seg_{i:03d}.wav"
        tts.tts_to_file(text=t, file_path=str(out), speaker=None, language=None)
        wavs.append(out)
    return wavs

def loudnorm_batch(wavs, I=-16, LRA=11, TP=-1.5):
    outs = []
    for w in wavs:
        o = TMP / (w.stem + "_ln.wav")
        cmd = [
          "ffmpeg","-y","-i",str(w),
          "-filter:a", f"loudnorm=I={I}:LRA={LRA}:TP={TP}:print_format=summary",
          str(o)
        ]
        run(cmd)
        outs.append(o)
    return outs

def wav2lip_batch(face_img, wavs):
    outs = []
    w2l = Path("/content/Wav2Lip")
    weight = None
    if (w2l/"checkpoints/wav2lip_gan.pth").exists(): weight = w2l/"checkpoints/wav2lip_gan.pth"
    elif (w2l/"wav2lip.pth").exists(): weight = w2l/"wav2lip.pth"
    else: raise RuntimeError("Wav2Lip weights not found.")
    for i, w in enumerate(wavs):
        out = TMP / f"seg_{i:03d}.mp4"
        cmd = [
          "python", str(w2l/"inference.py"),
          "--checkpoint_path", str(weight),
          "--face", str(face_img),
          "--audio", str(w),
          "--static","True",
          "--pads", "0","10","0","0",
          "--nosmooth",
          "--outfile", str(out)
        ]
        run(cmd)
        outs.append(out)
    return outs

def validate_conform(mp4, fps=25, w=1280, h=720, sar=1, pix_fmt="yuv420p"):
    j = ffprobe_json(mp4)
    v = [s for s in j["streams"] if s["codec_type"]=="video"][0]
    # fps
    r = v.get("r_frame_rate","25/1").split("/")
    vfps = float(r[0])/float(r[1])
    sar_s = v.get("sample_aspect_ratio","1:1")
    need = (abs(vfps-fps)>0.01) or (sar_s not in ("1:1", f"{sar}:1")) or (int(v["width"])!=w) or (int(v["height"])!=h) or (v.get("pix_fmt")!=pix_fmt)
    if not need:
        return mp4
    out = TMP / (Path(mp4).stem + "_conf.mp4")
    cmd = [
      "ffmpeg","-y","-i",str(mp4),
      "-vf", f"scale={w}:{h},setsar={sar},format={pix_fmt}",
      "-r", str(fps),
      "-c:v","libx264","-preset","medium","-crf","18",
      "-c:a","aac","-ar","44100","-ac","2",
      "-movflags","+faststart", str(out)
    ]
    run(cmd)
    return out

def xfade_stitch(mp4s, xfade_s=0.6, fps=25, crf=18, preset="medium", pix_fmt="yuv420p"):
    # Build dynamic filter_complex with per-pair durations
    n = len(mp4s)
    if n==1:
        final = OUT / "final.mp4"
        run(["ffmpeg","-y","-i",str(mp4s[0]),"-c","copy",str(final)])
        return final

    inputs = []
    for p in mp4s:
        inputs += ["-i", str(p)]
    # Precompute durations
    durs = [dur_s(p) for p in mp4s]

    vf_parts = []
    af_parts = []
    v_labels = []
    a_labels = []

    # normalize streams per input
    for i in range(n):
        vf_parts.append(f"[{i}:v]fps={fps},format={pix_fmt},scale={CFG['video']['width']}:{CFG['video']['height']},setsar={CFG['video']['sar']}[v{i}]")
        af_parts.append(f"[{i}:a]aresample=async=1,aresample=44100[a{i}]")
        v_labels.append(f"v{i}")
        a_labels.append(f"a{i}")

    v_prev = v_labels[0]
    a_prev = a_labels[0]
    step = []
    for i in range(1, n):
        off = max(0.0, durs[i-1] - xfade_s)
        v_out = f"v{i}o"
        a_out = f"a{i}o"
        step.append(f"[{v_prev}][{v_labels[i]}]xfade=transition=fade:duration={xfade_s}:offset={off:.3f}[{v_out}]")
        step.append(f"[{a_prev}][{a_labels[i]}]acrossfade=d={xfade_s}[{a_out}]")
        v_prev, a_prev = v_out, a_out

    filter_complex = "; ".join(vf_parts + af_parts + step)
    final = OUT / "final.mp4"
    cmd = ["ffmpeg","-y"] + inputs + [
        "-filter_complex", filter_complex,
        "-map", f"[{v_prev}]", "-map", f"[{a_prev}]",
        "-c:v","libx264","-crf", str(crf), "-preset", preset,
        "-pix_fmt", pix_fmt, "-c:a","aac","-movflags","+faststart",
        str(final)
    ]
    run(cmd)
    return final

def export_hls(mp4, hls_time=4):
    hls_dir = OUT / "hls"; hls_dir.mkdir(exist_ok=True)
    index = hls_dir / "index.m3u8"
    cmd = [
      "ffmpeg","-y","-i",str(mp4),
      "-profile:v","baseline","-level","3.0",
      "-start_number","0","-hls_time",str(hls_time),"-hls_list_size","0",
      "-f","hls", str(index)
    ]
    run(cmd)
    return index

In [13]:
# 6) Enter your script text
TEXT = "Hello! This is a short demo for Alethea AI Track C. We split the text into segments, synthesize speech, lip-sync using Wav2Lip, stitch with gentle crossfades, and export HLS. You can edit this text to be as long as you want; the pipeline will handle it in chunks."

segs = auto_segment(TEXT, target_seconds=CFG["segment"]["target_seconds"])
print("Segments:", len(segs))
for i, s in enumerate(segs): print(f"{i+1:02d}.", s[:100], "..." if len(s)>100 else "")

Segments: 2
01. Hello! This is a short demo for Alethea AI Track C. We split the text into segments, synthesize spee ...
02. this text to be as long as you want; the pipeline will handle it in chunks. 


In [25]:
def wav2lip_batch(face_img, wavs):
    import subprocess, shlex
    from pathlib import Path
    outs = []
    w2l = Path("/content/Wav2Lip")
    weight = None
    if (w2l/"checkpoints/wav2lip_gan.pth").exists(): weight = w2l/"checkpoints/wav2lip_gan.pth"
    elif (w2l/"wav2lip.pth").exists():   weight = w2l/"wav2lip.pth"
    else: raise RuntimeError("Wav2Lip weights not found.")

    # # sanity: face & s3fd
    # s3fd = w2l / "face_detection/detection/sfd/s3fd.pth"
    # assert Path(face_img).exists(), f"Face image missing: {face_img}"
    # assert s3fd.exists(), "s3fd.pth missing; download it to face_detection/detection/sfd/"

    for i, w in enumerate(wavs):
        out = TMP / f"seg_{i:03d}.mp4"
        cmd = [
          "python", str(w2l/"inference.py"),
          "--checkpoint_path", str(weight),
          "--face", str(face_img),
          "--audio", str(w),
          "--static","True",                 # <-- REQUIRED for single image
          "--pads", "0","10","0","0",
          "--resize_factor", "2",     # faster on CPU, ok on GPU too
          "--nosmooth",
          "--outfile", str(out)
        ]
        print(">>", " ".join(cmd))
        p = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
        print(p.stdout)               # show Wav2Lip’s own error messages
        if p.returncode != 0:
            raise RuntimeError(f"Wav2Lip failed on {w.name} (see logs above)")
        outs.append(out)
    return outs


In [30]:
#7) Run TTS → Loudness → Wav2Lip per segment
wavs = tts_coqui(segs, sr=CFG["tts"]["sample_rate"], model=CFG["tts"]["model_name"])
print("TTS wavs:", wavs)

wavs_ln = loudnorm_batch(wavs, I=CFG["audio"]["target_lufs"], LRA=CFG["audio"]["lra"], TP=CFG["audio"]["true_peak"])
print("Loudnorm wavs:", wavs_ln)


mp4s = wav2lip_batch(CFG["io"]["face_image"], wavs_ln)
print("Wav2Lip segments:", [str(p) for p in mp4s])




 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > vocoder_models/en/ljspeech/hifigan_v2 is already downloaded.
 > Using model: Tacotron2
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > pitch_fmin:1.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > Model's reduction rate `r` is set to: 1
 > Vocoder Model: hifigan
 > Setting up Audio P

In [29]:
import os, pathlib
os.makedirs("/content/temp", exist_ok=True)


In [32]:
valid = [validate_conform(m,
                          fps=CFG["video"]["fps"],
                          w=CFG["video"]["width"],
                          h=CFG["video"]["height"],
                          sar=CFG["video"]["sar"],
                          pix_fmt=CFG["video"]["pix_fmt"]) for m in mp4s]

final_mp4 = xfade_stitch(valid,
                         xfade_s=CFG["video"]["xfade_seconds"],
                         fps=CFG["video"]["fps"],
                         crf=CFG["stitch"]["crf"],
                         preset=CFG["stitch"]["preset"],
                         pix_fmt=CFG["video"]["pix_fmt"])
print("Final MP4:", final_mp4)


>> ffmpeg -y -i /content/outputs/tmp_05304231/seg_000.mp4 -vf scale=1280:720,setsar=1,format=yuv420p -r 25 -c:v libx264 -preset medium -crf 18 -c:a aac -ar 44100 -ac 2 -movflags +faststart /content/outputs/tmp_05304231/seg_000_conf.mp4
>> ffmpeg -y -i /content/outputs/tmp_05304231/seg_001.mp4 -vf scale=1280:720,setsar=1,format=yuv420p -r 25 -c:v libx264 -preset medium -crf 18 -c:a aac -ar 44100 -ac 2 -movflags +faststart /content/outputs/tmp_05304231/seg_001_conf.mp4
>> ffmpeg -y -i /content/outputs/tmp_05304231/seg_000_conf.mp4 -i /content/outputs/tmp_05304231/seg_001_conf.mp4 -filter_complex [0:v]fps=25,format=yuv420p,scale=1280:720,setsar=1[v0]; [1:v]fps=25,format=yuv420p,scale=1280:720,setsar=1[v1]; [0:a]aresample=async=1,aresample=44100[a0]; [1:a]aresample=async=1,aresample=44100[a1]; [v0][v1]xfade=transition=fade:duration=0.6:offset=16.457[v1o]; [a0][a1]acrossfade=d=0.6[a1o] -map [v1o] -map [a1o] -c:v libx264 -crf 18 -preset medium -pix_fmt yuv420p -c:a aac -movflags +faststart /

In [33]:
#9) Preview in Colab
from IPython.display import HTML
html = f'''
<video src="{final_mp4}" controls playsinline style="max-width:100%;height:auto"></video>
'''
HTML(html)

In [None]:
# Enable HLS export
CFG.setdefault("io", {})
CFG.setdefault("serve", {})
CFG["io"]["serve"] = "hls"          
CFG["serve"]["hls_time"] = 4        # 4–6s is typical segment size

# Re-run step 10
if CFG["io"]["serve"] == "hls":
    assert 'final_mp4' in globals(), "final_mp4 not found — run the stitch step first."
    m3u8 = export_hls(final_mp4, hls_time=CFG["serve"]["hls_time"])
else:
    print("CFG['io']['serve'] is 'none'; skipping HLS. Set it to 'hls' to export.")


### Notes & Tips

- **Black screen prevention:** we always re-encode during crossfades and force `format=yuv420p`, constant FPS, and `setsar=1` inside the filtergraph.
- **Change model/voice:** update `CFG["tts"]["model_name"]`; Coqui provides many multilingual voices.
- **Long scripts:** the simple word-based segmenter targets ~12s chunks. Adjust `target_seconds` or replace with a punctuation-aware splitter.
- **Weights:** If `gdown` fails to fetch `wav2lip_gan.pth`, try re-running cell 1 or upload weights to `ROOT` and move them into `/content/Wav2Lip`.
- **Speed-ups:** lower resolution (e.g., 720p) and keep FPS at 25. Consider batching TTS and running Wav2Lip sequentially on a single GPU.