# End-to-End Solution

This notebook is built assuming a GPU environment is available.
This is of course just a jupyter demo, but cuda should be enabled.

If using a free  jupyter notebook environment, use a T4 GPU environment. You can even [open a terminal now](https://blog.infuseai.io/run-a-full-tty-terminal-in-google-colab-without-colab-pro-2759b9f8a74a)

## Dependencies management

In [1]:
# pick a dependency solver.
# here I use saturn cloud (Google Colab GPU ran out on me) and mamba is preinstalled
# I usually pick mamba, poetry and uv
! which mamba

/opt/saturncloud/bin/mamba


In [2]:
# install dependencies
! mamba install -y tensorflow-gpu ffmpeg ffmpeg-python srt pytorch torchvision torchaudio pytorch-cuda>=12 pyaudio -c pytorch -c nvidia -c conda-forge

In [3]:
# Check that a cuda environment exists now
! nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0


In [4]:
# some dependencies are harder to find. whisper install only worked through git for me
! pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-lu5nt0yo
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-lu5nt0yo
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


## Audio File Transcription

As a first stage, let us try to get through whisper and the use of an appropriate external VAD (Silero) to get the transcription of an audio file.
Based on this [tutorial](https://colab.research.google.com/github/ANonEntity/WhisperWithVAD/blob/main/WhisperWithVAD.ipynb#scrollTo=sos9vsxPkIN7) where they also use deepl for compatibility with multiple languages. For now we'll assume english for simplicity.

Next stage would be to reproduce this result through streaming.

In [7]:
audio_path = "transcription_test.mp3"
model_size = "medium"  # ["medium", "large"]
language = "english"
translation_mode = "End-to-end Whisper (default)"  # ["End-to-end Whisper (default)", "Whisper -> DeepL", "No translation"]

source_separation = False
vad_threshold = 0.4
chunk_threshold = 3.0
deepl_target_lang = "EN-US"
max_attempts = 1
initial_prompt = ""


import datetime
import json
import os
import urllib.request

import ffmpeg
import srt
import tensorflow as tf
import torch
import whisper
from tqdm import tqdm

assert max_attempts >= 1
assert vad_threshold >= 0.01
assert chunk_threshold >= 0.1
assert audio_path != ""
assert language != ""


task = "transcribe"

out_path = os.path.splitext(audio_path)[0] + ".srt"
out_path_pre = os.path.splitext(audio_path)[0] + "_Untranslated.srt"

# if source_separation:
#     print("Separating vocals...")
#     !ffprobe -i "{audio_path}" -show_entries format=duration -v quiet -of csv="p=0" > input_length
#     with open("input_length") as f:
#         input_length = int(float(f.read())) + 1
#     !spleeter separate -d {input_length} -p spleeter:2stems -o output "{audio_path}"
#     spleeter_dir = os.path.basename(os.path.splitext(audio_path)[0])
#     audio_path = "output/" + spleeter_dir + "/vocals.wav"

print("Encoding audio...")
if not os.path.exists("vad_chunks"):
    os.mkdir("vad_chunks")
ffmpeg.input(audio_path).output(
    "vad_chunks/silero_temp.wav",
    ar="16000",
    ac="1",
    acodec="pcm_s16le",
    map_metadata="-1",
    fflags="+bitexact",
).overwrite_output().run(quiet=True)

print("Running VAD...")
model, utils = torch.hub.load(
    repo_or_dir="snakers4/silero-vad", model="silero_vad", onnx=False
)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Generate VAD timestamps
VAD_SR = 16000
wav = read_audio("vad_chunks/silero_temp.wav", sampling_rate=VAD_SR)
t = get_speech_timestamps(wav, model, sampling_rate=VAD_SR, threshold=vad_threshold)

# Add a bit of padding, and remove small gaps
for i in range(len(t)):
    t[i]["start"] = max(0, t[i]["start"] - 3200)  # 0.2s head
    t[i]["end"] = min(wav.shape[0] - 16, t[i]["end"] + 20800)  # 1.3s tail
    if i > 0 and t[i]["start"] < t[i - 1]["end"]:
        t[i]["start"] = t[i - 1]["end"]  # Remove overlap

# If breaks are longer than chunk_threshold seconds, split into a new audio file
# This'll effectively turn long transcriptions into many shorter ones
u = [[]]
for i in range(len(t)):
    if i > 0 and t[i]["start"] > t[i - 1]["end"] + (chunk_threshold * VAD_SR):
        u.append([])
    u[-1].append(t[i])

# Merge speech chunks
for i in range(len(u)):
    save_audio(
        "vad_chunks/" + str(i) + ".wav",
        collect_chunks(u[i], wav),
        sampling_rate=VAD_SR,
    )

os.remove("vad_chunks/silero_temp.wav")

# Convert timestamps to seconds
for i in range(len(u)):
    time = 0.0
    offset = 0.0
    for j in range(len(u[i])):
        u[i][j]["start"] /= VAD_SR
        u[i][j]["end"] /= VAD_SR
        u[i][j]["chunk_start"] = time
        time += u[i][j]["end"] - u[i][j]["start"]
        u[i][j]["chunk_end"] = time
        if j == 0:
            offset += u[i][j]["start"]
        else:
            offset += u[i][j]["start"] - u[i][j - 1]["end"]
        u[i][j]["offset"] = offset

# Run Whisper on each audio chunk
print("Running Whisper...")
model = whisper.load_model(model_size)
subs = []
segment_info = []
sub_index = 1
suppress_low = []  # words to remove
suppress_high = []  # words to remove
for i in tqdm(range(len(u))):
    line_buffer = []  # Used for DeepL
    for x in range(max_attempts):
        result = model.transcribe(
            "vad_chunks/" + str(i) + ".wav",
            task=task,
            language=language,
            initial_prompt=initial_prompt,
        )
        # Break if result doesn't end with severe hallucinations
        if len(result["segments"]) == 0:
            break
        elif result["segments"][-1]["end"] < u[i][-1]["chunk_end"] + 10.0:
            break
        elif x + 1 < max_attempts:
            print("Retrying chunk", i)
    for r in result["segments"]:
        # Skip audio timestamped after the chunk has ended
        if r["start"] > u[i][-1]["chunk_end"]:
            continue
        # Reduce log probability for certain words/phrases
        for s in suppress_low:
            if s in r["text"]:
                r["avg_logprob"] -= 0.15
        for s in suppress_high:
            if s in r["text"]:
                r["avg_logprob"] -= 0.35
        # Keep segment info for debugging
        del r["tokens"]
        segment_info.append(r)
        # Skip if log prob is low or no speech prob is high
        if r["avg_logprob"] < -1.0 or r["no_speech_prob"] > 0.7:
            continue
        # Set start timestamp
        start = r["start"] + u[i][0]["offset"]
        for j in range(len(u[i])):
            if (
                r["start"] >= u[i][j]["chunk_start"]
                and r["start"] <= u[i][j]["chunk_end"]
            ):
                start = r["start"] + u[i][j]["offset"]
                break
        # Prevent overlapping subs
        if len(subs) > 0:
            last_end = datetime.timedelta.total_seconds(subs[-1].end)
            if last_end > start:
                subs[-1].end = datetime.timedelta(seconds=start)
        # Set end timestamp
        end = u[i][-1]["end"] + 0.5
        for j in range(len(u[i])):
            if r["end"] >= u[i][j]["chunk_start"] and r["end"] <= u[i][j]["chunk_end"]:
                end = r["end"] + u[i][j]["offset"]
                break
        # Add to SRT list
        subs.append(
            srt.Subtitle(
                index=sub_index,
                start=datetime.timedelta(seconds=start),
                end=datetime.timedelta(seconds=end),
                content=r["text"].strip(),
            )
        )
        sub_index += 1

with open("segment_info.json", "w", encoding="utf8") as f:
    json.dump(segment_info, f, indent=4)

# Write SRT file
# Removal of garbage lines
garbage_list = []
need_context_lines = []
clean_subs = list()
last_line_garbage = False
for i in range(len(subs)):
    c = subs[i].content
    c = (
        c.replace(".", "")
        .replace(",", "")
        .replace(":", "")
        .replace(";", "")
        .replace("!", "")
        .replace("?", "")
        .replace("-", " ")
        .replace("  ", " ")
        .replace("  ", " ")
        .replace("  ", " ")
        .lower()
    )
    is_garbage = True
    for w in c.split(" "):
        if w.strip() == "":
            continue
        if w.strip() in garbage_list:
            continue
        elif w.strip() in need_context_lines and last_line_garbage:
            continue
        else:
            is_garbage = False
            break
    if not is_garbage:
        clean_subs.append(subs[i])
    last_line_garbage = is_garbage
with open(out_path, "w", encoding="utf8") as f:
    f.write(srt.compose(clean_subs))
print("\nDone! Subs written to", out_path)

2024-02-28 23:27:18.641731: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-28 23:27:18.686061: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-28 23:27:18.686091: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-28 23:27:18.687030: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-28 23:27:18.693985: I tensorflow/core/platform/cpu_feature_guar

Encoding audio...
Running VAD...


Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /home/jovyan/.cache/torch/hub/master.zip


Running Whisper...


100%|█████████████████████████████████████| 1.42G/1.42G [00:20<00:00, 74.1MiB/s]
100%|██████████| 1/1 [00:02<00:00,  2.83s/it]


Done! Subs written to transcription_test.srt





In [10]:
! cat transcription_test.srt

1
00:00:01,114 --> 00:00:05,114
This is a live recording and a test for live transcription.

2
00:00:05,114 --> 00:00:10,114
My name is Benjamin and I'm talking to you, the avatar.

3
00:00:10,114 --> 00:00:16,114
What I want to know is how many people live in Paris in 2023.



 This is a clear success!
    
Now let us try a similar  technique but from an audio stream

## Transcription of a live stream


In [11]:
import time

import pyaudio
import whisper

# Define audio stream parameters
FORMAT = pyaudio.paInt16
CHANNELS = 1  # don't need left and right here
RATE = 16000  # sampling rate (number of audio samples per second)
CHUNK_TIME = 5  # measured in seconds
CHUNK = 48000  # number of samples

# Create PyAudio object
p = pyaudio.PyAudio()

# Open audio stream
stream = p.open(
    format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK
)

# Initialize Whisper model
model = whisper.load_model("base")

try:
    print("Start speaking...")

    while True:
        data = stream.read(CHUNK)

        # Transcribe audio chunk
        result = model.transcribe(audio=data)

        # Extract text from result and print it **immediately**
        print(result["text"])

        # Optionally, clear the transcribed text for the next chunk
        # (reduces memory usage but discards previous text)
        result["text"] = ""

        # Exit on user input (optional)
        if input("Press 'q' to quit: ") == "q":
            break

except KeyboardInterrupt:
    print("\nExiting...")

finally:
    # Stop and close the stream
    stream.stop_stream()
    stream.close()

    # Close PyAudio
    p.terminate()

ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1342:(snd_func_refer) error evaluating name
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5727:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2675:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_concat returned error: No

OSError: [Errno -9996] Invalid input device (no default output device)

Given I am executing this notebook in the cloud, my own machine's microphone is not available.
Let us skip this part for now

## Getting an avatar 

We are using  a LLM to reply to the user, than a Text-to-speech approach, then the [MakeItTalk paper](https://github.com/yzhou359/MakeItTalk/blob/main/quick_demo.ipynb) here

### First we need to generate an answer
We can use any LLM here. Using an API can be fast and avoid any infrastructure cost for  this specific need.
Yet using a solution like Mistral or llama could be much faster.

Here I will generate the answer using Huggingface's API. [Check it out](https://huggingface.co/mistralai/Mistral-7B-v0.1?text=This+is+a+live+recording+and+a+test+for+live+transcription.%0D%0A+My+name+is+Benjamin+and+Im+talking+to+you%2C+the+avatar.%0D%0A+What+I+want+to+know+is+how+many+people+live+in+Paris+in+2023.)

For a commercial product, hosting the model would be possible. Specifically this exercise is about a "select few numbers of users"

In [5]:
# %cd ../
import json

with open("segment_info.json", "r+") as f:
    j = json.load(f)
question = "\n".join([e["text"].replace('"', "").replace("'", "") for e in j])
print(question)

 This is a live recording and a test for live transcription.
 My name is Benjamin and Im talking to you, the avatar.
 What I want to know is how many people live in Paris in 2023.


In [None]:
# import openai

# # Replace "YOUR_API_KEY" with your actual OpenAI API key
# openai.api_key = "YOUR_API_KEY"

# response = openai.Completion.create(
#     engine="text-davinci-003",  # Choose the appropriate model
#     prompt=question,
#     max_tokens=100,  # Limit response length (optional)
#     temperature=0.5,  # Control creativity (optional)
# )

## Alternatively
from transformers import AutoModelForCausalLM, AutoTokenizer

# Install the required libraries:
# pip install transformers

# Load the Mistral model and tokenizer
# model_name = "mistralai/Mistral-7B-v0.1"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForCausalLM.from_pretrained(model_name)

# # Encode the prompt into a format the model understandsé
# input_ids = tokenizer.encode(question, return_tensors="pt")

# # Generate text using the model (beam search with 3 beams)
# output = model.generate(
#     input_ids=input_ids,
#     max_length=50,  # Adjust the maximum generated text length
#     num_beams=3,
# )

# # Decode the generated tokens back into text
# response = tokenizer.decode(output[0], skip_special_tokens=True)

In [59]:
# assumed response
response = """
Based on the information I have access to. As of January 1, 2023, the estimated population of Paris is:

2,102,650 residents (source: statista.com)
It's important to note that population data can change over time, so it's always recommended to refer to reliable sources for the most up-to-date information.

I hope this information is helpful!
"""

### Then we need to create a voice sample based on the text we got out

We could use [voice-cloning](https://www.adrianbulat.com/downloads/python-fan) but will stick to Google TTS API for now

In [58]:
! mamba install -y gtts


Looking for: ['gtts']

[?25l[2K[0G[+] 0.0s
[2K[1A[2K[0Gpkgs/main/noarch                                              No change
pkgs/main/linux-64                                            No change
pkgs/r/linux-64                                               No change
[+] 0.1s
conda-forge/linux-64 [90m╸[0m[33m━━━━━━━━━━━━━━━╸[0m[90m━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s
conda-forge/noarch   [90m━━━━━━━━━━━━━╸[0m[33m━━━━━━━━━[0m   0.0 B /  ??.?MB @  ??.?MB/s  0.1s[2K[1A[2K[1A[2K[0Gpkgs/r/noarch                                                 No change
[+] 0.2s
conda-forge/linux-64 [90m━━━━━━━━━━━━━━━━━━━━━━━[0m 564.5kB /  32.8MB @   3.4MB/s  0.2s
conda-forge/noarch   [90m━━━━━━━━━━━━━━━━━━━━━━━[0m 606.2kB /  13.7MB @   3.6MB/s  0.2s[2K[1A[2K[1A[2K[0G[+] 0.3s
conda-forge/linux-64 ━━━━╸[90m━━━━━━━━━━━━━━━━━━[0m   7.9MB /  32.8MB @  26.6MB/s  0.3s
conda-forge/noarch   ━━━━━━━━━━╸[90m━━━━━━━━━━━━[0m   7.2MB /  13.7MB @  25.4MB/s  0.3s[2K[1A[

In [60]:
from gtts import gTTS

# Define language and speed (optional)
tts = gTTS(text=response, lang="en", slow=False)

# Save audio file
tts.save("reply.mp3")

### Then we need to create the animation and reproduce it
Here using [MakeItTalk tutorial](https://github.com/yzhou359/MakeItTalk/blob/main/quick_demo.ipynb) 

In [6]:
! pip install opencv-python face_alignment scikit-learn pydub soundfile librosa==0.9.1 pysptk pyworld resemblyzer tensorboardX pynormalize



In [8]:
! git clone https://github.com/yzhou359/MakeItTalk
! export PYTHONPATH=MakeItTalk:$PYTHONPATH

In [9]:
! mkdir MakeItTalk/examples/dump
! mkdir MakeItTalk/examples/ckpt
! pip install gdown
! gdown -O MakeItTalk/examples/ckpt/ckpt_autovc.pth https://drive.google.com/uc?id=1ZiwPp_h62LtjU0DwpelLUoodKPR85K7x
! gdown -O MakeItTalk/examples/ckpt/ckpt_content_branch.pth https://drive.google.com/uc?id=1r3bfEvTVl6pCNw5xwUhEglwDHjWtAqQp
! gdown -O MakeItTalk/examples/ckpt/ckpt_speaker_branch.pth https://drive.google.com/uc?id=1rV0jkyDqPW-aDJcj7xSO6Zt1zSXqn1mu
! gdown -O MakeItTalk/examples/ckpt/ckpt_116_i2i_comb.pth https://drive.google.com/uc?id=1i2LJXKp-yWKIEEgJ7C6cE3_2NirfY_0a
! gdown -O MakeItTalk/examples/dump/emb.pickle https://drive.google.com/uc?id=18-0CYl5E6ungS3H4rRSHjfYvvm-WwjTI

mkdir: cannot create directory ‘MakeItTalk/examples/dump’: File exists
mkdir: cannot create directory ‘MakeItTalk/examples/ckpt’: File exists
Downloading...
From (original): https://drive.google.com/uc?id=1ZiwPp_h62LtjU0DwpelLUoodKPR85K7x
From (redirected): https://drive.google.com/uc?id=1ZiwPp_h62LtjU0DwpelLUoodKPR85K7x&confirm=t&uuid=ce64cde6-b090-486e-8436-fced2f9bc42d
To: /home/jovyan/workspace/MakeItTalk/examples/ckpt/ckpt_autovc.pth
100%|████████████████████████████████████████| 172M/172M [00:01<00:00, 87.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1r3bfEvTVl6pCNw5xwUhEglwDHjWtAqQp
To: /home/jovyan/workspace/MakeItTalk/examples/ckpt/ckpt_content_branch.pth
100%|██████████████████████████████████████| 7.88M/7.88M [00:00<00:00, 45.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1rV0jkyDqPW-aDJcj7xSO6Zt1zSXqn1mu
To: /home/jovyan/workspace/MakeItTalk/examples/ckpt/ckpt_speaker_branch.pth
100%|██████████████████████████████████████| 15.4M/15.4M [00:00<00:00, 

# IMPORTANT
adapt MakeItTalk/thirdparty/AdaptativeWingLoss/core/models.py
You want line 5 to be `from ..core.coord_conv import CoordConvTh`

In [48]:
# You can use any picture you like but it has to be 256x256 size. Here I need to adapt the size of my "macron" picture.
from PIL import Image

# Define input and output image paths (replace with your actual paths)
input_path = "../macron_square.jpg"
output_path = "../macron_square_resized.jpg"

# Define the desired size for the resized image
new_size = 256  # Adjust the size as needed

# Open the image
image = Image.open(input_path)
print(image.height, image.width)
resized_image = image.resize((new_size, new_size))
resized_image.save(output_path)

633 633


In [10]:
%cd MakeItTalk/
import sys

sys.path.append("thirdparty/AdaptiveWingLoss")
import argparse
import glob
import os
import pickle
import shutil
import time

import cv2
import face_alignment
import numpy as np
import torch
import util.utils as util
from scipy.signal import savgol_filter
from src.approaches.train_audio2landmark import Audio2landmark_model
from src.approaches.train_image_translation import Image_translation_block
from src.autovc.AutoVC_mel_Convertor_retrain_version import AutoVC_mel_Convertor

/home/jovyan/workspace/MakeItTalk


In [49]:
default_head_name = "macron_square_resized" #"paint_boy"  # the image name (with no .jpg) to animate
ADD_NAIVE_EYE = True  # whether add naive eye blink
CLOSE_INPUT_FACE_MOUTH = (
    False  # if your image has an opened mouth, put this as True, else False
)
AMP_LIP_SHAPE_X = 2.0  # amplify the lip motion in horizontal direction
AMP_LIP_SHAPE_Y = 2.0  # amplify the lip motion in vertical direction
AMP_HEAD_POSE_MOTION = 0.7  # amplify the head pose motion (usually smaller than 1.0, put it to 0. for a static head pose)

In [50]:
parser = argparse.ArgumentParser()
parser.add_argument("--jpg", type=str, default="../{}.jpg".format(default_head_name))
parser.add_argument(
    "--close_input_face_mouth", default=CLOSE_INPUT_FACE_MOUTH, action="store_true"
)

parser.add_argument(
    "--load_AUTOVC_name", type=str, default="examples/ckpt/ckpt_autovc.pth"
)
parser.add_argument(
    "--load_a2l_G_name", type=str, default="examples/ckpt/ckpt_speaker_branch.pth"
)
parser.add_argument(
    "--load_a2l_C_name", type=str, default="examples/ckpt/ckpt_content_branch.pth"
)  # ckpt_audio2landmark_c.pth')
parser.add_argument(
    "--load_G_name", type=str, default="examples/ckpt/ckpt_116_i2i_comb.pth"
)  # ckpt_image2image.pth') #ckpt_i2i_finetune_150.pth') #c

parser.add_argument("--amp_lip_x", type=float, default=AMP_LIP_SHAPE_X)
parser.add_argument("--amp_lip_y", type=float, default=AMP_LIP_SHAPE_Y)
parser.add_argument("--amp_pos", type=float, default=AMP_HEAD_POSE_MOTION)
parser.add_argument(
    "--reuse_train_emb_list", type=str, nargs="+", default=[]
)  #  ['iWeklsXc0H8']) #['45hn7-LXDX8']) #['E_kmpT-EfOg']) #'iWeklsXc0H8', '29k8RtSUjE0', '45hn7-LXDX8',
parser.add_argument("--add_audio_in", default=False, action="store_true")
parser.add_argument("--comb_fan_awing", default=False, action="store_true")
parser.add_argument("--output_folder", type=str, default="examples")

parser.add_argument("--test_end2end", default=True, action="store_true")
parser.add_argument("--dump_dir", type=str, default="", help="")
parser.add_argument("--pos_dim", default=7, type=int)
parser.add_argument("--use_prior_net", default=True, action="store_true")
parser.add_argument("--transformer_d_model", default=32, type=int)
parser.add_argument("--transformer_N", default=2, type=int)
parser.add_argument("--transformer_heads", default=2, type=int)
parser.add_argument("--spk_emb_enc_size", default=16, type=int)
parser.add_argument("--init_content_encoder", type=str, default="")
parser.add_argument("--lr", type=float, default=1e-3, help="learning rate")
parser.add_argument("--reg_lr", type=float, default=1e-6, help="weight decay")
parser.add_argument("--write", default=False, action="store_true")
parser.add_argument("--segment_batch_size", type=int, default=1, help="batch size")
parser.add_argument("--emb_coef", default=3.0, type=float)
parser.add_argument("--lambda_laplacian_smooth_loss", default=1.0, type=float)
parser.add_argument("--use_11spk_only", default=False, action="store_true")
parser.add_argument("-f")

opt_parser = parser.parse_args()

In [39]:
# In case of network errors in the next cell,
# manually copy file "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" to /home/jovyan/.cache/torch/hub/checkpoints/s3fd-619a316812.pth
# "https://www.adrianbulat.com/downloads/python-fan/3DFAN4-4a694010b9.zip" to /home/jovyan/.cache/torch/hub/checkpoints/3DFAN4-4a694010b9.zip
# "https://www.adrianbulat.com/downloads/python-fan/depth-6c4283c0e0.zip" to /home/jovyan/.cache/torch/hub/checkpoints/depth-6c4283c0e0.zip
# ! mv ../s3fd-619a316812.pth /home/jovyan/.cache/torch/hub/checkpoints/
# ! mv ../3DFAN4-4a694010b9.zip /home/jovyan/.cache/torch/hub/checkpoints/
# ! mv ../depth-6c4283c0e0.zip /home/jovyan/.cache/torch/hub/checkpoints/

In [51]:
img = cv2.imread(opt_parser.jpg)
predictor = face_alignment.FaceAlignment(
    face_alignment.LandmarksType.THREE_D, device="cuda", flip_input=True
)
shapes = predictor.get_landmarks(img)
if not shapes or len(shapes) != 1:
    print("Cannot detect face landmarks. Exit.")
    exit(-1)
shape_3d = shapes[0]

if opt_parser.close_input_face_mouth:
    util.close_input_face_mouth(shape_3d)

In [33]:
# shape_3d[48:, 0] = (shape_3d[48:, 0] - np.mean(shape_3d[48:, 0])) * 1.05 + np.mean(shape_3d[48:, 0]) # wider lips
# shape_3d[49:54, 1] += 0.           # thinner upper lip
# shape_3d[55:60, 1] -= 1.           # thinner lower lip
# shape_3d[[37,38,43,44], 1] -=2.    # larger eyes
# shape_3d[[40,41,46,47], 1] +=2.    # larger eyes

In [52]:
shape_3d, scale, shift = util.norm_input_face(shape_3d)

In [25]:
# ! cp ../reply.mp3 examples
# ! ffmpeg -i examples/reply.mp3 examples/reply.wav

ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12.3.0 (conda-forge gcc 12.3.0-3)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libo

In [53]:
au_data = []
au_emb = []
ains = glob.glob1("examples", "*.wav")
ains = [item for item in ains if item != "tmp.wav"]
ains.sort()
for ain in ains:
    os.system(
        "ffmpeg -y -loglevel error -i examples/{} -ar 16000 examples/tmp.wav".format(
            ain
        )
    )
    shutil.copyfile("examples/tmp.wav", "examples/{}".format(ain))

    # au embedding
    from thirdparty.resemblyer_util.speaker_emb import get_spk_emb

    me, ae = get_spk_emb("examples/{}".format(ain))
    au_emb.append(me.reshape(-1))

    print("Processing audio file", ain)
    c = AutoVC_mel_Convertor("examples")

    au_data_i = c.convert_single_wav_to_autovc_input(
        audio_filename=os.path.join("examples", ain),
        autovc_model_path=opt_parser.load_AUTOVC_name,
    )
    au_data += au_data_i
if os.path.isfile("examples/tmp.wav"):
    os.remove("examples/tmp.wav")

# landmark fake placeholder
fl_data = []
rot_tran, rot_quat, anchor_t_shape = [], [], []
for au, info in au_data:
    au_length = au.shape[0]
    fl = np.zeros(shape=(au_length, 68 * 3))
    fl_data.append((fl, info))
    rot_tran.append(np.zeros(shape=(au_length, 3, 4)))
    rot_quat.append(np.zeros(shape=(au_length, 4)))
    anchor_t_shape.append(np.zeros(shape=(au_length, 68 * 3)))

if os.path.exists(os.path.join("examples", "dump", "random_val_fl.pickle")):
    os.remove(os.path.join("examples", "dump", "random_val_fl.pickle"))
if os.path.exists(os.path.join("examples", "dump", "random_val_fl_interp.pickle")):
    os.remove(os.path.join("examples", "dump", "random_val_fl_interp.pickle"))
if os.path.exists(os.path.join("examples", "dump", "random_val_au.pickle")):
    os.remove(os.path.join("examples", "dump", "random_val_au.pickle"))
if os.path.exists(os.path.join("examples", "dump", "random_val_gaze.pickle")):
    os.remove(os.path.join("examples", "dump", "random_val_gaze.pickle"))

with open(os.path.join("examples", "dump", "random_val_fl.pickle"), "wb") as fp:
    pickle.dump(fl_data, fp)
with open(os.path.join("examples", "dump", "random_val_au.pickle"), "wb") as fp:
    pickle.dump(au_data, fp)
with open(os.path.join("examples", "dump", "random_val_gaze.pickle"), "wb") as fp:
    gaze = {
        "rot_trans": rot_tran,
        "rot_quat": rot_quat,
        "anchor_t_shape": anchor_t_shape,
    }
    pickle.dump(gaze, fp)

Loaded the voice encoder model on cuda in 0.01 seconds.
Processing audio file reply.wav
0 out of 0 are in this portion
Loaded the voice encoder model on cuda in 0.01 seconds.
source shape: torch.Size([1, 1920, 80]) torch.Size([1, 256]) torch.Size([1, 256]) torch.Size([1, 1920, 257])
converted shape: torch.Size([1, 1920, 80]) torch.Size([1, 3840])


In [54]:
!pwd
model = Audio2landmark_model(opt_parser, jpg_shape=shape_3d)
if(len(opt_parser.reuse_train_emb_list) == 0):
    model.test(au_emb=au_emb)
else:
    model.test(au_emb=None)

/home/jovyan/workspace/MakeItTalk
Run on device: cuda
Loading Data random_val
EVAL num videos: 1
G: Running on cuda, total num params = 3.00M
48uYS3bHIA8
YAZuSHvwVC0
0yaLdVk_UyQ
E_kmpT-EfOg
fQR31F7L3ww
JPMZAOGGHh8
W6uRNCJmdtI
2KL8PfQPmBg
p575B7k07a8
iUoAe2gXKE4
HH-iOC056aQ
S8fiWqrZEew
ROWN2ssXek8
irx71tYyI-Q
me6cdZCM2FY
OkqHtWOFliM
OfPKHc6w2vw
1lh57VnuaKE
_ldiVrXgZKc
H1Xnb_rtgqY
45hn7-LXDX8
bs7ZWVqAGCU
UElg0R7fmlk
bCs5SoifsiY
1Lx_ZqrK1bM
RrnL6Pcjjbw
sRbWv2R2hxE
wJmdE0G4sEg
hE-4e1vEiT8
XXbxe3fCQqg
02HOKnTjBlQ
wAAMEC1OsRc
7Sk--XzX8b0
I5Lm0Qce5kg
qLxfiUMYgQg
_VpqWkdcaqM
ljIkW4uVVQY
5m5iPZNJS6c
J-NPsvtQ8lE
gOrQyrbptGo
43BiUVlNy58
swLghyvhoqA
X3FCAoFnmdA
2NiCRAmwoc4
KVUf0J2LAaA
YtZS9hH1j24
5fZj9Fzi5K0
wbWKG26ebMw
QgNlXur0wrs
qek_5m1MRik
rmFsUV5ICKk
bEdGv1wixF4
ljh5PB6Utsc
izudwWTXuUk
B08yOvYMF7Y
UEmI4r5G-5Y
Scujgl9GbHA
sxCbrYjBsGA
qvQC0w3y_Fo
bXpavyiCu10
iWeklsXc0H8
H00oAfd_GsM
Z7WRt--g-h4
29k8RtSUjE0
E0zgrhQ0QDw
9KhvSxKE6Mc
qLNvRwMkhik


OpenCV: FFMPEG: tag 0x47504a4d/'MJPG' is not supported with codec id 7 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'


examples/reply.wav


ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 12.3.0 (conda-forge gcc 12.3.0-3)
  configuration: --prefix=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac --cc=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-cc --cxx=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-c++ --nm=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-nm --ar=/home/conda/feedstock_root/build_artifacts/ffmpeg_1705436738391/_build_env/bin/x86_64-conda-linux-gnu-ar --disable-doc --disable-openssl --enable-demuxer=dash --enable-hardcoded-tables --enable-libfreetype --enable-libharfbuzz --enable-libfontconfig --enable-libo

In [55]:
fls = glob.glob1('examples', 'pred_fls_*.txt')
fls.sort()

for i in range(0,len(fls)):
    fl = np.loadtxt(os.path.join('examples', fls[i])).reshape((-1, 68,3))
    fl[:, :, 0:2] = -fl[:, :, 0:2]
    fl[:, :, 0:2] = fl[:, :, 0:2] / scale - shift

    if (ADD_NAIVE_EYE):
        fl = util.add_naive_eye(fl)

    # additional smooth
    fl = fl.reshape((-1, 204))
    fl[:, :48 * 3] = savgol_filter(fl[:, :48 * 3], 15, 3, axis=0)
    fl[:, 48*3:] = savgol_filter(fl[:, 48*3:], 5, 3, axis=0)
    fl = fl.reshape((-1, 68, 3))

    ''' STEP 6: Imag2image translation '''
    model = Image_translation_block(opt_parser, single_test=True)
    with torch.no_grad():
        model.single_test(jpg=img, fls=fl, filename=fls[i], prefix=opt_parser.jpg.split('.')[0])
        print('finish image2image gen')
    os.remove(os.path.join('examples', fls[i]))

Run on device cuda


OpenCV: FFMPEG: tag 0x67706a6d/'mjpg' is not supported with codec id 7 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'


Time - only video: 43.29017472267151
Time - ffmpeg add audio: 49.150564432144165
finish image2image gen
