# End-to-End Solution

This notebook is built assuming a GPU environment is available.
This is of course just a jupyter demo, but cuda should be enabled.

If using a free  jupyter notebook environment, use a T4 GPU environment. You can even [open a terminal now](https://blog.infuseai.io/run-a-full-tty-terminal-in-google-colab-without-colab-pro-2759b9f8a74a)

## Dependencies management

In [2]:
# pick a dependency solver.
# here I use saturn cloud (Google Colab GPU ran out on me) and mamba is preinstalled
# I usually pick mamba, poetry and uv
! which mamba

/opt/saturncloud/bin/mamba
/bin/bash: line 1: nvcc: command not found


In [4]:
# install dependencies
! mamba install -y tensorflow-gpu ffmpeg ffmpeg-python srt pytorch torchvision torchaudio pytorch-cuda>=12 pyaudio -c pytorch -c nvidia -c conda-forge

In [9]:
# Check that a cuda environment exists now
! nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0


In [6]:
# some dependencies are harder to find. whisper install only worked through git for me
! pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-oraczcbm
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-oraczcbm
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting more-itertools (from openai-whisper==20231117)
  Downloading more_itertools-10.2.0-py3-none-any.whl.metadata (34 kB)
Collecting tiktoken (from openai-whisper==20231117)
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading more_itertools-10.2.0-py3-none-any.whl (57 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading t

## Audio File Transcription

As a first stage, let us try to get through whisper and the use of an appropriate external VAD (Silero) to get the transcription of an audio file.
Based on this [tutorial](https://colab.research.google.com/github/ANonEntity/WhisperWithVAD/blob/main/WhisperWithVAD.ipynb#scrollTo=sos9vsxPkIN7) where they also use deepl for compatibility with multiple languages. For now we'll assume english for simplicity.

Next stage would be to reproduce this result through streaming.

In [7]:
audio_path = "transcription_test.mp3"
model_size = "medium"  # ["medium", "large"]
language = "english"
translation_mode = "End-to-end Whisper (default)"  # ["End-to-end Whisper (default)", "Whisper -> DeepL", "No translation"]

source_separation = False
vad_threshold = 0.4
chunk_threshold = 3.0
deepl_target_lang = "EN-US"
max_attempts = 1
initial_prompt = ""


import datetime
import json
import os
import urllib.request

import ffmpeg
import srt
import tensorflow as tf
import torch
import whisper
from tqdm import tqdm

assert max_attempts >= 1
assert vad_threshold >= 0.01
assert chunk_threshold >= 0.1
assert audio_path != ""
assert language != ""


task = "transcribe"

out_path = os.path.splitext(audio_path)[0] + ".srt"
out_path_pre = os.path.splitext(audio_path)[0] + "_Untranslated.srt"

# if source_separation:
#     print("Separating vocals...")
#     !ffprobe -i "{audio_path}" -show_entries format=duration -v quiet -of csv="p=0" > input_length
#     with open("input_length") as f:
#         input_length = int(float(f.read())) + 1
#     !spleeter separate -d {input_length} -p spleeter:2stems -o output "{audio_path}"
#     spleeter_dir = os.path.basename(os.path.splitext(audio_path)[0])
#     audio_path = "output/" + spleeter_dir + "/vocals.wav"

print("Encoding audio...")
if not os.path.exists("vad_chunks"):
    os.mkdir("vad_chunks")
ffmpeg.input(audio_path).output(
    "vad_chunks/silero_temp.wav",
    ar="16000",
    ac="1",
    acodec="pcm_s16le",
    map_metadata="-1",
    fflags="+bitexact",
).overwrite_output().run(quiet=True)

print("Running VAD...")
model, utils = torch.hub.load(
    repo_or_dir="snakers4/silero-vad", model="silero_vad", onnx=False
)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Generate VAD timestamps
VAD_SR = 16000
wav = read_audio("vad_chunks/silero_temp.wav", sampling_rate=VAD_SR)
t = get_speech_timestamps(wav, model, sampling_rate=VAD_SR, threshold=vad_threshold)

# Add a bit of padding, and remove small gaps
for i in range(len(t)):
    t[i]["start"] = max(0, t[i]["start"] - 3200)  # 0.2s head
    t[i]["end"] = min(wav.shape[0] - 16, t[i]["end"] + 20800)  # 1.3s tail
    if i > 0 and t[i]["start"] < t[i - 1]["end"]:
        t[i]["start"] = t[i - 1]["end"]  # Remove overlap

# If breaks are longer than chunk_threshold seconds, split into a new audio file
# This'll effectively turn long transcriptions into many shorter ones
u = [[]]
for i in range(len(t)):
    if i > 0 and t[i]["start"] > t[i - 1]["end"] + (chunk_threshold * VAD_SR):
        u.append([])
    u[-1].append(t[i])

# Merge speech chunks
for i in range(len(u)):
    save_audio(
        "vad_chunks/" + str(i) + ".wav",
        collect_chunks(u[i], wav),
        sampling_rate=VAD_SR,
    )

os.remove("vad_chunks/silero_temp.wav")

# Convert timestamps to seconds
for i in range(len(u)):
    time = 0.0
    offset = 0.0
    for j in range(len(u[i])):
        u[i][j]["start"] /= VAD_SR
        u[i][j]["end"] /= VAD_SR
        u[i][j]["chunk_start"] = time
        time += u[i][j]["end"] - u[i][j]["start"]
        u[i][j]["chunk_end"] = time
        if j == 0:
            offset += u[i][j]["start"]
        else:
            offset += u[i][j]["start"] - u[i][j - 1]["end"]
        u[i][j]["offset"] = offset

# Run Whisper on each audio chunk
print("Running Whisper...")
model = whisper.load_model(model_size)
subs = []
segment_info = []
sub_index = 1
suppress_low = []  # words to remove
suppress_high = []  # words to remove
for i in tqdm(range(len(u))):
    line_buffer = []  # Used for DeepL
    for x in range(max_attempts):
        result = model.transcribe(
            "vad_chunks/" + str(i) + ".wav",
            task=task,
            language=language,
            initial_prompt=initial_prompt,
        )
        # Break if result doesn't end with severe hallucinations
        if len(result["segments"]) == 0:
            break
        elif result["segments"][-1]["end"] < u[i][-1]["chunk_end"] + 10.0:
            break
        elif x + 1 < max_attempts:
            print("Retrying chunk", i)
    for r in result["segments"]:
        # Skip audio timestamped after the chunk has ended
        if r["start"] > u[i][-1]["chunk_end"]:
            continue
        # Reduce log probability for certain words/phrases
        for s in suppress_low:
            if s in r["text"]:
                r["avg_logprob"] -= 0.15
        for s in suppress_high:
            if s in r["text"]:
                r["avg_logprob"] -= 0.35
        # Keep segment info for debugging
        del r["tokens"]
        segment_info.append(r)
        # Skip if log prob is low or no speech prob is high
        if r["avg_logprob"] < -1.0 or r["no_speech_prob"] > 0.7:
            continue
        # Set start timestamp
        start = r["start"] + u[i][0]["offset"]
        for j in range(len(u[i])):
            if (
                r["start"] >= u[i][j]["chunk_start"]
                and r["start"] <= u[i][j]["chunk_end"]
            ):
                start = r["start"] + u[i][j]["offset"]
                break
        # Prevent overlapping subs
        if len(subs) > 0:
            last_end = datetime.timedelta.total_seconds(subs[-1].end)
            if last_end > start:
                subs[-1].end = datetime.timedelta(seconds=start)
        # Set end timestamp
        end = u[i][-1]["end"] + 0.5
        for j in range(len(u[i])):
            if r["end"] >= u[i][j]["chunk_start"] and r["end"] <= u[i][j]["chunk_end"]:
                end = r["end"] + u[i][j]["offset"]
                break
        # Add to SRT list
        subs.append(
            srt.Subtitle(
                index=sub_index,
                start=datetime.timedelta(seconds=start),
                end=datetime.timedelta(seconds=end),
                content=r["text"].strip(),
            )
        )
        sub_index += 1

with open("segment_info.json", "w", encoding="utf8") as f:
    json.dump(segment_info, f, indent=4)

# Write SRT file
# Removal of garbage lines
garbage_list = []
need_context_lines = []
clean_subs = list()
last_line_garbage = False
for i in range(len(subs)):
    c = subs[i].content
    c = (
        c.replace(".", "")
        .replace(",", "")
        .replace(":", "")
        .replace(";", "")
        .replace("!", "")
        .replace("?", "")
        .replace("-", " ")
        .replace("  ", " ")
        .replace("  ", " ")
        .replace("  ", " ")
        .lower()
    )
    is_garbage = True
    for w in c.split(" "):
        if w.strip() == "":
            continue
        if w.strip() in garbage_list:
            continue
        elif w.strip() in need_context_lines and last_line_garbage:
            continue
        else:
            is_garbage = False
            break
    if not is_garbage:
        clean_subs.append(subs[i])
    last_line_garbage = is_garbage
with open(out_path, "w", encoding="utf8") as f:
    f.write(srt.compose(clean_subs))
print("\nDone! Subs written to", out_path)

2024-02-28 23:27:18.641731: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-28 23:27:18.686061: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-28 23:27:18.686091: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-28 23:27:18.687030: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-28 23:27:18.693985: I tensorflow/core/platform/cpu_feature_guar

Encoding audio...
Running VAD...


Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to /home/jovyan/.cache/torch/hub/master.zip


Running Whisper...


100%|█████████████████████████████████████| 1.42G/1.42G [00:20<00:00, 74.1MiB/s]
100%|██████████| 1/1 [00:02<00:00,  2.83s/it]


Done! Subs written to transcription_test.srt





In [10]:
! cat transcription_test.srt

1
00:00:01,114 --> 00:00:05,114
This is a live recording and a test for live transcription.

2
00:00:05,114 --> 00:00:10,114
My name is Benjamin and I'm talking to you, the avatar.

3
00:00:10,114 --> 00:00:16,114
What I want to know is how many people live in Paris in 2023.



 This is a clear success!
    
Now let us try a similar  technique but from an audio stream

## Transcription of a live stream


In [11]:
import pyaudio
import whisper
import time

# Define audio stream parameters
FORMAT = pyaudio.paInt16
CHANNELS = 1 # don't need left and right here
RATE = 16000 # sampling rate (number of audio samples per second)
CHUNK_TIME = 5 # measured in seconds
CHUNK = 48000 # number of samples

# Create PyAudio object
p = pyaudio.PyAudio()

# Open audio stream
stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)

# Initialize Whisper model
model = whisper.load_model("base")

try:
    print("Start speaking...")

    while True:
        data = stream.read(CHUNK)

        # Transcribe audio chunk
        result = model.transcribe(audio=data)

        # Extract text from result and print it **immediately**
        print(result["text"])

        # Optionally, clear the transcribed text for the next chunk
        # (reduces memory usage but discards previous text)
        result["text"] = ""

        # Exit on user input (optional)
        if input("Press 'q' to quit: ") == "q":
            break

except KeyboardInterrupt:
    print("\nExiting...")

finally:
    # Stop and close the stream
    stream.stop_stream()
    stream.close()

    # Close PyAudio
    p.terminate()

ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1342:(snd_func_refer) error evaluating name
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5727:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2675:(snd_pcm_open_noupdate) Unknown PCM sysdefault
ALSA lib confmisc.c:855:(parse_card) cannot find card '0'
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_card_inum returned error: No such file or directory
ALSA lib confmisc.c:422:(snd_func_concat) error evaluating strings
ALSA lib conf.c:5204:(_snd_config_evaluate) function snd_func_concat returned error: No

OSError: [Errno -9996] Invalid input device (no default output device)

Given I am executing this notebook in the cloud, my own machine's microphone is not available.
Let us skip this part for now

## Getting an avatar 

We are using the [MakeItTalk paper]() here.

In [16]:
! pip install opencv-python face_alignment scikit-learn pydub soundfile librosa pysptk pyworld resemblyzer tensorboardX

Collecting face_alignment
  Using cached face_alignment-1.4.1-py2.py3-none-any.whl.metadata (7.4 kB)
Collecting pydub
  Using cached pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting soundfile
  Using cached soundfile-0.12.1-py2.py3-none-manylinux_2_31_x86_64.whl.metadata (14 kB)
Collecting librosa
  Using cached librosa-0.10.1-py3-none-any.whl.metadata (8.3 kB)
Collecting pysptk
  Using cached pysptk-0.2.2-cp310-cp310-linux_x86_64.whl
Collecting pyworld
  Using cached pyworld-0.3.4-cp310-cp310-linux_x86_64.whl
Collecting resemblyzer
  Using cached Resemblyzer-0.1.4-py3-none-any.whl.metadata (5.8 kB)
Collecting scikit-image (from face_alignment)
  Using cached scikit_image-0.22.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting audioread>=2.1.9 (from librosa)
  Using cached audioread-3.0.1-py3-none-any.whl.metadata (8.4 kB)
Collecting pooch>=1.0 (from librosa)
  Using cached pooch-1.8.1-py3-none-any.whl.metadata (9.5 kB)
Collecting sox

In [18]:
! git clone https://github.com/yzhou359/MakeItTalk
# ! export PYTHONPATH=/content/MakeItTalk:$PYTHONPATH

fatal: destination path 'MakeItTalk' already exists and is not an empty directory.


In [None]:
! mkdir examples/dump
! mkdir examples/ckpt
! pip install gdown
! gdown -O examples/ckpt/ckpt_autovc.pth https://drive.google.com/uc?id=1ZiwPp_h62LtjU0DwpelLUoodKPR85K7x
!gdown -O examples/ckpt/ckpt_content_branch.pth https://drive.google.com/uc?id=1r3bfEvTVl6pCNw5xwUhEglwDHjWtAqQp
!gdown -O examples/ckpt/ckpt_speaker_branch.pth https://drive.google.com/uc?id=1rV0jkyDqPW-aDJcj7xSO6Zt1zSXqn1mu
!gdown -O examples/ckpt/ckpt_116_i2i_comb.pth https://drive.google.com/uc?id=1i2LJXKp-yWKIEEgJ7C6cE3_2NirfY_0a
!gdown -O examples/dump/emb.pickle https://drive.google.com/uc?id=18-0CYl5E6ungS3H4rRSHjfYvvm-WwjTI