[![notebook shield](https://img.shields.io/static/v1?label=&message=Notebook&color=blue&style=for-the-badge&logo=googlecolab&link=https://colab.research.google.com/github/ArthurFDLR/whisper-youtube/blob/main/whisper_youtube.ipynb)](https://colab.research.google.com/github/Majdoddin/nlp/blob/main/Pyannote_plays_and_Whisper_rhymes_v_2_0.ipynb)
[![repository shield](https://img.shields.io/static/v1?label=&message=Repository&color=blue&style=for-the-badge&logo=github&link=https://github.com/openai/whisper)](https://github.com/majdoddin/nlp)

# Whisper's transcription plus Pyannote's Diarization

**Update** - [@johnwyles](https://github.com/johnwyles) added HTML output for audio/video files from Google Drive, along with some fixes.

Using the new word-level timestamping of Whisper, the transcription words are highlighted as the video plays, with optional autoscroll. And the display on small displays is improved.

Moreover, the model is loaded just once, thus the whole thing runs much faster now. You can also hardcode your Huggingface token.

---
Andrej Karpathy [suggested](https://twitter.com/karpathy/status/1574476200801538048?s=20&t=s5IMMXOYjBI6-91dib6w8g) training a classifier on top of  OpenAI [Whisper](https://openai.com/blog/whisper/) model features to identify the speaker, so we can visualize the speaker in the transcript. But, as [pointed out](https://twitter.com/tarantulae/status/1574493613362388992?s=20&t=s5IMMXOYjBI6-91dib6w8g) by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In the following, I use [**`pyannote-audio`**](https://github.com/pyannote/pyannote-audio), a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr, linked to the video. The input can be YouTube or an video/audio file (also on Google Drive). I try it on a [Customer Support Call](https://youtu.be/hpZFJctBUHQ). Check the result [**here**](https://majdoddin.github.io/dyson.html).

To make it easier to match the transcriptions to diarizations by speaker change, Sarah Kaiser [suggested](https://github.com/openai/whisper/discussions/264#discussioncomment-3825375) runnnig the pyannote.audio first and  then just running whisper on the split-by-speaker chunks.
For sake of performance (and transcription quality?), we attach the audio segments into a single audio file with a silent spacer as a separator, and run whisper on it. Enjoy it!

(For sake of performance , I also tried attaching the audio segments into a single audio file with a silent -or beep- spacer as a separator, and run whisper on it see it on [colab](https://colab.research.google.com/drive/1HuvcY4tkTHPDzcwyVH77LCh_m8tP-Qet?usp=sharing). It [works](https://majdoddin.github.io/lexicap.html) on some audio, and fails on some (Dyson's Interview). The problem is, whisper does not reliably make a timestap on a spacer. See the discussions [#139](https://github.com/openai/whisper/discussions/139) and [#29](https://github.com/openai/whisper/discussions/29))

The Markdown form used below is from [@ArthurFDLR](https://github.com/ArthurFDLR/whisper-youtube/).   

## Preparing the audio file

In [3]:
from pathlib import Path
import pandas as pd
import os

%cd ..
setup = True
videos = {}
urls_path = "urls"
audio_path = "audio"
transcriptions_path = "transcribed_audio"
audio_title = "joerogan"
access_token = ""

with open("credentials.txt", "r") as file:
    for line in file.readlines():
        key, value = line.split(":")

        if key == "huggingface":
            access_token = value

for file in os.listdir(urls_path):
    if os.path.isfile(os.path.join(urls_path, file)):
        with open(os.path.join(urls_path, file)) as f:
                for line in f:
                    name, url = line.split(";")
                    videos[str(name)] = str(url)

/home/guayo/Desktop/programming/python/NLP/YT-WHISPER_SUMMARIZATION


In [4]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [5]:
import sys
# sys.path.append("..")

In [6]:
print("Current path -->", os.getcwd())
print("Other:", sys.path[0])

Current path --> /home/guayo/Desktop/programming/python/NLP/YT-WHISPER_SUMMARIZATION
Other: /home/guayo/Desktop/programming/python/NLP/YT-WHISPER_SUMMARIZATION/whisper-speaker-recognition


`Careful!` -- the following cell removes previous work to make way for the new one

In [None]:
# !rm -r "{audio_path}/*"

### Setup

Custom build of `ffmpeg` as [recommended](https://github.com/yt-dlp/yt-dlp#strongly-recommended) by `yt-dlp`.

In [17]:
if setup:
  !wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc | tar -x

### Downloading the video from youtube

In [44]:
# The code below should work, but returns this error:
# CouldntDecodeError: Decoding failed. ffmpeg returned error code: 1
# ...
#   libswresample   3.  9.100 /  3.  9.100
#   libpostproc    55.  9.100 / 55.  9.100
# [wav @ 0x5594e30956c0] invalid start code [0][0][0][24] in RIFF header
# input.wav: Invalid data found when processing input
#
# We believe it is because the format that the file is downloaded in is .mp4,
# and then it is set to .wav with a simple rename. 

# from yt_downloader import GetFromYoutube

# url_kanyewest = "https://www.youtube.com/watch?v=qxOeWuAHOiw"
# GetFromYoutube(url_kanyewest).get_audio_and_rename("input", ".wav")

In [25]:
# Use this to download files instead:
from yt_dlp import YoutubeDL

for video_name, video_url in list(list(videos.items())):

    !yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav  -o "{str(audio_path) + '/' + video_name + '/'}input.wav" -- {video_url}

[debug] Command-line config: ['-xv', '--ffmpeg-location', 'ffmpeg-master-latest-linux64-gpl/bin', '--audio-format', 'wav', '-o', '../diarization_results/JRE-mileyCirus-02092020/input.wav', '--', 'https://www.youtube.com/watch?v=D7WUMXKV-FE']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.10.13 [b634ba742] (pip)
[debug] Python 3.10.13 (CPython x86_64 64bit) - Linux-5.19.0-32-generic-x86_64-with-glibc2.35 (OpenSSL 3.0.11 19 Sep 2023, glibc 2.35)
[debug] exe versions: ffmpeg N-112724-g96d2a40b9e-20231109 (setts), ffprobe N-112724-g96d2a40b9e-20231109
[debug] Optional libraries: Cryptodome-3.19.0, brotli-1.1.0, certifi-2023.07.22, mutagen-1.47.0, sqlite3-3.41.2, websockets-12.0
[debug] Proxy map: {}
[debug] Loaded 1890 extractors
[youtube] Extracting URL: https://www.youtube.com/watch?v=D7WUMXKV-FE
[youtube] D7WUMXKV-FE: Downloading webpage
[youtube] D7WUMXKV-FE: Downloading ios player API JSON
[youtube] D7WUMX

## Prepending a spacer

`pyannote.audio` seems to miss the first 0.5 seconds of the audio, and, therefore, we prepend a spcacer.

In [40]:
if setup:
    !pip install pydub



In [32]:
from pydub import AudioSegment

video_names = list(videos.keys())
for idx, directory in enumerate(video_names):
    print(f"Running the spacer on video {idx} out of {len(video_names)}...", )
    path = os.path.join(audio_path, directory)
    input_file = os.path.join(path, "input.wav")
    output_file = os.path.join(path, "input_prep.wav")

    spacermilli = 2000
    spacer = AudioSegment.silent(duration=spacermilli)

    audio = AudioSegment.from_wav(input_file)

    audio = spacer.append(audio, crossfade=0)

    audio.export(output_file, format='wav')

# Whisper's Transcriptions

In [None]:
from whisper_analisis import *

model = "base.en"
transcriber = TranscribeAudio(model)

video_names = list(videos.keys())
for idx, directory in enumerate(video_names):
    print(f"Transcribing video {idx + 1} out of {len(video_names)}...", )

    file = "input_prep.wav"   #file shall be in .wav format preferably
    audio_file = os.path.join(os.path.join(audio_path, directory), file)

    output_name = str(directory + "_transcribed.csv")  #The name of the directory is the name of the podcast
    transcriber.transcribe(audio_file, verbose=False)    #put False as a parameter to turn off verbose
    transcriber.save_DF_as_CSV(transcriptions_path, output_name)

# Getting split point for the files

The diarization process is very demanding. A video of more than 15 minutes is very likely to occupy the whole memory of host computer.  
Therefore, the audio needs to be split in chunks smaller than 10-15 minutes each in order to be processed.  
This comes with a problem: we don't want to cut the audio while a speaker is talking, this will mess up the diarization. Moreover we have to guarantee that the speakers are recognized between audio chunks, that is, if we have speaker 1,2 and 3 in one chunk, they should be indentified as such in the next one, not to mix their dialogues and get a consistent diarization.

The code below is an approach to tackle these issues and prepare the audios for diarization.

In [65]:
def find_split_points(audio_df, avg_split_duration=10):

    """
    Function that finds split points in an extracted transcription while not cutting speakers dialogue.
    The split cannot be performed at exactly x minutes because we could cut a speaker's words, therefore
    we find the closest timestamp to our split duration where a speaker stops talking.

    Parameters
    
    audio_df: pd.Dataframe
        dataframe containing transcribed audio
    avg_split_duration: int
        the avg duration of splits
    """

    start = 0   # start begings at 0
    end = start + ((avg_split_duration + 5) * 60)     # end begins at 15 min
    split_points = [0]

    while end < audio_df["end"].max():

        min_timestamp = audio_df.query("end < @end & end > (@end - 5 * 60)")["end"].min()   #find the split point
        start = min_timestamp       #start should be the newly found split point
        end = start + ((avg_split_duration + 5) * 60)     #sum (split_size + 5) more minutes from the latest split point
        split_points.append(min_timestamp)  #save the split point
        
    split_points.append(audio_df["end"].max())

    return split_points

In [71]:
from pydub import AudioSegment

def split_wav_pydub(file, split_points, export_path):
    # Load the audio file
    audio = AudioSegment.from_wav(file)

    # Split the file and save each part
    for i in range(len(split_points) - 1):
        # Calculate the start and end of the split
        start = split_points[i]
        end = split_points[i+1]
        # Extract the split audio
        split_audio = audio[start:end]
        # Export the split audio to a new .wav file
        split_audio.export(os.path.join(export_path, f'split_{i+1}.wav'), format='wav')

    return i + 1    #This is the number of splits performed

In [72]:
def audio_splitter(file_path, transcription_path, avg_split_duration, export_path):
    audio_df = pd.read_csv(transcription_path)
    split_points = find_split_points(audio_df, avg_split_duration)
    split_wav_pydub(file_path, split_points, export_path)

In [73]:
audio_splitter(
    os.path.join(audio_path, "flagrant-andrewHubermam-17102022/input_prep.wav"), 
    os.path.join(transcriptions_path, "flagrant-andrewHubermam-17102022_transcribed.csv"),
    10,
    audio_path + "/tests"
)

# Pyannote's Diarization

[`pyannote.audio`](https://github.com/pyannote/pyannote-audio) is an open-source toolkit written in Python for **speaker diarization**.

Based on [`PyTorch`](https://pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.

`pyannote.audio` also comes with pretrained [models](https://huggingface.co/models?other=pyannote-audio-model) and [pipelines](https://huggingface.co/models?other=pyannote-audio-pipeline) covering a wide range of domains for voice activity detection, speaker segmentation, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them.

Installing `pyannote.audio`.

In [9]:
if setup: 
    !pip install light-the-torch

[0m

In [None]:
# if setup: 
#     !ltt install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1

In [12]:
if setup: 
    !ltt install torch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1

[0m

In [13]:
if setup:
    !pip install  git+https://github.com/hmmlearn/hmmlearn.git
    !pip install  git+https://github.com/pyannote/pyannote-audio.git@develop

Collecting git+https://github.com/hmmlearn/hmmlearn.git
  Cloning https://github.com/hmmlearn/hmmlearn.git to /tmp/pip-req-build-iupifet1
  Running command git clone --filter=blob:none --quiet https://github.com/hmmlearn/hmmlearn.git /tmp/pip-req-build-iupifet1
  Resolved https://github.com/hmmlearn/hmmlearn.git to commit 23c0f132bd66280b46b7f898dc51812629c8bdb7
  Preparing metadata (setup.py) ... [?25ldone
[0mCollecting git+https://github.com/pyannote/pyannote-audio.git@develop
  Cloning https://github.com/pyannote/pyannote-audio.git (to revision develop) to /tmp/pip-req-build-804tl2dl
  Running command git clone --filter=blob:none --quiet https://github.com/pyannote/pyannote-audio.git /tmp/pip-req-build-804tl2dl
  Resolved https://github.com/pyannote/pyannote-audio.git to commit e0544b8ce16481001beac195275de55e4525521a
  Running command git submodule update --init --recursive -q
  Preparing metadata (setup.py) ... [?25ldone
Collecting torch>=2.0.0 (from pyannote.audio==3.0.1)
  Ob

**Important:** To load the pyannote speaker diarization pipeline,

* accept the user conditions on both [hf.co/pyannote/speaker-diarization](https://hf.co/pyannote/speaker-diarization) and [hf.co/pyannote/segmentation](https://huggingface.co/pyannote/segmentation).
* paste your access_token or login using `notebook_login` below

In [14]:
if setup:
    !pip install pyannote.audio

[0m

In [15]:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token= (access_token) or True )

  warn(f"Failed to load image Python extension: {e}")
  torchaudio.set_audio_backend("soundfile")
  from .autonotebook import tqdm as notebook_tqdm
  torchaudio.set_audio_backend("soundfile")
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.1. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.0.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.1.0+cu121. Bad things might happen unless you revert torch to 1.x.


In [16]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
pipeline.to(device)

RuntimeError: cuDNN version incompatibility: PyTorch was compiled  against (8, 9, 2) but found runtime version (8, 5, 0). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.

Running pyannote.audio to generate the diarizations.

In [None]:
for directory in os.listdir(output_path):
    base_dir = os.path.join(output_path, directory)
    num_splits = split_wav_pydub(os.path.join(base_dir, 'input_prep.wav'), 600)  # Split by every 10 minutes

    for idx in range(num_splits):
        DEMO_FILE = {'uri': 'blabla', 'audio': os.path.join(base_dir, f'output_{idx+1}.wav')}
        dz = pipeline(DEMO_FILE)

        with open(f"diarization_{idx+1}.txt", "w") as text_file:
            text_file.write(str(dz))

    #print(*list(dz.itertracks(yield_label = True))[:10], sep="\n")

# Preparing audio files according to the diarization

In [58]:
def millisec(timeStr):
  spl = timeStr.split(":")
  s = (int)((int(spl[0]) * 60 * 60 + int(spl[1]) * 60 + float(spl[2]) )* 1000)
  return s

Grouping the diarization segments according to the speaker.

In [59]:
import re

for directory in os.listdir(output_path):
  base_dir = os.path.join(output_path, directory)
  dzs = open(os.path.join(base_dir, 'diarization_1.txt')).read().splitlines()

  groups = []
  g = []
  lastend = 0

  for d in dzs:
    if g and (g[0].split()[-1] != d.split()[-1]):      #same speaker
      groups.append(g)
      g = []

    g.append(d)

    end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=d)[1]
    end = millisec(end)
    if (lastend > end):       #segment engulfed by a previous segment
      groups.append(g)
      g = []
    else:
      lastend = end
  if g:
    groups.append(g)
  print(*groups, sep='\n')

['[ 00:00:02.363 -->  00:00:05.093] Q SPEAKER_01', '[ 00:00:06.015 -->  00:00:07.960] R SPEAKER_01', '[ 00:00:10.503 -->  00:00:11.783] S SPEAKER_01']
['[ 00:00:15.605 -->  00:00:22.994] A SPEAKER_00']
['[ 00:00:20.998 -->  00:00:21.510] T SPEAKER_01']
['[ 00:00:23.626 -->  00:00:23.882] B SPEAKER_00', '[ 00:00:24.496 -->  00:00:37.807] C SPEAKER_00']
['[ 00:00:37.807 -->  00:02:57.431] U SPEAKER_01']
['[ 00:02:54.087 -->  00:02:55.179] D SPEAKER_00']
['[ 00:02:56.220 -->  00:03:23.745] E SPEAKER_00']
['[ 00:03:23.745 -->  00:03:36.493] V SPEAKER_01']
['[ 00:03:35.912 -->  00:03:42.005] F SPEAKER_00']
['[ 00:03:40.878 -->  00:03:41.117] W SPEAKER_01']
['[ 00:03:42.005 -->  00:03:45.827] X SPEAKER_01']
['[ 00:03:45.827 -->  00:03:47.005] G SPEAKER_00']
['[ 00:03:47.005 -->  00:03:49.001] Y SPEAKER_01']
['[ 00:03:48.506 -->  00:03:50.486] H SPEAKER_00', '[ 00:03:51.237 -->  00:03:57.295] I SPEAKER_00']
['[ 00:03:57.295 -->  00:04:18.284] Z SPEAKER_01']
['[ 00:04:17.141 -->  00:04:42.312]

Save the audio part corresponding to each diarization group.

In [60]:
audio = AudioSegment.from_wav("output_1.wav")
gidx = -1
for g in groups:
  start = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[-1])[1]
  start = millisec(start) #- spacermilli
  end = millisec(end)  #- spacermilli
  gidx += 1
  audio[start:end].export(str(gidx) + '.wav', format='wav')
  print(f"group {gidx}: {start}--{end}")

group 0: 2363--11783
group 1: 15605--22994
group 2: 20998--21510
group 3: 23626--37807


group 4: 37807--177430
group 5: 174087--175179
group 6: 176220--203745
group 7: 203745--216493
group 8: 215912--222005
group 9: 220878--221117
group 10: 222005--225827
group 11: 225827--227005
group 12: 227005--229001
group 13: 228506--237295
group 14: 237295--258284
group 15: 257141--282312
group 16: 269325--269889
group 17: 282943--287943
group 18: 288933--388967
group 19: 388370--406800
group 20: 406800--488216
group 21: 422141--422517
group 22: 423609--425725
group 23: 427670--428131
group 24: 469496--470332
group 25: 489581--599991


Freeing up some memory

In [61]:
del   DEMO_FILE, pipeline, spacer,  audio, dz

# Whisper's Transcriptions

Installing Open AI whisper.

In [64]:
if setup: 
    !pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-o4tc6ghr
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-o4tc6ghr
  Resolved https://github.com/openai/whisper.git to commit fcfeaf1b61994c071bba62da47d7846933576ac9
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting triton==2.0.0 (from openai-whisper==20231106)
  Using cached triton-2.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (63.3 MB)
INFO: pip is looking at multiple versions of torch to determine which version is compatible with other requirements. This could take a while.
Collecting torch (from openai-whisper==20231106)
  Using cached torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch->o

Run whisper on all audio files. Whisper generates the transcription and writes it to a file.

In [65]:
import whisper
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = whisper.load_model('base.en', device = device)


In [66]:
import json
for i in range(len(groups)):
  audiof = str(i) + '.wav'
  result = model.transcribe(audio=audiof, language='en', word_timestamps=True)#, initial_prompt=result.get('text', ""))
  with open(str(i)+'.json', "w") as outfile:
    json.dump(result, outfile, indent=4)

# Generating the HTML and/or txt file from the Transcriptions and the Diarization

Change or add to the speaker names and collors bellow as you wish `(speaker, textbox color, speaker color)`.

In [67]:
speakers = {'SPEAKER_00':('Customer', '#e1ffc7', 'darkgreen'), 'SPEAKER_01':('Call Center', 'white', 'darkorange') }
def_boxclr = 'white'
def_spkrclr = 'orange'

In the generated HTML,  the transcriptions for each diarization group are written in a box, with the speaker name on the top. By clicking a transcription, the embedded video jumps to the right time .

In [68]:

preS = '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\t<meta http-equiv="X-UA-Compatible" content="ie=edge">\n\t<title>' + \
video_title+ \
'</title>\n\t<style>\n\t\tbody {\n\t\t\tfont-family: sans-serif;\n\t\t\tfont-size: 14px;\n\t\t\tcolor: #111;\n\t\t\tpadding: 0 0 1em 0;\n\t\t\tbackground-color: #efe7dd;\n\t\t}\n\n\t\ttable {\n\t\t\tborder-spacing: 10px;\n\t\t}\n\n\t\tth {\n\t\t\ttext-align: left;\n\t\t}\n\n\t\t.lt {\n\t\t\tcolor: inherit;\n\t\t\ttext-decoration: inherit;\n\t\t}\n\n\t\t.l {\n\t\t\tcolor: #050;\n\t\t}\n\n\t\t.s {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.c {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.e {\n\t\t\t/*background-color: white; Changing background color */\n\t\t\tborder-radius: 10px;\n\t\t\t/* Making border radius */\n\t\t\twidth: 50%;\n\t\t\t/* Making auto-sizable width */\n\t\t\tpadding: 0 0 0 0;\n\t\t\t/* Making space around letters */\n\t\t\tfont-size: 14px;\n\t\t\t/* Changing font size */\n\t\t\tmargin-bottom: 0;\n\t\t}\n\n\t\t.t {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t#player-div {\n\t\t\tposition: sticky;\n\t\t\ttop: 20px;\n\t\t\tfloat: right;\n\t\t\twidth: 40%\n\t\t}\n\n\t\t#player {\n\t\t\taspect-ratio: 16 / 9;\n\t\t\twidth: 100%;\n\t\t\theight: auto;\n\n\t\t}\n\n\t\ta {\n\t\t\tdisplay: inline;\n\t\t}\n\t</style>\n\t<script>\n\t\tvar tag = document.createElement(\'script\');\n\t\ttag.src = "https://www.youtube.com/iframe_api";\n\t\tvar firstScriptTag = document.getElementsByTagName(\'script\')[0];\n\t\tfirstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n\t\tvar player;\n\t\tfunction onYouTubeIframeAPIReady() {\n\t\t\tplayer = new YT.Player(\'player\', {\n\t\t\t\t//height: \'210\',\n\t\t\t\t//width: \'340\',\n\t\t\t\tvideoId: \''+ \
video_id + \
'\',\n\t\t\t});\n\n\n\n\t\t\t// This is the source "window" that will emit the events.\n\t\t\tvar iframeWindow = player.getIframe().contentWindow;\n\t\t\tvar lastword = null;\n\n\t\t\t// So we can compare against new updates.\n\t\t\tvar lastTimeUpdate = "-1";\n\n\t\t\t// Listen to events triggered by postMessage,\n\t\t\t// this is how different windows in a browser\n\t\t\t// (such as a popup or iFrame) can communicate.\n\t\t\t// See: https://developer.mozilla.org/en-US/docs/Web/API/Window/postMessage\n\t\t\twindow.addEventListener("message", function (event) {\n\t\t\t\t// Check that the event was sent from the YouTube IFrame.\n\t\t\t\tif (event.source === iframeWindow) {\n\t\t\t\t\tvar data = JSON.parse(event.data);\n\n\t\t\t\t\t// The "infoDelivery" event is used by YT to transmit any\n\t\t\t\t\t// kind of information change in the player,\n\t\t\t\t\t// such as the current time or a playback quality change.\n\t\t\t\t\tif (\n\t\t\t\t\t\tdata.event === "infoDelivery" &&\n\t\t\t\t\t\tdata.info &&\n\t\t\t\t\t\tdata.info.currentTime\n\t\t\t\t\t) {\n\t\t\t\t\t\t// currentTime is emitted very frequently (milliseconds),\n\t\t\t\t\t\t// but we only care about whole second changes.\n\t\t\t\t\t\tvar ts = (data.info.currentTime).toFixed(1).toString();\n\t\t\t\t\t\tts = (Math.round((data.info.currentTime) * 5) / 5).toFixed(1);\n\t\t\t\t\t\tts = ts.toString();\n\t\t\t\t\t\tconsole.log(ts)\n\t\t\t\t\t\tif (ts !== lastTimeUpdate) {\n\t\t\t\t\t\t\tlastTimeUpdate = ts;\n\n\t\t\t\t\t\t\t// It\'s now up to you to format the time.\n\t\t\t\t\t\t\t//document.getElementById("time2").innerHTML = time;\n\t\t\t\t\t\t\tword = document.getElementById(ts)\n\t\t\t\t\t\t\tif (word) {\n\t\t\t\t\t\t\t\tif (lastword) {\n\t\t\t\t\t\t\t\t\tlastword.style.fontWeight = \'normal\';\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\tlastword = word;\n\t\t\t\t\t\t\t\t//word.style.textDecoration = \'underline\';\n\t\t\t\t\t\t\t\tword.style.fontWeight = \'bold\';\n\n\t\t\t\t\t\t\t\tlet toggle = document.getElementById("autoscroll");\n\t\t\t\t\t\t\t\tif (toggle.checked) {\n\t\t\t\t\t\t\t\t\tlet position = word.offsetTop - 20;\n\t\t\t\t\t\t\t\t\twindow.scrollTo({\n\t\t\t\t\t\t\t\t\t\ttop: position,\n\t\t\t\t\t\t\t\t\t\tbehavior: \'smooth\'\n\t\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\t}\n\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t})\n\t\t}\n\t\tfunction jumptoTime(timepoint, id) {\n\t\t\tevent.preventDefault();\n\t\t\thistory.pushState(null, null, "#" + id);\n\t\t\tplayer.seekTo(timepoint);\n\t\t\tplayer.playVideo();\n\t\t}\n\t</script>\n</head>\n\n<body>\n\t<h2>'  + \
video_title + \
'</h2>\n\t<i>Click on a part of the transcription, to jump to its video, and get an anchor to it in the address\n\t\tbar<br><br></i>\n\t<div id="player-div">\n\t\t<div id="player"></div>\n\t\t<div><label for="autoscroll">auto-scroll: </label>\n\t\t\t<input type="checkbox" id="autoscroll" checked>\n\t\t</div>\n\t</div>\n  '


postS = '\t</body>\n</html>'

In [69]:
#import webvtt
import json
from datetime import timedelta

def timeStr(t):
  return '{0:02d}:{1:02d}:{2:06.2f}'.format(round(t // 3600),
                                                round(t % 3600 // 60),
                                                t % 60)

html = list(preS)
txt = list("")
gidx = -1
for g in groups:
  shift = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  shift = millisec(shift) - spacermilli #the start time in the original video
  shift=max(shift, 0)

  gidx += 1

  captions = json.load(open(str(gidx) + '.json'))['segments']

  if captions:
    speaker = g[0].split()[-1]
    boxclr = def_boxclr
    spkrclr = def_spkrclr
    if speaker in speakers:
      speaker, boxclr, spkrclr = speakers[speaker]

    html.append(f'<div class="e" style="background-color: {boxclr}">\n');
    html.append('<p  style="margin:0;padding: 5px 10px 10px 10px;word-wrap:normal;white-space:normal;">\n')
    html.append(f'<span style="color:{spkrclr};font-weight: bold;">{speaker}</span><br>\n\t\t\t\t')

    for c in captions:
      start = shift + c['start'] * 1000.0
      start = start / 1000.0   #time resolution ot youtube is Second.
      end = (shift + c['end'] * 1000.0) / 1000.0
      txt.append(f'[{timeStr(start)} --> {timeStr(end)}] [{speaker}] {c["text"]}\n')

      for i, w in enumerate(c['words']):
        if w == "":
           continue
        start = (shift + w['start']*1000.0) / 1000.0
        #end = (shift + w['end']) / 1000.0   #time resolution ot youtube is Second.
        html.append(f'<a href="#{timeStr(start)}" id="{"{:.1f}".format(round(start*5)/5)}" class="lt" onclick="jumptoTime({int(start)}, this.id)">{w["word"]}</a><!--\n\t\t\t\t-->')
    #html.append('\n')
    html.append('</p>\n')
    html.append(f'</div>\n')

html.append(postS)


with open(f"capspeaker.txt", "w", encoding='utf-8') as file:
  s = "".join(txt)
  file.write(s)
  print('captions saved to capspeaker.txt:')
  print(s+'\n')

with open(f"capspeaker.html", "w", encoding='utf-8') as file:    #TODO: proper html embed tag when video/audio from file
  s = "".join(html)
  file.write(s)
  print('captions saved to capspeaker.html:')
  print(s+'\n')

captions saved to capspeaker.txt:
[00:00:001.60 --> 00:00:003.60] [Call Center]  Joe Rogan, why can't I check it out?
[00:00:004.02 --> 00:00:005.56] [Call Center]  The Joe Rogan, experience.
[00:00:006.06 --> 00:00:009.44] [Call Center]  Train by day Joe Rogan podcast by night, all day!
[00:00:013.61 --> 00:00:014.19] [Customer]  Hello, Mr. Rest.
[00:00:014.66 --> 00:00:015.06] [Customer]  What's up?
[00:00:015.27 --> 00:00:015.88] [Customer]  What's going on, man?
[00:00:016.12 --> 00:00:016.55] [Customer]  Good to see you.
[00:00:016.73 --> 00:00:017.29] [Customer]  Good to see you too.
[00:00:017.55 --> 00:00:018.25] [Customer]  We finally did it.
[00:00:018.32 --> 00:00:018.59] [Customer]  We're here.
[00:00:018.84 --> 00:00:019.50] [Customer]  We made it happen.
[00:00:019.77 --> 00:00:020.25] [Customer]  We're in the building.
[00:00:020.62 --> 00:00:020.84] [Customer]  Yes, sir.
[00:00:019.00 --> 00:00:019.36] [Call Center]  We made it happen.
[00:00:022.31 --> 00:00:025.19] [C