[![notebook shield](https://img.shields.io/static/v1?label=&message=Notebook&color=blue&style=for-the-badge&logo=googlecolab&link=https://colab.research.google.com/github/ArthurFDLR/whisper-youtube/blob/main/whisper_youtube.ipynb)](https://colab.research.google.com/github/Majdoddin/nlp/blob/main/Pyannote_plays_and_Whisper_rhymes_v_2_0.ipynb)
[![repository shield](https://img.shields.io/static/v1?label=&message=Repository&color=blue&style=for-the-badge&logo=github&link=https://github.com/openai/whisper)](https://github.com/majdoddin/nlp)

# Whisper's transcription plus Pyannote's Diarization 

Andrej Karpathy [suggested](https://twitter.com/karpathy/status/1574476200801538048?s=20&t=s5IMMXOYjBI6-91dib6w8g) training a classifier on top of  OpenAI [Whisper](https://openai.com/blog/whisper/) model features to identify the speaker, so we can visualize the speaker in the transcript. But, as [pointed out](https://twitter.com/tarantulae/status/1574493613362388992?s=20&t=s5IMMXOYjBI6-91dib6w8g) by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In the following, I use [**`pyannote-audio`**](https://github.com/pyannote/pyannote-audio), a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr, linked to the video. The input can be YouTube or an video/audio file (also on Google Drive). I try it on a part of an [interview](https://youtu.be/NSp2fEQ6wyA) with Freeman Dyson. Check the result [**here**](https://majdoddin.github.io/dyson.html).

To make it easier to match the transcriptions to diarizations by speaker change, Sarah Kaiser [suggested](https://github.com/openai/whisper/discussions/264#discussioncomment-3825375) runnnig the pyannote.audio first and  then just running whisper on the split-by-speaker chunks. 
For sake of performance (and transcription quality?), we attach the audio segements into a single audio file with a silent spacer as a seperator, and run whisper on it. Enjoy it!

(For sake of performance , I also tried attaching the audio segements into a single audio file with a silent -or beep- spacer as a seperator, and run whisper on it see it on [colab](https://colab.research.google.com/drive/1HuvcY4tkTHPDzcwyVH77LCh_m8tP-Qet?usp=sharing). It [works](https://majdoddin.github.io/lexicap.html) on some audio, and fails on some (Dyson's Interview). The problem is, whisper does not reliably make a timestap on a spacer. See the discussions [#139](https://github.com/openai/whisper/discussions/139) and [#29](https://github.com/openai/whisper/discussions/29))

The Markdown form used below is from [@ArthurFDLR](https://github.com/ArthurFDLR/whisper-youtube/).   

# Preparing the audio file

**Optional:** Mount Google Drive



In [1]:
from google.colab import drive
from pathlib import Path

drive_mount_path = Path("/content/drive")
drive.mount(str(drive_mount_path))
drive_mount_path /= "MyDrive"

Mounted at /content/drive


In [2]:
#@markdown Enter the URL of the YouTube video, or the path to the video/audio file you want to transcribe, give the output path, and run the cell. HTML file is generated only for YouTube videos

Source = 'Youtube' #@param ['Youtube', 'File (Google Drive)']
#@markdown ---
#@markdown #### **Youtube video**
video_url = "https://youtu.be/NSp2fEQ6wyA" #@param {type:"string"}
#store_audio = True #@param {type:"boolean"}
#@markdown ---
#@markdown #### **Google Drive video or audio (mp4, wav)**
video_path = "/content/drive/MyDrive/Colab Notebooks/PyannoteWhisper/dyson.mp4" #@param {type:"string"}
#@markdown ---
output_path = "/content/" #@param {type:"string"}
output_path = str(Path(output_path))
#@markdown ---
#@markdown **Run this cell again if you change the video.**


In [3]:
Path(output_path).mkdir(parents=True, exist_ok=True)
%cd {output_path}
video_title = ""
video_id = ""

/content


## From YouTube

 Installing [`yt-dlp`](https://github.com/yt-dlp/yt-dlp) and downloading the [video](https://youtu.be/NSp2fEQ6wyA) from youtube.

In [4]:
if Source == "Youtube":
  !pip install -U yt-dlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting yt-dlp
  Downloading yt_dlp-2023.3.4-py2.py3-none-any.whl (2.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m63.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pycryptodomex
  Downloading pycryptodomex-3.17-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m94.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting brotli
  Downloading Brotli-1.0.9-cp39-cp39-manylinux1_x86_64.whl (357 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m357.2/357.2 KB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting websockets
  Downloading websockets-10.4-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.5/106.5 KB[0

Custom build of `ffmpeg` as [recommended](https://github.com/yt-dlp/yt-dlp#strongly-recommended) by `yt-dlp`.

In [5]:
if Source == "Youtube":
  !wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

In [6]:
#Getting video info
if Source == "Youtube":
  from yt_dlp import YoutubeDL
  with YoutubeDL() as ydl: 
    info_dict = ydl.extract_info(video_url, download=False)
    video_title = info_dict.get('title', None)
    video_id = info_dict.get('id', None)
    print("Title: " + video_title) # <= Here, you got the video title


[youtube] Extracting URL: https://youtu.be/NSp2fEQ6wyA
[youtube] NSp2fEQ6wyA: Downloading webpage
[youtube] NSp2fEQ6wyA: Downloading android player API JSON
Title: Freeman Dyson - Pure mathematics at Cambridge: the influence of Besicovitch (23/157)


Downloading the audio from YouTube.

In [7]:
if Source == "Youtube":
  !yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav  -o "{str(output_path) + '/'}input.wav" -- {video_url}

[debug] Command-line config: ['-xv', '--ffmpeg-location', 'ffmpeg-master-latest-linux64-gpl/bin', '--audio-format', 'wav', '-o', '/content/input.wav', '--', 'https://youtu.be/NSp2fEQ6wyA']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version stable@2023.03.04 [392389b7d] (pip)
[debug] Python 3.9.16 (CPython x86_64 64bit) - Linux-5.10.147+-x86_64-with-glibc2.31 (OpenSSL 1.1.1f  31 Mar 2020, glibc 2.31)
[debug] exe versions: ffmpeg N-110022-gc3a7999099-20230316 (setts), ffprobe N-110022-gc3a7999099-20230316
[debug] Optional libraries: Cryptodome-3.17, brotli-1.0.9, certifi-2022.12.07, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.4
[debug] Proxy map: {}
[debug] Loaded 1786 extractors
[youtube] Extracting URL: https://youtu.be/NSp2fEQ6wyA
[youtube] NSp2fEQ6wyA: Downloading webpage
[youtube] NSp2fEQ6wyA: Downloading android player API JSON
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, cha

## or from File (Google Drive)

In [8]:
if Source == 'File (Google Drive)':
    !ffmpeg -i {repr(video_path)} -vn -acodec pcm_s16le -ar 16000 -ac 1 input.wav  

## Prepending a spacer

`pyannote.audio` seems to miss the first 0.5 seconds of the audio, and, therefore, we prepend a spcacer.

In [9]:
!pip install pydub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [10]:
from pydub import AudioSegment

spacermilli = 2000
spacer = AudioSegment.silent(duration=spacermilli)


audio = AudioSegment.from_wav("input.wav") 

audio = spacer.append(audio, crossfade=0)

audio.export('input_prep.wav', format='wav')

<_io.BufferedRandom name='input_prep.wav'>

# Pyannote's Diarization

[`pyannote.audio`](https://github.com/pyannote/pyannote-audio) is an open-source toolkit written in Python for **speaker diarization**. 

Based on [`PyTorch`](https://pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. 

`pyannote.audio` also comes with pretrained [models](https://huggingface.co/models?other=pyannote-audio-model) and [pipelines](https://huggingface.co/models?other=pyannote-audio-pipeline) covering a wide range of domains for voice activity detection, speaker segmentation, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them. 

Installing `pyannote.audio`.

In [11]:
!pip install   pyannote.audio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyannote.audio
  Downloading pyannote.audio-2.1.1-py2.py3-none-any.whl (390 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m390.7/390.7 KB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pytorch-metric-learning<2.0,>=1.0.0
  Downloading pytorch_metric_learning-1.7.3-py3-none-any.whl (112 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.2/112.2 KB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyannote.metrics<4.0,>=3.2
  Downloading pyannote.metrics-3.2.1-py3-none-any.whl (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.4/51.4 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting speechbrain<0.6,>=0.5.12
  Downloading speechbrain-0.5.13-py3-none-any.whl (498 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m499.0/499.0 KB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
[?

**Important:** To load the pyannote speaker diarization pipeline, 

* accept the user conditions on both [hf.co/pyannote/speaker-diarization](https://hf.co/pyannote/speaker-diarization) and [hf.co/pyannote/segmentation](https://huggingface.co/pyannote/segmentation).
* paste your access_token or login using `notebook_login` below

In [12]:
access_token = "" #copy your huggingface access token here
if not(access_token):
  from huggingface_hub import notebook_login
  notebook_login()

In [14]:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization', use_auth_token= (access_token) or True )

Running pyannote.audio to generate the diarizations.

In [15]:
DEMO_FILE = {'uri': 'blabla', 'audio': 'input_prep.wav'}
dz = pipeline(DEMO_FILE)  

with open("diarization.txt", "w") as text_file:
    text_file.write(str(dz))

In [16]:
print(*list(dz.itertracks(yield_label = True))[:10], sep="\n")

(<Segment(1.98281, 20.0391)>, 'F', 'SPEAKER_01')
(<Segment(20.8659, 34.1803)>, 'G', 'SPEAKER_01')
(<Segment(34.8553, 37.8759)>, 'H', 'SPEAKER_01')
(<Segment(38.8209, 40.4747)>, 'I', 'SPEAKER_01')
(<Segment(41.7741, 46.0772)>, 'J', 'SPEAKER_01')
(<Segment(46.9547, 51.6122)>, 'K', 'SPEAKER_01')
(<Segment(52.2534, 53.5528)>, 'L', 'SPEAKER_01')
(<Segment(55.2234, 56.3709)>, 'M', 'SPEAKER_01')
(<Segment(57.9741, 60.3703)>, 'N', 'SPEAKER_01')
(<Segment(61.2984, 70.4953)>, 'O', 'SPEAKER_01')


# Preparing audio files according to the diarization

In [17]:
def millisec(timeStr):
  spl = timeStr.split(":")
  s = (int)((int(spl[0]) * 60 * 60 + int(spl[1]) * 60 + float(spl[2]) )* 1000)
  return s

Grouping the diarization segments according to the speaker.

In [18]:
import re
dzs = open('diarization.txt').read().splitlines()

groups = []
g = []
lastend = 0

for d in dzs:   
  if g and (g[0].split()[-1] != d.split()[-1]):      #same speaker
    groups.append(g)
    g = []
  
  g.append(d)
  
  end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=d)[1]
  end = millisec(end)
  if (lastend > end):       #segment engulfed by a previous segment
    groups.append(g)
    g = [] 
  else:
    lastend = end
if g:
  groups.append(g)
print(*groups, sep='\n')

['[ 00:00:01.982 -->  00:00:20.039] F SPEAKER_01', '[ 00:00:20.865 -->  00:00:34.180] G SPEAKER_01', '[ 00:00:34.855 -->  00:00:37.875] H SPEAKER_01', '[ 00:00:38.820 -->  00:00:40.474] I SPEAKER_01', '[ 00:00:41.774 -->  00:00:46.077] J SPEAKER_01', '[ 00:00:46.954 -->  00:00:51.612] K SPEAKER_01', '[ 00:00:52.253 -->  00:00:53.552] L SPEAKER_01', '[ 00:00:55.223 -->  00:00:56.370] M SPEAKER_01', '[ 00:00:57.974 -->  00:01:00.370] N SPEAKER_01', '[ 00:01:01.298 -->  00:01:10.495] O SPEAKER_01', '[ 00:01:11.237 -->  00:01:15.692] P SPEAKER_01', '[ 00:01:16.536 -->  00:01:23.657] Q SPEAKER_01', '[ 00:01:24.265 -->  00:01:31.386] R SPEAKER_01', '[ 00:01:32.618 -->  00:01:33.192] S SPEAKER_01', '[ 00:01:36.060 -->  00:01:37.056] T SPEAKER_01', '[ 00:01:39.402 -->  00:01:40.819] U SPEAKER_01', '[ 00:01:42.490 -->  00:01:45.510] V SPEAKER_01', '[ 00:01:47.114 -->  00:01:53.172] W SPEAKER_01', '[ 00:01:54.539 -->  00:01:56.918] X SPEAKER_01']
['[ 00:01:57.947 -->  00:02:04.242] A SPEAKER_00'

Save the audio part corresponding to each diarization group.

In [19]:
audio = AudioSegment.from_wav("input_prep.wav")
gidx = -1
for g in groups:
  start = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[-1])[1]
  start = millisec(start) #- spacermilli
  end = millisec(end)  #- spacermilli
  gidx += 1
  audio[start:end].export(str(gidx) + '.wav', format='wav')
  print(f"group {gidx}: {start}--{end}")

group 0: 1982--116918
group 1: 117947--124242
group 2: 123719--224445
group 3: 224445--233001
group 4: 232697--239549
group 5: 236241--237152
group 6: 241928--248897
group 7: 248509--258449


Freeing up some memory

In [20]:
#del   DEMO_FILE, pipeline, spacer,  audio, dz

# Whisper's Transcriptions

Installing Open AI whisper.

In [21]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [22]:
!pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-4xws90u2
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-4xws90u2
  Resolved https://github.com/openai/whisper.git to commit 6dea21fd7f7253bfe450f1e2512a0fe47ee2d258
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken==0.3.1
  Downloading tiktoken-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
Collecting triton==2.0.0
  Downloading triton-2.0.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (63.3 MB)
[2K     [90m━━━

Run whisper on all audio files. Whisper generates the transcription and writes it to a file.

In [23]:
import whisper, torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = whisper.load_model('large', device = device)


100%|█████████████████████████████████████| 2.87G/2.87G [00:34<00:00, 88.5MiB/s]


In [24]:
import json
for i in range(len(groups)):
  audiof = str(i) + '.wav'
  result = model.transcribe(audio=audiof, language='en', word_timestamps=True)#, initial_prompt=result.get('text', ""))
  with open(str(i)+'.json', "w") as outfile:
    json.dump(result, outfile, indent=4)  

# Generating the HTML and/or txt file from the Transcriptions and the Diarization

Change or add to the speaker names and collors bellow as you wish `(speaker, textbox color, speaker color)`.

In [25]:
speakers = {'SPEAKER_00':('Interviewer', '#e1ffc7', 'darkgreen'), 'SPEAKER_01':('Dyson', 'white', 'darkorange') }
def_boxclr = 'white'
def_spkrclr = 'orange'

In the generated HTML,  the transcriptions for each diarization group are written in a box, with the speaker name on the top. By clicking a transcription, the embedded video jumps to the right time .

In [26]:
preS = '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\t<meta http-equiv="X-UA-Compatible" content="ie=edge">\n\t<title>Freeman Dyson - Pure mathematics at Cambridge: the influence of Besicovitch (23/157)' + \
video_title+ \
'</title>\n\t<style>\n\t\tbody {\n\t\t\tfont-family: sans-serif;\n\t\t\tfont-size: 14px;\n\t\t\tcolor: #111;\n\t\t\tpadding: 0 0 1em 0;\n\t\t\tbackground-color: #efe7dd;\n\t\t}\n\n\t\ttable {\n\t\t\tborder-spacing: 10px;\n\t\t}\n\n\t\tth {\n\t\t\ttext-align: left;\n\t\t}\n\n\t\t.lt {\n\t\t\tcolor: inherit;\n\t\t\ttext-decoration: inherit;\n\t\t}\n\n\t\t.l {\n\t\t\tcolor: #050;\n\t\t}\n\n\t\t.s {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.c {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t.e {\n\t\t\t/*background-color: white; Changing background color */\n\t\t\tborder-radius: 10px;\n\t\t\t/* Making border radius */\n\t\t\twidth: 50%;\n\t\t\t/* Making auto-sizable width */\n\t\t\tpadding: 0 0 0 0;\n\t\t\t/* Making space around letters */\n\t\t\tfont-size: 14px;\n\t\t\t/* Changing font size */\n\t\t\tmargin-bottom: 0;\n\t\t}\n\n\t\t.t {\n\t\t\tdisplay: inline-block;\n\t\t}\n\n\t\t#player-div {\n\t\t\tposition: sticky;\n\t\t\ttop: 20px;\n\t\t\tfloat: right;\n\t\t\twidth: 40%\n\t\t}\n\n\t\t#player {\n\t\t\taspect-ratio: 16 / 9;\n\t\t\twidth: 100%;\n\t\t\theight: auto;\n\n\t\t}\n\n\t\ta {\n\t\t\tdisplay: inline;\n\t\t}\n\t</style>\n\t<script>\n\t\tvar tag = document.createElement(\'script\');\n\t\ttag.src = "https://www.youtube.com/iframe_api";\n\t\tvar firstScriptTag = document.getElementsByTagName(\'script\')[0];\n\t\tfirstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n\t\tvar player;\n\t\tfunction onYouTubeIframeAPIReady() {\n\t\t\tplayer = new YT.Player(\'player\', {\n\t\t\t\t//height: \'210\',\n\t\t\t\t//width: \'340\',\n\t\t\t\tvideoId: \''+ \
video_id + \
'\',\n\t\t\t});\n\n\n\n\t\t\t// This is the source "window" that will emit the events.\n\t\t\tvar iframeWindow = player.getIframe().contentWindow;\n\t\t\tvar lastword = null;\n\n\t\t\t// So we can compare against new updates.\n\t\t\tvar lastTimeUpdate = "-1";\n\n\t\t\t// Listen to events triggered by postMessage,\n\t\t\t// this is how different windows in a browser\n\t\t\t// (such as a popup or iFrame) can communicate.\n\t\t\t// See: https://developer.mozilla.org/en-US/docs/Web/API/Window/postMessage\n\t\t\twindow.addEventListener("message", function (event) {\n\t\t\t\t// Check that the event was sent from the YouTube IFrame.\n\t\t\t\tif (event.source === iframeWindow) {\n\t\t\t\t\tvar data = JSON.parse(event.data);\n\n\t\t\t\t\t// The "infoDelivery" event is used by YT to transmit any\n\t\t\t\t\t// kind of information change in the player,\n\t\t\t\t\t// such as the current time or a playback quality change.\n\t\t\t\t\tif (\n\t\t\t\t\t\tdata.event === "infoDelivery" &&\n\t\t\t\t\t\tdata.info &&\n\t\t\t\t\t\tdata.info.currentTime\n\t\t\t\t\t) {\n\t\t\t\t\t\t// currentTime is emitted very frequently (milliseconds),\n\t\t\t\t\t\t// but we only care about whole second changes.\n\t\t\t\t\t\tvar ts = (data.info.currentTime).toFixed(1).toString();\n\t\t\t\t\t\tts = (Math.round((data.info.currentTime) * 5) / 5).toFixed(1);\n\t\t\t\t\t\tts = ts.toString();\n\t\t\t\t\t\tconsole.log(ts)\n\t\t\t\t\t\tif (ts !== lastTimeUpdate) {\n\t\t\t\t\t\t\tlastTimeUpdate = ts;\n\n\t\t\t\t\t\t\t// It\'s now up to you to format the time.\n\t\t\t\t\t\t\t//document.getElementById("time2").innerHTML = time;\n\t\t\t\t\t\t\tword = document.getElementById(ts)\n\t\t\t\t\t\t\tif (word) {\n\t\t\t\t\t\t\t\tif (lastword) {\n\t\t\t\t\t\t\t\t\tlastword.style.fontWeight = \'normal\';\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\tlastword = word;\n\t\t\t\t\t\t\t\t//word.style.textDecoration = \'underline\';\n\t\t\t\t\t\t\t\tword.style.fontWeight = \'bold\';\n\n\t\t\t\t\t\t\t\tlet toggle = document.getElementById("autoscroll");\n\t\t\t\t\t\t\t\tif (toggle.checked) {\n\t\t\t\t\t\t\t\t\tlet position = word.offsetTop - 10;\n\t\t\t\t\t\t\t\t\twindow.scrollTo({\n\t\t\t\t\t\t\t\t\t\ttop: position,\n\t\t\t\t\t\t\t\t\t\tbehavior: \'smooth\'\n\t\t\t\t\t\t\t\t\t});\n\t\t\t\t\t\t\t\t}\n\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t})\n\t\t}\n\t\tfunction jumptoTime(timepoint, id) {\n\t\t\tevent.preventDefault();\n\t\t\thistory.pushState(null, null, "#" + id);\n\t\t\tplayer.seekTo(timepoint);\n\t\t\tplayer.playVideo();\n\t\t}\n\t</script>\n</head>\n\n<body>\n\t<h2>'  + \
video_title + \
'</h2>\n\t<i>Click on a part of the transcription, to jump to its video, and get an anchor to it in the address\n\t\tbar<br><br></i>\n\t<div id="player-div">\n\t\t<div id="player"></div>\n\t\t<div><label for="autoscroll">auto-scroll: </label>\n\t\t\t<input type="checkbox" id="autoscroll" checked>\n\t\t</div>\n\t</div>\n  '

#preS = '<!DOCTYPE html>\n<html lang="en">\n\n<head>\n\t<meta charset="UTF-8">\n\t<meta name="viewport" content="width=device-width, initial-scale=1.0">\n\t<meta http-equiv="X-UA-Compatible" content="ie=edge">\n\t<title>' + \
#      video_title + \
#      '</title>\n    <style>\n        body {\n            font-family: sans-serif;\n            font-size: 14px;\n            color: #111;\n            padding: 0 0 1em 0;\n\t        background-color: #efe7dd;\n        }\n        table {\n             border-spacing: 10px;\n        }\n        th { text-align: left;}\n        .lt {\n          color: inherit;\n          text-decoration: inherit;\n        }\n        .l {\n          color: #050;\n        }\n        .s {\n            display: inline-block;\n        }\n        .c {\n            display: inline-block;\n        }\n        .e {\n            /*background-color: white; Changing background color */\n            border-radius: 10px; /* Making border radius */\n            width: 50%; /* Making auto-sizable width */\n            padding: 0 0 0 0; /* Making space around letters */\n            font-size: 14px; /* Changing font size */\n            margin-bottom: 0;\n        }\n\n        .t {\n            display: inline-block;\n        }\n        #player {\n            position: sticky;\n            top: 20px;\n            float: right;\naspect-ratio: 16 / 9;width:40%;height: auto;        }\n        a {\n            display: inline;\n        }\n</style>\n\t<script>\n      var tag = document.createElement(\'script\');\n      tag.src = "https://www.youtube.com/iframe_api";\n      var firstScriptTag = document.getElementsByTagName(\'script\')[0];\n      firstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n      var player;\n      function onYouTubeIframeAPIReady() {\n        player = new YT.Player(\'player\', {\n          //height: \'210\',\n          //width: \'340\',\n          videoId: \'' + \
#      video_id + \
#      '\',\n        });\n      }\n      function jumptoTime(timepoint, id) {\n        event.preventDefault();\n        history.pushState(null, null, "#"+id);\n        player.seekTo(timepoint);\n        player.playVideo();\n      }\n    </script>\n  </head>\n  <body>\n    <h2>' + \
#      video_title + \
#      '</h2>\n  <i>Click on a part of the transcription, to jump to its video, and get an anchor to it in the address bar<br><br></i>\n<div  id="player"></div>\n'
postS = '\t</body>\n</html>'

In [27]:
#import webvtt
import json
from datetime import timedelta

def timeStr(t):
  return '{0:02d}:{1:02d}:{2:06.2f}'.format(round(t // 3600), 
                                                round(t % 3600 // 60), 
                                                t % 60)

html = list(preS)
txt = list("")
gidx = -1
for g in groups:  
  shift = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  shift = millisec(shift) - spacermilli #the start time in the original video
  shift=max(shift, 0)
  
  gidx += 1
  
  captions = json.load(open(str(gidx) + '.json'))['segments']

  if captions:
    speaker = g[0].split()[-1]
    boxclr = def_boxclr
    spkrclr = def_spkrclr
    if speaker in speakers:
      speaker, boxclr, spkrclr = speakers[speaker] 
    
    html.append(f'<div class="e" style="background-color: {boxclr}">\n');
    html.append('<p  style="margin:0;padding: 5px 10px 10px 10px;word-wrap:normal;white-space:normal;">\n')
    html.append(f'<span style="color:{spkrclr};font-weight: bold;">{speaker}</span><br>\n\t\t\t\t')
      
    for c in captions:
      start = shift + c['start'] * 1000.0 
      start = start / 1000.0   #time resolution ot youtube is Second.            
      end = (shift + c['end'] * 1000.0) / 1000.0      
      txt.append(f'[{timeStr(start)} --> {timeStr(end)}] [{speaker}] {c["text"]}\n')

      for i, w in enumerate(c['words']):
        if w == "":
           continue
        start = (shift + w['start']*1000.0) / 1000.0        
        #end = (shift + w['end']) / 1000.0   #time resolution ot youtube is Second.  
        html.append(f'<a href="#{timeStr(start)}" id="{"{:.1f}".format(round(start*5)/5)}" class="lt" onclick="jumptoTime({int(start)}, this.id)">{w["word"]}</a><!--\n\t\t\t\t-->')
    #html.append('\n')      
    html.append('</p>\n')
    html.append(f'</div>\n')

html.append(postS)

with open("capspeaker.txt", "w") as file:
  s = "".join(txt)
  file.write(s)
if Source == 'File (Google Drive)':
  print(s)
elif Source == 'Youtube':
  with open("capspeaker.html", "w") as file:    #TODO: proper html embed tag when video/audio from file
    s = "".join(html)
    file.write(s)
    print(s)

<!DOCTYPE html>
<html lang="en">

<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<meta http-equiv="X-UA-Compatible" content="ie=edge">
	<title>Freeman Dyson - Pure mathematics at Cambridge: the influence of Besicovitch (23/157)Freeman Dyson - Pure mathematics at Cambridge: the influence of Besicovitch (23/157)</title>
	<style>
		body {
			font-family: sans-serif;
			font-size: 14px;
			color: #111;
			padding: 0 0 1em 0;
			background-color: #efe7dd;
		}

		table {
			border-spacing: 10px;
		}

		th {
			text-align: left;
		}

		.lt {
			color: inherit;
			text-decoration: inherit;
		}

		.l {
			color: #050;
		}

		.s {
			display: inline-block;
		}

		.c {
			display: inline-block;
		}

		.e {
			/*background-color: white; Changing background color */
			border-radius: 10px;
			/* Making border radius */
			width: 50%;
			/* Making auto-sizable width */
			padding: 0 0 0 0;
			/* Making space around letters */
			font-size: 14p