# Whisper's transcription plus Pyannote's Diarization 

Andrej Karpathy [suggested](https://twitter.com/karpathy/status/1574476200801538048?s=20&t=s5IMMXOYjBI6-91dib6w8g) training a classifier on top of  OpenAI [Whisper](https://openai.com/blog/whisper/) model features to identify the speaker, so we can visualize the speaker in the transcript. But, as [pointed out](https://twitter.com/tarantulae/status/1574493613362388992?s=20&t=s5IMMXOYjBI6-91dib6w8g) by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In the following, I use [**`pyannote-audio`**](https://github.com/pyannote/pyannote-audio), a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr. I try it on a part of an [interview](https://youtu.be/NSp2fEQ6wyA) with Freeman Dyson. Check the result [**here**](https://majdoddin.github.io/dyson.html).

To make it easier to match the transcriptions to diarizations by speaker change, Sarah Kaiser [suggested](https://github.com/openai/whisper/discussions/264#discussioncomment-3825375) runnnig the pyannote.audio first and  then just running whisper on the split-by-speaker chunks. 
For sake of performance (and transcription quality?), we attach the audio segements into a single audio file with a silent spacer as a seperator, and run whisper on it. Enjoy it!

(For sake of performance , I also tried attaching the audio segements into a single audio file with a silent (or beep) spacer as a seperator, and run whisper on it see it on [colab](https://colab.research.google.com/drive/1HuvcY4tkTHPDzcwyVH77LCh_m8tP-Qet?usp=sharing). It [works](https://majdoddin.github.io/lexicap.html) on some audio, and fails on some. The problem is, whisper does not reliably make a timestap on a spacer. See the discussions [#139](https://github.com/openai/whisper/discussions/139) and [#29](https://github.com/openai/whisper/discussions/29).

# Preparing the audio file

 Installing [`yt-dlp`](https://github.com/yt-dlp/yt-dlp). and downloading the [video](https://) from youtube.

In [None]:
!pip install -U yt-dlp

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting yt-dlp
  Downloading yt_dlp-2022.10.4-py2.py3-none-any.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 13.8 MB/s 
[?25hCollecting mutagen
  Downloading mutagen-1.46.0-py3-none-any.whl (193 kB)
[K     |████████████████████████████████| 193 kB 50.8 MB/s 
Collecting brotli
  Downloading Brotli-1.0.9-cp37-cp37m-manylinux1_x86_64.whl (357 kB)
[K     |████████████████████████████████| 357 kB 55.2 MB/s 
[?25hCollecting websockets
  Downloading websockets-10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (112 kB)
[K     |████████████████████████████████| 112 kB 65.9 MB/s 
[?25hCollecting pycryptodomex
  Downloading pycryptodomex-3.15.0-cp35-abi3-manylinux2010_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 60.9 MB/s 
[?25hInstalling collected packages: websockets, pycryptodomex, mutagen, bro

Custom build of `ffmpeg` as [recommended](https://github.com/yt-dlp/yt-dlp#strongly-recommended) by `yt-dlp`.

In [None]:
!wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

Downloading the audio from YouTube.

In [None]:
!yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav  -o youtube.wav -- https://youtu.be/NSp2fEQ6wyA

[debug] Command-line config: ['-xv', '--ffmpeg-location', 'ffmpeg-master-latest-linux64-gpl/bin', '--audio-format', 'wav', '-o', 'youtube.wav', '--', 'https://youtu.be/NSp2fEQ6wyA']
[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out UTF-8, error UTF-8, screen UTF-8
[debug] yt-dlp version 2022.10.04 [4e0511f] (pip) API
[debug] Python 3.7.15 (CPython 64bit) - Linux-5.10.133+-x86_64-with-Ubuntu-18.04-bionic (glibc 2.26)
[debug] Checking exe version: ffmpeg-master-latest-linux64-gpl/bin/ffmpeg -bsfs
[debug] Checking exe version: ffmpeg-master-latest-linux64-gpl/bin/ffprobe -bsfs
[debug] exe versions: ffmpeg N-108818-gd79c240196-20221024 (setts), ffprobe N-108818-gd79c240196-20221024
[debug] Optional libraries: Cryptodome-3.15.0, brotli-1.0.9, certifi-2022.09.24, mutagen-1.46.0, sqlite3-2.6.0, websockets-10.3
[debug] Proxy map: {}
[debug] Loaded 1690 extractors
[debug] [youtube] Extracting URL: https://youtu.be/NSp2fEQ6wyA
[youtube] NSp2fEQ6wyA: Downloading webpage
[youtube] NSp2fEQ

`pyannote.audio` seems to miss the first 0.5 seconds of the audio, and, therefore, we prepend a spcacer.

In [None]:
!pip install pydub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [None]:
from pydub import AudioSegment

spacermilli = 2000
spacer = AudioSegment.silent(duration=spacermilli)


audio = AudioSegment.from_wav("youtube.wav") #lecun1.wav

audio = spacer.append(audio, crossfade=0)

audio.export('audio.wav', format='wav')

<_io.BufferedRandom name='audio.wav'>

# Pyannote's Diarization

[`pyannote.audio`](https://github.com/pyannote/pyannote-audio) is an open-source toolkit written in Python for **speaker diarization**. 

Based on [`PyTorch`](https://pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. 

`pyannote.audio` also comes with pretrained [models](https://huggingface.co/models?other=pyannote-audio-model) and [pipelines](https://huggingface.co/models?other=pyannote-audio-pipeline) covering a wide range of domains for voice activity detection, speaker segmentation, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them. 

Installing `pyannote.audio`.

In [None]:
!pip install   pyannote.audio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyannote.audio
  Downloading pyannote.audio-2.0.1-py2.py3-none-any.whl (385 kB)
[K     |████████████████████████████████| 385 kB 11.3 MB/s 
[?25hCollecting asteroid-filterbanks<0.5,>=0.4
  Downloading asteroid_filterbanks-0.4.0-py3-none-any.whl (29 kB)
Collecting einops<0.4.0,>=0.3
  Downloading einops-0.3.2-py3-none-any.whl (25 kB)
Collecting pyannote.pipeline<3.0,>=2.3
  Downloading pyannote.pipeline-2.3-py3-none-any.whl (30 kB)
Collecting soundfile<0.11,>=0.10.2
  Downloading SoundFile-0.10.3.post1-py2.py3-none-any.whl (21 kB)
Collecting backports.cached-property
  Downloading backports.cached_property-1.0.2-py3-none-any.whl (6.1 kB)
Collecting pytorch-metric-learning<2.0,>=1.0.0
  Downloading pytorch_metric_learning-1.6.2-py3-none-any.whl (111 kB)
[K     |████████████████████████████████| 111 kB 63.6 MB/s 
Collecting singledispatchmethod
  Downloading singledispatchmetho

In [None]:
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization')

Downloading:   0%|          | 0.00/598 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/318 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/129k [00:00<?, ?B/s]

Running pyannote.audio to generate the diarizations.

In [None]:
DEMO_FILE = {'uri': 'blabla', 'audio': 'audio.wav'}
dz = pipeline(DEMO_FILE)  

with open("diarization.txt", "w") as text_file:
    text_file.write(str(dz))

In [None]:
print(*list(dz.itertracks(yield_label = True))[:10], sep="\n")

(<Segment(1.99969, 5.61094)>, 0, 'SPEAKER_00')
(<Segment(6.45469, 6.69094)>, 0, 'SPEAKER_00')
(<Segment(6.74156, 19.9209)>, 0, 'SPEAKER_00')
(<Segment(20.8659, 24.0553)>, 0, 'SPEAKER_00')
(<Segment(25.0847, 30.4847)>, 0, 'SPEAKER_00')
(<Segment(30.7041, 34.0284)>, 0, 'SPEAKER_00')
(<Segment(34.8722, 37.8084)>, 0, 'SPEAKER_00')
(<Segment(38.8378, 40.4578)>, 0, 'SPEAKER_00')
(<Segment(41.7741, 46.0097)>, 0, 'SPEAKER_00')
(<Segment(47.0897, 47.1066)>, 0, 'SPEAKER_00')


# Preparing audio files according to the diarization

In [None]:
def millisec(timeStr):
  spl = timeStr.split(":")
  s = (int)((int(spl[0]) * 60 * 60 + int(spl[1]) * 60 + float(spl[2]) )* 1000)
  return s

Grouping the diarization segments according to the speaker.

In [None]:
import re
dzs = open('diarization.txt').read().splitlines()

groups = []
g = []
lastend = 0

for d in dzs:   
  if g and (g[0].split()[-1] != d.split()[-1]):      #same speaker
    groups.append(g)
    g = []
  
  g.append(d)
  
  end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=d)[1]
  end = millisec(end)
  if (lastend > end):       #segment engulfed by a previous segment
    groups.append(g)
    g = [] 
  else:
    lastend = end
if g:
  groups.append(g)
print(*groups, sep='\n')

['[ 00:00:01.999 -->  00:00:05.610] 0 SPEAKER_00', '[ 00:00:06.454 -->  00:00:06.690] 0 SPEAKER_00', '[ 00:00:06.741 -->  00:00:19.920] 0 SPEAKER_00', '[ 00:00:20.865 -->  00:00:24.055] 0 SPEAKER_00', '[ 00:00:25.084 -->  00:00:30.484] 0 SPEAKER_00', '[ 00:00:30.704 -->  00:00:34.028] 0 SPEAKER_00', '[ 00:00:34.872 -->  00:00:37.808] 0 SPEAKER_00', '[ 00:00:38.837 -->  00:00:40.457] 0 SPEAKER_00', '[ 00:00:41.774 -->  00:00:46.009] 0 SPEAKER_00', '[ 00:00:47.089 -->  00:00:47.106] 0 SPEAKER_00', '[ 00:00:47.224 -->  00:00:49.671] 0 SPEAKER_00', '[ 00:00:50.245 -->  00:00:51.595] 0 SPEAKER_00', '[ 00:00:52.287 -->  00:00:53.535] 0 SPEAKER_00', '[ 00:00:55.240 -->  00:00:56.354] 0 SPEAKER_00', '[ 00:00:58.007 -->  00:01:00.201] 0 SPEAKER_00', '[ 00:01:01.315 -->  00:01:10.394] 0 SPEAKER_00', '[ 00:01:11.406 -->  00:01:15.675] 0 SPEAKER_00', '[ 00:01:16.553 -->  00:01:23.556] 0 SPEAKER_00', '[ 00:01:24.282 -->  00:01:26.627] 0 SPEAKER_00', '[ 00:01:26.948 -->  00:01:31.352] 0 SPEAKER_00',

Save the audio part corresponding to each diarization group.

In [None]:
audio = AudioSegment.from_wav("audio.wav")
gidx = -1
for g in groups:
  start = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  end = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[-1])[1]
  start = millisec(start) #- spacermilli
  end = millisec(end)  #- spacermilli
  print(start, end)
  gidx += 1
  audio[start:end].export(str(gidx) + '.wav', format='wav')

1999 116866
118268 124124
123735 224429
224648 232832
232714 239531
236275 236612
236815 237051
241928 248830
248526 258449


Freeing up some memory

In [None]:
#del   DEMO_FILE, pipeline, spacer,  audio, dz, newAudio

# Whisper's Transcriptions

Installing Open AI whisper.

**Important:** There is a version conflict with pyannote.audio resulting in an error (see this [RP](https://github.com/pyannote/pyannote-audio/pull/1098)). Our workaround is to first run Pyannote and then whisper. You can safely ignore the error.

In [None]:
!pip install git+https://github.com/openai/whisper.git 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-m6yb6yfz
  Running command git clone -q https://github.com/openai/whisper.git /tmp/pip-req-build-m6yb6yfz
Collecting transformers>=4.19.0
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 14.2 MB/s 
[?25hCollecting ffmpeg-python==0.2.0
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 62.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 55.9 MB/s 
Building wheels for collected packages: whisper
  Bui

Run whisper on all audio files. Whisper generates the transcription and writes it to a file.

In [None]:
for i in range(gidx+1):
  !whisper {str(i) + '.wav'} --language en --model large

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
Moving 0 files to the new cache system
0it [00:00, ?it/s]
100%|█████████████████████████████████████| 2.87G/2.87G [00:46<00:00, 66.6MiB/s]
tcmalloc: large alloc 3087007744 bytes == 0x6a72000 @  0x7fa8efda61e7 0x4b2590 0x5ad01c 0x5dcfef 0x58f92b 0x590c33 0x5e48ac 0x4d20fa 0x51041f 0x58fd37 0x50c4fc 0x5b4ee6 0x58ff2e 0x50d482 0x58fd37 0x50c4fc 0x5b4ee6 0x6005a3 0x607796 0x60785c 0x60a436 0x64db82 0x64dd2e 0x7fa8ef9a3c87 0x5b636a
[00:00.000 --> 00:10.940]  So then I come to Cambridge in 1941 as a 17-year-old, and I'd always been interested in physics
[00:10.940 --> 00:15.960]  and applied mathematics of all sorts, and one of the textbooks that I bought as a prize
[00:15.960 --> 00:24.380]  was a textbook in aerodynamics, which I think was because of James Light

# Generating the HTML file from the Transcriptions and the Diarization

Reading the transcription file.

In [None]:
!pip install -U webvtt-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting webvtt-py
  Downloading webvtt_py-0.4.6-py3-none-any.whl (16 kB)
Collecting docopt
  Downloading docopt-0.6.2.tar.gz (25 kB)
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13723 sha256=b2d86d8b4a158d15011470b7fab6dcf9b4eb57eb93ec122dfe0b0049fab2c7ae
  Stored in directory: /root/.cache/pip/wheels/72/b0/3f/1d95f96ff986c7dfffe46ce2be4062f38ebd04b506c77c81b9
Successfully built docopt
Installing collected packages: docopt, webvtt-py
Successfully installed docopt-0.6.2 webvtt-py-0.4.6


In the generated HTML,  the transcriptions for each diarization group are written in a box, with the speaker name on the top. By clicking a transcription, the embedded video jumps to the right time .

In [None]:
preS = '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta http-equiv="X-UA-Compatible" content="ie=edge">\n    <title>Lexicap</title>\n    <style>\n        body {\n            font-family: sans-serif;\n            font-size: 18px;\n            color: #111;\n            padding: 0 0 1em 0;\n\t    background-color: #efe7dd;\n\n        }\n        table {\n             border-spacing: 10px;\n        }\n        th { text-align: left;}\n        .l {\n          color: #050;\n        }\n        .s {\n            display: inline-block;\n        }\n        .c {\n            display: inline-block;\n        }\n        .e1 {\n                background-color: white;/* Changing background color */\n            border-radius: 20px; /* Making border radius */\n            width: fit-content; /* Making auto-sizable width */\n            height: fit-content; /* Making auto-sizable height */\n            padding: 5px 30px 5px 30px; /* Making space around letters */\n            font-size: 18px; /* Changing font size */\n            display: flex;\n            flex-direction: column;\n            margin-bottom: 10px;\n            }\n        .e2 {\n                background-color: #e1ffc7;/* Changing background color */\n            border-radius: 20px; /* Making border radius */\n            width: fit-content; /* Making auto-sizable width */\n            height: fit-content; /* Making auto-sizable height */\n            padding: 5px 30px 5px 30px; /* Making space around letters */\n            font-size: 18px; /* Changing font size */\n            display: flex;\n            flex-direction: column;\n                margin-bottom: 10px;\n            }\n            .t {\n                display: inline-block;\n            }\n            #player {\n            position: sticky;\n            top: 20px;\n            float: right;\n        }\n    </style>\n\t<script>\n      var tag = document.createElement(\'script\');\n      tag.src = "https://www.youtube.com/iframe_api";\n      var firstScriptTag = document.getElementsByTagName(\'script\')[0];\n      firstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n      var player;\n      function onYouTubeIframeAPIReady() {\n        player = new YT.Player(\'player\', {\n          //height: \'210\',\n          //width: \'340\',\n          videoId: \'NSp2fEQ6wyA\',\n        });\n      }\n      function setCurrentTime(timepoint) {\n        player.seekTo(timepoint);\n       player.playVideo();\n      }\n    </script>\n  </head>\n  <body>\n    <h2>Freeman Dyson - Pure mathematics at Cambridge: the influence of Besicovitch </h2>\n  <div  id="player"></div>'
postS = '\t</body>\n</html>'

In [None]:
import webvtt

from datetime import timedelta

html = list(preS)
gidx = -1
speakers = {'SPEAKER_00':('Dyson', 'e1'), 'SPEAKER_01':('Interviewer', 'e2') }
for g in groups:
  speaker = g[0].split()[-1]
  shift = re.findall('[0-9]+:[0-9]+:[0-9]+\.[0-9]+', string=g[0])[0]
  shift = millisec(shift) - spacermilli #the start time in the original video
  shift=max(shift, 0)
  
  gidx += 1
  captions = [[(int)(millisec(caption.start)), (int)(millisec(caption.end)),  caption.text] for caption in webvtt.read(str(gidx) + '.wav.vtt')]

  if captions:
    html.append(f'<div class="{speakers[speaker][1]}">\n');
    html.append(f'{speakers[speaker][0]}<br>\n')
    
    for c in captions:
      start = shift + c[0] 

      start = start / 1000.0   #time resolution ot youtube is Second.
      startStr = '{0:02d}:{1:02d}:{2:02.2f}'.format((int)(start // 3600), 
                                              (int)(start % 3600 // 60), 
                                              start % 60)      
      html.append(f'<div class="c">')
      html.append(f'\t\t\t\t<a class="l" href="#{startStr}" id="{startStr}">#</a> \n')
      html.append(f'\t\t\t\t<div class="s"><a href="javascript:void(0);" onclick=setCurrentTime({int(start)})>{startStr}</a></div>\n')
      html.append(f'\t\t\t\t<div class="t"> {c[2]}</div><br>\n')
      html.append(f'</div>')

    html.append(f'</div>\n');

html.append(postS)
s = "".join(html)

with open("lexicap.html", "w") as text_file:
    text_file.write(s)
print(s)

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Lexicap</title>
    <style>
        body {
            font-family: sans-serif;
            font-size: 18px;
            color: #111;
            padding: 0 0 1em 0;
	    background-color: #efe7dd;

        }
        table {
             border-spacing: 10px;
        }
        th { text-align: left;}
        .l {
          color: #050;
        }
        .s {
            display: inline-block;
        }
        .c {
            display: inline-block;
        }
        .e1 {
                background-color: white;/* Changing background color */
            border-radius: 20px; /* Making border radius */
            width: fit-content; /* Making auto-sizable width */
            height: fit-content; /* Making auto-sizable height */
            padding: 5px 30px 5px 30px; /* Maki