<a href="https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper's transcription plus Pyannote's Diarization
OpenAI's [**Whisper**](https://openai.com/blog/whisper/) does a great job transcribing audio files, show how it "[beats it](https://colab.research.google.com/drive/1T5iOKDbyv9_8cCI1J0hSfOG3oBMX49Zx?usp=sharing)"!

Andrej Karpathy's [Lexicap](https://karpathy.ai/lexicap/index.html), uses Whisper to transcribe all Lex Friedman's podcasts.

Andrej [suggests](https://twitter.com/karpathy/status/1574476200801538048?s=20&t=s5IMMXOYjBI6-91dib6w8g) training a classifier on top of  Whisper model features to identify Lex, so we can visualize the speaker in the transcript. But, as [pointed out](https://twitter.com/tarantulae/status/1574493613362388992?s=20&t=s5IMMXOYjBI6-91dib6w8g) by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In the following, I use [**`pyannote-audio`**](https://github.com/pyannote/pyannote-audio), a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr. I do it on the first 30 minutes of  Lex's 2nd [interview](https://youtu.be/SGzMElJ11Cc) with Yann LeCun. Check the result [**here**](https://majdoddin.github.io/lexicap.html). 

The result is alright, albeit it is tricky to match the diarizations at the moments that the speaker changes, for example [this part](https://majdoddin.github.io/lexicap.html#00:03:09.520). 

I think we should come up with a clever way to combine the two NNs, to generate transcriptins with diarization.

Installing `yt-dlp` and downloading the [video](https://).

In [None]:
!pip install -U yt-dlp

In [None]:
!wget -O - -q  https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

In [None]:
!yt-dlp -xv --ffmpeg-location ffmpeg-master-latest-linux64-gpl/bin --audio-format wav  -o lecun.wav -- https://youtu.be/SGzMElJ11Cc


Cutting the first 30 minutes of the video for further process.


In [None]:
!pip install pydub

In [None]:
from pydub import AudioSegment

t1 = 0 * 1000 #Works in milliseconds
t2 = 30 * 60 * 1000

newAudio = AudioSegment.from_wav("lecun.wav")
a = newAudio[t1:t2]
a.export("lecun1.wav", format="wav") #Exports to a wav file in the current path.


Installing and running Open AI whisper on the video. It writes the transcription into a file.

In [None]:
!pip install git+https://github.com/openai/whisper.git 

In [None]:
transcription = !whisper lecun1.wav --language en --model large

Reading the transcription file.

In [14]:
!pip install -U webvtt-py

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting webvtt-py
  Downloading webvtt_py-0.4.6-py3-none-any.whl (16 kB)
Installing collected packages: webvtt-py
Successfully installed webvtt-py-0.4.6


In [12]:
def time(timeStr):
  spl = timeStr.split(":")
  time = int(spl[0]) * 60 * 60 + int(spl[1]) * 60 + float(spl[2]) 
  return time

In [15]:
import webvtt

captions = []
for caption in webvtt.read('lecun1.wav.vtt'):
  captions.append([time(caption.start), time(caption.end), caption.start, caption.text])

In [80]:
for i in range(10):
  print(captions[i])

[0.0, 2.72, '00:00:00.000', 'The following is a conversation with Yann LeCun,']
[2.72, 4.56, '00:00:02.720', 'his second time on the podcast.']
[4.56, 9.18, '00:00:04.560', 'He is the chief AI scientist at Meta, formerly Facebook,']
[9.18, 13.08, '00:00:09.180', 'professor at NYU, touring award winner,']
[13.08, 15.64, '00:00:13.080', 'one of the seminal figures in the history']
[15.64, 18.48, '00:00:15.640', 'of machine learning and artificial intelligence,']
[18.48, 21.96, '00:00:18.480', 'and someone who is brilliant and opinionated']
[21.96, 23.44, '00:00:21.960', 'in the best kind of way,']
[23.44, 26.0, '00:00:23.440', 'and so it was always fun to talk to him.']
[26.0, 28.0, '00:00:26.000', 'This is the Lex Friedman podcast.']


[`pyannote.audio`](https://github.com/pyannote/pyannote-audio) is an open-source toolkit written in Python for **speaker diarization**. 

Based on [`PyTorch`](https://pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. 

`pyannote.audio` also comes with pretrained [models](https://huggingface.co/models?other=pyannote-audio-model) and [pipelines](https://huggingface.co/models?other=pyannote-audio-pipeline) covering a wide range of domains for voice activity detection, speaker segmentation, overlapped speech detection, speaker embedding reaching state-of-the-art performance for most of them. 

**This notebook will teach you how to apply those pretrained pipelines on your own data.**

Make sure you run it using a GPU (or it might otherwise be slow...)

Installing Pyannote and running it on the video to generate the diarizations.

In [1]:
!pip install -U  pyannote.audio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyannote.audio
  Downloading pyannote.audio-2.0.1-py2.py3-none-any.whl (385 kB)
[K     |████████████████████████████████| 385 kB 5.2 MB/s 
[?25hCollecting hmmlearn<0.3,>=0.2.7
  Downloading hmmlearn-0.2.8-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (217 kB)
[K     |████████████████████████████████| 217 kB 60.4 MB/s 
Collecting torch-audiomentations>=0.11.0
  Downloading torch_audiomentations-0.11.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 5.3 MB/s 
Collecting pyannote.metrics<4.0,>=3.2
  Downloading pyannote.metrics-3.2.1-py3-none-any.whl (51 kB)
[K     |████████████████████████████████| 51 kB 191 kB/s 
[?25hCollecting pytorch-lightning<1.7,>=1.5.4
  Downloading pytorch_lightning-1.6.5-py3-none-any.whl (585 kB)
[K     |████████████████████████████████| 585 kB 66.6 MB/s 
[?25hCollecting torchmetrics<1.0,>=0.6
  Downloading torchm

In [2]:
from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization')

Downloading:   0%|          | 0.00/598 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/318 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/129k [00:00<?, ?B/s]

In [6]:
DEMO_FILE = {'uri': 'blabal', 'audio': 'lecun1.wav'}
dz = pipeline(DEMO_FILE)  

In [38]:
dzList= list(dz.itertracks(yield_label=True))

In [90]:
for i in range(10):
  print(dzList[i])

(<Segment(0.497812, 25.5066)>, 1, 'SPEAKER_01')
(<Segment(25.9622, 34.8216)>, 1, 'SPEAKER_01')
(<Segment(36.1041, 49.3678)>, 1, 'SPEAKER_01')
(<Segment(49.8572, 87.5728)>, 0, 'SPEAKER_00')
(<Segment(87.6909, 88.1972)>, 0, 'SPEAKER_00')
(<Segment(89.2941, 90.9478)>, 0, 'SPEAKER_00')
(<Segment(92.8884, 114.438)>, 1, 'SPEAKER_01')
(<Segment(114.438, 149.656)>, 0, 'SPEAKER_00')
(<Segment(150.027, 177.112)>, 0, 'SPEAKER_00')
(<Segment(178.31, 192.046)>, 0, 'SPEAKER_00')


Matching each trainscrition line to some diarizations, and generating HTML tags.

In [18]:
preS = '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="UTF-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta http-equiv="X-UA-Compatible" content="ie=edge">\n    <title>Lexicap</title>\n    <style>\n        body {\n            font-family: sans-serif;\n            font-size: 18px;\n            color: #111;\n            padding: 0 0 1em 0;\n        }\n        .l {\n          color: #050;\n        }\n        .s {\n            display: inline-block;\n        }\n        .e {\n            display: inline-block;\n        }\n        .t {\n            display: inline-block;\n        }\n        #player {\n\t\tposition: sticky;\n\t\ttop: 20px;\n\t\tfloat: right;\n\t}\n    </style>\n  </head>\n  <body>\n    <h2>Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258</h2>\n  <div  id="player"></div>\n    <script>\n      var tag = document.createElement(\'script\');\n      tag.src = "https://www.youtube.com/iframe_api";\n      var firstScriptTag = document.getElementsByTagName(\'script\')[0];\n      firstScriptTag.parentNode.insertBefore(tag, firstScriptTag);\n      var player;\n      function onYouTubeIframeAPIReady() {\n        player = new YT.Player(\'player\', {\n          height: \'210\',\n          width: \'340\',\n          videoId: \'SGzMElJ11Cc\',\n        });\n      }\n      function setCurrentTime(timepoint) {\n        player.seekTo(timepoint);\n      }\n    </script><br>\n    <a href="0258-large.html">large model</a><br></div>\n'
postS = '\t</body>\n</html>'

In [74]:
html = list(preS)
idx = 0
for c in captions:
  if c[0] >= 30 * 60:
    break
  if idx < len(dzList):
    while (c[0] >= dzList[idx][0].end) or (c[1] > dzList[idx][0].end+0.8):
      idx += 1
      if idx == len(dzList):
         idx =  len(dzList) -1
         break
  if idx < len(dzList):
    #print(c)
    html.append('\t\t\t<div class="c">\n')
    html.append(f'\t\t\t\t<a class="l" href="#{c[2]}" id="{c[2]}">link</a> |\n')
    html.append(f'\t\t\t\t<div class="s"><a href="javascript:void(0);" onclick=setCurrentTime({int(c[0])})>{c[2]}</a></div>\n')
    html.append(f'\t\t\t\t<div class="t">{"[Lex]" if dzList[idx][2]=="SPEAKER_01" else "[LeCun]"} {c[3]}</div>\n')
    html.append('\t\t\t</div>\n\n')
html.append(postS)
s = "".join(html)
print(s)
with open("lexicap.html", "w") as text_file:
    text_file.write(s)



<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta http-equiv="X-UA-Compatible" content="ie=edge">
    <title>Lexicap</title>
    <style>
        body {
            font-family: sans-serif;
            font-size: 18px;
            color: #111;
            padding: 0 0 1em 0;
        }
        .l {
          color: #050;
        }
        .s {
            display: inline-block;
        }
        .e {
            display: inline-block;
        }
        .t {
            display: inline-block;
        }
        #player {
		position: sticky;
		top: 20px;
		float: right;
	}
    </style>
  </head>
  <body>
    <h2>Yann LeCun: Dark Matter of Intelligence and Self-Supervised Learning | Lex Fridman Podcast #258</h2>
  <div  id="player"></div>
    <script>
      var tag = document.createElement('script');
      tag.src = "https://www.youtube.com/iframe_api";
      var firstScriptTag = document.