Notes on usage:

- Make sure to [change runtime to GPU](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm). 
- The transcript will be saved in Files, which you can find in the menu on the left.
- Change the number of speakers below if different from two.
- Pick a bigger model if you want more accuracy and a smaller model if you want the program to run faster ([more info](https://github.com/openai/whisper#available-models-and-languages)).
- If you know the language being spoken is English, then change language to 'English' as this improves performance.


High level overview of what's happening here:


1.   I'm using Open AI's Whisper model to seperate audio into segments and generate transcripts.
2.   I'm then generating speaker embeddings for each segments.
3.   Then I'm using agglomerative clustering on the embeddings to identify the speaker for each segment.   

Let me know if I can make it better!


In [1]:
# upload audio file
from google.colab import files
uploaded = files.upload()
path = next(iter(uploaded))

Saving audio.mp3 to audio.mp3


In [10]:
path = 'audio/audio.mp3'
num_speakers = 2 #@param {type:"integer"}

language = 'English' #@param ['any', 'English']

model_size = 'tiny' #@param ['tiny', 'base', 'small', 'medium', 'large']


model_name = model_size
if language == 'English' and model_size != 'large':
  model_name += '.en'


In [11]:
# !pip install -q git+https://github.com/openai/whisper.git > /dev/null
# !pip install -q git+https://github.com/pyannote/pyannote-audio > /dev/null

import whisper
import datetime

import subprocess

import torch
import pyannote.audio
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
embedding_model = PretrainedSpeakerEmbedding( 
    "speechbrain/spkrec-ecapa-voxceleb")

from pyannote.audio import Audio
from pyannote.core import Segment

import wave
import contextlib

from sklearn.cluster import AgglomerativeClustering
import numpy as np

In [12]:
if path[-3:] != 'wav':
  subprocess.call(['ffmpeg', '-i', path, 'audio.wav', '-y'])
  path = 'audio.wav'

In [13]:
model = whisper.load_model(model_size)

100%|█████████████████████████████████████| 72.1M/72.1M [00:02<00:00, 36.0MiB/s]


In [14]:
result = model.transcribe(path)
segments = result["segments"]

In [15]:
with contextlib.closing(wave.open(path,'r')) as f:
  frames = f.getnframes()
  rate = f.getframerate()
  duration = frames / float(rate)

In [16]:
audio = Audio()

def segment_embedding(segment):
  start = segment["start"]
  # Whisper overshoots the end timestamp in the last segment
  end = min(duration, segment["end"])
  clip = Segment(start, end)
  waveform, sample_rate = audio.crop(path, clip)
  return embedding_model(waveform[None])

In [17]:
embeddings = np.zeros(shape=(len(segments), 192))
for i, segment in enumerate(segments):
  embeddings[i] = segment_embedding(segment)

embeddings = np.nan_to_num(embeddings)

In [18]:
clustering = AgglomerativeClustering(num_speakers).fit(embeddings)
labels = clustering.labels_
for i in range(len(segments)):
  segments[i]["speaker"] = 'SPEAKER ' + str(labels[i] + 1)

In [19]:
def time(secs):
  return datetime.timedelta(seconds=round(secs))

f = open("transcripts/transcript.txt", "w")

for (i, segment) in enumerate(segments):
  if i == 0 or segments[i - 1]["speaker"] != segment["speaker"]:
    f.write("\n" + segment["speaker"] + ' ' + str(time(segment["start"])) + '\n')
  f.write(segment["text"][1:] + ' ')
f.close()

In [21]:
with open('transcripts/transcript.txt','r') as f:
  text = f.read()

In [22]:
text

"\nSPEAKER 1 0:00:00\nI want to do it here, you want to do some more things. \nSPEAKER 2 0:00:02\nNo, I think this is good because it's a cool environment. \nSPEAKER 1 0:00:04\nSure. It might be that if that comes on, it's going to be a problem. \nSPEAKER 2 0:00:07\nWell, let's try it. Okay. So how long will we be lived in Carthage? \nSPEAKER 1 0:00:11\nMy whole life. \nSPEAKER 2 0:00:12\nAnd I need to look at it. \nSPEAKER 1 0:00:13\nYeah. \nSPEAKER 2 0:00:15\nHave you ever thought of leaving Carthage? \nSPEAKER 1 0:00:18\nUm, not really. No. Never had a good reason to leave. \nSPEAKER 2 0:00:26\nAnd why did you stay? What's your reasons for staying? \nSPEAKER 1 0:00:29\nI was home. You know, and, you know, my wife from here just makes, you know, her family's here, my family's here for the most part. So I guess it just made good sense. Had a good opportunity for the career. That wasn't any reason to move. \nSPEAKER 2 0:00:45\nAnd what is your career? \nSPEAKER 1 0:00:47\nWe sell, like

In [23]:
named_text = text.replace('SPEAKER 1', 'Brandon').replace('SPEAKER 2', 'Stephanie')

In [24]:
print( named_text)


Brandon 0:00:00
I want to do it here, you want to do some more things. 
Stephanie 0:00:02
No, I think this is good because it's a cool environment. 
Brandon 0:00:04
Sure. It might be that if that comes on, it's going to be a problem. 
Stephanie 0:00:07
Well, let's try it. Okay. So how long will we be lived in Carthage? 
Brandon 0:00:11
My whole life. 
Stephanie 0:00:12
And I need to look at it. 
Brandon 0:00:13
Yeah. 
Stephanie 0:00:15
Have you ever thought of leaving Carthage? 
Brandon 0:00:18
Um, not really. No. Never had a good reason to leave. 
Stephanie 0:00:26
And why did you stay? What's your reasons for staying? 
Brandon 0:00:29
I was home. You know, and, you know, my wife from here just makes, you know, her family's here, my family's here for the most part. So I guess it just made good sense. Had a good opportunity for the career. That wasn't any reason to move. 
Stephanie 0:00:45
And what is your career? 
Brandon 0:00:47
We sell, like, keeping our commission equipment primar