# **"Speaker Diarization with WhisperX"**

### The purpose of this notebook is to use the [WhisperX](https://github.com/m-bain/whisperx) library to perform speaker diarization on a given audio file.

#### **Libraries/Repos Used:**
- [WhisperX](https://github.com/m-bain/whisperx)
- [Pyannote](https://github.com/pyannote/pyannote-audio)
- [Hugging Face](https://huggingface.co/)
- [Hugging Face Hub](https://huggingface.co/)

#### **Work Flow:**
- Load Audio
- Transcribe using Whisper model
- Chunk level transcriptions
- For Word level Transcriptions (Aligning the words)
- Speaker Diarization


# +_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+_+

### 1. Install WhisperX

In [None]:
!pip install --q git+https://github.com/m-bain/whisperx.git

### 2. Import Libraries

In [None]:
import whisperx
import gc

### 3. Set Parameters
##### compute_type: "float16" or "int8" (default: "float16", it will be faster in computation but slightly less accurate than "int8")

In [3]:
device = "cuda"
batch_size = 4
comput_type = "float16"

## **4. Speech to text Transcription:**

### Load Audio

In [4]:
audio_file = "/content/talk.mp3"

### 1) Transcribe using Whisper model:

In [5]:
model = whisperx.load_model("large-v2", device, compute_type=comput_type)

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.80k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

vocabulary.txt:   0%|          | 0.00/460k [00:00<?, ?B/s]

No language specified, language will be first be detected for each audio file (increases inference time).


100%|█████████████████████████████████████| 16.9M/16.9M [00:02<00:00, 7.94MiB/s]
INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.1+cu121. Bad things might happen unless you revert torch to 1.x.


### Transcribe:

In [6]:
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

Detected language: en (0.98) in first 30s of audio...
[{'text': " What's that? I don't know where the question... I don't know where the song is. Well, can I ask you a question? The first one? Right. What is the worst thing about being young? Well, you get lots of homework. It's also pretty... It's... They're like in the middle. Like in school, like in the middle of bad and good.", 'start': 5.247, 'end': 32.944}, {'text': " What is the worst thing about being old? Not being able to do things that you could do when you were young. Like, you can't bend down and get stuff on the floor? Well, I can still do that. But the problem is your body gets a bit stiff. I know it hurts a lot when you try to get down when you are old. That's right, yes. You might get sick more often. Hopefully I don't. But that's the problem. Yeah, that's pretty bad. It is pretty bad.", 'start': 33.217, 'end': 63.097}, {'text': ' The only time I went to hospital is my mum to like get me born. Do you wish you were old?

### Chunk level transcriptions:

In [7]:
result

{'segments': [{'text': " What's that? I don't know where the question... I don't know where the song is. Well, can I ask you a question? The first one? Right. What is the worst thing about being young? Well, you get lots of homework. It's also pretty... It's... They're like in the middle. Like in school, like in the middle of bad and good.",
   'start': 5.247,
   'end': 32.944},
  {'text': " What is the worst thing about being old? Not being able to do things that you could do when you were young. Like, you can't bend down and get stuff on the floor? Well, I can still do that. But the problem is your body gets a bit stiff. I know it hurts a lot when you try to get down when you are old. That's right, yes. You might get sick more often. Hopefully I don't. But that's the problem. Yeah, that's pretty bad. It is pretty bad.",
   'start': 33.217,
   'end': 63.097},
  {'text': ' The only time I went to hospital is my mum to like get me born. Do you wish you were old? Maybe, like so I was old

### 2) For Word level Transcriptions (Aligning the words)

##### Its a 2 step process. First transcribe audio using one model, then we use another model called align_model. align_model takes the results from the previous model and its language as input and it will return alignment model and corresponding metadata.

In [8]:
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)

result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
# we don't need alignment for each character, but for each word (hence return_char_alignments = FALSE)

print(result["segments"]) # after alignment

Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100%|██████████| 360M/360M [00:01<00:00, 223MB/s]


[{'start': 7.168, 'end': 7.568, 'text': " What's that?", 'words': [{'word': "What's", 'start': 7.168, 'end': 7.468, 'score': 0.296}, {'word': 'that?', 'start': 7.488, 'end': 7.568, 'score': 0.122}]}, {'start': 7.588, 'end': 10.69, 'text': "I don't know where the question... I don't know where the song is.", 'words': [{'word': 'I', 'start': 7.588, 'end': 7.608, 'score': 0.002}, {'word': "don't", 'start': 7.628, 'end': 7.769, 'score': 0.771}, {'word': 'know', 'start': 7.789, 'end': 7.929, 'score': 0.795}, {'word': 'where', 'start': 7.949, 'end': 8.149, 'score': 0.576}, {'word': 'the', 'start': 8.189, 'end': 8.349, 'score': 0.486}, {'word': 'question...', 'start': 8.369, 'end': 9.049, 'score': 0.75}, {'word': 'I', 'start': 9.51, 'end': 9.57, 'score': 0.383}, {'word': "don't", 'start': 9.61, 'end': 9.73, 'score': 0.271}, {'word': 'know', 'start': 9.75, 'end': 9.85, 'score': 0.162}, {'word': 'where', 'start': 9.89, 'end': 10.07, 'score': 0.531}, {'word': 'the', 'start': 10.09, 'end': 10.21,

## **5. Identify Multiple Speakers:**

##### Identify speakers and transcribe in such a way that a different ID is assigned to each individual speaker

In [9]:
audio_file = "/content/talk.mp3"

In [10]:
audio = whisperx.load_audio(audio_file)

### Diarization model of Whisperx


##### Running this line for the first time will throw a pyannote error so you have to go to its github and login there to use it: [pyannote_link](https://github.com/pyannote/pyannote-audio)


In [20]:
diarize_model = whisperx.DiarizationPipeline(use_auth_token="YOUR_HUGGING_FACE_TOKEN", device=device)  # use your hugging face token key

pytorch_model.bin:   0%|          | 0.00/26.6M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/221 [00:00<?, ?B/s]

In [21]:
diarize_segments = diarize_model(audio, min_speakers=2, max_speakers=10)

In [22]:
diarize_segments

Unnamed: 0,segment,label,speaker,start,end
0,[ 00:00:02.538 --> 00:00:03.726],A,SPEAKER_00,2.538200,3.726655
1,[ 00:00:03.726 --> 00:00:04.711],B,SPEAKER_01,3.726655,4.711375
2,[ 00:00:04.711 --> 00:00:04.949],C,SPEAKER_00,4.711375,4.949066
3,[ 00:00:04.949 --> 00:00:05.000],D,SPEAKER_01,4.949066,5.000000
4,[ 00:00:05.152 --> 00:00:05.865],E,SPEAKER_01,5.152801,5.865874
...,...,...,...,...,...
101,[ 00:04:10.534 --> 00:04:14.032],CX,SPEAKER_01,250.534805,254.032258
102,[ 00:04:14.405 --> 00:04:15.560],CY,SPEAKER_01,254.405772,255.560272
103,[ 00:04:16.273 --> 00:04:18.005],CZ,SPEAKER_01,256.273345,258.005093
104,[ 00:04:18.412 --> 00:04:19.753],DA,SPEAKER_01,258.412564,259.753820


## **6. Combine the output of Speaker Identification and Speech Transcription**

In [23]:
result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

                               segment label     speaker       start  \
0    [ 00:00:02.538 -->  00:00:03.726]     A  SPEAKER_00    2.538200   
1    [ 00:00:03.726 -->  00:00:04.711]     B  SPEAKER_01    3.726655   
2    [ 00:00:04.711 -->  00:00:04.949]     C  SPEAKER_00    4.711375   
3    [ 00:00:04.949 -->  00:00:05.000]     D  SPEAKER_01    4.949066   
4    [ 00:00:05.152 -->  00:00:05.865]     E  SPEAKER_01    5.152801   
..                                 ...   ...         ...         ...   
101  [ 00:04:10.534 -->  00:04:14.032]    CX  SPEAKER_01  250.534805   
102  [ 00:04:14.405 -->  00:04:15.560]    CY  SPEAKER_01  254.405772   
103  [ 00:04:16.273 -->  00:04:18.005]    CZ  SPEAKER_01  256.273345   
104  [ 00:04:18.412 -->  00:04:19.753]    DA  SPEAKER_01  258.412564   
105  [ 00:04:20.280 -->  00:04:20.704]    DB  SPEAKER_01  260.280136   

            end  intersection       union  
0      3.726655   -256.804345  258.052800  
1      4.711375   -255.819625  256.864345  
2  

## **7. Results**

### Timestamp Format

In [24]:
result

{'segments': [{'start': 7.168,
   'end': 7.568,
   'text': " What's that?",
   'words': [{'word': "What's",
     'start': 7.168,
     'end': 7.468,
     'score': 0.296,
     'speaker': 'SPEAKER_01'},
    {'word': 'that?',
     'start': 7.488,
     'end': 7.568,
     'score': 0.122,
     'speaker': 'SPEAKER_00'}],
   'speaker': 'SPEAKER_01'},
  {'start': 7.588,
   'end': 10.69,
   'text': "I don't know where the question... I don't know where the song is.",
   'words': [{'word': 'I',
     'start': 7.588,
     'end': 7.608,
     'score': 0.002,
     'speaker': 'SPEAKER_00'},
    {'word': "don't",
     'start': 7.628,
     'end': 7.769,
     'score': 0.771,
     'speaker': 'SPEAKER_00'},
    {'word': 'know',
     'start': 7.789,
     'end': 7.929,
     'score': 0.795,
     'speaker': 'SPEAKER_00'},
    {'word': 'where',
     'start': 7.949,
     'end': 8.149,
     'score': 0.576,
     'speaker': 'SPEAKER_00'},
    {'word': 'the',
     'start': 8.189,
     'end': 8.349,
     'score': 0.486

### Conversational Format

In [29]:
import re

# Format the output
conversation = []  # List to hold the formatted conversation
for segment in result["segments"]:
    speaker_id_str = segment["speaker"]  # Get the speaker ID as a string
    text = segment["text"].strip()        # Get the transcribed text and strip whitespace

    # Extract the numeric part of the speaker ID (e.g., from 'SPEAKER_01' to '1')
    speaker_id_match = re.search(r'(\d+)', speaker_id_str)
    if speaker_id_match:
        speaker_id = int(speaker_id_match.group(1))  # Convert the extracted number to an integer
    else:
        continue  # Skip this segment if no valid speaker ID is found

    # Create a speaker label (e.g., Speaker A, Speaker B)
    speaker_label = f"Speaker {chr(65 + speaker_id)}"  # Converts 0 -> 'A', 1 -> 'B', etc.

    # Append the formatted text to the conversation list
    conversation.append(f"{speaker_label}: {text}")

# Print the formatted conversation
for line in conversation:
    print(line)

Speaker B: What's that?
Speaker A: I don't know where the question... I don't know where the song is.
Speaker B: Well, can I ask you a question?
Speaker B: The first one?
Speaker B: Right.
Speaker B: What is the worst thing about being young?
Speaker A: Well, you get lots of homework.
Speaker A: It's also pretty... It's... They're like in the middle.
Speaker A: Like in school, like in the middle of bad and good.
Speaker A: What is the worst thing about being old?
Speaker B: Not being able to do things that you could do when you were young.
Speaker A: Like, you can't bend down and get stuff on the floor?
Speaker B: Well, I can still do that.
Speaker B: But the problem is your body gets a bit stiff.
Speaker A: I know it hurts a lot when you try to get down when you are old.
Speaker B: That's right, yes.
Speaker B: You might get sick more often.
Speaker B: Hopefully I don't.
Speaker B: But that's the problem.
Speaker A: Yeah, that's pretty bad.
Speaker B: It is pretty bad.
Speaker A: The 