# Speaker diarization.
In this project we wish to do Speaker Diarization. Specifically we wish to build custom pipelines in order to answer the following questions:

- Which clustering algorithms are best for speaker diarization?

- Does Clustering Algorithms on Deep Neural Network embeddings outperform traditional clustering algorithms?

- Can end-to-end Deep Neural Network models outperform traditional clustering algorithms?

The ground truth are the RTTM files. The RTTM files are in the following format:
```

SPEAKER <NA> 1 0.00 0.39 <NA> <NA> spk_0 <NA>
SPEAKER <NA> 1 0.39 0.01 <NA> <NA> spk_1 <NA>

```
The first number is the start time, the second number is the duration, and the last number is the speaker id.



    

### All imports 

In [1]:
import librosa
import torch
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from utils_vad import (
    get_speech_timestamps,
)  # Assuming this file defines the get_speech_timestamps function
from preprocess import Wav2Mel
import torchaudio

import numpy as np
from scipy.spatial.distance import pdist, squareform

### Loading the train data and the ground truth on the train data
- Below is a playground to load the train data and the ground truth on the train data for one of the files. Later this will be done on all the files. 

In [3]:
# Global variables used for preprocessing
SAMPLE_RATE = 16000
NORM_DB = -3
FFT_WINDOW_MS = 25
FFT_HOP_MS = 10
FRAME_SIZE = 40  # Adjust frame size if needed
BLOCK_SIZE = 50  # MFCC frames to stack together for embedding

# Path to audio file
AUDIO_PATH = "../Dataset/Audio/Test/aggyz.wav"

Lightning automatically upgraded your loaded checkpoint from v1.1.3 to v2.2.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\rakin\.cache\torch\pyannote\models--pyannote--segmentation\snapshots\059e96f964841d40f1a5e755bb7223f76666bba4\pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.2.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.7.1, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.


Lightning automatically upgraded your loaded checkpoint from v1.1.3 to v2.2.5. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint C:\Users\rakin\.cache\torch\pyannote\models--pyannote--segmentation\snapshots\059e96f964841d40f1a5e755bb7223f76666bba4\pytorch_model.bin`


Model was trained with pyannote.audio 0.0.1, yours is 3.2.0. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.7.1, yours is 2.3.0+cu121. Bad things might happen unless you revert torch to 1.x.
[ 00:00:00.030 -->  00:00:19.420] _ speech
[ 00:00:20.685 -->  00:02:34.082] _ speech
[ 00:01:07.530 -->  00:01:08.020] _ speech
[ 00:01:08.796 -->  00:01:10.838] _ speech
[ 00:01:30.261 -->  00:01:31.982] _ speech
[ 00:01:43.575 -->  00:01:44.318] _ speech
[ 00:02:20.059 -->  00:02:21.224] _ speech
[ 00:02:27.045 -->  00:02:27.265] _ speech
[ 00:02:34.352 -->  00:04:12.109] _ speech
[ 00:02:52.088 -->  00:02:53.759] _ speech
[ 00:02:58.197 -->  00:03:00.053] _ speech
[ 00:03:02.601 -->  00:03:03.006] _ speech
[ 00:03:17.249 -->  00:03:17.805] _ speech
[ 00:03:20.674 -->  00:03:21.450] _ speech
[ 00:03:32.402 -->  00:03:35.018] _ speech
[ 00:03:38.663 -->  00:03:39.034] _ speech
[ 00:03:40.249 -->  00:03:40.755] _ speech
[ 00:03:42.224 -->  00:03:42.