<a href="https://colab.research.google.com/github/PaulSZH95/audio_processing/blob/main/vad_notebooks/pyannote.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Voice Activity Detection

Voice Activity Detection (VAD) is a crucial task within the field of audio processing. Traditionally, this task has been addressed using classical methods, such as bandpass filters. However, classical approaches often demand manual tuning of frequency bands and algorithms rooted in signal processing concepts, which can be time-consuming.

In recent times, Machine Learning (ML) has made significant inroads into signal processing, reducing the need for extensive manual adjustments. Before diving into this notebook, it's essential to have a fundamental understanding of audio processing. I recommend taking the [Huggingface audio course](https://huggingface.co/learn/audio-course/chapter0/introduction) for a comprehensive introduction.

**In this notebook, we will immediately delve into the implementation of sincnet functions as feature extractors for audio processing.**


# Dataset Setup Instructions

Follow these steps to set up the required dataset for your project:

**Note:** Make sure to uncomment the lines to run the cells as needed.

1. Set up the database using a CPU runtime.

2. Our motivation is to download pyannote-setup configs and the AMI dataset prepared by the library creator.

3. Mount your drive and move the downloaded files into your drive for ease of reuse.

4. Edit the `database.yml` file located at `/content/drive/MyDrive/AMI-diarization-setup-main/pyannote/database.yml`. Change the following lines:

   ```yaml
   AMI: amicorpus/{uri}/audio/{uri}.Mix-Headset.wav
   AMI-SDM: amicorpus/{uri}/audio/{uri}.Array1-01.wav
   to
   AMI: ../../amicorpus/{uri}/audio/{uri}.Mix-Headset.wav
   AMI-SDM: ../../amicorpus/{uri}/audio/{uri}.Array1-01.wav


In [154]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
# !unzip main.zip

In [4]:
# import os
# ami_setup = [_ for _ in os.listdir() if "ami" in _.lower()][0]
# ami_setup

'AMI-diarization-setup-main'

In [None]:
# !sh AMI-diarization-setup-main/pyannote/download_ami.sh

In [6]:
# !mv amicorpus drive/MyDrive/

In [7]:
# !mv AMI-diarization-setup-main drive/MyDrive/

# Training Stage

Follow these steps to begin the training stage for your project:

1. **Switch to GPU Runtime:** Please change to a GPU runtime now to accelerate the training process.

2. **Follow Cell Execution:** Once on GPU runtime, follow each cell closely and execute them in sequence.

## What is SincNet?

[SincNet](https://arxiv.org/pdf/1811.09725.pdf) is an implementation of sinc algorithms. Instead of using it as a bandpass filter, it is employed as a convolutional layer. The Pyannote SincNet block is initialized with mel spectrogram frequencies, which is similar to using MFCC (Mel-frequency cepstral coefficients). The rationale behind this approach is not just MFCC modeling but also capturing audio patterns similar to the pitch that the human ear detects.

Refer to the [SincNet paper](https://arxiv.org/pdf/1811.09725.pdf) for detailed coverage of this technique.

Now, proceed with the training stage, making the most of the GPU runtime and the SincNet approach to enhance your project's performance.


In [153]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Colab is not exactly compatible with pyannote, at least for this vad model.
#### 1) remember to reinstall torch to ennsure all torch related libraries are compatible
#### 2) After all the installation and cp of files, remember to select restart under the top left runtime tab
#### 3) Don't be afraid of all the red errors, it is a part of life :)

In [None]:
!pip install pyannote-audio==2.1.1
!pip install pyannote-core==4.5
!pip install pyannote-database==5.0.1
# the recommended seems to be
!pip install torchmetrics==0.10.3
!pip install pyannote-pipeline==2.3
!pip install mlflow
# !pip install pytorch_metric_learning

In [None]:
!pip3 uninstall torch torchvision torchaudio -y
!pip3 install torch torchvision torchaudio

In [4]:
!cp -r /content/drive/MyDrive/AMI-diarization-setup-main /content/AMI-diarization-setup-main
!cp -r /content/drive/MyDrive/amicorpus /content/amicorpus

# SPECIAL INSTRUCTION
**Restart Runtime now before running below**

In [None]:
import os
os.environ["PYANNOTE_DATABASE_CONFIG"] = '/content/AMI-diarization-setup-main/pyannote/database.yml'

# used to automatically find paths to wav files
from pyannote.database import FileFinder
preprocessors = {'audio': FileFinder()}

# initialize 'only_words' experimental protocol
from pyannote.database import get_protocol
ami = get_protocol('AMI.SpeakerDiarization.only_words', preprocessors=preprocessors)

In [2]:
# RUN THIS, THERE SHOULD BE NO ERROR. IF ERROR, PLEASE FIX BEFORE PROCEEDING
for file in ami.train():
    meeting = file['uri']
    reference = file['annotation']
    path = file['audio']
    break

In [3]:
from pyannote.audio.tasks import VoiceActivityDetection
vad = VoiceActivityDetection(ami, duration=2., batch_size=512)
from pyannote.audio.models.segmentation import PyanNet
model = PyanNet(sincnet={'stride': 10}, task=vad)

Protocol AMI.SpeakerDiarization.only_words does not precompute the output of torchaudio.info(): adding a 'torchaudio.info' preprocessor for you to speed up dataloaders. See pyannote.database documentation on how to do that yourself.


In [4]:
from pytorch_lightning.callbacks import (
    EarlyStopping,
    ModelCheckpoint
)
monitor, direction = vad.val_monitor
print(monitor, direction)
# don't waste space, just save top3 is good enough
checkpoint = ModelCheckpoint(
    monitor=monitor,
    mode=direction,
    save_top_k=3,
    every_n_epochs=1,
    save_last=False,
    save_weights_only=False,
    verbose=False,
)
# stop early if there's not a lot of diff
early_stopping = EarlyStopping(
    monitor=monitor,
    mode=direction,
    min_delta=0.1,
    patience=3,
    strict=True,
    verbose=False,
)
callbacks = [checkpoint, early_stopping]

VoiceActivityDetection-AMISpeakerDiarizationonly_words-AUROC max


In [13]:
import pytorch_lightning as pl
# import mlflow
# mlflow.set_experiment("abc")
# mlflow.start_run()
call_trainer = pl.Trainer(devices=1, callbacks=callbacks,accelerator="auto", max_epochs=15)
call_trainer.fit(model)
# mlflow.end_run()

INFO:pytorch_lightning.utilities.rank_zero:Trainer already configured with model summary callbacks: [<class 'pytorch_lightning.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True, used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name              | Type             | Params | In sizes        | Out sizes                                        
-----------------------------------------------------------------------------------------------------------------------------
0 | sincnet           | SincNet          | 42.6 K | [512, 1, 32000] | [512, 60, 115]   

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

In [None]:
!mv lightning_logs /content/drive/MyDrive/lightning_logs

# Inferencing Instructions

Follow these steps for inferencing in your project:

1. **Switch to CPU Runtime:** To save your GPU allocation, end your GPU runtime and switch to a CPU runtime.

2. **Utilize Pyannote Segment Objects:** Pyannote segment objects come prebuilt with a graphical interface, but they haven't been used in this project. Instead, all segment objects have been converted into numerical values. You can choose to create plots with these numerical values if desired.

By switching to a CPU runtime and working with numerical values for segment objects, you can efficiently perform inferencing for your project.

Feel free to proceed with inferencing according to these instructions.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install pyannote-audio==2.1.1
!pip install pyannote-core==4.5
!pip install pyannote-database==5.0.1
# the recommended seems to be
!pip install torchmetrics==0.10.3
!pip install pyannote-pipeline==2.3
!pip install mlflow
# !pip install pytorch_metric_learning

In [None]:
!pip3 uninstall torch torchvision torchaudio -y
!pip3 install torch torchvision torchaudio

In [1]:
import os
os.environ["PYANNOTE_DATABASE_CONFIG"] = '/content/drive/MyDrive/AMI-diarization-setup-main/pyannote/database.yml'

# used to automatically find paths to wav files
from pyannote.database import FileFinder
preprocessors = {'audio': FileFinder()}

# initialize 'only_words' experimental protocol
from pyannote.database import get_protocol
ami = get_protocol('AMI.SpeakerDiarization.only_words', preprocessors=preprocessors)

'AMI-SDM.SpeakerDiarization.only_words' found in /content/drive/MyDrive/AMI-diarization-setup-main/pyannote/database.yml does not define the 'scope' of speaker labels (file, database, or global). Setting it to 'file'.
'AMI-SDM.SpeakerDiarization.mini' found in /content/drive/MyDrive/AMI-diarization-setup-main/pyannote/database.yml does not define the 'scope' of speaker labels (file, database, or global). Setting it to 'file'.
'AMI.SpeakerDiarization.only_words' found in /content/drive/MyDrive/AMI-diarization-setup-main/pyannote/database.yml does not define the 'scope' of speaker labels (file, database, or global). Setting it to 'file'.
'AMI.SpeakerDiarization.mini' found in /content/drive/MyDrive/AMI-diarization-setup-main/pyannote/database.yml does not define the 'scope' of speaker labels (file, database, or global). Setting it to 'file'.
'AMI.SpeakerDiarization.word_and_vocalsounds' found in /content/drive/MyDrive/AMI-diarization-setup-main/pyannote/database.yml does not define the '

In [2]:
from pyannote.audio import Model, Inference
mod_path = "/content/drive/MyDrive/lightning_logs/version_0/checkpoints/epoch3.ckpt"
model = Model.from_pretrained(mod_path)
vad_infer = Inference(model)

In [3]:
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
initial_params = {"onset": 0.5, "offset": 0.5,
                  "min_duration_on": 0.0, "min_duration_off": 0.0}
pipeline.instantiate(initial_params)

<pyannote.audio.pipelines.voice_activity_detection.VoiceActivityDetection at 0x7dbe1c8b4190>

In [6]:
# convert ami.train into key:value format:
formated_ami = {data['uri']:data for data in ami.train()}

In [10]:
# lets just look at one of the file 'IS1000a'
file = formated_ami["IS1000a"]
ground = list(file["annotation"].get_timeline().support())
restruct_ground = [(seg.start,seg.end) for seg in ground]
pred = list(pipeline(file).get_timeline().support())
restruct_pred = [(seg.start,seg.end) for seg in pred]
proba = list(vad_infer(file))
restruct_proba = [[(seg[0].start, seg[0].end), round(seg[1].item(),4)] for seg in proba]

In [49]:
import torchaudio
import numpy as np
waveform = torchaudio.load(file["audio"])
audio_form = waveform[0].numpy().flatten()
print(type(audio_form))
zeros_like_aud = np.zeros_like(audio_form)
print(zeros_like_aud.shape)

<class 'numpy.ndarray'>
(25322667,)


In [55]:
import copy
def annotate_where_activity(zeros_like_audio, array_of_seg):
  dummy_use = copy.deepcopy(zeros_like_audio)
  for seg in array_of_seg:
    idx_left,idx_right = 16000 * round(seg[0],4),16000 * round(seg[1],4)
    dummy_use[int(idx_left):int(idx_right+1)] = 1
  return dummy_use

In [56]:
seg_arr = [item[0] for item in restruct_proba]
pred_0_1 = annotate_where_activity(zeros_like_aud,restruct_pred)
grd_0_1 = annotate_where_activity(zeros_like_aud,restruct_ground)
proba_0_1 = annotate_where_activity(zeros_like_aud,seg_arr)

In [57]:
# e.g. ground = [1,2,3] but pred = [2,2,2]. so mislabel indexes are [0,1]
diff = pred_0_1 != grd_0_1
diff_idx = np.where(diff)[0] # mislabel indx

In [79]:
# we know this audio array is super long.
# so then how can we differentiate mislabel seg from seg
# e.g. grd = [1,2,3,4,5,6], prd = [1,0,0,4,0,0] . we know idx 1:2 and 4:5 are 2 diff seg.
# one easy way to pinpoubt the seg is by [2,4,5] - [1,2,4] - -> [1,2,1]
# we see mislabel of a seg have diff of 1 and other seg have > 1
leap_in_index = np.where(diff_idx[1:]- diff_idx[:-1]  > 1)[0]

In [113]:
# since the calculation only find where > 0, we need to manually include index of 0 and last_idx
full = np.concatenate(([0], leap_in_index,[len(diff_idx)-2]))

In [114]:
# so what we want is to structure full array into a [start,stop] style so we can index diff_idx
logic_implement = np.hstack((full[:-1, np.newaxis],(full[1:, np.newaxis] + 1)))

In [147]:
def get_topk_mislabel(diff_idx,logic_implement,k):
  rightful_segment_of_mislabel_periods = diff_idx[logic_implement]
  duration = np.diff(rightful_segment_of_mislabel_periods,axis=1)
  top3_indices = np.where(np.argsort(duration,axis=0) >= duration.shape[0] - 3)[0]
  return rightful_segment_of_mislabel_periods[top3_indices]
  # return top3_indices

In [148]:
k =3
topk_mislab = get_topk_mislabel(diff_idx,logic_implement,k)

In [149]:
topk_mislab

array([[  343844,   510164],
       [20709945, 20725875],
       [24181919, 24243705]])

In [150]:
# if 1 means predicted speech else nonspeech
# by right should be pred_0_1[343844:510164].mean()
#but as long as you understand the entire seg is the same, what we do here is also correct
pred_0_1[topk_mislab]

array([[1., 1.],
       [1., 1.],
       [1., 1.]], dtype=float32)

In [152]:
# if 1 means predicted speech else nonspeech
grd_0_1[topk_mislab]

array([[0., 0.],
       [0., 0.],
       [0., 0.]], dtype=float32)

In [146]:
from pyannote.audio.utils.preview import listen
from pyannote.core import Segment
mislabel = Segment(21.49 ,   31.885)
listen(file, mislabel)

Many thanks to the Pyannote library for providing such an awesome tool! 🙌

Please check out the Pyannote library on [GitHub](https://github.com/pyannote/pyannote-audio).

For more in-depth information on Voice Activity Detection (VAD), you can refer to the reference paper available [arvix](https://arxiv.org/pdf/2104.04045.pdf).

Happy coding! 🚀
```
@inproceedings{Bredin2021,
  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
  Booktitle = {Proc. Interspeech 2021},
  Year = {2021},
}