<a href="https://colab.research.google.com/github/SupradeepDanturti/ConvAIProject/blob/main/ConvAI_Project_submission_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- [Project 7: Speaker Counter and Overlap Detector](#scrollTo=TZzv25vDL9LQ)
  - [Abstract](#scrollTo=MP7fynuIMGlg)
  - [Introduction](#scrollTo=wBGDBctaMYYQ)
  - [Methodology](#scrollTo=XtzOOJDnMpWp)
    - [Preparation & Preprocessing](#scrollTo=XtzOOJDnMpWp)
    - [Data Augmentation](#scrollTo=XtzOOJDnMpWp)
    - [Model Development and Optimization](#scrollTo=XtzOOJDnMpWp)
      - [ECAPA TDNN](#scrollTo=XtzOOJDnMpWp)
      - [XVector](#scrollTo=XtzOOJDnMpWp)
      - [Selfsupervised - MLP (Multi-Layer Perceptron)](#scrollTo=XtzOOJDnMpWp)
      - [Selfsupervised XVector](#scrollTo=XtzOOJDnMpWp)
  - [Experimental Setup](#scrollTo=YIQOcLeaPq3v)
    - [Hyperparameters Used](#scrollTo=YIQOcLeaPq3v)
  - [Model Performance Analysis - Results](#scrollTo=BRL5KR20QWKu)
    - [X-Vector Model](#scrollTo=BRL5KR20QWKu)
    - [ECAPA-TDNN Model](#scrollTo=BRL5KR20QWKu)
    - [Self-Supervised MLP Model](#scrollTo=BRL5KR20QWKu)
    - [Self-Supervised X-Vector Model](#scrollTo=BRL5KR20QWKu)
    - [Classwise Error Rate](#scrollTo=BRL5KR20QWKu)
  - [Setup and Training Instructions](#scrollTo=OTaD2SZ2cvsR)
  - [Inference Interface](#scrollTo=vpsZKEbeko19)
  - [Conclusions](#scrollTo=b4Jyn3BcQDpf)
  - [References](#scrollTo=yaxqlm6kRcmb)

# **Project 7: Speaker Counter and Overlap Detector**

## **Abstract**

This project addresses the challenge of accurately counting speakers in meeting recordings where speech may overlap. This is essential for improving the accuracy of automated meeting transcriptions. To generate realistic training data, a simulator was developed that combines clean speech (LibriSpeech-clean-100) with noise and reverberation effects (Open-RIR dataset).

Two established speaker recognition models (x-vector and ECAPA-TDNN) were tested alongside a novel approach. This new method integrated a pretrained Wav2Vec 2.0 model with a linear classifier and XVector. The system analyzes short audio segments, providing timestamps and the detected number of speakers.

Crucially, the Wav2Vec 2.0 hybrid model significantly outperformed the other approaches. This demonstrates its power in handling complex meeting environments.  This work pushes the boundaries of speaker counting technology and offers a valuable tool for the SpeechBrain project, ultimately benefiting a wide range of speech-related applications.

## **Introduction**

Speaker diarization, a critical component in speech processing, identifies "who spoke when" within audio recordings. One of the primary challenges in speaker diarization arises from overlapping speech, where multiple individuals speak simultaneously, often leading to errors in speaker identification. Traditional clustering-based diarization methods typically falter under these conditions, resulting in mislabelled speakers.

To mitigate these issues, contemporary project has focused on the integration of a speaker counting module within the diarization framework. This module, by estimating the number of speakers active in any audio segment, enhances the diarization system's capability to accurately segment and label overlapping speech segments.

The current project advances this methodology, aiming to engineer a resilient system for the detection and enumeration of speakers in conditions of overlapping dialogue. Drawing on the methodologies presented in "Overlapped Speech Detection and Speaker Counting using Distant Microphone Arrays", this project utilizes a data simulation process that amalgamates clean speech data from the LibriSpeech-Clean-100 corpus with artificial noise and reverberation from the Open-RIR dataset, thereby generating realistic scenarios for training.

The investigation explores various models for counting speakers, including established techniques such as x-vectors and ECAPA-TDNN, and introduces an innovative method that employs a pre-trained self-supervised Wav2Vec 2.0 model. This proposed system processes audio recordings in brief segments, ascertains the number of speakers, and records the findings with corresponding timestamps.

Preliminary results indicate that this hybrid approach, which combines the strengths of self-supervised learning with traditional speaker embedding techniques, shows promise in surpassing traditional diarization models. This finding is supported by project illustrated in "Count And Separate: Incorporating Speaker Counting For Continuous Speaker Separation", highlighting the significant advantages of integrating speaker counting into complex speech processing tasks.

## **Methodology**

My project methodology comprised several significant steps that enabled me to effectively address the complexities of speaker counting within overlapping speech segments. Here, we focus on the crucial aspects of the approach that contributed significantly to the project's success.

#### **Preparation & Preprocessing**
- Used the [LibriSpeech-Clean-100](https://www.openslr.org/12) dataset for clean speech samples and [Open-RIR](https://www.openslr.org/28/) dataset for realistic noise and reverberation, creating a challenging environment for the models. For Evaluation used [LibriSpeech-Dev-Clean](http://www.openslr.org/resources/12/dev-clean.tar.gz) and tested on [LibriSpeech-test-Clean](http://www.openslr.org/resources/12/test-clean.tar.gz).  Each audio file was segmented into 1-2 second clips, which were then annotated with the number of speakers present, ranging from 0 to 4.

The metadata of each file will be first divided into segments based on time and stored in a JSON like the one below. These JSON metadata files will then be combined to create mixtures, and those mixtures will be transformed into .wav files, resulting in the final data structure used for training


```
{
    "session_0_spk_0": {
        "0": [
            {
                "start": 0,
                "stop": 120,
                "words": [],
                "file": "generate_silence"
            }
        ]
    },
    "session_0_spk_1": {
        "7447": [
            {
                "start": 0,
                "stop": 8.88,
                "words": "ALTHOUGH HIS MODE OF EXPRESSION WAS PECULIARLY HIS OWN HE HAD RECEIVED A STRONG IMPULSE FROM THE POPULAR MUSIC OF POLAND",
                "file": "train-clean-100\\7447\\91187\\7447-91187-0015.flac"
            },
            {
                "start": 8.88,
                "stop": 25.160000000000004,
                "words": "HAVE EXTOLLED HIM FOR THE BEAUTY OF HIS MELODIES AND HARMONIES THE EXPRESSIVENESS OF HIS MODULATIONS THE WEALTH SPONTANEITY AND LOGICAL CLEARNESS OF HIS IDEAS AND THE SUPERB ARCHITECTURE OF HIS PRODUCTIONS",
                "file": "train-clean-100\\7447\\91186\\7447-91186-0036.flac"
            }
        ]
    },
  .
  .
  .
          
```
<center>
<img src="https://drive.google.com/uc?export=view&id=1Zuz4uSL1w0f1cYsQ5CnLN3WUNqt6r8Cl" alt="Image description" width="500"/>
<figcaption>Fig.1 - Data Structure</figcaption>
</center>

Which finally looks this way before training:

```
{
    "session_0_spk_0_mixture_000": {
        "wav_path": "../data/train/session_0_spk_0/session_0_spk_0_mixture_000_segment.wav",
        "num_speakers": "0",
        "length": 2.0
    },
    "session_0_spk_0_mixture_001": {
        "wav_path": "../data/train/session_0_spk_0/session_0_spk_0_mixture_001_segment.wav",
        "num_speakers": "0",
        "length": 2.0
    },
    "session_0_spk_0_mixture_002": {
        "wav_path": "../data/train/session_0_spk_0/session_0_spk_0_mixture_002_segment.wav",
        "num_speakers": "0",
        "length": 2.0
    },
    "session_0_spk_0_mixture_003": {
        "wav_path": "../data/train/session_0_spk_0/session_0_spk_0_mixture_003_segment.wav",
        "num_speakers": "0",
        "length": 2.0
    },
```

#### **Data Augmentation**
- Data Augmentation played a critical role in enhancing the robustness of the models. Implemented techniques such as noise injection, varying speed, and pitch modification to ensure that the models could generalize well across different acoustic environments and speaker variations.

#### **Model Development and Optimization**
- I have developed and tested four models: X-Vector, ECAPA-TDNN, and a combination of Pretrained Wav2Vec 2.0 with MLP and with Xvectors. Each model was chosen based on its proven efficacy in related tasks such as speaker verification and identification.

#### 1. ECAPA TDNN
(Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Network)

Leverages a special type of neural network layer called a Time Delay Neural Network (TDNN) to capture temporal information in speech audio.

Incorporates an "attention" mechanism that emphasizes informative channels within the network, potentially improving speaker differentiation, especially in overlapping scenarios.

Well-suited for speaker identification and verification tasks, making it a strong candidate for speaker counting as well.
<center>
<img src="https://drive.google.com/uc?export=view&id=1IZHqF86vornEw9Ib-N46JMVyhMZ7cctZ" alt="Image description" width="600"/>
<figcaption>Fig.2 - ECAPA-TDNN</figcaption>
</center>

#### 2. XVector
Employs a convolutional neural network (CNN) architecture to learn speaker embeddings, which are compressed representations that encode speaker identity.
Designed for speaker identification and verification, and its strength lies in its ability to capture speaker-specific characteristics even in noisy or varying environments.

This makes it a viable model for speaker counting, where identification of individual speakers often precedes counting.
<center>
<img src="https://drive.google.com/uc?export=view&id=1n6mRBqFzJNfzbmzQjcKQmDjZBTxr0hbN" alt="Image description" width="600"/>
<figcaption>Fig.3 - XVector</figcaption>
</center>

#### 3. Selfsupervised - MLP(Multi-Layer Perceptron)

Wav2Vec 2.0 is a powerful self-supervised model, meaning it learns representations from vast amounts of unlabeled speech data.
These learned representations are highly effective at capturing intricate speech features.

The MLP acts as a classifier, taking the Wav2Vec 2.0 output and mapping it to the predicted number of speakers.

This approach leverages the strengths of self-supervised learning for feature extraction and a traditional MLP for classification, potentially offering robustness in speaker counting.
<center>
<img src="https://drive.google.com/uc?export=view&id=1enhfLUxl3v-FWZ3bLNa0tFmGLvF6VAvq" alt="Image description" width="600"/>
<figcaption>Fig.4 - wav2vec2 with linear classifier</figcaption>
</center>

#### 4. Selfsupervised XVector
Combines the power of Wav2Vec 2.0's self-supervised feature learning with X-vector's speaker identification capability.

Wav2Vec 2.0 extracts rich features, and X-vectors can potentially isolate speaker-specific information within those features.

This combined approach may lead to more accurate speaker counting, especially in challenging overlapping speech conditions.
<center>
<img src="https://drive.google.com/uc?export=view&id=1en7t4gevL-3QqA5ru6Mz7VlfpGx7cby3" alt="Image description" width="600"/>
<figcaption>Fig.5 - wav2vec2 with Xvector</figcaption>
</center>

## **Experimental Setup**

The LibriSpeech-Clean-100 dataset was used for training, while the LibriSpeech-Dev-Clean set was used for validation, and the LibriSpeech-Test-Clean set was used for final testing. This split ensures generalization is evaluated on unseen data. All models were trained using the Adam optimizer with varying learning rates and weight decay as indicated in the table below. Experiments were conducted on a system with GTX 4050 6GB GPU.

### Hyperparameters Used

<center>
<table>
  <tr>
    <th>Model</th>
    <th>Hyperparams</th>
    <th>GitHub Link</th>
    <th>Model</th>
    <th>Hyperparams</th>
    <th>GitHub Link</th>
  </tr>
  <tr>
    <td>ECAPA-TDNN</td>
    <td><pre><code>
    sample_rate: 16000
    number_of_epochs: 20
    batch_size: 64
    lr_start: 0.001
    lr_final: 0.0001
    weight_decay: 0.00002
    num_workers: 0 # For windows or 4 for linux
    n_classes: 5
    dim: 192
    num_attention_channels: 128
    n_mels: 80
    channels: [256, 256, 256, 256, 768]
    kernel_sizes: [5, 3, 3, 3, 1]
    dilations: [1, 2, 3, 4, 1]
    </code></pre></td>
    <td><a href="https://github.com/SupradeepDanturti/ConvAIProject/blob/main/ecapa_tdnn/hparams_ecapa_tdnn_augmentation.yaml">View Complete file</a></td>
    <td>SelfSupervised <br>
    Linear Classifier</td>
    <td><pre><code>
    number_of_epochs: 5
    batch_size: 64
    lr: 0.001
    lr_ssl: 0.0001
    freeze_ssl: False
    freeze_ssl_conv: True
    encoder_dim: 768
    out_n_neurons: 5
    </code></pre></td>
    <td><a href="https://github.com/SupradeepDanturti/ConvAIProject/blob/main/selfsupervised/hparams_selfsupervised_mlp.yaml">View Complete file</a></td>
  </tr>
  <tr>
    <td>XVector</td>
    <td><pre><code>
    sample_rate: 16000
    number_of_epochs: 50
    batch_size: 64
    lr_start: 0.001
    lr_final: 0.0001
    weight_decay: 0.00002
    num_workers: 0 # For windows or 4 for linux
    n_mels: 4
    n_classes: 5
    emb_dim: 128
    tdnn_channels: 64
    tdnn_channels_out: 128
    tdnn_kernel_sizes: [5, 3, 3, 1, 1]
    tdnn_dilations: [1, 2, 3, 1, 1]
    </code></pre></td>
    <td><a href="https://github.com/SupradeepDanturti/ConvAIProject/blob/main/xvector/hparams_xvector_augmentation.yaml">View Complete file</a></td>
    <td>SelfSupervised <br>
    XVector</td>
    <td><pre><code>
    number_of_epochs: 15
    batch_size: 128
    lr: 0.001
    lr_final: 0.0001
    lr_ssl: 0.00001
    freeze_ssl: False
    freeze_ssl_conv: True
    encoder_dim: 768
    emb_dim: 128
    out_n_neurons: 5
    tdnn_channels: [ 64, 64, 64 ]
    tdnn_kernel_sizes: [ 5, 2, 3 ]
    tdnn_dilations: [ 1, 2, 3 ]
    </code></pre></td>
    <td><a href="https://github.com/SupradeepDanturti/ConvAIProject/blob/main/selfsupervised/hparams_selfsupervised_xvector.yaml">View Complete file</a></td>
  </tr>
</table></center>

## **Model Performance Analysis - Results**

Across all models tested, the Self-supervised approaches (Wav2Vec2.0 + MLP and Wav2Vec2.0 + X-vector) demonstrated superior performance in the speaker counting task, achieving lower overall error rates compared to the ECAPA-TDNN and X-vector models.

Although the specific error rates were close, the Wav2Vec2.0 + X-vector hybrid model offered a slight advantage. This advantage was particularly noticeable when handling overlapping speech from multiple speakers, as illustrated in Fig.7,9 & 12.

An interesting and positive finding from the analysis was the perfect accuracy achieved by all models in identifying segments with no speakers present. This capability is crucial for any diarization system, as it ensures the accurate detection of silent intervals, preventing unnecessary processing and improving overall system efficiency.

A common trend across all models was an increase in error rate as the true number of speakers increased.  This difficulty in handling three or more overlapping speakers highlights a core challenge for speaker counting systems.  The class-wise error rates [See Table 1 below] illustrate this struggle, with the highest error rates occurring for classes with 3 or 4 speakers.

Notably, all models achieved perfect accuracy in identifying segments containing no speakers. This indicates strong performance in silence detection, a valuable component of speaker diarization systems.

But During inference XVector and ECAPA-TDNN are much faster when compared to SelfSupervised models.

#### **X-Vector Model**
<center>
<table>
  <tr>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1t1VZhDyG8awIU0keoyrw54yRLETa7r9B" alt="Selfsupervised XVector Train and Valid Loss" width="300"/>
      <figcaption>Fig.6 - XVector Train and Valid Loss</figcaption>
    </td>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1qVfJx5yzxa_UKE3XVAveicqboy69l2kV" alt="img" width="300"/>
      <figcaption>Fig.7 - Error rate of Both XVector Models</figcaption>
    </td>
  </tr>
</table>
</center>

#### **ECAPA-TDNN Model**
<center>
<table>
  <tr>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1VeKXeO3-aXcTyBz6Ztq-WH3M40njXEkq" alt="Selfsupervised XVector Train and Valid Loss" width="300"/>
      <figcaption>Fig.8 - ECAPA-TDNN Train and Valid Loss</figcaption>
    </td>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1TQ3qK56wQr3lxTEh9-NUh5s-NtRkqtyB" alt="img" width="300"/>
      <figcaption>Fig.9 - Error rate of Both ECAPA-TDNN Models</figcaption>
    </td>
  </tr>
</table>
</center>

#### **Self-Supervised MLP Model**

<center>
<img src="https://drive.google.com/uc?export=view&id=1IgCQqGb1HLbyrhYwUMPuV8fjY-Xy4aUy" alt="Selfsupervised MLP Train and Valid Loss" width="300"/>
<figcaption>Fig.10 - Self-Supervised MLP Train and Valid Loss</figcaption>
</center>

#### **Self-Supervised X-Vector Model**
<center>
<table>
  <tr>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1K6vvfvfoIZShz8Y1lV-epE2waoPgdMIs" alt="Selfsupervised XVector Train and Valid Loss" width="300"/>
      <figcaption>Fig.11 - Self-Supervised X-Vector Train and Valid Loss</figcaption>
    </td>
    <td>
      <img src="https://drive.google.com/uc?export=view&id=1cZDKGp0VBPOE9TKhWkUXCnd7bgrXV4IQ" alt="img" width="300"/>
      <figcaption>Fig.12 - Error rate of Both Selfsupervised Models</figcaption>
    </td>
  </tr>
</table>
</center>
<br>

#### **Classwise Error Rate**
<center>
<table>
 <tr>
    <th>Model</th>
    <th>Class</th>
    <th>Error Rate</th>
    <th>Model</th>
    <th>Class</th>
    <th>Error Rate</th>
  </tr>
  <tr>
    <td rowspan="6">XVector</td>
    <td>Overall</td>
    <td>2.29e-01</td>
    <td rowspan="6">ECAPA-TDNN</td>
    <td>Overall</td>
    <td>2.40e-01</td>
  </tr>
  <tr>
    <td>No Speakers</td>
    <td>0.00e+00</td>
    <td>No Speakers</td>
    <td>0.00e+00</td>
  </tr>
  <tr>
    <td>1 Speaker</td>
    <td>4.06e-01</td>
    <td>1 Speaker</td>
    <td>4.12e-01</td>
  </tr>
  <tr>
    <td>2 Speakers</td>
    <td>3.20e-02</td>
    <td>2 Speakers</td>
    <td>4.10e-01</td>
  </tr>
  <tr>
    <td>3 Speakers</td>
    <td>2.15e-01</td>
    <td>3 Speakers</td>
    <td>3.26e-01</td>
  </tr>
  <tr>
    <td>4 Speakers</td>
    <td>4.67e-01</td>
    <td>4 Speakers</td>
    <td>5.23e-01</td>
  </tr>
  <tr><td colspan="6"></td></tr>
  <tr>
    <td rowspan="6">Selfsupervised MLP</td>
    <td>Overall</td>
    <td>2.00e-01</td>
    <td rowspan="6">Selfsupervised XVector</td>
    <td>Overall</td>
    <td>2.10e-01</td>
  </tr>
   <tr>
    <td>No Speakers</td>
    <td>0.00e+00</td>
    <td>No Speakers</td>
    <td>0.00e+00</td>
  </tr>
  <tr>
    <td>1 Speaker</td>
    <td>8.67e-03</td>
    <td>1 Speaker</td>
    <td>1.64e-02</td>
  </tr>
  <tr>
    <td>2 Speakers</td>
    <td>1.14e-01</td>
    <td>2 Speakers</td>
    <td>1.39e-01</td>
  </tr>
  <tr>
    <td>3 Speakers</td>
    <td>4.08e-01</td>
    <td>3 Speakers</td>
    <td>3.34e-01</td>
  </tr>
  <tr>
    <td>4 Speakers</td>
    <td>4.49e-01</td>
    <td>4 Speakers</td>
    <td>5.41e-01</td>
  </tr>
</table>
</center>

## **Setup and Training Instructions**

To reproduce the results of this project, follow these steps.

```
!git clone https://github.com/SupradeepDanturti/ConvAIProject
%cd ConvAIProject
```
Download the project code from the GitHub repository and navigate into the project directory.


```
!python prepare_dataset/download_required_data.py --output_folder <destination_folder_path>
```
Download the necessary datasets (LibriSpeech etc.), specifying the desired destination folder.

```
!python prepare_dataset/create_custom_dataset.py prepare_dataset/dataset.yaml
```
Create custom dataset based on set parameters as shown in the sample below

Sample of dataset.yaml:
```
n_sessions:
  train: 1000 # Creates 1000 sessions per class
  dev: 200 # Creates 200 sessions per class
  eval: 200 # Creates 200 sessions per class
n_speakers: 4 # max number of speakers. In this case the total classes will be 5 (0-4 speakers)
max_length: 120 # max length in seconds for each session/utterance.
```
<center>Sample of dataset.yaml</center>

To train the XVector model run the following command.
```
%cd xvector
!python train_xvector_augmentation.py hparams_xvector_augmentation.yaml
```

To train the ECAPA-TDNN model run the following command.

```
%cd ecapa_tdnn
!python train_ecapa_tdnn.py hparams_ecapa_tdnn_augmentation.yaml
```

To train the SelfSupervised MLP model run the following command.
```
%cd selfsupervised
!python selfsupervised_mlp.py hparams_selfsupervised_mlp.yaml
```

To train the SelfSupervised XVector model run the following command.
```
%cd selfsupervised
!python selfsupervised_xvector.py hparams_selfsupervised_xvector.yaml
```

## **Inference Interface**

To run each models inference pull the interface directory as shown in the cell below


Sample result
```
0.00-1.06 has 1 speaker
1.06-2.65 has 2 speakers
2.65-5.30 has 1 speaker
5.30-6.36 has 3 speakers
6.36-7.42 has 1 speaker
7.42-9.01 has 4 speakers
9.01-10.07 has 1 speaker
10.07-12.19 has 3 speakers
```



In [1]:
%%capture
!pip install speechbrain

In [2]:
%%capture
"""This might not work because I reached my git lfs limits please use the google drive one mentioned below."""
# !git clone --filter=blob:none --no-checkout https://github.com/SupradeepDanturti/ConvAIProject
# %cd ConvAIProject
# !git sparse-checkout init --cone
# !git sparse-checkout set interface
# !git checkout

""" Download from google drive """
!pip install --upgrade --no-cache-dir gdown
!gdown 1oVQuzHXNNPNxQ6WqR0mUEvMptx15SZQG
!unzip interface.zip

To run the inference for XVector or ECAPA-TDNN Models run the run_inference_xvector_ecapa_tdnn.py file or run the cell below.

In [3]:
!pwd

/content


In [4]:
""" This is used for Both XVector and ECAPA-TDNN """

from interface.SpeakerCounter import SpeakerCounter

wav_path = "interface/sample_audio1.wav"  # Path to your audio file
save_dir = "interface/sample_inference_run2/" # Where to save results
model_path = "interface/xvector" # /ecapa_tdnn  # Path of the trained model

# Create classifier object
audio_classifier = SpeakerCounter.from_hparams(source=model_path, savedir=save_dir)

# Run inference on the audio file
audio_classifier.classify_file(wav_path)

0.00-0.53 has 1 speaker
0.53-4.77 has 2 speakers
4.77-65.72 has 1 speaker
65.72-66.78 has 2 speakers
66.78-85.86 has 1 speaker
85.86-86.92 has 2 speakers
86.92-87.45 has 1 speaker
87.45-91.16 has 2 speakers
91.16-94.34 has 1 speaker
94.34-94.87 has 4 speakers
94.87-95.40 has 2 speakers
95.40-103.35 has 1 speaker
103.35-104.94 has 4 speakers
104.94-106.53 has 1 speaker
106.53-107.06 has 4 speakers
107.06-107.59 has 3 speakers
107.59-108.65 has 1 speaker
108.65-109.18 has 2 speakers
109.18-110.24 has 4 speakers
110.24-115.54 has 1 speaker
115.54-117.13 has 4 speakers
117.13-123.49 has 1 speaker
123.49-124.02 has 2 speakers
124.02-128.79 has 1 speaker
128.79-129.32 has 3 speakers
129.32-129.85 has 1 speaker
129.85-130.38 has 3 speakers
130.38-130.91 has 4 speakers
130.91-131.97 has 3 speakers
131.97-133.03 has 1 speaker
133.03-133.56 has 3 speakers
133.56-134.62 has 4 speakers
134.62-136.21 has 1 speaker
136.21-136.74 has 4 speakers
136.74-137.27 has 2 speakers
137.27-146.81 has 1 speaker

To run the inference for Selfsupervised Models run the run_inference_selfsupervised.py file or run the cell below.

Note- This usually takes a bit more time than the XVector or the ECAPA-TDNN Models but given more accurate result. Which is the only disadvantage it has.

In [5]:
from interface.SpeakerCounterSelfsupervisedMLP import SpeakerCounter

wav_path = "interface/sample_audio1.wav" # Path to your audio file
save_dir = "interface/sample_inference_run" # Where to save results
model_path = "interface/selfsupervised_mlp" # Path of the trained model

# Create classifier object
audio_classifier = SpeakerCounter.from_hparams(source=model_path, savedir=save_dir)

# Run inference on the audio file
audio_classifier.classify_file(wav_path)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.84k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/380M [00:00<?, ?B/s]

Some weights of Wav2Vec2Model were not initialized from the model checkpoint at facebook/wav2vec2-base and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]



0.00-0.53 has 1 speaker
0.53-4.77 has 2 speakers
4.77-65.72 has 1 speaker
65.72-67.31 has 2 speakers
67.31-85.33 has 1 speaker
85.33-90.63 has 2 speakers
90.63-146.81 has 1 speaker
146.81-151.58 has 2 speakers
151.58-217.83 has 1 speaker
217.83-224.72 has 2 speakers
224.72-299.33 has 1 speaker


## **Conclusions**

The project's investigation into the development of a speaker counting and overlap detection system has led to several important conclusions and insights. The utilization of various modeling approaches such as ECAPA-TDNN, X-Vector, and innovative self-supervised techniques involving Wav2Vec 2.0, have each contributed differently to the system's overall performance and efficiency.

**Model Performance**: The Self-Supervised models leveraging Wav2Vec 2.0 technology demonstrated a significant improvement over traditional models (ECAPA-TDNN and X-Vector) in terms of error rates, particularly in complex auditory environments with overlapping speech. These findings confirm the advantage of integrating self-supervised learning frameworks for feature extraction in speaker diarization tasks.

**Error Rates and Challenges**: Despite advancements, the project revealed increasing challenges with higher speaker counts. Error rates escalated as the number of overlapping speakers increased, indicating a persistent difficulty in distinguishing multiple simultaneous speakers. This suggests a need for further refinement in model architectures or training strategies to handle such scenarios more effectively.

**Inference Speed**: While self-supervised models showed superior accuracy, they lagged in inference speed compared to traditional models like ECAPA-TDNN and X-Vector. This trade-off between accuracy and speed is crucial for real-time applications, implying that depending on the use case, a balance might need to be struck between the two metrics.

**Data Augmentation**: The use of a data augmentation strategy involving noise injection, speed variation, and pitch modification proved beneficial in enhancing model robustness. This approach allowed the models to generalize better across diverse acoustic settings and speaker variations, which is vital for practical deployments.

**Limitations**: The project encountered limitations in handling extreme noise conditions and highly dynamic speech activities, which occasionally led to misclassifications. These aspects underscore the necessity for ongoing research and adaptation of the models to accommodate a broader spectrum of real-world conditions.

## **References**
You can add here the citations of books, websites, or academic papers, etc.

[1] Cornella, S., Omologo, M., Squartini, S., & Vincent, E. (2020). Overlapped Speech Detection and Speaker Counting using Distant Microphone Arrays.

[2] Duong, T. T. H., Nguyen, P. L., Nguyen, H. S., & Duong, N. Q. K. (2023). Investigating the Role of Speaker Counter in Handling Overlapping Speeches in Speaker Diarization Systems. Authorea.

[3] Wang, Z. Q., & Wang, D. (2022). Count And Separate: Incorporating Speaker Counting For Continuous Speaker Separation.

[4] Andrei, V., Cucu, H., & Burileanu, C. (2020). Overlapped Speech Detection and Competing Speaker Counting Humans Versus Deep Learning. IEEE.

[5] Ravanelli, M., Parcollet, T., Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., Dawalatabad, N., Heba, A., Zhong, J., Chou, J., Yeh, S., Fu, S., Bengio, Y., & Mohamed, C. (2021). SpeechBrain: A General-Purpose Speech Toolkit. arXiv:2106.04624.