*3 Nov 2024 : 21BAI1133 - Mukundh J*
#  Speech and Natural Language Processing Lab 12
- Create a speaker recognition system.
- Corpus:
  - Use any ready made corpus or create your own corpus collecting 5 minutes of data from your friends.  
  - Use at least 5 speakers.
- Features: Use MFCC or LPCC, or anything else.
- Model:
  - Gaussian mixture model.
  - There would be one model for each speaker.
  - Vary the number of mixtures in the Gaussian mixture model
  - You can use Sklearn's GMM class to do this.
- Comment on the difference in Precision, Recall, and F1 measures as the model parameters are varied.
- Training:
  - Each model will be fed feature vectors of all time frames of all utterances corresponding to the speaker, to create the model for the speaker.

Feature Matrix:
- If your feature vector is 13 dimensional
- If you have 5 utterances from a particular speaker, with 100 frames in each utterance.
- Feature matrix will be of size 500x13

In [2]:
!pip install datasets



In [3]:
import numpy as np
import librosa
from sklearn.mixture import GaussianMixture
from sklearn.metrics import precision_score, recall_score, f1_score
from datasets import load_dataset
from sklearn.model_selection import train_test_split

# Loading and Preprocessing

In [4]:
dataset = load_dataset("owahltinez/speaker-recognition-american-rhetoric")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['audio', 'label'],
        num_rows: 7507
    })
})


In [6]:
unique_speakers = set(dataset['train']['label'])
print("Unique speakers identified:", unique_speakers)

Unique speakers identified: {0, 1, 2, 3, 4, 5, 6}


- 0 ==> Benjamin Netanyahu
- 1 ==> Jens Stoltenberg
- 2 ==> Julia Gillard
- 3 ==> Magaret Tarcher
- 4 ==> Nelson Mandela
- 5 ==> Background noise
- 6 ==> other

In [7]:
speaker_mapping = {
    0: "Benjamin Netanyahu",
    1: "Jens Stoltenberg",
    2: "Julia Gillard",
    3: "Magaret Tarcher",
    4: "Nelson Mandela",
    5: "Background noise",
    6: "other"
}

label = 2
speaker_name = speaker_mapping.get(label, "Unknown speaker")
print(f"Label {label} corresponds to: {speaker_name}")



Label 2 corresponds to: Julia Gillard


# mfcc

In [8]:
def extract_mfcc(audio, sample_rate, n_mfcc=13):
    mfcc = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=n_mfcc)
    return mfcc.T

In [9]:
features = []
labels = []

In [10]:
for sample in dataset['train']:
    audio = sample['audio']['array']  # Get the audio array
    sample_rate = sample['audio']['sampling_rate']  # Get the sampling rate
    label = sample['label']  # Get the label

    # Extract MFCC features
    mfcc = extract_mfcc(audio, sample_rate)

    # Append features and labels
    features.append(mfcc)
    labels.append(label)

In [11]:
all_features = np.vstack(features)  # Stack all features into one array
all_labels = np.concatenate([[label] * len(mfcc) for mfcc, label in zip(features, labels)])

# gmm

In [12]:
gmm_models = {}
n_components = 8  # Number of mixture components

for speaker_id in np.unique(all_labels):
    # Filter features for the current speaker
    speaker_features = all_features[all_labels == speaker_id]

    # Train GMM model for the speaker
    gmm = GaussianMixture(n_components=n_components, covariance_type='diag', random_state=42)
    gmm.fit(speaker_features)

    # Store the model
    gmm_models[speaker_id] = gmm

In [13]:
def predict_speaker(test_features, gmm_models):
    scores = {speaker_id: model.score(test_features) for speaker_id, model in gmm_models.items()}
    return max(scores, key=scores.get)

In [14]:
# Train-test split and evaluation
X_train, X_test, y_train, y_test = train_test_split(all_features, all_labels, test_size=0.2, stratify=all_labels)

In [16]:
predictions = [predict_speaker(feature.reshape(1, -1), gmm_models) for feature in X_test]

In [17]:
precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')
f1 = f1_score(y_test, predictions, average='weighted')

print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Precision: 0.90
Recall: 0.90
F1 Score: 0.90


# training

In [18]:
def train_gmm_models(n_components, all_features, all_labels):
    models = {}
    for speaker_id in np.unique(all_labels):
        speaker_features = all_features[all_labels == speaker_id]
        gmm = GaussianMixture(n_components=n_components, covariance_type='diag', random_state=42)
        gmm.fit(speaker_features)
        models[speaker_id] = gmm
    return models

In [19]:
results = []

In [21]:
for n_components in [4, 8, 16]:
    gmm_models = train_gmm_models(n_components, all_features, all_labels)

    # Reshape the features to be 2D (1 sample, multiple features)
    predictions = [predict_speaker(feature.reshape(1, -1), gmm_models) for feature in X_test]

    precision = precision_score(y_test, predictions, average='weighted')
    recall = recall_score(y_test, predictions, average='weighted')
    f1 = f1_score(y_test, predictions, average='weighted')

    results.append((n_components, precision, recall, f1))


In [22]:
for n_components, precision, recall, f1 in results:
    print(f"GMM Components: {n_components} -> Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")

GMM Components: 4 -> Precision: 0.86, Recall: 0.86, F1 Score: 0.86
GMM Components: 8 -> Precision: 0.90, Recall: 0.90, F1 Score: 0.90
GMM Components: 16 -> Precision: 0.92, Recall: 0.92, F1 Score: 0.92


As the number of GMM components increases from 4 to 16, there is a general trend of improvement in the Precision, Recall, and F1 score. This indicates that a more complex model with more components is able to better capture the underlying distribution of the speaker's voice features, leading to better speaker recognition performance.


* **GMM Components = 4:** The model with 4 components provides a baseline performance with decent precision, recall, and F1 scores.
* **GMM Components = 8:** Increasing the number of components to 8 leads to noticeable improvement in all three metrics. This suggests that a more detailed representation of the speaker's voice is captured, leading to better discrimination between speakers.
* **GMM Components =16:** The model with 16 components shows the best performance, with the highest precision, recall, and F1 score. This indicates that the increased complexity allows for a more accurate modeling of the speaker's voice characteristics, thus improving the model's ability to accurately identify the speaker.


It can be observed that there is a potential trade-off between model complexity and computational cost. If the model becomes too complex, it might lead to overfitting and increased training time.

In [25]:
import numpy as np
from sklearn.mixture import GaussianMixture

# Example of creating a feature matrix for a single speaker
speaker_features = []  # List to store features for a specific speaker

for sample in dataset['train']:
    audio = sample['audio']['array']
    sample_rate = sample['audio']['sampling_rate']
    label = sample['label']

    if label == 1:  # Only extract features for the target speaker -Jens Stoltenberg
        mfcc = extract_mfcc(audio, sample_rate)  # Extract MFCC features
        speaker_features.append(mfcc)

# Convert list of 1D MFCCs to a 2D feature matrix (500×13)
# Assuming speaker_features contains 500 MFCC vectors, each of size 13
speaker_features_matrix = np.array(speaker_features).reshape(-1, 13)  # 500x13 matrix

# Train GMM for this speaker
gmm = GaussianMixture(n_components=8)
gmm.fit(speaker_features_matrix)


In [26]:
# Predict speaker for a new feature
predictions = [predict_speaker(feature.reshape(1, -1), gmm_models) for feature in X_test]

In [27]:
precision = precision_score(y_test, predictions, average='weighted')
recall = recall_score(y_test, predictions, average='weighted')
f1 = f1_score(y_test, predictions, average='weighted')

In [28]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.92
