# Bias and Fairness in Speaker Verification 



## 1. Introduction to Speaker Verification

### What is Speaker Verification?

Speaker verification is a form of biometric authentication that determines whether a given speech sample matches a claimed identity. This technology is widely used in applications like voice-based device unlocking, personalized virtual assistants, and security systems. The core idea is to extract unique features from a person's voice — much like a fingerprint — and use them to verify their identity. The process typically involves converting speech into a numerical representation called an embedding and then comparing this embedding to previously stored ones. If the similarity score exceeds a certain threshold, the system confirms the identity; otherwise, it rejects it. This makes speaker verification powerful but also sensitive to variations in audio data and speaker characteristics.

### Datasets for Speaker Verification

    VoxCeleb: One of the most popular datasets for speaker verification is VoxCeleb. It is a widely-used dataset with speech samples from thousands of speakers across different nationalities and genders. Metadata includes speaker ID, gender, and nationality.


### Load and Explore VoxCeleb Metadata

Let's start by loading and exploring the VoxCeleb metadata to get a sense of the distribution of speakers.

In [None]:
import pandas as pd

# Load metadata (example CSV)
metadata_path = "/home/santhwanat1029@alabsad.fau.de/data-governance-seminar/data-governance-seminar/dataHDD/voxceleb/voxceleb_trainer/data/vox1_meta.csv"
metadata = pd.read_csv(metadata_path)

# Inspect data
metadata.head()

### 📝 **Exercise 1 :**   Explore Gender and Nationality Distribution

In this exercise, you will visualize the distribution of speakers based on gender and nationality. This will help you understand the dataset composition and identify potential representation biases

In [None]:
import matplotlib.pyplot as plt

# Plot gender distribution
metadata['gender'].value_counts().plot(kind='bar')
plt.title("Gender Distribution in VoxCeleb")
plt.show()

# Plot nationality distribution
metadata['nationality'].value_counts().head(10).plot(kind='bar')
plt.title("Top 10 Nationalities in VoxCeleb")
plt.show()


## 2. Understanding Data Biases

Bias can inadvertently creep into machine learning systems, including speaker verification models. Biases often arise from imbalances or limitations in the training data, which can cause models to perform poorly for certain groups of people. Let’s break this down into different types of bias to understand how they manifest in audio data.
Types of Bias:

    Historical Bias: This occurs when societal inequalities are reflected in the dataset itself. For instance, if VoxCeleb contains more male speakers than female speakers, the model may become biased toward male voices, leading to higher error rates for women.

    Representation Bias: When some groups are underrepresented in the dataset, the model may struggle to generalize to those groups. For example, if most speakers in the dataset are from a handful of nationalities, the system may perform poorly for speakers with less-represented accents.

    Measurement Bias:  Measurement bias arises when the way data is collected or labeled distorts reality. In speaker verification, gender labels can be misleading — for example, a high-pitched male voice might be misclassified as female, and vice versa.

### 📝 **Exercise 2 :**  Analyze Bias in VoxCeleb

You can explore potential biases by looking at the distribution of gender and speech durations.

In [None]:
#Representation Bias 

gender_counts = metadata['gender'].value_counts(normalize=True)
print("Gender Representation:\n", gender_counts)



To understand the historical bias in detail, perform the following task . 

### 📝 **Task 1 :** Analyze Gender and Nationality Distribution

In this task, you’ll compute the percentage distribution for gender and nationality, and explore their intersections. This will help you identify potential imbalances in the dataset. 

In [None]:
# Compute gender distribution percentages
gender_counts = metadata['gender'].value_counts(normalize=True) * 100

# Compute nationality distribution percentages
nationality_counts = metadata['nationality'].value_counts(normalize=True) * 100

# Compute intersection of gender and nationality
intersection_counts = metadata.groupby(['gender', 'nationality']).size()
intersection_percentages = (intersection_counts / len(metadata)) * 100

# Display results
print("Gender Distribution (%):")
print(gender_counts)

print("\nTop 10 Nationality Distribution (%):")
print(nationality_counts.head(10))

print("\nIntersection of Gender and Nationality (%):")
print(intersection_percentages.sort_values(ascending=False).head(10)) 

### 🎯 **Your Goal:**
- **Understand Gender Distribution:** See how speakers are distributed across gender categories.
- **Examine Nationality Representation:** Check which nationalities dominate the dataset.
- **Explore Intersectional Bias:** Discover which gender-nationality combinations are most or least represented.


After running the code, reflect on the following:
- Are certain gender or nationality groups overrepresented or underrepresented?
- How might these imbalances affect the performance of a speaker verification model?

### 📝 **Task 2 :**

Now listen to the following audio file 


In [None]:
from IPython.display import display, HTML

# Provide the full path to your audio file
audio_file_path = '/home/santhwanat1029@alabsad.fau.de/data-governance-seminar/data-governance-seminar/dataHDD/voxceleb/voxceleb_trainer/data/voxceleb1/id10035/9VR7wckrdP8/00003.wav'

# Embed the audio file using HTML
audio_html = f"""
    <audio controls>
        <source src="file://{audio_file_path}" type="audio/wav">
        Your browser does not support the audio element.
    </audio>
"""

display(HTML(audio_html))


Again listen to the following audio 

In [None]:
from IPython.display import display, HTML

# Provide the full path to your audio file
audio_file_path = '/home/santhwanat1029@alabsad.fau.de/data-governance-seminar/data-governance-seminar/dataHDD/voxceleb/voxceleb_trainer/data/voxceleb1/id10066/1Kr6tGO56H8/00001.wav'

# Embed the audio file using HTML
audio_html = f"""
    <audio controls>
        <source src="file://{audio_file_path}" type="audio/wav">
        Your browser does not support the audio element.
    </audio>
"""

display(HTML(audio_html))


#### Question 1 : Are binary gender labels truly useful for speaker verification, given the natural variation in pitch and voice characteristics? Give reason. 

## 3. Training a Speaker Verification Model

To experiment with speaker verification, we can train a model based on a ResNet architecture. ResNet models are commonly used for speaker verification tasks due to their ability to learn rich audio features.

In [None]:
### Using a ResNet-based Model
import torch
from models import ResNetSpeakerModel  # Hypothetical model

# Initialize model
model = ResNetSpeakerModel()

After training the model, we need to evaluate how well it distinguishes between speakers and whether its performance varies across different demographic groups.

 #### False Positives and False Negatives  


False Positive (FP): This occurs when the model incorrectly predicts a positive class for an instance that actually belongs to the negative class.

For example, imagine a disease detection model. If the model predicts that a healthy person has the disease, it would be a False Positive.

False Negative (FN): This happens when the model incorrectly predicts a negative class for an instance that actually belongs to the positive class.

For the disease detection example, if the model predicts that a sick person is healthy, it would be a False Negative.


### Understanding the Threshold

The threshold is a critical component of speaker verification. It determines the score above which a speaker is considered verified. Choosing the right threshold is a trade-off: lowering it reduces false negatives but increases false positives, and vice versa. The threshold can also have unequal effects on different groups, contributing to fairness issues.

#### How Threshold Affects FP and FN

When we are using a model that outputs probabilities for classes (for example, the probability that a patient has a disease), we can set a threshold to decide the cutoff point for classification:

    If the model’s probability output for an instance is above the threshold, we classify it as positive (e.g., the person has the disease).
    If the model’s probability is below the threshold, we classify it as negative (e.g., the person does not have the disease).

As the threshold increases or decreases, the number of False Positives and False Negatives will change:

    Lower threshold (closer to 0): More instances will be classified as positive, so False Positives (FP) may increase, and False Negatives (FN) may decrease.
    Higher threshold (closer to 1): More instances will be classified as negative, so False Positives (FP) will decrease, and False Negatives (FN) may increase.

### Visualizing the Effect of Thresholds on FP and FN
 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interactive

# Generate example data: 100 samples with random probabilities for class 1
np.random.seed(42)
y_true = np.random.choice([0, 1], size=100)  # True labels (0 = Negative, 1 = Positive)
y_prob = np.random.rand(100)  # Predicted probabilities for class 1

# Function to compute FP and FN given a threshold
def calculate_fp_fn(threshold):
    # Predicted labels based on the threshold
    y_pred = (y_prob >= threshold).astype(int)
    
    # Calculate False Positives and False Negatives
    FP = np.sum((y_pred == 1) & (y_true == 0))  # Predicted positive, actual negative
    FN = np.sum((y_pred == 0) & (y_true == 1))  # Predicted negative, actual positive
    
    return FP, FN

# Function to update the plot and display FP/FN based on threshold
def update_plot(threshold):
    FP, FN = calculate_fp_fn(threshold)
    
    # Plotting the results
    plt.figure(figsize=(6, 4))
    plt.bar(['False Positives', 'False Negatives'], [FP, FN], color=['red', 'blue'])
    plt.title(f'FP and FN at Threshold {threshold:.2f}')
    plt.ylabel('Count')
    plt.show()

    # Display FP and FN
    print(f'False Positives (FP): {FP}')
    print(f'False Negatives (FN): {FN}')

# Create an interactive widget to adjust threshold
interactive_plot = interactive(update_plot, threshold=(0.0, 1.0, 0.01))
interactive_plot


## 4. Evaluating Model Performance

Once the model is trained, it outputs similarity scores for speaker pairs. We can use these scores to visualize the system's performance through Detection Error Tradeoff (DET) curves, which show the balance between false positives and false negatives.

In [None]:
# Load scores
scores_df = pd.read_csv("./scores.csv")

# Plot DET Curve
from sklearn.metrics import det_curve
fpr, fnr, thresholds = det_curve(scores_df['true_label'], scores_df['score'])

plt.plot(fpr, fnr)
plt.xlabel('False Positive Rate')
plt.ylabel('False Negative Rate')
plt.title('DET Curve')
plt.grid()
plt.show()