# Speaker Recognition using Frequency-Domain Speech Features

## 1. Introduction
Speaker recognition is a fundamental problem in speech processing, where the objective is to identify the speaker from an audio recording. This project focuses on developing a deep learning model for speaker classification using frequency-domain features extracted from speech samples. The approach leverages convolutional neural networks (CNNs) for learning speaker-specific patterns in the transformed audio data.

The primary goals of this project are:
* To preprocess raw speech recordings and extract meaningful frequency-domain representations using Fast Fourier Transform (FFT).

* To construct a 1D convolutional neural network (CNN) with residual connections to classify different speakers.

* To improve model generalization through data augmentation techniques, particularly by introducing background noise.

* To evaluate the model's performance on speaker classification tasks and compare it with conventional approaches.

This project is implemented using TensorFlow and Keras and is designed to run on TensorFlow 2.3 or higher. Additionally, to ensure consistency in audio sampling, the ffmpeg library is required for resampling all noise samples to 16,000 Hz before preprocessing.

## 2. Methodology

### 2.1 Data Preparation

The dataset consists of speech recordings from multiple speakers. Each audio sample is labeled with the corresponding speaker identity. To improve robustness and ensure that the model generalizes well, background noise is added to the audio samples, mimicking real-world scenarios where speech signals often contain environmental noise.

#### 2.1.1 Preprocessing Steps

- **Resampling:** All audio recordings are resampled to a standard frequency of 16,000 Hz to maintain consistency across different sources.

- **Noise Augmentation:** Background noise samples are added to the speech data to increase variability and enhance model robustness.

- **Feature Extraction:** Each audio sample is transformed into the frequency domain using the Fast Fourier Transform (FFT), converting time-series data into spectral components.

### 2.2 Model Architecture

A 1D Convolutional Neural Network (CNN) is designed for speaker classification. CNNs are effective for audio classification as they can capture local temporal patterns and spectral relationships within frequency-domain representations.

#### 2.2.1 CNN Model Design

- **Input Layer:** The input to the model is the FFT-transformed speech sample.

- **Convolutional Layers:** Several 1D convolutional layers are used to capture frequency patterns specific to different speakers.

- **Residual Connections:** To improve training stability and gradient flow, residual connections are introduced.

- **Batch Normalization & Dropout:** These techniques are applied to prevent overfitting and stabilize training.

- **Fully Connected Layers:** The extracted features are fed into dense layers for final speaker classification.

In [1]:
import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import shutil
import numpy as np

import tensorflow as tf
import keras

from pathlib import Path
from IPython.display import display, Audio

- Data Source: https://www.kaggle.com/kongaevans/speaker-recognition-dataset/

In [2]:
!nvidia-smi

Mon Mar 17 16:09:58 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8             11W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [3]:
import tensorflow as tf
print("Num GPUs Available:", len(tf.config.experimental.list_physical_devices('GPU')))


Num GPUs Available: 1


In [4]:
!kaggle datasets download -d kongaevans/speaker-recognition-dataset

Dataset URL: https://www.kaggle.com/datasets/kongaevans/speaker-recognition-dataset
License(s): unknown
Downloading speaker-recognition-dataset.zip to /content
100% 230M/231M [00:11<00:00, 23.5MB/s]
100% 231M/231M [00:11<00:00, 21.7MB/s]


In [5]:
!unzip -qq speaker-recognition-dataset.zip

In [6]:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Set memory growth to avoid TensorFlow using all GPU memory
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print("GPU is activated and ready to use!")
    except RuntimeError as e:
        print(e)


GPU is activated and ready to use!


In [8]:
# DATASET_ROOT = "/content/drive/MyDrive/Colab Notebooks/16000_pcm_speeches"
DATASET_ROOT = "/content/16000_pcm_speeches"

# The folders in which we will put the audio samples and the noise samples
AUDIO_SUBFOLDER = "audio"
NOISE_SUBFOLDER = "noise"

DATASET_AUDIO_PATH = os.path.join(DATASET_ROOT, AUDIO_SUBFOLDER)
DATASET_NOISE_PATH = os.path.join(DATASET_ROOT, NOISE_SUBFOLDER)

# Percentage of samples to use for validation
VALID_SPLIT = 0.1

# Seed to use when shuffling the dataset and the noise
SHUFFLE_SEED = 43

# The sampling rate to use.
SAMPLING_RATE = 16000

# The factor to multiply the noise with according to:
#   noisy_sample = sample + noise * prop * scale
#      where prop = sample_amplitude / noise_amplitude
SCALE = 0.5

BATCH_SIZE = 128
EPOCHS = 1

In [9]:
DATASET_AUDIO_PATH

'/content/16000_pcm_speeches/audio'

In [10]:
DATASET_NOISE_PATH

'/content/16000_pcm_speeches/noise'

## Data Preparation

The dataset consists of speech samples from five different speakers and background noise samples. To ensure an organized dataset structure for preprocessing and model training, we first categorize the data into two main groups:

**Speech Samples**

- Each of the five speakers has a dedicated folder containing 1,500 audio files, each lasting 1 second and sampled at 16,000 Hz.

**Background Noise Samples**
- The dataset includes two folders for noise samples, containing a total of six long audio files. These noise files need to be resampled to 16,000 Hz and segmented into 354 one-second noise samples for augmentation during training.

In [11]:
for folder in os.listdir(DATASET_ROOT):
    if os.path.isdir(os.path.join(DATASET_ROOT, folder)):
        if folder in [AUDIO_SUBFOLDER, NOISE_SUBFOLDER]:
            # If folder is `audio` or `noise`, do nothing
            continue
        elif folder in ["other", "_background_noise_"]:
            # If folder is one of the folders that contains noise samples,
            # move it to the `noise` folder
            shutil.move(
                os.path.join(DATASET_ROOT, folder),
                os.path.join(DATASET_NOISE_PATH, folder),
            )
        else:
            # Otherwise, it should be a speaker folder, then move it to
            # `audio` folder
            shutil.move(
                os.path.join(DATASET_ROOT, folder),
                os.path.join(DATASET_AUDIO_PATH, folder),
            )

## Noise Preparation
To ensure the robustness of our speaker recognition model, we introduce background noise augmentation. This step enhances the model’s generalization by simulating real-world conditions where speech signals are often corrupted by environmental noise.

**Objective**
- **Load all noise samples:**These samples should already be resampled to 16,000 Hz to maintain consistency with the speech data.
- ***Segment the noise files into 1-second chunks:*** Since our speech samples are 1 second long, we divide the noise files into equal 1-second segments (16,000 samples per chunk).

In [12]:
# Get the list of all noise files
noise_paths = []
for subdir in os.listdir(DATASET_NOISE_PATH):
    subdir_path = Path(DATASET_NOISE_PATH) / subdir
    if os.path.isdir(subdir_path):
        noise_paths += [
            os.path.join(subdir_path, filepath)
            for filepath in os.listdir(subdir_path)
            if filepath.endswith(".wav")
        ]
if not noise_paths:
    raise RuntimeError(f"Could not find any files at {DATASET_NOISE_PATH}")
print(
    "Found {} files belonging to {} directories".format(
        len(noise_paths), len(os.listdir(DATASET_NOISE_PATH))
    )
)

Found 6 files belonging to 3 directories


In [13]:
# Resampling all to 1600
command = (
    "for dir in `ls -1 " + DATASET_NOISE_PATH + "`; do "
    "for file in `ls -1 " + DATASET_NOISE_PATH + "/$dir/*.wav`; do "
    "sample_rate=`ffprobe -hide_banner -loglevel panic -show_streams "
    "$file | grep sample_rate | cut -f2 -d=`; "
    "if [ $sample_rate -ne 16000 ]; then "
    "ffmpeg -hide_banner -loglevel panic -y "
    "-i $file -ar 16000 temp.wav; "
    "mv temp.wav $file; "
    "fi; done; done"
)
os.system(command)


# Split noise into chunks of 16,000 steps each
def load_noise_sample(path):
    sample, sampling_rate = tf.audio.decode_wav(
        tf.io.read_file(path), desired_channels=1
    )
    if sampling_rate == SAMPLING_RATE:
        # Number of slices of 16000 each that can be generated from the noise sample
        slices = int(sample.shape[0] / SAMPLING_RATE)
        sample = tf.split(sample[: slices * SAMPLING_RATE], slices)
        return sample
    else:
        print("Sampling rate for {} is incorrect. Ignoring it".format(path))
        return None


noises = []
for path in noise_paths:
    sample = load_noise_sample(path)
    if sample:
        noises.extend(sample)
noises = tf.stack(noises)

print(
    "{} noise files were split into {} noise samples where each is {} sec. long".format(
        len(noise_paths), noises.shape[0], noises.shape[1] // SAMPLING_RATE
    )
)

6 noise files were split into 354 noise samples where each is 1 sec. long


In [14]:
def paths_and_labels_to_dataset(audio_paths, labels):
    """Constructs a dataset of audios and labels."""
    path_ds = tf.data.Dataset.from_tensor_slices(audio_paths)
    audio_ds = path_ds.map(
        lambda x: path_to_audio(x), num_parallel_calls=tf.data.AUTOTUNE
    )
    label_ds = tf.data.Dataset.from_tensor_slices(labels)
    return tf.data.Dataset.zip((audio_ds, label_ds))


def path_to_audio(path):
    """Reads and decodes an audio file."""
    audio = tf.io.read_file(path)
    audio, _ = tf.audio.decode_wav(audio, 1, SAMPLING_RATE)
    return audio




In [15]:
def add_noise(audio, noises=None, scale=0.5):
    if noises is not None:
        # Create a random tensor of the same size as audio ranging from
        # 0 to the number of noise stream samples that we have.
        tf_rnd = tf.random.uniform(
            (tf.shape(audio)[0],), 0, noises.shape[0], dtype=tf.int32
        )
        noise = tf.gather(noises, tf_rnd, axis=0)

        # Get the amplitude proportion between the audio and the noise
        prop = tf.math.reduce_max(audio, axis=1) / tf.math.reduce_max(noise, axis=1)
        prop = tf.repeat(tf.expand_dims(prop, axis=1), tf.shape(audio)[1], axis=1)

        # Adding the rescaled noise to audio
        audio = audio + noise * prop * scale

    return audio




In [16]:
def audio_to_fft(audio):
    # Since tf.signal.fft applies FFT on the innermost dimension,
    # we need to squeeze the dimensions and then expand them again
    # after FFT
    audio = tf.squeeze(audio, axis=-1)
    fft = tf.signal.fft(
        tf.cast(tf.complex(real=audio, imag=tf.zeros_like(audio)), tf.complex64)
    )
    fft = tf.expand_dims(fft, axis=-1)

    # Return the absolute value of the first half of the FFT
    # which represents the positive frequencies
    return tf.math.abs(fft[:, : (audio.shape[1] // 2), :])




In [17]:
# Get the list of audio file paths along with their corresponding labels

class_names = os.listdir(DATASET_AUDIO_PATH)
print(
    "Our class names: {}".format(
        class_names,
    )
)

audio_paths = []
labels = []
for label, name in enumerate(class_names):
    print(
        "Processing speaker {}".format(
            name,
        )
    )
    dir_path = Path(DATASET_AUDIO_PATH) / name
    speaker_sample_paths = [
        os.path.join(dir_path, filepath)
        for filepath in os.listdir(dir_path)
        if filepath.endswith(".wav")
    ]
    audio_paths += speaker_sample_paths
    labels += [label] * len(speaker_sample_paths)

print(
    "Found {} files belonging to {} classes.".format(len(audio_paths), len(class_names))
)



Our class names: ['Nelson_Mandela', 'Benjamin_Netanyau', 'Jens_Stoltenberg', 'Julia_Gillard', 'Magaret_Tarcher', '.ipynb_checkpoints']
Processing speaker Nelson_Mandela
Processing speaker Benjamin_Netanyau
Processing speaker Jens_Stoltenberg
Processing speaker Julia_Gillard
Processing speaker Magaret_Tarcher
Processing speaker .ipynb_checkpoints
Found 7501 files belonging to 6 classes.


In [18]:
# Shuffle
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(audio_paths)
rng = np.random.RandomState(SHUFFLE_SEED)
rng.shuffle(labels)

# Split into training and validation
num_val_samples = int(VALID_SPLIT * len(audio_paths))
print("Using {} files for training.".format(len(audio_paths) - num_val_samples))
train_audio_paths = audio_paths[:-num_val_samples]
train_labels = labels[:-num_val_samples]

print("Using {} files for validation.".format(num_val_samples))
valid_audio_paths = audio_paths[-num_val_samples:]
valid_labels = labels[-num_val_samples:]

# Create 2 datasets, one for training and the other for validation
train_ds = paths_and_labels_to_dataset(train_audio_paths, train_labels)
train_ds = train_ds.shuffle(buffer_size=BATCH_SIZE * 8, seed=SHUFFLE_SEED).batch(
    BATCH_SIZE
)

valid_ds = paths_and_labels_to_dataset(valid_audio_paths, valid_labels)
valid_ds = valid_ds.shuffle(buffer_size=32 * 8, seed=SHUFFLE_SEED).batch(32)


# Add noise to the training set
train_ds = train_ds.map(
    lambda x, y: (add_noise(x, noises, scale=SCALE), y),
    num_parallel_calls=tf.data.AUTOTUNE,
)

# Transform audio wave to the frequency domain using `audio_to_fft`
train_ds = train_ds.map(
    lambda x, y: (audio_to_fft(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
train_ds = train_ds.prefetch(tf.data.AUTOTUNE)

valid_ds = valid_ds.map(
    lambda x, y: (audio_to_fft(x), y), num_parallel_calls=tf.data.AUTOTUNE
)
valid_ds = valid_ds.prefetch(tf.data.AUTOTUNE)

Using 6751 files for training.
Using 750 files for validation.


## Model Definition
### Residual Block Design
- The core component of the model is the residual block, which consists of multiple 1D convolutional layers with shortcut connections. The shortcut connections enable the network to learn residual mappings, which help mitigate the vanishing gradient problem and allow for deeper architectures.

#### Residual Block Implementation
Each residual block follows this structure:

* **Shortcut Connection:** A 1×1 convolution is applied to match the dimensions of the input and output.
* **Convolutional Layers:** Two or more 1D convolutional layers (with kernel size = 3) are used to extract local patterns from speech signals.
* **Activation Function:** ReLU is applied after each convolution to introduce non-linearity.
* **Residual Addition:** The shortcut connection is added to the main branch, reinforcing identity mappings.
* **Max Pooling:** A downsampling operation (pool size = 2) is applied to reduce temporal dimensions while preserving essential features.

In [19]:
def residual_block(x, filters, conv_num=3, activation="relu"):
    # Shortcut connection using 1x1 convolution
    s = keras.layers.Conv1D(filters, 1, padding="same")(x)

    # Stacked convolutional layers
    for i in range(conv_num - 1):
        x = keras.layers.Conv1D(filters, 3, padding="same")(x)
        x = keras.layers.Activation(activation)(x)

    # Final convolutional layer
    x = keras.layers.Conv1D(filters, 3, padding="same")(x)

    # Adding the shortcut connection
    x = keras.layers.Add()([x, s])
    x = keras.layers.Activation(activation)(x)

    # Downsampling with MaxPooling
    return keras.layers.MaxPool1D(pool_size=2, strides=2)(x)


### Model Architecture and Construction
The model is constructed using stacked residual blocks followed by fully connected layers.

The architecture follows a hierarchical feature extraction approach, where:

- Initial layers capture low-level spectral features.
- Deeper layers capture high-level temporal patterns in the speech signal.

In [20]:
def build_model(input_shape, num_classes):
    inputs = keras.layers.Input(shape=input_shape, name="input")

    # Feature extraction using Residual CNN blocks
    x = residual_block(inputs, 16, 2)   # First residual block
    x = residual_block(x, 32, 2)        # Second residual block
    x = residual_block(x, 64, 3)        # Third residual block
    x = residual_block(x, 128, 3)       # Fourth residual block
    x = residual_block(x, 128, 3)       # Fifth residual block

    # Global Feature Aggregation
    x = keras.layers.AveragePooling1D(pool_size=3, strides=3)(x)
    x = keras.layers.Flatten()(x)

    # Fully Connected Layers
    x = keras.layers.Dense(256, activation="relu")(x)
    x = keras.layers.Dense(128, activation="relu")(x)

    # Output Layer with Softmax Activation
    outputs = keras.layers.Dense(num_classes, activation="softmax", name="output")(x)

    return keras.models.Model(inputs=inputs, outputs=outputs)


###  Model Summary

The model is built and compiled using Sparse Categorical Crossentropy Loss for multi-class classification.

In [21]:
# Initialize Model
model = build_model((SAMPLING_RATE // 2, 1), len(class_names))

# Print Model Summary
model.summary()


### Model Compilation & Training Strategy
The model is compiled using Adam optimizer, which provides adaptive learning rates for faster convergence. The EarlyStopping and ModelCheckpoint callbacks are used to optimize training performance.

In [22]:
# Compile the model using Adam optimizer
model.compile(
    optimizer="Adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

# Callbacks for optimized training
model_save_filename = "model.keras"
earlystopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
mdlcheckpoint_cb = keras.callbacks.ModelCheckpoint(
    model_save_filename, monitor="val_accuracy", save_best_only=True
)


- Early Stopping: Monitors validation accuracy and stops training when performance degrades.
- Checkpointing: Saves the best-performing model based on validation accuracy.
- Sparse Categorical Crossentropy: Used since labels are integer-encoded instead of one-hot encoded.

## Training


In [23]:
history = model.fit(
    train_ds,
    epochs=EPOCHS,
    validation_data=valid_ds,
    callbacks=[earlystopping_cb, mdlcheckpoint_cb],
)

[1m53/53[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m61s[0m 765ms/step - accuracy: 0.4732 - loss: 1.9176 - val_accuracy: 0.8853 - val_loss: 0.3092


## Evaluation

In [24]:
print(model.evaluate(valid_ds))

[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - accuracy: 0.8940 - loss: 0.2751
[0.3092075288295746, 0.8853333592414856]


## Model Demonstration
To assess the model’s effectiveness in recognizing speakers from audio samples follwoing demonstration is done:

- Predicting the speaker from a given noisy audio sample.
- Comparing the predicted speaker with the actual label.
- Listening to the audio sample to verify the model's robustness in noisy environments.

### Steps
#### Preparing the Dataset
- Convert audio paths and their labels into a TensorFlow dataset.
- Shuffle the dataset to ensure randomness in sample selection.
- Batch the dataset for efficient processing.
- Introduce noise to simulate real-world scenarios where audio might be distorted.


In [25]:
SAMPLES_TO_DISPLAY = 10

# Convert test audio paths and labels into a dataset
test_ds = paths_and_labels_to_dataset(valid_audio_paths, valid_labels)

# Shuffle and batch the test dataset
test_ds = test_ds.shuffle(buffer_size=BATCH_SIZE * 8, seed=SHUFFLE_SEED).batch(BATCH_SIZE)

# Apply noise augmentation to simulate real-world scenarios
test_ds = test_ds.map(
    lambda x, y: (add_noise(x, noises, scale=SCALE), y),
    num_parallel_calls=tf.data.AUTOTUNE,
)


### Running Predictions on Sampled Audio
- Extract a subset of samples randomly from the test batch.
- Convert the audio signal to its frequency representation (FFT) for input to the model.
- Obtain predictions from the trained model.
- Compare predictions with actual speaker labels and visually highlight correct or incorrect classifications.
- Play the audio sample to verify the accuracy of the model even in noisy conditions.

In [27]:
for audios, labels in test_ds.take(1):
    # Convert the audio signal into its frequency representation
    ffts = audio_to_fft(audios)

    # Predict speaker identity
    y_pred = model.predict(ffts)

    # Select random samples for demonstration
    rnd = np.random.randint(0, BATCH_SIZE, SAMPLES_TO_DISPLAY)
    audios = audios.numpy()[rnd, :, :]
    labels = labels.numpy()[rnd]
    y_pred = np.argmax(y_pred, axis=-1)[rnd]

    for index in range(SAMPLES_TO_DISPLAY):
        # Compare the predicted and actual speaker labels
        correct_prediction = labels[index] == y_pred[index]

        # Print the speaker comparison with color-coded results
        print(
            "Speaker:\33{} {}\33[0m\tPredicted:\33{} {}\33[0m".format(
                "[92m" if correct_prediction else "[91m",  # Green for correct, Red for incorrect
                class_names[labels[index]],
                "[92m" if correct_prediction else "[91m",
                class_names[y_pred[index]],
            )
        )

        # Play the audio sample
        display(Audio(audios[index, :, :].squeeze(), rate=SAMPLING_RATE))


[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Speaker:[92m Magaret_Tarcher[0m	Predicted:[92m Magaret_Tarcher[0m


Speaker:[92m Magaret_Tarcher[0m	Predicted:[92m Magaret_Tarcher[0m


Speaker:[92m Nelson_Mandela[0m	Predicted:[92m Nelson_Mandela[0m


Speaker:[92m Nelson_Mandela[0m	Predicted:[92m Nelson_Mandela[0m


Speaker:[92m Magaret_Tarcher[0m	Predicted:[92m Magaret_Tarcher[0m


Speaker:[92m Nelson_Mandela[0m	Predicted:[92m Nelson_Mandela[0m


Speaker:[92m Magaret_Tarcher[0m	Predicted:[92m Magaret_Tarcher[0m


Speaker:[92m Julia_Gillard[0m	Predicted:[92m Julia_Gillard[0m


Speaker:[92m Julia_Gillard[0m	Predicted:[92m Julia_Gillard[0m


Speaker:[92m Benjamin_Netanyau[0m	Predicted:[92m Benjamin_Netanyau[0m


## Key Takeaways
- Real-world robustness: The model is evaluated on noisy samples to test its performance under challenging conditions.
- Random sampling ensures fair evaluation: A diverse set of examples is taken from the test dataset.
- FFT-based input transformation: Instead of using raw waveforms, we leverage frequency representations for better classification.
- Color-coded result display: Improves interpretability by highlighting correct vs. incorrect predictions.