# Distance-based Speech Recognition

This project implements a **distance-based speech recognition** system using **Librosa**, **NumPy**, and **Matplotlib**. The system recognizes speech commands from the **Mini Speech Commands Dataset** by performing feature extraction and using **Euclidean** and **Cosine distance** to classify input audio signals.

## Table of Contents

- [Project Description](#project-description)
- [Dependencies](#dependencies)
- [Methodology](#methodology)
  - [1. Divide the Signal into Frames](#1-divide-the-signal-into-frames)
  - [2. Compute the Spectrum Using FFT](#2-compute-the-spectrum-using-fft)
  - [3. Compute Class Prototypes](#3-compute-class-prototypes)
  - [4. Testing](#4-testing)
  - [5. Predict Class](#5-predict-class)
  - [6. Compute Accuracy](#6-compute-accuracy)
  - [7. Euclidean Distance](#7-euclidean-distance)
  - [8. Cosine Distance](#8-cosine-distance)
- [Conclusion](#conclusion)

## Project Description

This project aims to build a simple speech recognition system using the **Mini Speech Commands Dataset**. The model is based on the **Euclidean distance** and **Cosine distance** between the test features and class prototypes computed during training. The system extracts **spectral features** from the audio signal, computes **prototypes** for each class, and classifies test signals by calculating the distance to each class prototype.

### Key Features:
- **FFT (Fast Fourier Transform)** is used for frequency domain analysis.
- **Euclidean distance** and **Cosine distance** are used for classification.
- The model is trained and evaluated on a set of predefined speech commands.

---

## Dependencies

This project requires the following Python libraries:
- **Librosa**: For audio processing and feature extraction.
- **NumPy**: For numerical operations.
- **Matplotlib**: For plotting and visualizations.
- **SciPy**: For metrics.

---

## Methodology

### 1. Divide the Signal into Frames

We begin by dividing the audio signal into overlapping frames. This allows us to analyze smaller segments of the signal to capture both **temporal** and **spectral** information.

- **Frame length**: 400 samples (~25ms at 16kHz sampling rate)
- **Hop length**: (~10ms)

These frame lengths are chosen to ensure that the window of analysis captures short-term spectral changes, as speech signals tend to change rapidly in short durations.

#### Equation:
If the audio signal $x(t)$ is of length $N$, then the total number of frames $F$ will be:

$$
F = \frac{N - \text{frame\_length}}{\text{hop\_length}} + 1
$$

---

### 2. Compute the Spectrum Using FFT

Next, we compute the **Fast Fourier Transform (FFT)** for each frame. The FFT transforms the time-domain signal into the frequency domain, allowing us to analyze the spectral content.

- **FFT size**: 512 points

The choice of $n_{\text{fft}} = 512$ allows us to capture enough frequency information (e.g., for speech analysis) while maintaining a balance between **frequency resolution** and **computational efficiency**.

#### Formula:
The FFT of a signal $x(t)$ is defined as:

$$
X(f) = \sum_{n=0}^{N-1} x(n) \cdot e^{-j 2\pi f n / N}
$$

Where:
- $x(n)$ is the signal in the time domain
- $X(f)$ is the signal in the frequency domain
- $N$ is the length of the signal or FFT size

---

### 3. Compute Class Prototypes

During training, we compute a **prototype** for each speech class. This is done by averaging the FFT frames of all audio samples belonging to a particular class.

- **Training set**: We compute the **mean** of the FFT features for each class.
- **Prototype for class $c$**: The mean of the FFT frames from all samples of class $c$.

#### Equation:
If the FFT frames for class $c$ are $X_{c1}, X_{c2}, \dots, X_{cN_c}$, then the prototype for class $c$ is:

$$
\mu_c = \frac{1}{N_c} \sum_{i=1}^{N_c} X_{ci}
$$

Where:
- $N_c$ is the number of samples in class $c$
- $X_{ci}$ are the FFT frames of the $i$-th sample in class $c$
- $\mu_c$ is the prototype (average frame) for class $c$

---

### 4. Testing

For testing, we compute the FFT features of the test signal and calculate the **Euclidean distance** and **Cosine distance** between the test features and each class prototype.

#### Euclidean Distance Formula:
The Euclidean distance between two vectors $x$ and $y$ is given by:

$$
D(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}
$$

Where:
- $x_i$ and $y_i$ are the elements of the vectors $x$ and $y$, respectively
- $n$ is the length of the vectors

We compute this distance for each class and select the class with the smallest distance.

#### Cosine Distance Formula:
The Cosine distance between two vectors $x$ and $y$ is given by:

$$
D_{\text{cos}}(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|}
$$

Where:
- $x \cdot y$ is the dot product of $x$ and $y$
- $\|x\|$ and $\|y\|$ are the magnitudes (norms) of the vectors $x$ and $y$, respectively

---

### 5. Predict Class

After computing the distances for all classes, we select the class with the smallest distance to the test sample. This is the predicted class.

```python
def predict_class(distances):
    return min(distances, key=distances.get)
```

---

### 6. Compute Accuracy

Finally, the model’s performance is evaluated using **accuracy**, which is the proportion of correctly predicted classes compared to the total number of test samples.

#### Accuracy Formula:
The accuracy is computed as:

$$
\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}
$$

Where:
- A correct prediction means the predicted class matches the true class label.

```python
def compute_accuracy(predictions, true_labels):
    correct = sum(p == t for p, t in zip(predictions, true_labels))
    return correct / len(true_labels)
```

---

## Conclusion

In this project, we built a **distance-based speech recognition** system that leverages **FFT** for feature extraction and both **Euclidean distance** and **Cosine distance** for classification. While this approach is simple, it provides a strong foundation for more advanced speech recognition techniques.

---


In [1]:
import os
import random
import librosa
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import euclidean, cosine, cdist

# Split the data

In [2]:
def load_data_and_split(data_dir, test_size=0.2, seed=20):
    # Initialize lists for data and labels
    train_signals, train_labels = [], []
    test_signals, test_labels = [], []

    # Get class folders
    classes = sorted(os.listdir(data_dir))  # Classes are folder names
    class_to_label = {cls: idx for idx, cls in enumerate(classes)}

    for cls in classes:
        class_folder = os.path.join(data_dir, cls)
        wav_files = [os.path.join(class_folder, f) for f in os.listdir(class_folder) if f.endswith('.wav')]

        # Shuffle and split data into train and test
        random.seed(seed)
        random.shuffle(wav_files)
        split_idx = int(len(wav_files) * (1 - test_size))

        train_files = wav_files[:split_idx]
        test_files = wav_files[split_idx:]

        # Load WAV files and labels
        for wav_file in train_files:
            signal, _ = librosa.load(wav_file, sr=None)  # Load audio file
            train_signals.append(signal)
            train_labels.append(class_to_label[cls])

        for wav_file in test_files:
            signal, _ = librosa.load(wav_file, sr=None)
            test_signals.append(signal)
            test_labels.append(class_to_label[cls])

    return (np.array(train_signals, dtype=object), np.array(train_labels)), \
           (np.array(test_signals, dtype=object), np.array(test_labels))


def frame_signal(signal, frame_size, hop_size):
    """
    Divide a signal into overlapping frames.

    Args:
        signal: The input audio signal (1D NumPy array).
        frame_size: Length of each frame in samples.
        hop_size: Step size between frames in samples.

    Returns:
        A 2D NumPy array where each row is a frame.
    """
    frames = librosa.util.frame(signal, frame_length=frame_size, hop_length=hop_size).T 
    return frames # shape (98, 400)

def compute_fft(frames):
    """
    Compute the FFT spectrum for each frame.

    Args:
        frames: A 2D NumPy array where each row is a frame.

    Returns:
        A 2D NumPy array of spectra (magnitude values) for each frame.
    """
    return np.abs(np.fft.rfft(frames, axis=1))

def compute_class_prototypes(train_data, labels):
    """
    Compute the average (prototype) for each speech class based on the FFT frames.
    
    Args:
        train_data: A list or 1D NumPy array where each element is a 2D array (e.g., FFT features of shape `(98, 201)`).
        labels: A 1D NumPy array of labels corresponding to each element in train_data.

    Returns:
        class_prototypes: A dictionary where keys are labels and values are the mean of all frames for that class.
    """
    class_prototypes = {}

    # Get unique labels
    for label in np.unique(labels):
        # Find indices of frames corresponding to the current label
        indices = np.where(labels == label)[0]
        
        # Collect all frames for the current label
        class_frames = [train_data[i] for i in indices]

        class_frames_mean = np.array([np.mean(frame, axis=0) for frame in class_frames])
        
        # Compute the mean across all frames (averaging across the first axis)
        class_prototypes[label] = np.mean(class_frames_mean, axis=0)
    
    return class_prototypes

def cosine_distance(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return 1 - (dot_product / (norm_vec1 * norm_vec2))

def compute_distances(test_features, class_prototypes, metric="cosine"):
    """
    Compute distances between test features and class prototypes.

    Args:
        test_features: A 2D array of shape `(num_frames, num_features)` for a single test sample.
        class_prototypes: A dictionary where keys are labels and values are 2D arrays (prototypes).
        metric: The distance metric to use ('euclidean', 'cosine', 'hamming', etc.).

    Returns:
        distances: A dictionary where keys are class labels and values are distances.
    """
    distances = {}
    for label, prototype in class_prototypes.items():
        # Compute distances between test_features and the prototype (frame-wise)
        test_features_mean = np.mean(test_features, axis=0)
        
        # frame_distances = cdist(test_features_mean, prototype, metric=metric)
        frame_distances = cdist(test_features_mean.reshape(1, -1), prototype.reshape(1, -1), metric=metric)
        
        # Average the frame-wise distances for the final distance
        distances[label] = frame_distances.mean()
    
    return distances


def predict_class(distances):
    return min(distances, key=distances.get)

def compute_accuracy(predictions, true_labels):
    correct = sum(p == t for p, t in zip(predictions, true_labels))
    return correct / len(true_labels)



In [3]:
!ls -1A mini_speech_commands/mini_speech_commands/down | wc -l

1000


In [4]:
# Define dataset path
data_dir = "mini_speech_commands/mini_speech_commands"

# Load data and split
(train_signal, train_labels), (test_signal, test_labels) = load_data_and_split(data_dir, test_size=0.2)

print(f"Training samples: {len(train_signal)}, Testing samples: {len(test_signal)}")


Training samples: 5600, Testing samples: 2400


In [5]:
# Test_sample
tst_signal, sr = librosa.load('mini_speech_commands/mini_speech_commands/go/0132a06d_nohash_2.wav', sr=None)
len(tst_signal), tst_signal 

(16000,
 array([ 0.00036621,  0.00057983, -0.00036621, ..., -0.00030518,
        -0.00021362, -0.0005188 ], dtype=float32))

In [27]:
# devide the signals to frames
sr = 16000
frame_size = int(0.025 * sr)  # 25ms frame (convert to samples)
hop_size = int(0.010 * sr) # 10ms hop (convert to samples)

train_signal_frames_fft = []
for signal in train_signal:
    frames = frame_signal(signal, frame_size, hop_size)
    fft_frames = compute_fft(frames) 
    train_signal_frames_fft.append(fft_frames) # Max, Min = (98, 201), (41, 201)

In [28]:
frames.shape, fft_frames.shape, train_signal_frames_fft[0].shape, len(train_signal_frames_fft)  

((97, 640), (97, 321), (97, 321), 5600)

In [29]:
max_shape = tuple(np.max([frame.shape for frame in train_signal_frames_fft], axis=0))
min_shape = tuple(np.min([frame.shape for frame in train_signal_frames_fft], axis=0))
max_shape, min_shape

((97, 321), (39, 321))

In [30]:
# compute the average of the frames for each speech class
class_prototypes = compute_class_prototypes(train_signal_frames_fft, train_labels) 

In [31]:
class_prototypes[1].shape

(321,)

# Testing

In [32]:
test_signal_frames_fft = []
for signal in test_signal:
    frames = frame_signal(signal, frame_size, hop_size)
    fft_frames = compute_fft(frames)
    test_signal_frames_fft.append(fft_frames)


In [33]:
# Predictions with cosine distance
# predictions = []
# for test_frame in test_signal_frames_fft:
#     distances = {}
#     for label, prototype in class_prototypes.items():
#         # Compute distances between test_features and the prototype (frame-wise)
#         test_frame_mean = np.mean(test_frame, axis=0)  # Average the frame-wise features for the test frame
        
#         # Compute cosine distance between the test frame and the prototype
#         frame_distances = cosine_distance(test_frame_mean, prototype)
        
#         # Since frame_distances is a float, just assign it directly
#         distances[label] = frame_distances
#     predictions.append(predict_class(distances))


In [34]:
# Step 4: Predict for test set
predictions = []
for test_frame in test_signal_frames_fft:
    distances = compute_distances(test_frame, class_prototypes, metric='cosine') # euclidean, hamming, cosine
    predictions.append(predict_class(distances))


In [35]:
test_signal_frames_fft[0].shape, class_prototypes[7].shape, max_shape

((84, 321), (321,), (97, 321))

In [36]:
# Step 6: Compute accuracy
accuracy = compute_accuracy(predictions, test_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 37.50%


In [16]:
np.unique(predictions)

array([1, 2, 3, 4, 5, 6, 7, 8])