
---

# Voice-Based Door Access Recognition System

## Project Overview

This project is part of the course **Introduction to Machine Learning (2024/2025)** by **Dr. Agnieszka Jastrzębska**. The aim of this project is to develop a **voice recognition system** for an automated intercom, capable of distinguishing between allowed and disallowed persons using voice recordings. The core idea is to convert voice recordings into spectrograms and apply **Convolutional Neural Networks (CNNs)** for classification.

This is a binary classification problem, where:
- **Class 1 (allowed)**: People allowed to open the door.
- **Class 0 (disallowed)**: People not allowed to open the door.

The data used for the project is derived from the **DAPS (Device and Produced Speech) Dataset**.

---

## Table of Contents
1. [Project Structure](#project-structure)
2. [Data Preprocessing](#data-preprocessing)
3. [Model Architecture](#model-architecture)
4. [Model Training](#model-training)
5. [Model Evaluation](#model-evaluation)
6. [Future Improvements](#future-improvements)
7. [Conclusion](#conclusion)

---

## Project Structure

The project is organized as follows:

```bash
.
├── data                     # Contains raw audio and generated spectrograms
│   ├── allowed              # Audio files for Class 1 (allowed)
│   ├── disallowed           # Audio files for Class 0 (disallowed)
│   ├── spectrograms         # Generated spectrograms for training the model
│   └── test_voices          # Test audio files
├── models                   # Stores trained CNN models
├── notebooks                # Jupyter notebooks for analysis and interaction
├── src                      # Python scripts for data processing, training, and evaluation
│   ├── data_preprocessing.py # Script for loading and generating spectrograms
│   ├── model.py             # CNN model definition
│   ├── train.py             # Script to train the CNN model
│   └── evaluate.py          # Script to evaluate the model
├── tests                    # Unit tests for different components
└── requirements.txt         # Python dependencies
```

---

## Data Preprocessing
The input to our machine learning model is not the raw audio file but the **Mel-spectrogram**, a visual representation of the frequency spectrum. The steps involved in preprocessing are:

1. **Load Audio**: Each `.mp3` or `.wav` file is loaded using the `librosa` library. It is possible to apply random gain (volume change) to the audio: 

#### Code example:

```python
import os
import librosa
import numpy as np
import matplotlib
matplotlib.use('Agg')  # For non-GUI rendering
import matplotlib.pyplot as plt
from tqdm import tqdm
from concurrent.futures import ProcessPoolExecutor, as_completed

def augment_audio(y, sr):
    if np.random.rand() > 0.5: # 50% chance to apply gain augmentation
        gain = np.random.uniform(0.7, 1.3) # Random gain factor between 0.7 and 1.3
        y = y * gain # Apply gain
    return y
```


Silent sections are removed and the file is split in chunks if longer than max_duration. 

#### Code example:

```python
def load_audio(audio_path, sr=22050, max_duration=120, top_db=30, augment=False):

    try:
        y, sr = librosa.load(audio_path, sr=sr)
    except Exception as e:
        print(f"Error loading {audio_path}: {e}")
        return [], sr

    # Remove silent parts of the audio
    non_silent_intervals = librosa.effects.split(y, top_db=top_db)
    if len(non_silent_intervals) == 0:
        print(f"No non-silent intervals found in {audio_path}. Skipping.")
        return [], sr
    y_nonsilent = np.concatenate([y[start:end] for start, end in non_silent_intervals])

    # Augment the audio if enabled
    if augment:
        y_nonsilent = augment_audio(y_nonsilent, sr)

    total_duration = len(y_nonsilent) / sr
    if total_duration > max_duration:
        # Split audio into chunks
        max_samples = int(max_duration * sr)
        audio_chunks = [y_nonsilent[i:i + max_samples] for i in range(0, len(y_nonsilent), max_samples)]
    else:
        # Return entire audio as a single chunk if within max_duration
        audio_chunks = [y_nonsilent]
    
    return audio_chunks, sr

```

2. **Generate Spectrograms**: Mel-spectrograms are generated using `librosa.feature.melspectrogram`. These spectrograms are saved as images (`.png`) for both classes (allowed and disallowed).

#### Code example:

```python 
def generate_spectrogram(audio_data, sr, output_dir, file_name, segment_idx):
    try:
        # Create mel-spectrogram and convert to decibel scale
        S = librosa.feature.melspectrogram(y=audio_data, sr=sr, n_mels=128)
        log_S = librosa.power_to_db(S, ref=np.max)
    except Exception as e:
        print(f"Error generating spectrogram for {file_name} segment {segment_idx}: {e}")
        return None

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    output_file = os.path.join(output_dir, f"{file_name}_segment_{segment_idx}_spectrogram.png")
    try:
        # Plot and save the spectrogram
        plt.figure(figsize=(10, 4))
        librosa.display.specshow(log_S, sr=sr, x_axis='time', y_axis='mel', fmax=8000)
        plt.colorbar(format='%+2.0f dB')
        plt.title('Mel-frequency spectrogram')
        plt.tight_layout()
        plt.savefig(output_file)
        plt.close()
    except Exception as e:
        print(f"Error saving spectrogram for {file_name} segment {segment_idx}: {e}")
        return None
    return output_file
```

3. **Output Structure**: The spectrograms are saved in directories labeled according to their respective classes (`allowed`, `disallowed`).

---

## Model Architecture

We use a **Convolutional Neural Network (CNN)** to classify the spectrogram images into two categories: allowed or disallowed. The architecture is simple but effective, consisting of:

1. **Initial Convolutional Layer**: A convolutional layer with $64$ filters of size $3 \times 3$ is applied, followed by batch normalization and the ReLU activation function.  
2. **Three Sets of Convolutional Layers**: Each set consists of two convolutional operations with increasing filter sizes (64, 128, and 256). These layers extract progressively abstract features from the input spectrogram. 
3. **Max Pooling**: Reduces the spatial dimensions to focus on the most critical parts of the image. It is applied between each set of convolutions.
4. **Global Average Pooling**: When applied, it reduces each feature map into a single scalar by averaging over all spatial locations.
5. **Dense Layer**: Fully connected layer for classification.
4. **Dropout Layer**: Helps reduce overfitting by randomly dropping units during training.
5. **Output Layer**: A single neuron activated by the sigmoid function to predict whether a person is allowed access.

### Convolutional Layers:

```python
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dropout, BatchNormalization, Dense, Input, Add, GlobalAveragePooling2D, Activation
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2

def residual_block(x, filters, kernel_size=(3, 3)):
    """Residual block with two convolutional layers."""
    shortcut = x  # Save the input tensor

    # First convolutional layer
    x = Conv2D(filters, kernel_size, padding='same', kernel_regularizer=l2(0.001))(x)
    x = BatchNormalization()(x) # Apply batch normalization to improve training stability
    x = Activation('relu')(x) # Use ReLU activation for non-linearity

    # Second convolutional layer
    x = Conv2D(filters, kernel_size, padding='same', kernel_regularizer=l2(0.001))(x)
    x = BatchNormalization()(x)  # Apply batch normalization again

    # Adjust shortcut if needed
    if shortcut.shape[-1] != filters:
        shortcut = Conv2D(filters, (1, 1), padding='same')(shortcut)  # Match the dimensions

    # Add the shortcut to the output
    x = Add()([x, shortcut])
    x = Activation('relu')(x) # Apply ReLU activation to the combined output
    
    return x # Return the output of the residual block
```

### Model creation: 

```python

def create_cnn_model(input_shape):
    """Create a CNN model with residual blocks."""
    inputs = Input(shape=input_shape)

    # Initial convolutional block
    x = Conv2D(64, (3, 3), padding='same', kernel_regularizer=l2(0.001))(inputs)  # Initial convolution with 64 filters
    x = BatchNormalization()(x) #Normalize the output
    x = Activation('relu')(x) #apply ReLu activation
    x = MaxPooling2D(pool_size=(2, 2))(x) # Downsample the feature maps

    # Residual block 1
    x = residual_block(x, 64) # Apply the first residual block with 64 filters
    x = MaxPooling2D(pool_size=(2, 2))(x) # Downsample again

    # Residual block 2
    x = residual_block(x, 128) # Apply the second residual block with 128 filters
    x = MaxPooling2D(pool_size=(2, 2))(x) # Downsample again

    # Residual block 3
    x = residual_block(x, 256)  # Apply the third residual block with 256 filters
    x = MaxPooling2D(pool_size=(2, 2))(x)  # Downsample again

    # Add more residual blocks if the dataset is large and training time allows it
    # x = residual_block(x, 512)
    # x = MaxPooling2D(pool_size=(2, 2))(x)

    # Global Average Pooling (reduces parameter count while keeping spatial info)
    x = GlobalAveragePooling2D()(x)

    # Fully connected layer
    x = Dense(256, activation='relu', kernel_regularizer=l2(0.001))(x)
    x = BatchNormalization()(x) # Normalize the output
    x = Dropout(0.7)(x) # Apply dropout for regularization to prevent overfitting

    # Output layer for binary classification
    outputs = Dense(1, activation='sigmoid')(x)

    # Define model
    model = Model(inputs=inputs, outputs=outputs)
    
    return model
```

---

## Model Training

The training process involves:
- **Data Augmentation**: We use techniques such as rotation, zoom and horizontal flip to artificially increase the size of our dataset and introduce variability.
- **Class balancing**: There are fewer allowed speakers than disallowed, which is why class weights $w_c$ are applied during training to adjust the loss for each class.
- **Optimization using the Adam algorithm**: The Adaptive Moment Estimation optimizer is used to minimize the loss function.
- **Train/Test Split**: The dataset is split into 80% training and 20% validation using `train_test_split`.
- **Early stopping**: The model's performance is monitored on the validation set, and training is stopped if the validation loss does not improve after $p = 15$ epochs.

### Loading data for training: 
```python
import os #for file and directory operations
import sys #to handle system-specific parameters and functions
import numpy as np #for efficient numerical computations
import tensorflow as tf #a deep learning framework
from tensorflow.keras.preprocessing.image import ImageDataGenerator #for data augmentation
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, LearningRateScheduler, TensorBoard #for training management
from tensorflow.keras.optimizers import Adam #imports the Adam optimizer
from model import create_cnn_model #Imports the CNN model architecture from another script/module
from sklearn.utils.class_weight import compute_class_weight #to compute class weights for imbalanced datasets
from sklearn.model_selection import train_test_split #to split data into training and testing sets
from datetime import datetime  # For timestamping TensorBoard logs

def load_data(spectrogram_dir, target_size=(128, 128)):
    """Load spectrogram images from the directory and assign labels based on subdirectories."""
    X, y = [], [] # Initialize empty lists to store images and labels
    for class_name in ['allowed', 'disallowed']:  # Loop over classes
        class_dir = os.path.join(spectrogram_dir, class_name)  # Path to each class folder
        for file_name in os.listdir(class_dir): # Loop over files in the class folder
            if file_name.endswith('_spectrogram.png'): # Only process spectrogram images
                image_path = os.path.join(class_dir, file_name) # Full path to the image
                # Load the image as grayscale with target dimensions and normalize it
                image = tf.keras.preprocessing.image.load_img(image_path, color_mode='grayscale', target_size=target_size)
                image_array = tf.keras.preprocessing.image.img_to_array(image) / 255.0 # Convert to array and normalize
                X.append(image_array) # Append image data to X
                y.append(1 if class_name == 'allowed' else 0) # Assign label 1 for "allowed", 0 for "disallowed"
    return np.array(X), np.array(y) # Return images and labels as numpy arrays

```

### Training Code:

```python
def main():
    # Load the data from the train directory (no separate test directory)
    train_dir = './data/spectrograms/train'
    X, y = load_data(train_dir) # Load training data and labels

    # Convert labels to float32 type to avoid data type issues
    y = y.astype(np.float32)

    # Split data into 70% training and 30% testing
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

    # Further split training data into training and validation (80% train, 20% validation from the 70%)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

    # Compute class weights to handle class imbalance
    class_weights = compute_class_weight('balanced', classes=np.array([0, 1]), y=y_train)
    class_weight_dict = dict(enumerate(class_weights))

    # Create the CNN model
    model = create_cnn_model(input_shape=(128, 128, 1))
    model.compile(optimizer=Adam(0.0001), loss='binary_crossentropy', metrics=['accuracy'])

    # Define ImageDataGenerator for augmentation
    datagen = ImageDataGenerator(
        rotation_range=20, zoom_range=0.2, horizontal_flip=True, fill_mode='nearest')

    # Ensure the data and labels are in correct format and shape
    print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
    print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")

    # Use tf.data.Dataset to control the input pipeline structure explicitly
    def data_generator(X, y, batch_size):
        """Generator to yield batches of data and labels as tf.float32."""
        dataset = tf.data.Dataset.from_tensor_slices((X, y))
        dataset = dataset.shuffle(len(y)).batch(batch_size)
        dataset = dataset.map(lambda X, y: (tf.cast(X, tf.float32), tf.cast(y, tf.float32)))
        return dataset

    # Create train and validation datasets
    train_dataset = data_generator(X_train, y_train, batch_size=32)
    val_dataset = data_generator(X_val, y_val, batch_size=32)

    # Callbacks
    early_stopping = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)  # Stops training if no improvement
    checkpoint = ModelCheckpoint('best_model.keras', monitor='val_loss', save_best_only=True) # Saves the best model
    lr_scheduler = LearningRateScheduler(lambda epoch, lr: lr * 0.9 if epoch > 10 else lr)# Decreases learning rate after epoch 10

    # TensorBoard logging callback
    log_dir = "./logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
    tensorboard_callback = TensorBoard(log_dir=log_dir, histogram_freq=1) # Logs training metrics to TensorBoard

    # Train the model using the training and validation datasets
    model.fit(
        train_dataset,  # Training dataset
        validation_data=val_dataset,  # Validation dataset
        epochs=50, #number of training epochs
        class_weight=class_weight_dict,  # Class weights to handle imbalance
        callbacks=[early_stopping, checkpoint, lr_scheduler, tensorboard_callback]  # Added TensorBoard callback
    )

    model.save('final_model_2.keras') #save the final model

    # Evaluate the model on the test set
    test_dataset = data_generator(X_test, y_test, batch_size=32)
    test_loss, test_acc = model.evaluate(test_dataset) #Calculates test loss and accuracy
    print(f"Test accuracy: {test_acc}, Test loss: {test_loss}") # Prints evaluation results

# Main entry point to run the main function if the script is executed directly
if __name__ == "__main__": 
    main()

```
---

## Model Evaluation

### Key Metrics

1. **False Acceptance Ratio (FAR)**: Measures how often a disallowed person is incorrectly accepted.
2. **False Rejection Ratio (FRR)**: Measures how often an allowed person is incorrectly rejected.
3. **General Efficiency Coefficient**: A comprehensive metric that combines the model's accuracy, FAR, and FRR. The GEC is defined as:

    $\text{GEC} = w_1 \cdot \text{Accuracy} + w_2 \cdot (1 - \text{FAR}) + w_3 \cdot (1 - \text{FRR})$

    where:
    - $\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$
    - $w_1 = 0.4$, $w_2 = 0.3$, and $w_3 = 0.3$ are the weights assigned to accuracy, $(1 - \text{FAR})$, and $(1 - \text{FRR})$ respectively, reflecting their relative importance in the overall evaluation.

### FAR/FRR Calculation Code:

```python
def calculate_far_frr_binary(y_true, y_pred):
    """
    Calculate False Acceptance Ratio (FAR) and False Rejection Ratio (FRR) for binary classification.
    
    Parameters:
    - y_true: Ground truth labels (1 for allowed, 0 for disallowed)
    - y_pred: Model predictions (1 for allowed, 0 for disallowed)
    
    Returns:
    - FAR: False Acceptance Ratio (disallowed incorrectly classified as allowed)
    - FRR: False Rejection Ratio (allowed incorrectly classified as disallowed)
    - tn: True negatives count
    - fp: False positives count
    - fn: False negatives count
    - tp: True positives count
    """
    y_true = y_true.astype(int)
    y_pred = y_pred.astype(int)

    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    far = fp / (fp + tn) if (fp + tn) != 0 else 0  # False Acceptance Ratio
    frr = fn / (fn + tp) if (fn + tp) != 0 else 0  # False Rejection Ratio

    return far, frr, tn, fp, fn, tp
```

### GEC Calcultion Code: 
```python
def calculate_general_efficiency_coefficient(accuracy, far, frr, w1=0.4, w2=0.3, w3=0.3):
"""
Calculate the General Efficiency Coefficient (GEC) as a weighted average of accuracy, (1 - FAR), and (1 - FRR).

Parameters:
- accuracy: Overall accuracy of the model
- far: False Acceptance Ratio
- frr: False Rejection Ratio
- w1, w2, w3: Weights for accuracy, (1 - FAR), and (1 - FRR), respectively

Returns:
- gec: Calculated General Efficiency Coefficient
"""
gec = w1 * accuracy + w2 * (1 - far) + w3 * (1 - frr)
return gec
```
    

---

## Future Improvements

Some ideas for further improvements:

1. **Enhance Data Augmentation**: Apply more sophisticated augmentations to increase model robustness.
2. **Advanced CNN Architectures**: Implement more advanced architectures like ResNet or VGG to improve performance.
3. **Noise Handling**: Explore techniques for handling background noise more effectively, such as noise reduction or filtering.
4. **Hyperparameter Tuning**: Experiment with different optimizers, learning rates, and batch sizes to improve model accuracy.

---

## Conclusion

This project demonstrates the application of **Convolutional Neural Networks** for voice recognition in a door-access system. By converting voice recordings into spectrograms, we leverage image-based recognition techniques to classify users based on their voice. The system shows promise and can be further improved with more sophisticated models and techniques.

---
