
---

# Voice-Based Door Access Recognition System

## Project Overview

This project is part of the course **Introduction to Machine Learning (2024/2025)** by **Dr. Agnieszka Jastrzębska**. The aim of this project is to develop a **voice recognition system** for an automated intercom, capable of distinguishing between allowed and disallowed persons using voice recordings. The core idea is to convert voice recordings into spectrograms and apply **Convolutional Neural Networks (CNNs)** for classification.

This is a binary classification problem, where:
- **Class 1 (allowed)**: People allowed to open the door.
- **Class 0 (disallowed)**: People not allowed to open the door.

The data used for the project is derived from the **DAPS (Device and Produced Speech) Dataset**.




## Data Preprocessing
The input to our machine learning model is not the raw audio file but the **Mel-spectrogram**, a visual representation of the frequency spectrum. The steps involved in preprocessing are:

1. **Load Audio**: Each `.mp3` or `.wav` file is loaded using the `librosa` library. It is possible to apply random gain (volume change) to the audio: 



Silent sections are removed and the file is split in chunks if longer than max_duration. 



2. **Generate Spectrograms**: Mel-spectrograms are generated using `librosa.feature.melspectrogram`. These spectrograms are saved as images (`.png`) for both classes (allowed and disallowed).

#### Code example:

```python 
def generate_spectrogram(audio_data, sr, output_dir, file_name, segment_idx):
    try:
        # Create mel-spectrogram and convert to decibel scale
        S = librosa.feature.melspectrogram(y=audio_data, sr=sr, n_mels=128)
        log_S = librosa.power_to_db(S, ref=np.max)
    except Exception as e:
        print(f"Error generating spectrogram for {file_name} segment {segment_idx}: {e}")
        return None

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    output_file = os.path.join(output_dir, f"{file_name}_segment_{segment_idx}_spectrogram.png")
    try:
        # Plot and save the spectrogram
        plt.figure(figsize=(10, 4))
        librosa.display.specshow(log_S, sr=sr, x_axis='time', y_axis='mel', fmax=8000)
        plt.colorbar(format='%+2.0f dB')
        plt.title('Mel-frequency spectrogram')
        plt.tight_layout()
        plt.savefig(output_file)
        plt.close()
    except Exception as e:
        print(f"Error saving spectrogram for {file_name} segment {segment_idx}: {e}")
        return None
    return output_file
```

3. **Output Structure**: The spectrograms are saved in directories labeled according to their respective classes (`allowed`, `disallowed`).

---

## Model Architecture

We use a **Convolutional Neural Network (CNN)** to classify the spectrogram images into two categories: allowed or disallowed. The architecture is simple but effective, consisting of:

1. **Initial Convolutional Layer**: A convolutional layer with $64$ filters of size $3 \times 3$ is applied, followed by batch normalization and the ReLU activation function.  
2. **Three Sets of Convolutional Layers**: Each set consists of two convolutional operations with increasing filter sizes (64, 128, and 256). These layers extract progressively abstract features from the input spectrogram. 
3. **Max Pooling**: Reduces the spatial dimensions to focus on the most critical parts of the image. It is applied between each set of convolutions.
4. **Global Average Pooling**: When applied, it reduces each feature map into a single scalar by averaging over all spatial locations.
5. **Dense Layer**: Fully connected layer for classification.
4. **Dropout Layer**: Helps reduce overfitting by randomly dropping units during training.
5. **Output Layer**: A single neuron activated by the sigmoid function to predict whether a person is allowed access.

---

## Model Training

The training process involves:
- **Data Augmentation**: We use techniques such as rotation, zoom and horizontal flip to artificially increase the size of our dataset and introduce variability.
- **Class balancing**: There are fewer allowed speakers than disallowed, which is why class weights $w_c$ are applied during training to adjust the loss for each class.
- **Optimization using the Adam algorithm**: The Adaptive Moment Estimation optimizer is used to minimize the loss function.
- **Train/Test Split**: The dataset is split into 80% training and 20% validation using `train_test_split`.
- **Early stopping**: The model's performance is monitored on the validation set, and training is stopped if the validation loss does not improve after $p = 15$ epochs.

---

## Model Evaluation

### Key Metrics

1. **False Acceptance Ratio (FAR)**: Measures how often a disallowed person is incorrectly accepted.
2. **False Rejection Ratio (FRR)**: Measures how often an allowed person is incorrectly rejected.
3. **General Efficiency Coefficient**: A comprehensive metric that combines the model's accuracy, FAR, and FRR. The GEC is defined as:

    $\text{GEC} = w_1 \cdot \text{Accuracy} + w_2 \cdot (1 - \text{FAR}) + w_3 \cdot (1 - \text{FRR})$

    where:
    - $\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$
    - $w_1 = 0.4$, $w_2 = 0.3$, and $w_3 = 0.3$ are the weights assigned to accuracy, $(1 - \text{FAR})$, and $(1 - \text{FRR})$ respectively, reflecting their relative importance in the overall evaluation.

### FAR/FRR Calculation Code:

```python
def calculate_far_frr_binary(y_true, y_pred):
    """
    Calculate False Acceptance Ratio (FAR) and False Rejection Ratio (FRR) for binary classification.
    
    Parameters:
    - y_true: Ground truth labels (1 for allowed, 0 for disallowed)
    - y_pred: Model predictions (1 for allowed, 0 for disallowed)
    
    Returns:
    - FAR: False Acceptance Ratio (disallowed incorrectly classified as allowed)
    - FRR: False Rejection Ratio (allowed incorrectly classified as disallowed)
    - tn: True negatives count
    - fp: False positives count
    - fn: False negatives count
    - tp: True positives count
    """
    y_true = y_true.astype(int)
    y_pred = y_pred.astype(int)

    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    far = fp / (fp + tn) if (fp + tn) != 0 else 0  # False Acceptance Ratio
    frr = fn / (fn + tp) if (fn + tp) != 0 else 0  # False Rejection Ratio

    return far, frr, tn, fp, fn, tp
```

### GEC Calcultion Code: 
```python
def calculate_general_efficiency_coefficient(accuracy, far, frr, w1=0.4, w2=0.3, w3=0.3):
"""
Calculate the General Efficiency Coefficient (GEC) as a weighted average of accuracy, (1 - FAR), and (1 - FRR).

Parameters:
- accuracy: Overall accuracy of the model
- far: False Acceptance Ratio
- frr: False Rejection Ratio
- w1, w2, w3: Weights for accuracy, (1 - FAR), and (1 - FRR), respectively

Returns:
- gec: Calculated General Efficiency Coefficient
"""
gec = w1 * accuracy + w2 * (1 - far) + w3 * (1 - frr)
return gec
```
    

---

## Future Improvements  --- Where are they? 

Some ideas for further improvements:

1. **Enhance Data Augmentation**: Apply more sophisticated augmentations to increase model robustness.
2. **Advanced CNN Architectures**: Implement more advanced architectures like ResNet or VGG to improve performance.
3. **Noise Handling**: Explore techniques for handling background noise more effectively, such as noise reduction or filtering.
4. **Hyperparameter Tuning**: Experiment with different optimizers, learning rates, and batch sizes to improve model accuracy.

---

## Conclusion

This project demonstrates the application of **Convolutional Neural Networks** for voice recognition in a door-access system. By converting voice recordings into spectrograms, we leverage image-based recognition techniques to classify users based on their voice. The system shows promise and can be further improved with more sophisticated models and techniques.

---
