# Voice-Based Door Access Recognition System

## Project Overview

This project is part of the course **Introduction to Machine Learning (2024/2025)** by **Dr. Agnieszka Jastrzębska**. The aim of this project is to develop a **voice recognition system** for an automated intercom, capable of distinguishing between allowed and disallowed persons using voice recordings. The core idea is to convert voice recordings into spectrograms and apply **Convolutional Neural Networks (CNNs)** for classification.

This is a binary classification problem, where:
- **Class 1 (allowed)**: People allowed to open the door.
- **Class 0 (disallowed)**: People not allowed to open the door.

The data used for the project is derived from the **DAPS (Device and Produced Speech) Dataset**.

## Data Preprocessing

The input to our machine learning model is not the raw audio file but the **Mel-spectrogram**, a visual representation of the frequency spectrum. The steps involved in preprocessing are:

1. **Load Audio**: Each `.mp3` or `.wav` file is loaded using the `librosa` library.
2. **Generate Spectrograms**: Mel-spectrograms are generated using `librosa.feature.melspectrogram`. These spectrograms are saved as images (`.png`) for both classes (allowed and disallowed).
3. **Output Structure**: The spectrograms are saved in directories labeled according to their respective classes (`allowed`, `disallowed`).

---

## Model Architecture

We use a **Convolutional Neural Network (CNN)** to classify the spectrogram images into two categories: allowed or disallowed. The architecture is simple but effective, consisting of:

1. **Three Convolutional Layers**: These layers extract spatial features from the spectrograms.
2. **Max Pooling Layers**: Reduces the spatial dimensions to focus on the most critical parts of the image.
3. **Dense Layer**: Fully connected layer for classification.
4. **Dropout Layer**: Helps reduce overfitting by randomly dropping units during training.
5. **Output Layer**: A single neuron activated by the sigmoid function to predict whether a person is allowed or not.

---

## Model Training

The training process involves:

- **Data Augmentation**: We use techniques such as rotation and zoom to artificially increase the size of our dataset and introduce variability.
- **Train/Test Split**: The dataset is split into 80% training and 20% validation using `train_test_split`.
- **Epochs**: The model is trained over 20 epochs using augmented data.

---

## Model Evaluation

### Key Metrics

1. **False Acceptance Ratio (FAR)**: Measures how often a disallowed person is incorrectly accepted.
2. **False Rejection Ratio (FRR)**: Measures how often an allowed person is incorrectly rejected.

The evaluation process includes:

- **Predicting on Test Set**: The model's performance is evaluated on unseen test data.
- **Calculating FAR/FRR**: Using a confusion matrix to calculate these critical metrics.

## Future Improvements

Some ideas for further improvements:

1. **Enhance Data Augmentation**: Apply more sophisticated augmentations to increase model robustness.
2. **Advanced CNN Architectures**: Implement more advanced architectures like ResNet or VGG to improve performance.
3. **Noise Handling**: Explore techniques for handling background noise more effectively, such as noise reduction or filtering.
4. **Hyperparameter Tuning**: Experiment with different optimizers, learning rates, and batch sizes to improve model accuracy.

