Sure! Below is a detailed Markdown template you can use in a Jupyter notebook to describe your solution.

---

# Voice-Based Door Access Recognition System

## Project Overview

This project is part of the course **Introduction to Machine Learning (2024/2025)** by **Dr. Agnieszka Jastrzębska**. The aim of this project is to develop a **voice recognition system** for an automated intercom, capable of distinguishing between allowed and disallowed persons using voice recordings. The core idea is to convert voice recordings into spectrograms and apply **Convolutional Neural Networks (CNNs)** for classification.

This is a binary classification problem, where:
- **Class 1 (allowed)**: People allowed to open the door.
- **Class 0 (disallowed)**: People not allowed to open the door.

The data used for the project is derived from the **DAPS (Device and Produced Speech) Dataset**.

---

## Table of Contents
1. [Project Structure](#project-structure)
2. [Data Preprocessing](#data-preprocessing)
3. [Model Architecture](#model-architecture)
4. [Model Training](#model-training)
5. [Model Evaluation](#model-evaluation)
6. [Future Improvements](#future-improvements)
7. [Conclusion](#conclusion)

---

## Project Structure

The project is organized as follows:

```bash
.
├── data                     # Contains raw audio and generated spectrograms
│   ├── allowed              # Audio files for Class 1 (allowed)
│   ├── disallowed           # Audio files for Class 0 (disallowed)
│   ├── spectrograms         # Generated spectrograms for training the model
│   └── test_voices          # Test audio files
├── models                   # Stores trained CNN models
├── notebooks                # Jupyter notebooks for analysis and interaction
├── src                      # Python scripts for data processing, training, and evaluation
│   ├── data_preprocessing.py # Script for loading and generating spectrograms
│   ├── model.py             # CNN model definition
│   ├── train.py             # Script to train the CNN model
│   └── evaluate.py          # Script to evaluate the model
├── tests                    # Unit tests for different components
└── requirements.txt         # Python dependencies
```

---

## Data Preprocessing

The input to our machine learning model is not the raw audio file but the **Mel-spectrogram**, a visual representation of the frequency spectrum. The steps involved in preprocessing are:

1. **Load Audio**: Each `.mp3` or `.wav` file is loaded using the `librosa` library.
2. **Generate Spectrograms**: Mel-spectrograms are generated using `librosa.feature.melspectrogram`. These spectrograms are saved as images (`.png`) for both classes (allowed and disallowed).
3. **Output Structure**: The spectrograms are saved in directories labeled according to their respective classes (`allowed`, `disallowed`).

### Code Example:

```python
# Load and preprocess the audio data
import os
import librosa
import numpy as np

def load_audio(audio_path, sr=22050):
    y, sr = librosa.load(audio_path, sr=sr)
    return y, sr

def generate_spectrogram(audio_path, output_dir, sr=22050, n_mels=128):
    y, sr = load_audio(audio_path, sr)
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=n_mels)
    log_S = librosa.power_to_db(S, ref=np.max)

    output_file = os.path.join(output_dir, os.path.basename(audio_path).replace('.mp3', '.png'))
    return log_S, output_file
```

---

## Model Architecture

We use a **Convolutional Neural Network (CNN)** to classify the spectrogram images into two categories: allowed or disallowed. The architecture is simple but effective, consisting of:

1. **Three Convolutional Layers**: These layers extract spatial features from the spectrograms.
2. **Max Pooling Layers**: Reduces the spatial dimensions to focus on the most critical parts of the image.
3. **Dense Layer**: Fully connected layer for classification.
4. **Dropout Layer**: Helps reduce overfitting by randomly dropping units during training.
5. **Output Layer**: A single neuron activated by the sigmoid function to predict whether a person is allowed or not.

### Model Summary:

```python
from tensorflow.keras import layers, models

def create_cnn_model(input_shape):
    model = models.Sequential()
    model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=input_shape))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(128, (3, 3), activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Flatten())
    model.add(layers.Dense(128, activation='relu'))
    model.add(layers.Dropout(0.5))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model
```

---

## Model Training

The training process involves:

- **Data Augmentation**: We use techniques such as rotation and zoom to artificially increase the size of our dataset and introduce variability.
- **Train/Test Split**: The dataset is split into 80% training and 20% validation using `train_test_split`.
- **Epochs**: The model is trained over 20 epochs using augmented data.

### Training Code:

```python
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
datagen = ImageDataGenerator(rotation_range=10, zoom_range=0.1, horizontal_flip=True)
datagen.fit(X_train)

model.fit(datagen.flow(X_train, y_train, batch_size=32), epochs=20, validation_data=(X_val, y_val))
```

---

## Model Evaluation

### Key Metrics

1. **False Acceptance Ratio (FAR)**: Measures how often a disallowed person is incorrectly accepted.
2. **False Rejection Ratio (FRR)**: Measures how often an allowed person is incorrectly rejected.

The evaluation process includes:

- **Predicting on Test Set**: The model's performance is evaluated on unseen test data.
- **Calculating FAR/FRR**: Using a confusion matrix to calculate these critical metrics.

### Evaluation Code:

```python
from sklearn.metrics import confusion_matrix

def calculate_far_frr(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    far = fp / (fp + tn)
    frr = fn / (fn + tp)
    return far, frr
```

---

## Future Improvements

Some ideas for further improvements:

1. **Enhance Data Augmentation**: Apply more sophisticated augmentations to increase model robustness.
2. **Advanced CNN Architectures**: Implement more advanced architectures like ResNet or VGG to improve performance.
3. **Noise Handling**: Explore techniques for handling background noise more effectively, such as noise reduction or filtering.
4. **Hyperparameter Tuning**: Experiment with different optimizers, learning rates, and batch sizes to improve model accuracy.

---

## Conclusion

This project demonstrates the application of **Convolutional Neural Networks** for voice recognition in a door-access system. By converting voice recordings into spectrograms, we leverage image-based recognition techniques to classify users based on their voice. The system shows promise and can be further improved with more sophisticated models and techniques.

---

This Markdown structure should give you a clear and professional presentation of your project when used in a Jupyter notebook. You can modify sections and details as necessary based on your specific work and findings!