Adaptis
Working With You to Deliver More.

Presentation (PowerPoint) • Presentation (PDF) • Demo (Video)

An advanced voice recognition system for multi-speaker environments with contextual memory

Key Features • Components • Installation • Usage • API Reference • Use Cases • Troubleshooting • Contributing

Introduction

This project is our submission for the Helbling challenge at START Hack 2025. The challenge tasked us with developing an AI assistant capable of distinguishing multiple voices in a noisy environment, simulating an autonomous waiter in a bustling restaurant setting. Our solution, Adaptis, leverages advanced voice recognition and natural language processing to create a seamless interaction between customers and the AI waiter, showcasing the potential of AI in enhancing customer service in the hospitality industry.

Key Features

Multi-Speaker Recognition - Identifies different speakers in a conversation using voice embeddings
Single-Message Multi-Speaker Compatibility - Each message can contain voice from multiple clients, the voices and content will be separated and treated accordingly.
Noise Reduction - Advanced audio processing to isolate main voices from background noise
Real-Time Transcription - Live speech-to-text conversion as audio is being recorded
Speaker Diarization - Separates and labels audio by individual speakers
Contextual Memory - Maintains conversation history and preferences for each identified speaker
Privacy-Focused - Built with data security and user privacy as core principles

Components

The system architecture consists of multiple specialized modules:

Speaker Recognition Module

Implements real-time speaker identification using SpeechBrain's ECAPA-VOXCELEB model for generating voice embeddings and FAISS for efficient similarity search. The system:

Generates 192-dimensional speaker embeddings using SpeechBrain
Uses FAISS IndexFlatIP for fast similarity-based speaker matching
Maintains a dynamic speaker database with automatic new speaker registration
Supports preloading of known speaker profiles

Audio Processing Pipeline

Raw Audio → DeepFilterNet Noise Reduction → High-pass Filtering (100Hz) → Audio Normalization → Pyannote Speaker Diarization → Whisper Transcription

Memory Management System

Implements a persona-based memory system that:

Maintains conversation history and speaker profiles
Generates dynamic speaker personas using OpenAI's API
Groups conversations by speaker for contextual memory retrieval
Supports real-time updates and persistence of conversation context

API Server

Flask-based server providing:

RESTful endpoints for audio processing and transcription
WebSocket support for real-time speech-to-text streaming
Session management for continuous voice recognition
Memory persistence and retrieval endpoints

Project Structure

└── StartHack25
    ├── frontend # contains a react frontend for the demo embedding https://voiceoasis.azurewebsites.net/
    └── src
        ├── speaker_recognition
        │   ├── __init__.py
        │   ├── audios              # Pre-loaded speaker audio samples
        │   │   └── *.wav           # Sample audio files for known speakers
        │   ├── embedder.py         # SpeechBrain embedding generation
        │   ├── utils.py            # Speaker recognition utilities
        │   └── vector_database.py  # FAISS-based speaker matching
        ├── memoryprocessing.py     # Conversation context and persona management
        ├── openai_api_example.py   # OpenAI API integration examples
        ├── relay.py               # Main Flask API server
        ├── utils.py               # Audio processing and utility functions
        ├── visualize_noise_filter_results.ipynb  # Noise filtering analysis
        └── websocket_adapter.py   # WebSocket client for real-time streaming

Installation

Prerequisites

Python 3.12+
PyTorch
CUDA (optional, for GPU acceleration)

Setup

Clone the repository:

git clone https://github.com/your-username/StartHack25.git
cd StartHack25

Create and activate the Conda environment:

conda env create -f environment.yml
conda activate starthack25

Install additional dependencies:
```
pip install -r requirements.txt
```
Set up environment variables:
```
cp .env.example .env
```
Edit the .env file with your API keys and configuration.
Run the application:
```
python src/relay.py
```

Limitations

LLM Hallucinations: The system may occasionally produce inaccurate or irrelevant responses due to limitations in language model understanding.
Incomplete Dockerization: The current implementation lacks full containerization, which may affect deployment consistency across different environments.
Performance in Noisy Environments: In scenarios with multiple background voices, the denoising algorithm's effectiveness may be reduced, impacting overall system performanc, although it should still be considered reliable.
Ongoing Denoising Improvements: Further refinement of the noise reduction techniques is needed to enhance audio quality in challenging acoustic environments.
Frontend Integration: Due to limited access to the frontend application and the actual system prompt of the agent, some features may not be fully optimized for user interaction.
Memory Management Workarounds: The current implementation of memory management relies on temporary solutions, which may impact long-term scalability and efficiency.

Remarks

This implementation is based on the following repository: https://github.com/START-Hack/Helbling_STARTHACK25/tree/main

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
docs		docs
frontend		frontend
img		img
src		src
.env.example		.env.example
.gitignore		.gitignore
DeepDiveSlides_Helbling.pdf		DeepDiveSlides_Helbling.pdf
INSPECT_Start_Hack_ADAPTIS.pdf		INSPECT_Start_Hack_ADAPTIS.pdf
INSPECT_Start_Hack_ADAPTIS.pptx		INSPECT_Start_Hack_ADAPTIS.pptx
Intro.mp4		Intro.mp4
README.md		README.md
demo.mp4		demo.mp4
environment.yml		environment.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
summary_persona.jinja		summary_persona.jinja
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptis
Working With You to Deliver More.

An advanced voice recognition system for multi-speaker environments with contextual memory

Introduction

Key Features

Components

Speaker Recognition Module

Audio Processing Pipeline

Memory Management System

API Server

Project Structure

Installation

Prerequisites

Setup

Limitations

Remarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptis Working With You to Deliver More.

An advanced voice recognition system for multi-speaker environments with contextual memory

Introduction

Key Features

Components

Speaker Recognition Module

Audio Processing Pipeline

Memory Management System

API Server

Project Structure

Installation

Prerequisites

Setup

Limitations

Remarks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Adaptis
Working With You to Deliver More.

Packages