A complete pipeline for real-time audio ingestion, speaker diarization, transcription, and per-speaker summarization using local services and LLM. Supports Zoom meeting platform
- Real-time audio streaming via WebSocket
- Speaker diarization and identification
- Automatic transcription using local Whisper model
- AI-powered per-speaker summaries via LLM
- Reverse proxy for unified endpoint access
- JSON API for transcripts and summaries
┌───────────────────────┐
│ Zoom Meet / Attendee│
│ Bot Audio │
└─────────┬─────────────┘
│ (WebSocket)
▼
┌───────────────────────────┐
│ Reverse Proxy (WebSocket)│
│ reverse_proxy.py │
└─────────┬─────────────────┘
│
▼
┌─────────────────────────────┐
│ Audio Consumer WebSocket │
│ websocket_server.py │
│ │
│ - DiarizedAudioProcessor │
│ → maps audio → speaker │
│ - SpeakerAudioAggregator │
│ → collects per-speaker │
│ audio segments │
│ - process_speaker_segment() │
│ → transcribes using │
│ Whisper model │
│ - TranscriptManager │
│ → stores per-speaker + │
│ complete transcript │
└─────────┬───────────────────┘
│
▼
┌─────────────────────────────┐
│ Transcription Service │
│ transcription_service.py │
│ - WhisperTranscriptionService│
│ → converts audio → text │
└─────────┬───────────────────┘
│
▼
┌─────────────────────────────┐
│ Transcript Microservice │
│ transcript_microservice.py │
│ - Stores raw & per-speaker │
│ transcripts locally │
│ - Fetch loop: polls reverse │
│ proxy for updated transcript│
│ - /transcript endpoint │
│ → returns JSON transcript │
│ - /summary endpoint │
│ → sends per-speaker text │
│ to LLM (Groq ) │
│ → returns summary JSON │
└─────────┬───────────────────┘
│
▼
┌─────────────────────────────┐
│ External LLM API │
│ (Groq / OpenAI / Claude) │
│ - Summarizes per-speaker │
└─────────────────────────────┘
-
Audio Source: Meeting audio streams sent to WebSocket audio server
-
WebSocket Audio Server (
ws://127.0.0.1:5005):- Receives audio chunks in real-time
- Groups audio by speaker using
SpeakerAudioAggregator - Maps audio chunks to speakers via
DiarizedAudioProcessor - Sends completed segments to Whisper for transcription
-
Webhook/Local Transcript Server (
http://127.0.0.1:5006/5007):- Receives speaker diarization updates via webhooks
- creates transcripts and stores in
TranscriptManager - Exposes transcripts as JSON at
/transcripts5007
-
Reverse Proxy (
http://127.0.0.1:8080):- Consolidates WebSocket and HTTP traffic under one public port
- Routes
/attendee-websocket*to WebSocket server - Routes
/webhook*to webhook server - Supports bidirectional streaming
-
Microservice (FastAPI on port 8000):
- Polls local transcripts every 5 seconds
- Maintains raw transcripts and per-speaker stores
- Generates per-speaker summaries via Groq LLM
- Exposes endpoints for transcripts and summaries
- Speaker diarization and audio chunk mapping (
DiarizedAudioProcessor) - Transcript management (
TranscriptManager) - WebSocket audio server
- Webhook HTTP server
- Local transcript HTTP server
- Use of dataclass for
TranscriptUtterance - Asynchronous handling with
asyncioandwebsockets
- Defines a base interface (
TranscriptionService) for audio-to-text transcription. - Implements Whisper-based transcription with
WhisperTranscriptionService. - Loads a Whisper model (
tiny/base/small/...) on initialization. transcribemethod converts NumPy audio array → text, language, and segments.get_transcription_service()returns the singleton instance.set_transcription_service()allows replacing the global service (for testing or swapping models).
- SpeakerAudioSegment: stores audio, timestamps, and speaker info for a single segment.
- SpeakerAudioAggregator: groups incoming audio chunks by speaker into segments.
add_audio_chunk: adds audio, finalizes segments if too long or new speaker appears.finalize_stale_segments: ends segments idle for a while.finalizeAllSegments: ends all active segments (e.g., meeting end).- Tracks active speakers and provides segment stats for transcription or analysis.
-
Audio WebSocket Server:
- Runs on
ws://127.0.0.1:5005. - Receives real-time audio chunks from meetings or clients.
- Buffers audio per speaker using
SpeakerAudioAggregator. - Sends completed segments for transcription using
WhisperTranscriptionService.
- Runs on
-
Webhook / Local Transcript HTTP Server:
- Runs on
http://127.0.0.1:5006(webhooks) andhttp://127.0.0.1:5007(local transcript). - Receives speaker diarization updates (
/webhook). - Exposes local transcripts via JSON (
/transcripts,/transcripts/per-speaker).
- Runs on
- Consolidates multiple local services under one public port.
- WebSocket proxy:
/attendee-websocket*→ forwards tows://127.0.0.1:5005.- Maintains bidirectional streaming between client and local audio server.
- HTTP proxy:
/webhook*and/transcripts*→ forwards tohttp://127.0.0.1:5006.- Preserves request method, headers, and body; returns backend response.
- Uses
pyngrokto create a secure tunnel to the reverse proxy on port 8080. - Provides a public URL for external services to connect to local servers.
- Purpose: FastAPI microservice to fetch transcripts and generate per-speaker summaries using an LLM.
- Fetch loop: Polls
http://127.0.0.1:5007every 5 seconds to update local stores:raw_transcripts→ full JSONper_speaker_store→ transcripts grouped by speaker
- Endpoints:
GET /transcripts→ returns raw transcript JSONGET /summary→ returns LLM-generated summaries per speaker
- LLM usage: Sends each speaker's transcript to Groq LLM to generate a concise summary
- Python 3.8+
- ngrok (for public tunnel)
- Groq API key (for LLM summaries)
pip install -r requirements.txtRun the following services in separate terminal windows:
python audio_consumer/websocket_server.pypython reverse_proxy.pyngrok http 8080Copy the https://<ngrok-id>.ngrok-free.app URL for configuration.
python microservice.pycurl -X POST 'https://app.attendee.dev/api/v1/bots' \
-H 'Content-Type: application/json' \
-H 'Authorization: Token YOUR_API_TOKEN' \
-d '{
"meeting_url": "https://us05web.zoom.us/j/YOUR_MEETING_ID",
"bot_name": "Meeting Bot",
"websocket_settings": {
"audio": {
"url": "wss://YOUR-NGROK-URL.ngrok-free.app/attendee-websocket",
"sample_rate": 16000
}
},
"webhooks": [
{
"url": "https://YOUR-NGROK-URL.ngrok-free.app/webhook",
"triggers": ["transcript.update", "bot.state_change"]
}
]
}'Raw transcripts:
http://127.0.0.1:5007/transcripts
Per-speaker summaries:
http://127.0.0.1:8000/summary
{
"format": "json",
"meeting_info": {
"start_time_ms": 1763438437019,
"end_time_ms": 1763438704648,
"total_segments": 9,
"speakers": ["deepit shah"]
},
"transcripts": [
{
"timestamp_ms": 1763438437019,
"end_timestamp_ms": 1763438467023,
"duration_ms": 30004,
"speaker_uuid": "16778240",
"speaker_name": "deepit shah",
"speaker_is_host": true,
"transcription": "Hi, so let us everything is up and running...",
"audio_samples": 480160,
"sample_rate": 16000,
"processed_at": "2025-11-17T20:03:45.282319"
}
]
}{
"summary_per_speaker": {
"deepit shah": "The speaker is setting up and demonstrating a system with multiple components, including a web audio server, webhook server, local transcript server, and a microservice. The microservice generates summaries using an LLM..."
}
}- 5005: WebSocket Audio Server
- 5006: Webhook Server
- 5007: Local Transcript Server
- 8000: Microservice (FastAPI)
- 8080: Reverse Proxy
export GROQ_API_KEY="your_groq_api_key"
export WHISPER_MODEL="base" # Options: tiny, base, small, medium, large- Ensure ngrok is running and the URL is correctly configured
- Check firewall settings for port 8080
- Increase Whisper model size (base → small → medium)
- Verify audio sample rate is 16000 Hz
- Verify Groq API key is set
- Check microservice logs for LLM errors