Feature: Implement Kafka Infrastructure & AI Audio Processing Pipeline
Problem
FluentMeet's core value proposition — real-time voice translation in video calls — requires a low-latency, fault-tolerant pipeline for processing audio as it streams. Orchestrating Speech-to-Text (STT), Translation, and Text-to-Speech (TTS) synchronously within a single request or WebSocket handler would be fragile, unscalable, and far too slow. There is currently no messaging infrastructure to decouple these processing stages from one another.
Proposed Solution
Set up Apache Kafka as the central message bus for the audio processing pipeline. Each stage of the pipeline (ingest → STT → translation → TTS → egress) becomes an independent, async worker that consumes from one Kafka topic and produces to the next. This architecture allows each stage to be scaled, monitored, and replaced independently, while providing natural backpressure and replay capabilities via Kafka's offset management.
Pipeline Architecture
[WebSocket Audio Ingest]
│
▼
audio.raw ──► STTWorker (Deepgram) ──► text.original
│
▼
TranslationWorker (DeepL/GPT) ──► text.translated
│
▼
TTSWorker (Voice.ai) ──► audio.synthesized
│
▼
[WebSocket Audio Egress]
User Stories
- As a meeting participant, I want to hear the speaker's voice translated in near real-time, so I can follow the conversation without language barriers.
- As a developer, I want each processing stage to be an independently deployable worker, so I can scale the bottleneck stages (e.g., STT) without over-provisioning the others.
- As a DevOps engineer, I want all pipeline failures to be logged with the original message retained in the topic, so I can diagnose issues and replay failed messages without data loss.
Acceptance Criteria
- A Kafka cluster is configured and reachable from the FastAPI application (via
docker-compose for local development).
- The following Kafka topics are created and documented:
audio.raw — raw audio chunks from the WebSocket ingest.
text.original — transcribed text output from the STT worker.
text.translated — translated text output from the Translation worker.
audio.synthesized — synthesized audio output from the TTS worker.
AudioIngestService is implemented in app/services/audio_bridge.py to accept streaming audio from the WebSocket and publish chunks to audio.raw.
STTWorker consumes from audio.raw, calls the Deepgram API for transcription, and publishes results to text.original.
TranslationWorker consumes from text.original, calls the DeepL or GPT API for translation, and publishes results to text.translated.
TTSWorker consumes from text.translated, calls the Voice.ai API for speech synthesis, and publishes the resulting audio to audio.synthesized.
- The WebSocket audio egress handler consumes from
audio.synthesized and streams the translated audio back to the correct meeting room participants.
- All workers handle transient errors gracefully (retry with backoff) and log pipeline latency per stage.
- Each worker is independently horizontally scalable via Kafka consumer groups.
- End-to-end pipeline latency is measured and logged for each audio chunk (from
audio.raw publish to audio.synthesized consume).
Proposed Technical Details
- Kafka Client:
aiokafka for all async producer and consumer operations.
- Kafka Config:
KAFKA_BOOTSTRAP_SERVERS already defined in app/core/config.py.
- Topic Schema: Each message carries a
room_id, user_id, sequence_number, and the payload (binary audio or UTF-8 text), enabling correct reassembly and routing.
- Worker lifecycle: All workers are started as background tasks within FastAPI's
lifespan context manager, ensuring clean startup and shutdown.
- Consumer Groups:
stt-worker-group
translation-worker-group
tts-worker-group
- New Service Files:
app/services/audio_bridge.py — AudioIngestService (producer) and egress router.
app/services/stt_worker.py — STTWorker consumer.
app/services/translation_worker.py — TranslationWorker consumer.
app/services/tts_worker.py — TTSWorker consumer.
- Infrastructure:
infra/docker-compose.yml extended with zookeeper and kafka services.
Tasks
Open Questions/Considerations
- Which STT provider will be primary — Deepgram or OpenAI Whisper? Do we need to support fallback?
- What is the acceptable end-to-end pipeline latency target (e.g., < 500ms from speech to translated audio)?
- How many Kafka partitions should each topic have to achieve the desired throughput? This depends on the expected number of concurrent meeting rooms.
- Should we implement a dead-letter topic (e.g.,
pipeline.dlq) for messages that fail processing after N retries?
- How will we handle speaker diarization — do we need to preserve per-speaker audio streams through the pipeline?
Feature: Implement Kafka Infrastructure & AI Audio Processing Pipeline
Problem
FluentMeet's core value proposition — real-time voice translation in video calls — requires a low-latency, fault-tolerant pipeline for processing audio as it streams. Orchestrating Speech-to-Text (STT), Translation, and Text-to-Speech (TTS) synchronously within a single request or WebSocket handler would be fragile, unscalable, and far too slow. There is currently no messaging infrastructure to decouple these processing stages from one another.
Proposed Solution
Set up Apache Kafka as the central message bus for the audio processing pipeline. Each stage of the pipeline (ingest → STT → translation → TTS → egress) becomes an independent, async worker that consumes from one Kafka topic and produces to the next. This architecture allows each stage to be scaled, monitored, and replaced independently, while providing natural backpressure and replay capabilities via Kafka's offset management.
Pipeline Architecture
User Stories
Acceptance Criteria
docker-composefor local development).audio.raw— raw audio chunks from the WebSocket ingest.text.original— transcribed text output from the STT worker.text.translated— translated text output from the Translation worker.audio.synthesized— synthesized audio output from the TTS worker.AudioIngestServiceis implemented inapp/services/audio_bridge.pyto accept streaming audio from the WebSocket and publish chunks toaudio.raw.STTWorkerconsumes fromaudio.raw, calls the Deepgram API for transcription, and publishes results totext.original.TranslationWorkerconsumes fromtext.original, calls the DeepL or GPT API for translation, and publishes results totext.translated.TTSWorkerconsumes fromtext.translated, calls the Voice.ai API for speech synthesis, and publishes the resulting audio toaudio.synthesized.audio.synthesizedand streams the translated audio back to the correct meeting room participants.audio.rawpublish toaudio.synthesizedconsume).Proposed Technical Details
aiokafkafor all async producer and consumer operations.KAFKA_BOOTSTRAP_SERVERSalready defined inapp/core/config.py.room_id,user_id,sequence_number, and the payload (binary audio or UTF-8 text), enabling correct reassembly and routing.lifespancontext manager, ensuring clean startup and shutdown.stt-worker-grouptranslation-worker-grouptts-worker-groupapp/services/audio_bridge.py—AudioIngestService(producer) and egress router.app/services/stt_worker.py—STTWorkerconsumer.app/services/translation_worker.py—TranslationWorkerconsumer.app/services/tts_worker.py—TTSWorkerconsumer.infra/docker-compose.ymlextended withzookeeperandkafkaservices.Tasks
infra/docker-compose.yml.audio.raw,text.original,text.translated,audio.synthesized) with appropriate partition and retention settings.app/schemas/pipeline.py.AudioIngestServiceinapp/services/audio_bridge.pyto publish raw audio chunks.STTWorkerinapp/services/stt_worker.pyusing the Deepgram SDK.TranslationWorkerinapp/services/translation_worker.pyusing the DeepL or OpenAI API.TTSWorkerinapp/services/tts_worker.pyusing the Voice.ai API.audio.synthesizedback to room participants.lifespancontext inapp/main.py.audio.rawtoaudio.synthesized.Open Questions/Considerations
pipeline.dlq) for messages that fail processing after N retries?