AIDA is an AI-driven, multimodal healthcare assistant that combines live audio capture, facial-emotion analysis, retrieval-augmented generation over user EHR-like data, and a chat UI to provide context-aware medical guidance.
- Context-aware chat: Uses OpenAI Assistants with a vector store populated from the user’s historical conversations and metadata in MongoDB.
- Voice-to-text: Periodic local audio capture and transcription via OpenAI Whisper, streamed into the conversation loop.
- Affective signals: Frontend streams webcam frames via WebSocket; backend uses Hume Face model to compute emotion scores.
- Session orchestration: Start/stop capture endpoints; “latest” polling for synchronized, near-real-time chat updates.
- Personalization: User profile and medical history persisted in MongoDB, referenced inside RAG.
flowchart LR
subgraph Client[Next.js Frontend]
UI[Chat UI / Pages] --> VC[VideoChat component]
UI --> CH[ChatHistory component]
UI --> Setup[Setup Form]
end
subgraph Backend[FastAPI Backend]
API[/REST APIs/]:::api
WS((WebSocket /ws)):::ws
Poll[poll_transcript loop]:::svc
RAG[OpenAI Assistants + Vector Store]:::ext
STT[OpenAI Whisper]:::ext
Hume[Hume Face Analysis]:::ext
DB[(MongoDB)]:::db
end
VC -- send frames --> WS
UI -- POST /get_answer --> API
UI -- GET /latest --> API
UI -- GET /get_history --> API
UI -- GET /start-recording --> API
UI -- GET /stop-recording --> API
Setup -- localStorage profile --> UI
API <--> DB
API --> RAG
Poll --> STT
Poll --> RAG
WS --> Hume
classDef api fill:#e8f0fe,stroke:#5c6bc0;
classDef ws fill:#e0f7fa,stroke:#00acc1;
classDef db fill:#e8f5e9,stroke:#43a047;
classDef ext fill:#fff3e0,stroke:#fb8c00;
classDef svc fill:#f3e5f5,stroke:#8e24aa;
sequenceDiagram
autonumber
participant User
participant Frontend as Next.js UI
participant Backend as FastAPI
participant Whisper as OpenAI Whisper
participant Hume as Hume Face API
participant Assist as OpenAI Assistants
participant Mongo as MongoDB
User->>Frontend: Start Session
Frontend->>Backend: GET /start-recording
Frontend->>Backend: WS /ws (send frame blobs every 2s)
Backend->>Hume: analyze(face image)
Hume-->>Backend: emotion scores
loop every ~5s
Frontend->>Backend: mic capture (MediaRecorder) [optional]
Backend->>Whisper: transcribe(output.wav)
Whisper-->>Backend: transcript
Backend->>Assist: ask(question=transcript, tools=file_search)
Assist->>Mongo: RAG over vector store (prior convos)
Assist-->>Backend: assistant response
Backend->>Frontend: GET /latest → {latestUser, latestBot}
end
User->>Frontend: Types question
Frontend->>Backend: POST /get_answer
Backend->>Assist: ask(question=input)
Assist-->>Backend: answer
Backend->>Mongo: push messages[], export vector store
Backend-->>Frontend: {answer}
Frontend->>Backend: GET /stop-recording
Backend->>Assist: summarize(last convo → title)
Backend->>Mongo: update title
- Framework:
FastAPI - File:
backend/app.py - Responsibilities:
- CORS, REST endpoints, WebSocket endpoint
- Conversation creation, Q&A, titling, and RAG updates
- Background transcript polling loop and “latest” cache
- Vector store export on updates
- Emotion analysis via Hume Face API
| Method | Path | Purpose | Input | Output |
|---|---|---|---|---|
| POST | /create_conversation/ |
Start a new conversation | form: email, first_name, last_name |
unique_id (UUID, used as initial title) |
| POST | /get_answer/ |
Get assistant answer for a question and persist to latest convo | form: email, first_name, last_name, question |
{ answer } |
| POST | /update_conversation_title/ |
Summarize latest convo title using Assistant | form: email, first_name, last_name |
{ status, new_title } |
| GET | /get_history |
Fetch list of past conversation dates and titles | query: email, first_name, last_name |
[ { date, title } ] |
| GET | /start-recording/ |
Begin background audio polling + transcript loop | query: email, first_name, last_name |
200 |
| GET | /stop-recording/ |
Stop polling and retitle latest convo | query: email, first_name, last_name |
200 |
| GET | /latest/ |
Poll latest user/bot messages produced by audio loop | none | { latestUser, latestBot, newText } |
| WS | /ws |
Receive periodic webcam frames for emotion analysis | binary blobs (jpeg/webm) | server-side updates average sentiments |
Notes:
- The Hume emotion averages are maintained in-memory (
sentiments,sentCount). The/avgsroute exists but is commented out inapp.py. - The WebSocket in
videoProcess.pymirrorsapp.pybut is not used when runningapp.py. Prefer the single FastAPI app inapp.py.
export_and_upload_to_vector_store()exports all user documents from MongoDB to a JSON file and re-uploads it to a preconfigured vector store (vs_r70jSDRJR1LyHTCChmWyKTGd).- Each Q&A flow uses an Assistant configured with the
file_searchtool and bound to that vector store to ground answers on user-specific data. - Retitling (
return_title) uses the same store to summarize the most recent convo.
backend/wav.pycaptures 5s WAV segments usingsounddeviceand writesoutput.wav.backend/whisper.pycalls OpenAI Whisper (whisper-1) to get text; short fragments (<10 chars) are discarded.poll_transcriptappends transcripts until silence, then callsreturn_answer(transcript)and publishes to thelatestcache for the frontend to poll.
A single user document contains profile and conversational history. Conversations are appended and the latest is targeted for updates.
{
"email": "johndoe@gmail.com",
"first_name": "John",
"last_name": "Doe",
"basic_info": { "age": 32, "height": 178, "weight": 75, "sex": "Male" },
"medical_history": [
{ "disease": "Hypertension", "severity": 2, "probability": 60 }
],
"past_convos": [
{
"date": "09/12/25",
"title": "a6b1-...-uuid-or-summary",
"messages": [
{ "role": "user", "content": "I feel dizzy..." },
{ "role": "assistant", "content": "Given your history..." }
]
}
]
}- Framework:
Next.js+React, styling withTailwind CSS. - Key pages/components:
pages/index.js: Landing with feature navigation cards.pages/chatbot.js: Main chat; posts to/get_answer/, polls/latest/, rendersVideoChatandChatHistory.components/VideoChat.js: InitializesMediaRecorder, opens WebSocket tows://localhost:8000/ws, sends frame blobs every 2s; start/stop session endpoints.components/ChatHistory.js: Displays chat sessions (static sample; API call present but commented).pages/setup.js+components/setup.js: Local-only user profile capture stored inlocalStorage.components/results.js: Patient summary view, sentiment chart (static data unless/avgsis enabled).
sequenceDiagram
participant UI as Chat UI
participant BE as FastAPI
UI->>BE: POST /get_answer (question)
BE-->>UI: { answer }
Note over UI,BE: UI also polls GET /latest for voice-loop outputs
UI->>BE: GET /start-recording (begin polling loop)
UI->>BE: WS /ws (send frames)
UI->>BE: GET /stop-recording (retitle latest)
- Unified RAG over user timeline: Exporting the entire user corpus to a vector store on each update ensures the Assistant’s file_search has consistent context across sessions without bespoke embedding code.
- Hybrid interaction loop: Text chat and passive voice capture run concurrently; the frontend reconciles both streams into a single chat timeline.
- Affective context: Real-time emotion inference from facial signals enables future adaptive responses (e.g., escalation on high distress).
- Lightweight orchestration: Background polling loop with
sounddeviceavoids heavy streaming infra while still delivering incremental insights.
- Python 3.10+
- Node.js 18+
- MongoDB Atlas URI (or local MongoDB)
- API keys: OpenAI, Hume
Create a .env at backend/.env with:
OPENAI_API_KEY=sk-...
WHISPER_KEY=sk-... # if using a distinct key variable for Whisper
MONGODB_URI=mongodb+srv://...
HUME_API_KEY=...
VECTOR_STORE_ID=vs_...Update backend/app.py to read MONGODB_URI, VECTOR_STORE_ID, and Hume key from env (recommended). The current code includes hardcoded values—replace these with env lookups for security.
cd backend
pip install -r requirements.txt # create this file if missing; include fastapi, uvicorn, pymongo, python-dotenv, requests, sounddevice, scipy, hume, openai
uvicorn app:app --reload --host 127.0.0.1 --port 8000cd frontend
npm install
npm run dev
# open http://localhost:3000- Move credentials to environment variables; never commit keys.
- Restrict CORS to known origins (currently
http://localhost:3000). - Consider encrypting sensitive user fields at rest and applying field-level validation.
- Add authentication/authorization; the example addresses a fixed demo user.
- Handle PHI according to compliance needs; add consent and data retention policies.
- Add structured logging and request IDs to backend.
- Expose health/readiness probes for the FastAPI app.
- Persist Hume emotion aggregates per session instead of in-memory if analytics are needed post-session.
- Enable
/avgsand integrate real emotion timelines inresults.js. - Replace polling with server push (Server-Sent Events or WebSocket) for
latestupdates. - Implement real auth + per-user vector stores or namespaces.
- Add rate limiting and input validation on all endpoints.
- Migrate to streaming STT to reduce latency; add VAD.
backend/
app.py # FastAPI app: REST, WS, RAG, polling
wav.py # Audio capture + WAV writer
whisper.py # Whisper transcription client
videoProcess.py # Separate FastAPI+Socket.IO demo (not primary)
frontend/
pages/*.js # Next.js pages (index, chatbot, setup, etc.)
components/*.js # VideoChat, ChatHistory, Results, etc.
- This README reflects the current code: hardcoded URIs and IDs exist; replace with env-driven configuration for production.
- The vector store ID and Hume key are placeholders if changed; ensure consistent configuration between code and environment.