A real-time, emotionally aware voice agent demo built on Vision Agents and Stream Video. The agent watches your face on the video track, derives an emotion/gaze/engagement state with MediaPipe, and steers Inworld TTS v2 delivery to match — whispering when you look sad, getting animated when you're engaged.
The backend is a Python Vision Agents service (Inworld TTS + Gemini + Deepgram + MediaPipe + Anam avatar). The frontend is a Next.js call experience that joins the Stream call, renders the avatar, and shows captions and metrics.
- Stream Video — Register — API key + secret for the call edge
- Inworld AI — TTS v2 API key (basic-auth base64 token from the console)
- Deepgram — STT API key
- Google AI Studio — Gemini API key
- Anam — avatar API key and avatar ID
- Vision Agents — the underlying agent framework
backend/README.md— deep dive on the face processor and TTS steering
Browser (Next.js) ──► Stream Edge ◄── Backend (Vision Agents, Python)
▲ │
│ ├── Deepgram (STT)
│ ├── Gemini (LLM)
│ ├── Inworld (TTS v2)
│ ├── MediaPipe (face state)
└────────── Anam (avatar video) ◄───────┘
The frontend hits the backend's HTTP API to create and close agent sessions, then joins the same Stream call as the agent. The backend runs the STT → LLM → TTS pipeline and publishes the agent's audio + Anam avatar video into the call. A MediaPipeFaceProcessor consumes the user's video track at 8 fps and emits smoothed emotion/gaze/engagement state that gets prepended to each LLM turn so the model can pick appropriate Inworld steering tags.
backend/ Python Vision Agents service
frontend/ Next.js demo app
cd backend
cp .env.example .env
uv sync
uv run python scripts/download_face_model.py
uv run python main.py serve --host 127.0.0.1 --port 8000In another terminal:
cd frontend
cp .env.example .env.local
npm install
npm run devFill both env files with the keys from the providers listed above (Stream credentials must match across the two files). Open http://localhost:3000. If you run the backend on a different host or port, set NEXT_PUBLIC_BASE_URL in frontend/.env.local.
For a backend-only smoke test that opens a Stream demo room directly:
cd backend
uv run python main.py runYou need an account and API key from every provider below before the demo will run end-to-end.
| Provider | Used for | Sign up | Env vars |
|---|---|---|---|
| Stream Video | Call edge (WebRTC), session tokens | getstream.io/video | STREAM_API_KEY, STREAM_API_SECRET, NEXT_PUBLIC_STREAM_API_KEY |
| Inworld AI | TTS v2 with inline steering tags | inworld.ai | INWORLD_API_KEY |
| Deepgram | Speech-to-text | console.deepgram.com | DEEPGRAM_API_KEY |
| Google AI Studio | Gemini LLM | aistudio.google.com | GOOGLE_API_KEY |
| Anam | Lip-synced avatar video | anam.ai | ANAM_API_KEY, ANAM_AVATAR_ID |
The frontend additionally needs a user JWT (NEXT_PUBLIC_STREAM_TOKEN) and user ID (NEXT_PUBLIC_STREAM_USER_ID). See frontend/.env.example for the full list.
Stream is free for most side and hobby projects. To qualify, your project/company needs to have < 5 team members and < $10k in monthly revenue. For complete pricing details, visit the Video Pricing Page.
MIT — see LICENSE.