Aelios Spark

Make any web app voice-controllable in 37 languages in 5 minutes.

launch.mp4

Aelios Spark is an open-source voice control layer for web apps. Drop in a widget, define a few tools, and your users can operate your software by talking to it — creating records, navigating screens, running queries, all hands-free.

flowchart LR
    user(["🎙️ User"])
    subgraph host["Your web app (browser)"]
        widget["Aelios Spark widget<br/>+ your tool defs"]
    end
    subgraph server["Your machine / VPS"]
        agent["Aelios Spark agent server<br/>(Python / Pipecat)"]
        cfg[("aelios-spark.config.yaml<br/>prompt · persona · KB")]
        agent -.reads.-> cfg
    end
    user <-->|voice| widget
    widget <==>|"WebRTC<br/>(audio + tool RPC)"| agent
    agent -->|"LLM, STT, TTS"| providers[("OpenAI · Deepgram<br/>Cartesia · Daily")]

No backend. No SaaS sign-up. You run the agent server yourself, define tools in your app code, and the voice loop runs locally. Bring your own API keys for OpenAI, Daily, Deepgram, Cartesia.

Desktop only today. The widget refuses to mount on viewports narrower than 768px and tears down any live session if the window shrinks below that threshold. Mobile support is on the roadmap.

Looking for production scale? The managed version is Aelios AI — autoscaling, multi-tenant agents, hosted control plane, continuous-learning loops, and a separate video demo agent that learns your software and streams hands-free product demos 24/7. The OSS Aelios Spark widget is the same code path; the managed platform adds the surfaces around it.

Quick start — get it running in 5 minutes
How it works — a session, end to end
Two registration patterns — how the host page wires everything up
Two modes — action and guide — the agent operates your app, OR it narrates it
Languages — the 37 the widget ships with
What you need — bring-your-own-key providers
Repo layout
Deep documentation — the rest of the system, one doc per concern
Contributing
License

Quick start

You need three things running:

Agent server (Python, this repo)
Widget bundle (TypeScript, this repo — build once)
Your web app (where the widget gets embedded)

# Clone
git clone https://github.com/Aelios-AI/aelios-spark
cd aelios-spark

# 1. Agent server
cd packages/agent-server
cp .env.example .env       # paste in OPENAI_API_KEY, DAILY_API_KEY, etc.
uv sync
uv run python server.py    # serves :3002

# 2. Widget bundle (in another terminal)
cd packages/widget
npm install
npm run build              # produces dist/aelios-spark-widget.js

# 3. Try the example app
cd ../../examples/tracker
npm install && npm run copy-widget
npm run dev                # → http://localhost:5180

Open the example, click the launcher, and talk to your tasks app. Try "create a task to ship the release notes by Friday" or "list tasks assigned to Alice".

Full step-by-step with troubleshooting in docs/quickstart.md.

How it works

sequenceDiagram
    autonumber
    participant User
    participant Page as Host page
    participant Widget as Aelios Spark widget
    participant Server as Agent server
    participant LLM as LLM + STT + TTS

    Page->>Widget: AeliosSpark.configure({...}), AeliosSpark.defineTool(...)
    User->>Widget: Click launcher, pick language/mode
    Widget->>Server: POST /start (tools + lang + mode)
    Server->>Server: Load aelios-spark.config.yaml<br/>(prompt + persona + KB)
    Server-->>Widget: Daily room URL + token
    Note over Widget,Server: WebRTC voice loop established

    loop Conversation turn
        User->>Widget: speaks
        Widget->>Server: audio (WebRTC)
        Server->>LLM: STT → reason → TTS
        LLM-->>Server: tool calls + reply
        Server->>Widget: tool_call_batch (RTVI)
        Widget->>Page: invoke registered tool fn
        Page-->>Widget: result
        Widget-->>Server: tool_result
        Server-->>Widget: spoken reply (audio)
        Widget-->>User: speaks
    end

A session has three layers:

Widget runs in your visitor's browser. It captures audio, renders the chrome, holds the tool registry, and talks to the agent server over WebRTC + RTVI.
Agent server runs on your machine (or VPS). It hosts a Pipecat pipeline — STT → LLM → TTS → audio out — plus the InAppAgentProcessor state machine that schedules tool calls, manages demonstrations, requests screenshots, runs idle timers, and applies schema-gated structured output.
Your web app (the host page) registers tools and calls AeliosSpark.configure(...) to point at the agent server and tweak the pill's position + theme colors.

Tool calls flow over the RTVI data channel; audio flows over WebRTC. Everything is one-session-per-process — no shared state.

For the full architecture (priority queue, five wake modes, demonstrations, screenshot service, tool dispatcher, watchdogs, the RTVI custom-message protocol), read docs/architecture.md.

Two registration patterns

The host page interacts with the widget through two patterns. They serve different concerns and can be called in any order.

Pattern 1 — `AeliosSpark.configure({...})`: agent URL + widget look

Tells the widget where the agent server is and how it should look. The full surface is small — see docs/configuration.md:

AeliosSpark.configure({
    agentUrl: "http://localhost:3002/start",
    branding: {
        position: "bottom-right",      // or "bottom-left"
        themeColors: {                 // optional palette override
            primary: "#F4F5F7",
            bg: "#0A0A0A",
            text: "#F4F5F7",
            muted: "#A0A0A0",
            onPrimary: "#0A0A0A",
        },
    },
});

Pattern 2 — `AeliosSpark.defineTool({...})`: callable functions

Each tool the agent can invoke during voice turns. Tools accumulate in an in-memory registry; at session start, the registry is forwarded to the agent server as the session's tool set.

AeliosSpark.defineTool({
    name: "create_contact",
    description: "Add a new contact. Use when the user says 'add' or names a new person.",
    parameters: {
        type: "object",
        properties: {
            name: { type: "string" },
            email: { type: "string" },
        },
        required: ["name"],
    },
    execute: async ({ name, email }) => myApi.createContact({ name, email }),
    requiresConfirmation: false,    // set true for destructive ops
});

The `AeliosSparkReady` queue — order-independent setup

Both patterns work through a callback queue so they're safe to call before the widget bundle has finished loading:

<script src="/aelios-spark-widget.js" data-agent-url="http://localhost:3002/start"></script>
<script>
  window.AeliosSparkReady = window.AeliosSparkReady || [];
  window.AeliosSparkReady.push((AeliosSpark) => {
    AeliosSpark.configure({ ... });
    AeliosSpark.defineTool({ ... });
    AeliosSpark.defineTool({ ... });
  });
</script>

Then on the server side — tell the agent who it is, what your software is, and what it should know about it — in packages/agent-server/aelios-spark.config.yaml. Both the agent's persona and the host software's knowledge base live here, because both get baked into the system prompt the LLM sees every turn:

agent:                          # who the agent is
  name: "Acme Assistant"
  personality: "Friendly and precise."

software:                       # the app the widget is embedded in
  name: "Acme CRM"
  tldr: "A simple CRM for small teams."
  docs_file: "./knowledge.md"   # KB the agent draws on for every reply

additional_instructions: |      # any extra business rules / style notes
  You operate Acme CRM on behalf of the user via voice. Be concise.

Restart the agent server and refresh your app — voice control is live.

Full tool-writing guide in docs/tools.md. Full widget config schema in docs/configuration.md.

Two modes — action and guide

Aelios Spark sessions run in one of two modes. The visitor picks at session start; the choice is frozen for the session.

	`action` (default)	`guide`
Calls your tools	yes	no
Sees the screen	only when the agent decides	every turn
Points to UI	no	yes (ghost cursor)
Best for	operating your app	narrating your app

Action mode is the agent operating your software on the visitor's behalf — voice-driven CRUD, dictation-with-effects, hands-free workflows. The agent only sees the screen when it explicitly requests a screenshot.

Guide mode is read-only narration with on-screen pointing — onboarding, accessibility, sales demos. The agent gets a screenshot every turn and can drop a ghost cursor (an arrow + fixed "Agent" tag) onto any element on the page; what to do there is conveyed by the spoken reply itself. It cannot call tools; the schema literally drops the tool_invocations field.

Both modes run through the same InAppAgentProcessor, but each has its own Jinja system-prompt template (IN_APP_AGENT_TURN_TEMPLATE for action, IN_APP_AGENT_GUIDE_TURN_TEMPLATE for guide) — guide mode has no tools, no demonstrations, no batches, so a shared template would bury the relevant instructions under sections the LLM has to skip every turn. Schema gating layers on top: guide mode's schema literally drops the tool_invocations field. Full breakdown — when to use each, the schema differences, the two-trigger rule, the confirmation flow — in docs/modes.md.

Languages

The widget ships a hardcoded 37-language picker that visitors choose from at session start. The chosen language code is sent in the /start body; the agent server runs Deepgram Nova-3 STT for all 37 (configured per-session via the language enum) and Cartesia handles TTS.

🇸🇦 Arabic · 🇧🇬 Bulgarian · 🇨🇳 Chinese · 🇭🇷 Croatian · 🇨🇿 Czech · 🇩🇰 Danish · 🇳🇱 Dutch · 🇺🇸 English · 🇫🇮 Finnish · 🇫🇷 French · 🇩🇪 German · 🇬🇷 Greek · 🇮🇳 Gujarati · 🇮🇱 Hebrew · 🇮🇳 Hindi · 🇭🇺 Hungarian · 🇮🇩 Indonesian · 🇮🇹 Italian · 🇯🇵 Japanese · 🇮🇳 Kannada · 🇰🇷 Korean · 🇲🇾 Malay · 🇮🇳 Marathi · 🇳🇴 Norwegian · 🇵🇱 Polish · 🇵🇹 Portuguese · 🇷🇴 Romanian · 🇷🇺 Russian · 🇸🇰 Slovak · 🇪🇸 Spanish · 🇸🇪 Swedish · 🇵🇭 Tagalog · 🇮🇳 Tamil · 🇮🇳 Telugu · 🇹🇭 Thai · 🇹🇷 Turkish · 🇻🇳 Vietnamese

All 37 ship with native Cartesia voices out of the box. All bundled voices are female — if you set agent.name in aelios-spark.config.yaml, pick a feminine name so the persona name and the spoken voice match. Operators who want a different voice (different gender, different accent, custom clone) should override per-agent via voice_languages or edit CARTESIA_TTS_VOICES in adapters/languages.py.

The picker list is fixed in Widget.tsx and not host- configurable.

What you need

Bring-your-own-key. None of these are baked in:

Provider	What for	Required
OpenAI	Main LLM	yes
Daily	WebRTC transport	yes (free tier covers dev)
Deepgram	Speech-to-text — Nova-3 covers all 37 languages	yes
Cartesia	Agent's voice (text-to-speech)	yes
Google AI Studio	Gemini — conversation-history summarisation	yes

See packages/agent-server/.env.example.

Want a different LLM? The agent server talks to LLMs through LangChain, so switching providers is a LangChain swap — Anthropic, Google, Mistral, Cohere, local models via Ollama / vLLM, anything LangChain supports. Two call sites: brain/processor.py for the main agent loop (currently ChatOpenAI) and brain/conversation_history.py for the cheap summarizer (currently ChatGoogleGenerativeAI).

Want a different STT/TTS/Transport provider? All voice services and the transport service are drop-in Pipecat adapters — swap them in bot.py and you can run on Whisper, ElevenLabs, Riva, AssemblyAI, SmallWebRTC, etc. See the Pipecat services docs.

Repo layout

aelios-spark/
├── packages/
│   ├── widget/         the embeddable JS — runs in your users' browsers
│   └── agent-server/   the Python voice agent — you run this
├── examples/
│   └── tracker/        full sample app showing how to wire everything up
├── docs/               deep documentation (read these — see below)
├── CONTRIBUTING.md     dev setup, test architecture, PR process
└── LICENSE             Apache 2.0

Deep documentation

One doc per concern. The README is the orientation; these are the manual.

Doc	What it covers
`docs/quickstart.md`	Step-by-step setup with troubleshooting
`docs/architecture.md`	The agent server end-to-end: Pipecat pipeline, processor state machine, priority queue, five wake modes, tool dispatcher, demonstrations, screenshot service, conversation history, watchdogs, RTVI custom-message protocol
`docs/modes.md`	Action vs guide mode — the schema differences, the two-trigger rule, confirmation flow, screenshot behaviour, when to use each
`docs/widget.md`	Widget bundle anatomy, connection state machine, session timing rules (90-min cap, 6-min connecting timeout, etc.), idle protocol, error states, mock mode, theming
`docs/tools.md`	Writing tool definitions — when to call, return values, parallel batches, confirmation flow, common patterns
`docs/configuration.md`	Every config knob — widget-side (`AeliosSpark.configure(...)`) and server-side (`aelios-spark.config.yaml`), env vars, provider swaps
`packages/agent-server/tests/README.md`	Three-layer test architecture (unit / processor / real-LLM-judge), when to add tests at which layer

Read in roughly that order if you want to understand the whole system.

Built on Pipecat

The agent server is built on top of Pipecat, the open-source framework for voice + multimodal conversational AI. All STT/TTS/ transport wrappers live in packages/agent-server/adapters/ — swap in any of Pipecat's services and Aelios Spark keeps working.

Contributing

PRs welcome — see CONTRIBUTING.md for dev setup, the three-layer test contract, the contributions matrix, and code style.

Aelios Spark is a real OSS project backed by a real production agent loop, so changes that touch the agent state machine get reviewed carefully. The "Reviewed carefully" rows in CONTRIBUTING flag exactly which areas those are.

Managed offering

For production, Aelios AI wraps the OSS agent code path with the surfaces a serious deployment actually needs:

Autoscaling, multi-tenant agents, hosted control plane — no infra to operate.
Observability — per-session traces, transcripts, tool call/result audit, latency breakdowns.
Continuous-learning loops — session analytics feed back into the agent's persona / KB / tool descriptions so the agent gets better at your specific software over time.
Video demo agent — a separate agent product that learns your software's UI from your docs + recorded screen flows, then drives on-screen demo videos hands-free. Runs 24/7 so prospects can watch a live product walk-through any time without sales-team scheduling. Same conversational core as the widget; different delivery surface.

Graduate when you outgrow self-hosting.

License

Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github		.github
docs		docs
examples/tracker		examples/tracker
packages		packages
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aelios Spark

Table of contents

Quick start

How it works

Two registration patterns

Pattern 1 — `AeliosSpark.configure({...})`: agent URL + widget look

Pattern 2 — `AeliosSpark.defineTool({...})`: callable functions

The `AeliosSparkReady` queue — order-independent setup

Two modes — action and guide

Languages

What you need

Repo layout

Deep documentation

Built on Pipecat

Contributing

Managed offering

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Aelios Spark

Table of contents

Quick start

How it works

Two registration patterns

Pattern 1 — AeliosSpark.configure({...}): agent URL + widget look

Pattern 2 — AeliosSpark.defineTool({...}): callable functions

The AeliosSparkReady queue — order-independent setup

Two modes — action and guide

Languages

What you need

Repo layout

Deep documentation

Built on Pipecat

Contributing

Managed offering

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Pattern 1 — `AeliosSpark.configure({...})`: agent URL + widget look

Pattern 2 — `AeliosSpark.defineTool({...})`: callable functions

The `AeliosSparkReady` queue — order-independent setup

Packages