A self-hosted AI virtual assistant designed for consumer hardware running a local LLM. Every architectural decision is shaped by two constraints:
- Context discipline. Self-hosted models degrade as their context fills. Yenna caps the effective window (default 16k tokens), delegates work to short-lived subagents, and auto-compacts when budget is exceeded.
- Determinism beats agent autonomy. Small models loop, forget, and rarely take initiative. The harness — not the model — enforces invariants: consecutive-tool-call detection, memory recall/ingestion at lifecycle boundaries, scheduled check-ins via heartbeats.
You need Bun and an OpenAI-compatible LLM endpoint (Ollama, vLLM, llama.cpp server, LM Studio, etc.).
# 1. Install workspace deps.
bun install
# 2. Drop a config in place — copy the skeleton and edit the values.
mkdir -p ~/.yenna
cp config.example.yaml ~/.yenna/config.yaml
$EDITOR ~/.yenna/config.yaml
# 3. Start the daemon.
bun run start
# 4. In another terminal, open the TUI.
bun run tuiIf you don't have an LLM endpoint yet, docker compose up ollama brings one
up on localhost:11434; then docker exec ollama ollama pull qwen2.5:7b
(or whichever model you configured).
Unix socket: ~/.yenna/yenna.sock
HTTP + WebSocket via Bun.serve
│
┌──────────────┐ ┌─────────┴──────────┐ ┌──────────────────────┐
│ TUI (Ink) │────│ yenna daemon │────│ Telegram (in-daemon) │
│ packages/tui │ │ packages/core │ │ channels/telegram │
└──────────────┘ │ │ └──────────────────────┘
│ - agent loop │
│ - hooks registry │
│ - tool registry │
│ - skill registry │
│ - scheduler │
│ - SQLite store │
│ - LLM adapter │
└──────────┬─────────┘
│ OpenAI-compatible /v1/chat /v1/embeddings
▼
Self-hosted inference (Ollama / vLLM / llama.cpp / …)
yenna/
├── config.example.yaml # copy to ~/.yenna/config.yaml
├── packages/
│ ├── core/ # the daemon
│ │ ├── src/
│ │ │ ├── agent/ # run-turn loop, ContextBuilder, types
│ │ │ ├── channels/ # Channel interface, TUI + Telegram
│ │ │ ├── config/ # YAML loader, env interpolation, Zod schema
│ │ │ ├── hooks/ # registry + built-in hooks
│ │ │ ├── llm/ # Vercel AI SDK adapter + MockChatModel
│ │ │ ├── memory/ # markdown-file memory store + tools
│ │ │ ├── persistence/ # bun:sqlite + DAOs
│ │ │ ├── protocol/ # wire-protocol types (shared with SDK)
│ │ │ ├── rag/ # vector storage + cosine search
│ │ │ ├── scheduled-jobs/ # heartbeat job, signal providers
│ │ │ ├── scheduler/ # interval scheduler
│ │ │ ├── shell-policy/ # tier classifier + composition detection
│ │ │ ├── skills/ # discovery, indexing, run_skill
│ │ │ ├── tools/ # tool registry + fs / shell built-ins
│ │ │ ├── transcript/ # JSONL pretty-printer + CLI
│ │ │ ├── transport/ # Bun.serve setup + ConversationHub
│ │ │ └── web/ # search providers + web_search/fetch
│ │ ├── skills/ # bundled skills (web-research, summarize-page)
│ │ └── test/
│ ├── sdk/ # @yenna/sdk — typed daemon client
│ └── tui/ # @yenna/tui — Ink chat UI
└── docker-compose.yml # ollama service + optional yenna container
All settings live in ~/.yenna/config.yaml (or $YENNA_CONFIG_PATH). The
config.example.yaml at the repo root is a complete
skeleton with every field documented inline.
Secrets reference environment variables: ${env:VAR_NAME} is replaced at
load time. Useful for llm.api_key, channels.telegram.bot_token,
web_search.brave.api_key.
Built-in tools available to the agent:
fs_read,fs_write,fs_list,fs_mkdir— workspace-scoped filesystemshell— runs viabash -cwith tier-based policy gatingweb_search,web_fetch— pluggable search provider + HTML→markdownsearch_skills,run_skill— skill discovery + sub-agent dispatchrecall_memory,save_memory— RAG over markdown memory store
The shell tool classifies commands into safe / mutating / dangerous
tiers and prompts for approval when something exceeds auto_approve_tiers.
Chained commands (|, &&, ;, redirections) are always elevated to
dangerous.
A skill is a directory containing a SKILL.md with YAML frontmatter:
---
name: my-skill
description: One-line description for the agent to decide whether to use this.
tools: [shell, fs_write] # optional; restricts the sub-agent's tools
---
# Skill body
The full markdown content is loaded as the sub-agent's system prompt.Skills live in:
packages/core/skills/(bundled with yenna)~/.agents/skills/(user-installed, using the skills.sh convention so existing skills work unchanged)
User skills override bundled ones with the same name. The bundled set
currently ships web-research and summarize-page as templates.
Markdown files in ~/.yenna/memory/. The agent automatically:
- Recalls relevant memories at the start of each user turn
(
MemoryRecallerhook). - Ingests new memorable facts asynchronously at turn end
(
MemoryIngesterhook — fire-and-forget LLM extraction call).
You can also inspect and hand-edit the files directly — they're not a black box.
Every conversation has a JSONL transcript at
~/.yenna/logs/<conversation-id>.jsonl covering all lifecycle events. To
read one:
bun run --cwd packages/core transcript <conversation-id>A channel is a delivery target for assistant messages. The TUI channel delivers via the WS stream; Telegram delivers via the Bot API.
To enable Telegram:
channels:
telegram:
enabled: true
bot_token: "${env:YENNA_TELEGRAM_BOT_TOKEN}"
authorized_chat_id: 12345678 # your Telegram user id; single-user modeThe daemon long-polls the Telegram Bot API in-process. Each chat becomes a
conversation with channel: "telegram", auto-created on first message.
Scheduled proactive check-ins. The agent reaches out to the user — through the configured primary channel — when criteria suggest it should. Useful for upcoming calendar events, follow-ups, or reminders.
heartbeat:
enabled: true
interval_ms: 1800000 # 30 minutes
primary_channel: "telegram"
noop_token: "HEARTBEAT_OK"The agent's response is suppressed (not persisted, not delivered) when it
contains HEARTBEAT_OK — that's how the model says "nothing to say right
now." Small models with short contexts are prone to picking up patterns
VERY quickly. All it takes is a couple of messages with HEARBEAT_OK at the
end in the chat history and suddenly every message ends in HEARTBEAT_OK.
This is especially pertinent if the agent is left alone for a while, as they
will rapidly create their own echo chamber / feedback loop that will consume
pretty much all pre-existing context / history.
The agent loop fires lifecycle events at well-known points (turn_start,
pre_llm_call, pre_tool_call, post_tool_call, pre_compaction,
turn_end, …) and a HookRegistry dispatches them in priority order. The
registry gives us three properties that direct callbacks don't: ordering
(memory recall has to run before the prompt is built, audit logging has to
see the final outcome), short-circuit results (pre_tool_call can return
abort or skip_tool to alter control flow — that's how duplicate
detection works), and fire-and-forget observers (the audit logger watches
every event without anyone wiring it in). Middleware was considered and
rejected: it tangles cross-cutting concerns with the linear request/response
shape, and the agent loop isn't request/response — it's a tool-call cycle
with reentrant LLM calls and conditional dispatch. Hooks are now the only
extension point for cross-cutting behavior; the agent loop itself has no
knowledge of memory, duplicate detection, audit logging, or heartbeat
suppression. Adding a new concern means writing a hook, not editing the
loop.
A single agent with every tool in scope drowns small models. The system
prompt grows linearly with the tool catalog, every tool's JSON schema eats
output tokens during planning, and irrelevant tools become attractive
nuisances. The run_skill tool fans work out to short-lived subagents,
each loaded with one skill's markdown as its system prompt and a
restricted tool set declared in the skill's frontmatter (tools: [shell, fs_write] means only those tools are available). The parent agent
returns to a clean context with just the subagent's final message. Skill
authors can opt into wider tool sets when they need them, but the default
is narrow. run_skill itself is excluded from subagent tool sets by
default, which bounds recursion depth at 1 — a subagent can't spawn
further subagents unless the skill explicitly opts in. That cap is a
deliberate choice over arbitrary nesting: deeper agent trees are hard to
debug and amplify failure modes (one bad turn loops a whole tree).
When an LLM response contains multiple tool calls, the loop executes them
one at a time and waits for each to complete before starting the next.
Parallel execution would be measurably faster for independent tool calls,
but three concerns kept it out. First, permission gating: a denied shell
command should inform whether the agent attempts the next one — running
them in parallel races the user's decision against side effects. Second,
shell ordering: fs_write followed by shell running a script that reads
that file has to happen in order, and the agent's request order is the
only ordering signal we have. Third, persistence: the messages table
appends in monotonic order and tool results are interleaved into the
conversation history; parallel execution would force either out-of-order
inserts (which break the compaction-cutoff invariant) or a join barrier
that defeats the latency win. We'd revisit this if tool latency became the
dominant cost in real workloads — at that point a per-tool parallelSafe
flag plus a join-barrier dispatcher would be the natural extension.
Memory is single-user, small (dozens to low hundreds of entries in
practice), and rarely the bottleneck. Brute-force cosine similarity over
embeddings stored as SQLite BLOBs returns top-K in well under a
millisecond at that scale, which lets us skip an entire dependency. The
user-facing payoff matters more than the perf one: memories are
human-readable markdown files in ~/.yenna/memory/, not opaque rows in a
vector database. You can grep, cat, hand-edit, version-control,
sync via Syncthing, or just rm a file that's wrong. The index rebuilds
on the next startup. This is a deliberate scale tradeoff — it would not
survive multi-user deployment or memory catalogs in the tens of thousands.
At that point swapping in sqlite-vec, LanceDB, or a hosted vector store
is a contained change because the MemoryEmbeddingsIndex interface is
narrow.
Telegram long-polling, the scheduler, the WebSocket server, the agent
loop, and the SQLite store all live in one Bun process. No message broker,
no inter-service queue, no shared cache. The deployment target is a
single machine — usually the same machine the user is sitting at — and
the operational overhead of a broker isn't earned by anything yenna needs.
Channels are objects in a registry, not network endpoints; the heartbeat
job calls hub.emit() synchronously rather than enqueueing a job for a
worker to drain. This collapses an entire class of failure modes (broker
down, queue backpressure, serialization mismatches) and keeps the mental
model small enough to hold in your head. It would not hold up if the
architecture had to scale horizontally — multiple daemon instances would
need a real coordination layer for conversation locking, scheduler
leadership, and channel routing. That's a bridge to cross if it ever
becomes a problem, not a problem to design around today.
bun test # unit + integration (mocked LLM)
YENNA_E2E=1 bun test # also runs E2E against a real LLME2E env vars:
YENNA_E2E_BASE_URL(defaulthttp://localhost:11434/v1)YENNA_E2E_CHAT_MODEL(defaultqwen2.5:7b)YENNA_E2E_EMBED_MODEL(defaultnomic-embed-text)
MIT — see LICENSE.