archolith-context

Self-hosted context intelligence for LLMs.

Compress context. Extract knowledge. Remember everything.

archolith-context is an OpenAI-compatible proxy that sits in front of /v1/chat/completions, extracts durable knowledge from each turn into a session graph, and uses a tool-calling context manager LLM to build the minimum viable context window for the next turn. The goal is not generic prompt compression. The goal is a coding agent that never re-reads a file it already knows, never loses a decision it already made, and never degrades from context bloat in long sessions.

Product name: archolith-context
Python package: archolith_proxy
Default public bootstrap: LadybugDB + OpenAI-compatible upstream on port 9800
Current focus: proxy-first OSS release, with archolith-memory and archolith-filter as adjacent products on the broader roadmap

See ARCHITECTURE.md for the system walkthrough and CONTRIBUTING.md for local development.

What It Does

archolith-context keeps the parts of a conversation that still matter and rewrites the parts that no longer need verbatim replay.

At a high level it:

Resolves a stable session from X-Session-ID or a prompt fingerprint.
Sends the current request upstream through an OpenAI-compatible interface.
Extracts facts, decisions, observations, tool results, and goals after each turn.
Stores those artifacts in a session graph.
Rebuilds future requests by preserving the system prompt and recent coherence tail while replacing older middle history with graph-assembled context.

This makes the proxy useful for:

long coding sessions where earlier design decisions still matter
agent conversations with heavy tool output
OpenAI-compatible clients that already know how to talk to /v1/chat/completions
local or self-hosted workflows where you want to keep session memory under your control

Quick Start

Option A: Docker Compose

# fill in UPSTREAM_API_KEY and, for graph features, EXTRACTOR_API_KEY

docker compose up --build

Copy .env.example to .env before starting the container.

The default compose profile runs the proxy with:

GRAPH_BACKEND=ladybug
embedded LadybugDB at /app/data/context.lbug
health check on http://localhost:9800/live

If you want the Neo4j backend instead:

docker compose --profile neo4j up --build

Option B: Local Python Environment

uv sync --extra dev
# fill in UPSTREAM_API_KEY and, for graph features, EXTRACTOR_API_KEY

uv run uvicorn archolith_proxy.main:app --host 0.0.0.0 --port 9800

Copy .env.example to .env before running the proxy locally.

Point an OpenAI-Compatible Client at the Proxy

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:9800/v1",
    api_key="proxy-local",
    default_headers={"X-Session-ID": "demo-session"},
)

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are helping implement a background task queue."},
        {"role": "user", "content": "Design the task model and enqueue flow."},
    ],
)

print(resp.choices[0].message.content)

Notes:

The proxy forwards requests using UPSTREAM_API_KEY; the client-side api_key can be any placeholder.
X-Session-ID is the most reliable way to keep turns in the same graph-backed session.
The root URL redirects to the local dashboard at /dashboard/dashboard.html.

Benchmark Results

The committed benchmark audit is ./.agent/audits/2026-05-21-gpt4omini-16turn-baseline.md.

That run used a 16-turn coding conversation against gpt-4o-mini with LadybugDB, gpt-4.1-mini extraction, embeddings enabled, and relaxed gating so assembly would activate earlier than the production defaults. It is a useful system audit, but it is not a claim that the default .env.example values will produce the same curve out of the box.

Aggregate Outcomes

Total direct input tokens: 138,016
Total proxy savings: 49,926
Overall savings ratio: 36.2%
Steady-state savings range: 45-65% across turns 7-16
Peak single-turn savings: 65% on turn 7
Mean response time: 32,041 ms direct vs 18,168 ms through the proxy in that run

Representative Turns

Turn	Direct In	Proxy In	Rewritten	Savings	Ratio	Mode
1	119	0	0	0	0%	`unknown`
4	2,743	1,258	1,106	152	12%	`graph`
8	7,671	4,750	2,598	2,152	45%	`graph`
12	13,194	8,971	4,121	4,850	54%	`graph`
16	18,370	12,828	5,206	7,622	59%	`graph`

The main signal from the audit is not "compression starts immediately." It is that once the conversation becomes long enough to have real replay cost, the graph-assembled form grows much more slowly than raw chat history.

How It Works

1. Session Resolution

The proxy first resolves a session:

preferred: X-Session-ID
fallback: a SHA-256 fingerprint of the sanitized system prompt and first user message

This lets clients opt into stable session IDs while still giving single-threaded tools a deterministic fallback.

2. Post-Response Extraction

After the upstream response returns, the proxy asynchronously extracts structured knowledge from the turn:

decision
file_state
tool_result
state
goal
observation
error

Those facts are deduplicated, written to the graph backend, and associated with the session.

3. Context Assembly

On later turns, the assembler builds a replacement context block from graph facts instead of replaying every prior message.

The assembled block has two layers:

SESSION OVERVIEW: current goal, touched files, decisions, fact count
RELEVANT CONTEXT: ranked facts selected by recency, confidence, type, and optional embedding similarity

The proxy then:

preserves the original system prompt
preserves a smart "coherence tail" of recent messages
removes the replay-heavy middle
merges the assembled context into the system message so providers that dislike consecutive system messages still work

4. Context Manager LLM (Curator)

When CURATOR_ENABLED=true, a tool-calling context manager LLM replaces the heuristic assembler as the primary context-building path. Instead of scoring and ranking facts with Python heuristics, the curator LLM actively queries the session's knowledge store using 7 tools:

Tool	What it retrieves
`list_session_files`	All cached files for the session (path, line count, last turn)
`get_file`	Full content of a cached file (≤200 lines)
`get_file_lines`	Specific line range from a cached file
`search_facts`	Facts matching a keyword substring
`get_session_goal`	The session goal string
`get_recent_decisions`	Decisions made during the session
`get_touched_files`	Files read or modified, with status and turn

The curator calls 2–4 tools per turn (configurable max), selects only what the agent needs for the current question, and emits a structured context block:

=== SESSION GOAL ===
<goal>

=== RELEVANT CODE ===
<path> lines <start>-<end>:


```
=== KEY FACTS ===





=== DECISIONS ===






The curator runs with a 6-second hard latency cap (`CURATOR_LATENCY_BUDGET_MS`). On timeout or failure it falls back to the heuristic assembler — no errors are surfaced to the client. File content is stored in a local cache (LadybugDB `FileContent` table) with SHA-256 change detection. The curator only re-reads a file when its content hash changes — previously seen file versions are served from cache.

Enable with `CURATOR_ENABLED=true` in your `.env`. Requires `FILE_CACHE_ENABLED=true` (default) and at least `COLD_START_TURNS` user turns before activating.

### 5. Optional Recall Tool

If `SESSION_RECALL_TOOL_ENABLED=true`, the proxy injects a hidden `__archolith_recall` tool into the request. If the model calls it, the proxy intercepts the tool call, queries the session graph, returns a bounded tool result, and re-sends the request upstream. The tool never leaks back out as part of the public API response.

### 6. Optional Long-Term Promotion

If `PROMOTION_ENABLED=true`, high-confidence session facts can be promoted into an external memory backend. The registry currently supports adapters including `archolith_memory`, `mem0`, `zep`, `generic_http`, `basic_memory`, `claude_mem`, `cognee`, `openmemory`, and `nocturne_memory`.

## Session Recall Tool

The recall tool is intentionally conservative:

- it is off by default
- it only searches the current session graph
- it rewrites ambiguous recall queries only when query rewriting is enabled
- it caps returned recall context to an internal token budget
- it is removed from the final response payload after interception

This makes it useful for targeted "what did we decide earlier?" lookups without turning every request into a tool-using workflow.

## Configuration

Start with `.env.example`. Only `UPSTREAM_API_KEY` is required for passthrough proxying, but graph-backed features need more than that.

| Variable | Purpose | Default / Recommended |
|----------|---------|-----------------------|
| `UPSTREAM_BASE_URL` | OpenAI-compatible upstream endpoint | `https://api.openai.com/v1` in `.env.example` |
| `UPSTREAM_API_KEY` | Credential used for proxied upstream calls | required |
| `GRAPH_BACKEND` | Session graph backend | `ladybug` for getting started |
| `LADYBUG_DB_PATH` | Embedded graph file path | `./data/context.lbug` |
| `SESSION_NEO4J_*` | Neo4j connection settings | only needed with `GRAPH_BACKEND=neo4j` |
| `EXTRACTOR_MODEL` | Post-turn fact extraction model | `gpt-4.1-mini` |
| `EMBEDDING_MODEL` | Similarity search model | `text-embedding-3-small` |
| `COHERENCE_TAIL_SIZE` | Recent messages preserved verbatim | `10` |
| `CONTEXT_TOKEN_BUDGET` | Budget for the assembled context block | `15000` |
| `COLD_START_TURNS` | Minimum turns before assembly is considered | `3` |
| `ASSEMBLY_MIN_SAVINGS_RATIO` | Skip rewriting if savings are too small | `0.20` |
| `ASSEMBLY_MIN_INPUT_TOKENS` | Skip rewriting below this input size | `50000` |
| `SESSION_RECALL_TOOL_ENABLED` | Enable hidden recall tool injection | `false` |
| `PROMOTION_ENABLED` | Enable long-term memory promotion | `false` |
| `ADMIN_TOKEN` | Protect operator surfaces like `/trace` and `/sessions` | empty by default |

Current code-path nuance: if you enable extraction, query rewriting, or embeddings, set `EXTRACTOR_API_KEY` and `EMBEDDING_API_KEY` explicitly. The comments in `.env.example` describe fallback behavior, but the active code paths check the explicit values.

## Backends

### LadybugDB

LadybugDB is the default bootstrap backend because it is embedded and file-backed:

- no external database server
- good fit for local development and single-node deployments
- schema created on first connect
- stores sessions, facts, files, and decisions in one local database file

### Neo4j

Neo4j remains the more traditional graph backend:

- external service
- useful if you already operate Neo4j
- selected with `GRAPH_BACKEND=neo4j`
- requires `SESSION_NEO4J_PASSWORD`

### Long-Term Memory

Promotion is separate from the session graph. `archolith-context` can optionally promote facts into an external memory system, but that path is disabled by default.

## Comparison

This is a directional comparison based on public product docs as of `2026-05-22`. Check vendor docs for the latest details.

| Tool | Core idea | Deploy model | Best fit | How `archolith-context` differs |
|------|-----------|--------------|----------|-------------------------------|
| `archolith-context` | Rewrite replay-heavy chat history into a graph-assembled session context | Self-hosted proxy | Long-running coding and agent sessions where continuity matters | It sits directly in front of OpenAI-compatible chat APIs and reconstructs context from extracted facts |
| [Graphiti / Zep](https://help.getzep.com/zep-vs-graphiti) | Graphiti is an open-source temporal graph framework; Zep is a managed context platform built around it | OSS framework plus managed platform | Teams that want a graph foundation or a turnkey hosted context stack | `archolith-context` is narrower and proxy-first: it rewrites chat payloads for existing clients instead of being a full managed context platform |
| [Headroom](https://headroomlabs.ai/) | Compress tool output, logs, files, and RAG payloads before they hit the model | Local/open-source layer | Tool-heavy agents where non-chat payloads dominate token spend | `archolith-context` focuses on session memory and decision recall across turns, not just compressing the current payload |
| [The Token Company](https://thetokencompany.com/docs) | Hosted API that compresses prompt text before inference | Hosted API + SDKs | Teams that want API-based prompt compression without self-hosting | `archolith-context` keeps session state local and rebuilds history from graph facts instead of sending prompts to a third-party compression API |

## Roadmap

- tighten the default production tuning based on longer benchmark runs
- harden trace timing and extraction observability
- ship clearer client examples and end-to-end demos
- stabilize the proxy-first OSS release
- connect the broader `Archolith` family (`archolith-memory`, `archolith-filter`) without collapsing everything into one binary

## Contributing

Contributions are welcome. Start with [CONTRIBUTING.md](./CONTRIBUTING.md) for local setup, test commands, and repo conventions.

## License

Apache 2.0. See [LICENSE](./LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
.agent		.agent
.github		.github
archolith_proxy		archolith_proxy
docs		docs
scripts		scripts
tests		tests
.clinerules		.clinerules
.cursorrules		.cursorrules
.env.example		.env.example
.gitignore		.gitignore
.windsurfrules		.windsurfrules
AGENTS.md		AGENTS.md
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QWEN.md		QWEN.md
README.md		README.md
docker-compose.yml		docker-compose.yml
gemini.md		gemini.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

archolith-context

What It Does

Quick Start

Option A: Docker Compose

Option B: Local Python Environment

Point an OpenAI-Compatible Client at the Proxy

Benchmark Results

Aggregate Outcomes

Representative Turns

How It Works

1. Session Resolution

2. Post-Response Extraction

3. Context Assembly

4. Context Manager LLM (Curator)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

archolith-context

What It Does

Quick Start

Option A: Docker Compose

Option B: Local Python Environment

Point an OpenAI-Compatible Client at the Proxy

Benchmark Results

Aggregate Outcomes

Representative Turns

How It Works

1. Session Resolution

2. Post-Response Extraction

3. Context Assembly

4. Context Manager LLM (Curator)

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages