A self-improving system that records how AI models behave, evaluates their output, and trains them to be better — all locally, all yours.
Record. Learn. Rewrite.
Point any app's LLM calls at Cassette. It traces every interaction, builds training datasets from real usage, trains LoRA adapters, and proves whether the trained model is actually better — with numbers, not hope.
your app → Cassette gateway → model server (ollama, llama.cpp, vLLM)
|
└── traces → dataset → training → better model
Validated with OpenFOIA: 29 government documents extracted, 28 training records curated, LoRA adapter trained in 21 minutes on M4 Mac, format compliance improved from 0.20 → 0.90 on entity validation task. Zero code changes in OpenFOIA — just changed the model name.
# Install
git clone https://github.com/JordanCoin/Cassette
cd Cassette
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
# Check system
cassette doctor
# Run the full pipeline with a test query
cassette run-loop --query "What is gradient descent?"
# See what was produced
cassette list-snapshots
cassette propose-trainingNo GPU needed. No model server needed. Mock mode works out of the box.
# Or run the guided demo
cassette demo# 1. Route your app's LLM calls through Cassette
export CASSETTE_PROVIDER=llama_cpp_http
export CASSETTE_LLAMA_CPP_URL=http://localhost:11434
export CASSETTE_MODEL=llama3.2:3b
make dev
# 2. Use your app normally — every LLM call is traced
# 3. Run the pipeline
cassette run-loop
# 4. Train a LoRA adapter
pip install cassette[training]
cassette train
# 5. Export to ollama
cassette export-model --name my-model-v1
# 6. Compare base vs trained
cassette compare --base llama3.2:3b --adapter my-model-v1
# 7. If better, update your app config:
# "model": "my-model-v1"
# 8. Keep using your app → more traces → retrain → repeatSee docs/TRAINING.md for the full training guide.
OpenFOIA is an open-source FOIA investigation toolkit. It uses LLMs to validate entity extraction from government documents.
Integration: One config change — point the LLM base_url at Cassette's gateway. No code changes.
Result: Cassette traced 29 document extractions, built a training dataset, trained a LoRA adapter on Qwen 2.5 1.5B in 21 minutes on an M4 Mac, and the trained model returned clean JSON entity validation instead of code-fenced markdown. Measured improvement: 0.20 → 0.90 format compliance across 10 test records, 0 regressions.
See docs/INTEGRATION.md for how to connect any project.
cassette demo # Guided demo of the full pipeline
cassette doctor # Full system diagnostics
cassette health # Quick provider check
cassette run-loop # observe → evaluate → promote → snapshot → propose
cassette run-loop --query "question" # Seed a query, then run
cassette train # Plan → validate → execute LoRA training
cassette export-model --name my-model # Merge adapter → GGUF → register with ollama
cassette compare --base m1 --adapter m2 # Score base vs adapter across 8 dimensions
cassette validate-training # Check if training can run on this hardware
cassette plan-training # Show the training command without running it
cassette evaluate-dataset # Evaluate + promote dataset records
cassette evaluate-dataset --use-judge # Add LLM-as-judge scoring
cassette extract-dataset # Extract records from traces
cassette snapshot-dataset # Version the promoted dataset
cassette list-snapshots # List versioned datasets
cassette propose-training # Generate training proposal
All commands accept --data-dir <path> to override the data directory.
| Variable | Default | Description |
|---|---|---|
CASSETTE_PROVIDER |
mock |
Model provider (mock, llama_cpp_http) |
CASSETTE_MODEL |
default |
Model name (ollama tag or HuggingFace ID) |
CASSETTE_LLAMA_CPP_URL |
http://localhost:8080 |
Model server URL |
CASSETTE_SEARCH_URL |
http://localhost:8888 |
Search API URL (SearXNG) |
CASSETTE_PROVIDER_TIMEOUT |
60 |
Provider HTTP timeout (seconds) |
libs/core/ — contracts, ports, domain logic (no IO)
libs/adapters/ — JSONL store, HTTP providers, writers
services/gateway/ — FastAPI gateway (OpenAI-compatible)
services/orchestrator/ — stage runner and stages
Gateway — proxies LLM calls, traces everything, serves metrics Orchestrator — staged execution with event instrumentation Data pipeline — extraction, evaluation, promotion, versioned snapshots Training — LoRA via TRL, model comparison, GGUF export to ollama Adapters — pluggable backends for storage, providers, web tooling
cassette compare scores responses across 8 dimensions:
| Category | Metric | What it measures |
|---|---|---|
| Format | valid_json |
Response is parseable JSON |
| Format | no_code_fences |
No markdown wrappers |
| Format | correct_schema |
Has expected keys (keep/remove) |
| Data | clean_values |
No type prefixes or confidence strings in values |
| Data | numeric_confidence |
Confidence values are numbers |
| Data | has_corrections |
Corrected field present |
| Completeness | entity_coverage |
Output accounts for input entities |
| Completeness | no_phantoms |
Doesn't invent entities beyond input |
docker compose up --build # Mock mode
CASSETTE_PROVIDER=llama_cpp_http \
docker compose --profile with-backend up # With model serverSee compose.yaml for full configuration.
- Small training sets produce conservative models — 28 records taught format, not deep entity decisions. More data = broader confidence.
- No DVC integration — dataset versioning is file-based snapshots
- Single provider at a time — no multi-model routing
- No auth — endpoints are open (intended for local/dev use)
- GGUF conversion requires git — llama.cpp converter is auto-downloaded but needs git
| Guide | What |
|---|---|
| docs/TRAINING.md | Full training guide: traces to deployed model |
| docs/INTEGRATION.md | How to connect any project to Cassette |
| examples/WALKTHROUGH.md | Step-by-step walkthrough with examples |
| tests/integration/README.md | Multi-surface integration testing |
| AGENTS.md | Development rules |
| PROGRAM.md | Build loop process |
Cassette treats intelligence as a process, not a model.
Models are temporary. The loop is the product.
The goal is freedom from subscription-based AI: train what you want, on your data, on your hardware, improving from your actual usage.
Early-stage, built for iteration. Issues and PRs welcome.
340+ tests. Strict typing. Full linting. Integration tests validated on M4 Mac with real ollama backend.
MIT