Agent runtime tracing + LLM-mock replay for Python. Pip install. Async-first. Self-contained reports.
⚠️ Pre-release (v0.2). Tracer, replay, instrumentors, cost tracking, and the pytest plugin are implemented end-to-end and tested. APIs are stabilising — don't depend on this in production yet.
TraceForge records every LLM call, tool invocation, error, and state transition your agent makes into a typed span. The output is a replayable run.jsonl artifact plus a self-contained HTML report you can open in any browser — no server, no SaaS, no SDK lock-in. Replay mode re-executes the agent with cached LLM responses (or cached tool outputs) so you can verify the execution path without burning API calls.
pip install "traceforge-llm[anthropic]" # or [openai], [all]
traceforge init && python agent.py| TraceForge | LangSmith | Langfuse | OpenLLMetry | print() |
|
|---|---|---|---|---|---|
| Pip install, no account, no server | ✅ | — | partial (self-host) | ✅ | ✅ |
| Records LLM I/O + tool I/O + state per span | ✅ | ✅ | ✅ | partial | — |
Replay with cached LLM responses (llm-mock) |
✅ | — | — | — | — |
| Dry-run replay with cached tool outputs | ✅ | — | — | — | — |
| Self-contained HTML report (no CDN, no server) | ✅ | — | — | — | — |
| Auto-cost tracking per-span + per-run | ✅ | ✅ | ✅ | partial | — |
| First-class pytest plugin with snapshot testing | ✅ | — | — | — | — |
| Auto-patches your SDK clients | opt-in | ✅ | ✅ | ✅ | n/a |
| Cloud storage / hosted dashboard | — | ✅ | ✅ | via vendor | — |
Where TraceForge fits: when you need a local, file-based, replayable record of what your agent did — for debugging, CI regression tests, or post-hoc analysis — without sending your traces to anyone else's database. Auto-patching frameworks like LangSmith give you a UI; OpenLLMetry gives you OTel pipes. TraceForge gives you a JSONL you can git diff, an HTML you can email, and a tracer.replay() you can run offline.
1. Install and scaffold.
pip install "traceforge-llm[anthropic]"
traceforge inittraceforge init writes traceforge.yaml, a working agent.py example, and a .gitignore entry.
2. Wrap your agent.
import asyncio
from anthropic import AsyncAnthropic
from traceforge import Tracer
from traceforge.integrations.anthropic import AnthropicInstrumentor
tracer = Tracer()
async def main():
async with tracer.run() as run:
client = AnthropicInstrumentor(run).instrument(AsyncAnthropic())
response = await client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": "What is 2 + 2?"}],
)
print(response.content[0].text)
run.trace.print_summary()
asyncio.run(main())3. Run.
export ANTHROPIC_API_KEY=sk-ant-...
python agent.pyYou get a Rich-formatted summary on stdout, plus a directory at .traceforge/runs/<ulid>-<run-name>/ containing manifest.json, run.jsonl, and a self-contained report.html.
Library-only API (no instrumentor)
async with tracer.run() as run:
run.record_llm_call(
provider="anthropic",
model="claude-haiku-4-5",
messages=[...],
response="...",
input_tokens=12, output_tokens=4, latency_ms=180,
)
run.record_tool_call("search", tool_input={"q": "..."}, tool_output={"hits": 3})
run.custom("phase.done", metadata={"step": 1})
run.trace.print_summary()Manual recording is the lower-level API the instrumentors are built on. Useful when you don't want TraceForge anywhere near the SDK call.
Decorator sugar
@tracer.trace
async def my_agent(query, _run=None):
_run.record_tool_call("search", {"q": query}, {"hits": 3})
return "done"
await my_agent("hello") # auto-saves trace to .traceforge/runs/
trace = tracer.last()tracer.run() writes three files per run to .traceforge/runs/<ulid>-<run-name>/:
.traceforge/runs/01KS8E...-true-elk/
├── report.html ← open this
├── run.jsonl ← replayable artifact (one span per line + manifest)
└── manifest.json ← aggregate counts + cost + token totals
Terminal: run.trace.print_summary() prints a Rich panel + span tree:
HTML: self-contained, dark theme, no CDN. Open report.html in any browser (or via traceforge open <run-name>):
- Stat strip at the top: duration, span count, LLM calls, tool calls, token totals, cost, errors
- Span cards with type-coded left border (indigo LLM, cyan tool, slate custom, red errors)
- Collapsible payloads for system prompts, message arrays, and tool I/O — kept folded by default so the page stays scannable
- Per-span cost rendered inline for every LLM call
A live example sits at docs/example-report.html — open it directly to see the layout.
# llm-mock: LLM responses served from cache, tools execute live
result = await tracer.replay(trace, agent_fn, mode="llm-mock")
# dry-run: both LLM responses AND tool outputs served from cache, no network
result = await tracer.replay(trace, agent_fn, mode="dry-run")
result.print()
# Similarity: 100% · Status: ALIGNEDThe replay engine builds two interceptors keyed by SHA-256 of the original messages / tool inputs, and hands them to your agent function. Your agent consults the interceptor before calling out:
async def my_agent(query, _run=None, _mock_llm=None, _mock_tool=None):
cached = _mock_llm.get(messages) if _mock_llm else None
if cached is not None:
return cached
return await client.messages.create(...)The shipped instrumentors handle this for you — just pass mock_interceptor=_mock_llm when instrumenting:
client = AnthropicInstrumentor(_run, mock_interceptor=_mock_llm).instrument(AsyncAnthropic())Similarity scoring. ReplayResult.similarity_score is the ratio of matching span types between original and replayed traces. Below 0.4 the replay is marked DIVERGED. See docs/replay-faq.md for why replays diverge and how to fix it.
Every LLM span gets a USD cost estimate attached automatically, looked up from a built-in pricing table for the major Anthropic and OpenAI models (with longest-prefix matching, so versioned IDs like claude-haiku-4-5-20251001 resolve correctly). Aggregates flow into manifest.total_cost_usd.
async with tracer.run() as run:
await client.messages.create(...) # instrumentor records cost
print(run.trace.manifest.total_cost_usd) # → 0.0042Override the table for negotiated contract pricing or private models:
from traceforge import Tracer
from traceforge.pricing import ModelPrice
tracer = Tracer(pricing={
"my-internal-model": ModelPrice(input_per_million=0.5, output_per_million=1.5),
"claude-opus-4-7": ModelPrice(input_per_million=12.0, output_per_million=60.0),
})Unknown models cost 0 and emit a one-shot warning so the trace still saves.
pip install traceforge-llm auto-registers a pytest plugin (via pytest11 entry point). Three fixtures appear in any test suite:
import pytest
@pytest.mark.asyncio
async def test_agent_runs_under_budget(tracer, tf_assert, tf_snapshot):
async with tracer.run() as run:
await my_agent("hello", _run=run)
tf_assert(
run.trace,
has_span="search",
llm_calls=1,
max_cost_usd=0.01,
max_tokens=2000,
)
# Golden-trace snapshot — fails if the span-type sequence drifts.
tf_snapshot.assert_match(run.trace, "agent_v1")| Fixture | What it gives you |
|---|---|
tracer |
A non-auto-saving Tracer per test (no .traceforge/ cruft) |
tf_assert(trace, ...) |
One-line common assertions: has_span, has_span_type, no_errors, llm_calls, tool_calls, max_cost_usd, max_tokens, min_spans |
tf_snapshot.assert_match(trace, name) |
Records the trace on first run, then asserts span-type-sequence similarity ≥ 0.8 on every subsequent run |
Snapshots live in tests/__tf_snapshots__/<name>.jsonl by default — override with --tf-snapshot-dir. Commit them like you commit any other golden file. Refresh after intentional changes with:
pytest --tf-update-snapshots| Provider | Class | Wraps |
|---|---|---|
| Anthropic | traceforge.integrations.anthropic.AnthropicInstrumentor |
client.messages.create |
| OpenAI | traceforge.integrations.openai.OpenAIInstrumentor |
client.chat.completions.create |
| LangChain | traceforge.integrations.langchain.LangChainInstrumentor |
manual record_chain_step / record_llm_step |
LangChain is intentionally manual — auto-patching is fragile across versions, so TraceForge ships a bridge helper you call from your callback handler.
| Command | Purpose |
|---|---|
traceforge init |
Scaffold traceforge.yaml, agent.py example, .gitignore entry |
traceforge list |
Table of local runs (newest first) |
traceforge open <id> |
Open a run's HTML report in your browser |
traceforge show <id> |
Print a run summary to the terminal |
<id> accepts a ULID prefix or the human-readable run name (brave-salmon).
- No auto-patching by default. Instrumentors are opt-in. Your code stays explicit about what's being traced.
- No time-travel debugging. TraceForge records and replays; it does not pause your agent mid-flight.
- No cloud storage. Traces live in
.traceforge/runs/. Bring your own object store if you want central retention. - No built-in eval scoring. TraceForge captures the run; pair it with
evalkit(or your own scorer) for grading.
| Feature | Status |
|---|---|
Async + sync context manager, @tracer.trace decorator |
✅ shipped |
| Anthropic / OpenAI instrumentors | ✅ shipped |
| LangChain bridge (manual) | ✅ shipped |
File store: manifest.json + run.jsonl + report.html |
✅ shipped |
| Self-contained HTML report (no CDN) | ✅ shipped |
| LLM-mock replay | ✅ shipped |
| Dry-run replay (tool cache) | ✅ shipped |
| Cost tracking (per-span + manifest total) | ✅ shipped (v0.2) |
| Custom pricing tables | ✅ shipped (v0.2) |
Pytest plugin (tracer, tf_assert) |
✅ shipped (v0.2) |
Trace snapshot testing (tf_snapshot) |
✅ shipped (v0.2) |
| Streaming + tool-use in instrumentors | deferred |
traceforge diff (span-level diff) |
deferred |
Slim mode (--slim) |
deferred |
| LangGraph auto-instrumentation | manual only |
| Cloud storage backends | non-goal |
Track progress and propose features via GitHub Issues.
- Replay FAQ — why replays diverge, and how to fix it
- Example HTML report — live, self-contained
