TraceForge

Agent runtime tracing + LLM-mock replay for Python. Pip install. Async-first. Self-contained reports.

⚠️ Pre-release (v0.2). Tracer, replay, instrumentors, cost tracking, and the pytest plugin are implemented end-to-end and tested. APIs are stabilising — don't depend on this in production yet.

TraceForge records every LLM call, tool invocation, error, and state transition your agent makes into a typed span. The output is a replayable run.jsonl artifact plus a self-contained HTML report you can open in any browser — no server, no SaaS, no SDK lock-in. Replay mode re-executes the agent with cached LLM responses (or cached tool outputs) so you can verify the execution path without burning API calls.

pip install "traceforge-llm[anthropic]"   # or [openai], [all]
traceforge init && python agent.py

Why TraceForge

	TraceForge	LangSmith	Langfuse	OpenLLMetry	`print()`
Pip install, no account, no server	✅	—	partial (self-host)	✅	✅
Records LLM I/O + tool I/O + state per span	✅	✅	✅	partial	—
Replay with cached LLM responses (`llm-mock`)	✅	—	—	—	—
Dry-run replay with cached tool outputs	✅	—	—	—	—
Self-contained HTML report (no CDN, no server)	✅	—	—	—	—
Auto-cost tracking per-span + per-run	✅	✅	✅	partial	—
First-class pytest plugin with snapshot testing	✅	—	—	—	—
Auto-patches your SDK clients	opt-in	✅	✅	✅	n/a
Cloud storage / hosted dashboard	—	✅	✅	via vendor	—

Where TraceForge fits: when you need a local, file-based, replayable record of what your agent did — for debugging, CI regression tests, or post-hoc analysis — without sending your traces to anyone else's database. Auto-patching frameworks like LangSmith give you a UI; OpenLLMetry gives you OTel pipes. TraceForge gives you a JSONL you can git diff, an HTML you can email, and a tracer.replay() you can run offline.

60-second quickstart

1. Install and scaffold.

pip install "traceforge-llm[anthropic]"
traceforge init

traceforge init writes traceforge.yaml, a working agent.py example, and a .gitignore entry.

2. Wrap your agent.

import asyncio
from anthropic import AsyncAnthropic
from traceforge import Tracer
from traceforge.integrations.anthropic import AnthropicInstrumentor

tracer = Tracer()

async def main():
    async with tracer.run() as run:
        client = AnthropicInstrumentor(run).instrument(AsyncAnthropic())
        response = await client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            messages=[{"role": "user", "content": "What is 2 + 2?"}],
        )
        print(response.content[0].text)
    run.trace.print_summary()

asyncio.run(main())

3. Run.

export ANTHROPIC_API_KEY=sk-ant-...
python agent.py

You get a Rich-formatted summary on stdout, plus a directory at .traceforge/runs/<ulid>-<run-name>/ containing manifest.json, run.jsonl, and a self-contained report.html.

Library-only API (no instrumentor)

async with tracer.run() as run:
    run.record_llm_call(
        provider="anthropic",
        model="claude-haiku-4-5",
        messages=[...],
        response="...",
        input_tokens=12, output_tokens=4, latency_ms=180,
    )
    run.record_tool_call("search", tool_input={"q": "..."}, tool_output={"hits": 3})
    run.custom("phase.done", metadata={"step": 1})

run.trace.print_summary()

Manual recording is the lower-level API the instrumentors are built on. Useful when you don't want TraceForge anywhere near the SDK call.

Decorator sugar

@tracer.trace
async def my_agent(query, _run=None):
    _run.record_tool_call("search", {"q": query}, {"hits": 3})
    return "done"

await my_agent("hello")        # auto-saves trace to .traceforge/runs/
trace = tracer.last()

Reports

tracer.run() writes three files per run to .traceforge/runs/<ulid>-<run-name>/:

.traceforge/runs/01KS8E...-true-elk/
├── report.html      ← open this
├── run.jsonl        ← replayable artifact (one span per line + manifest)
└── manifest.json    ← aggregate counts + cost + token totals

Terminal: run.trace.print_summary() prints a Rich panel + span tree:

HTML: self-contained, dark theme, no CDN. Open report.html in any browser (or via traceforge open <run-name>):

Stat strip at the top: duration, span count, LLM calls, tool calls, token totals, cost, errors
Span cards with type-coded left border (indigo LLM, cyan tool, slate custom, red errors)
Collapsible payloads for system prompts, message arrays, and tool I/O — kept folded by default so the page stays scannable
Per-span cost rendered inline for every LLM call

A live example sits at docs/example-report.html — open it directly to see the layout.

Replay

# llm-mock: LLM responses served from cache, tools execute live
result = await tracer.replay(trace, agent_fn, mode="llm-mock")

# dry-run: both LLM responses AND tool outputs served from cache, no network
result = await tracer.replay(trace, agent_fn, mode="dry-run")

result.print()
# Similarity: 100%  ·  Status: ALIGNED

The replay engine builds two interceptors keyed by SHA-256 of the original messages / tool inputs, and hands them to your agent function. Your agent consults the interceptor before calling out:

async def my_agent(query, _run=None, _mock_llm=None, _mock_tool=None):
    cached = _mock_llm.get(messages) if _mock_llm else None
    if cached is not None:
        return cached
    return await client.messages.create(...)

The shipped instrumentors handle this for you — just pass mock_interceptor=_mock_llm when instrumenting:

client = AnthropicInstrumentor(_run, mock_interceptor=_mock_llm).instrument(AsyncAnthropic())

Similarity scoring. ReplayResult.similarity_score is the ratio of matching span types between original and replayed traces. Below 0.4 the replay is marked DIVERGED. See docs/replay-faq.md for why replays diverge and how to fix it.

Cost tracking

Every LLM span gets a USD cost estimate attached automatically, looked up from a built-in pricing table for the major Anthropic and OpenAI models (with longest-prefix matching, so versioned IDs like claude-haiku-4-5-20251001 resolve correctly). Aggregates flow into manifest.total_cost_usd.

async with tracer.run() as run:
    await client.messages.create(...)  # instrumentor records cost

print(run.trace.manifest.total_cost_usd)  # → 0.0042

Override the table for negotiated contract pricing or private models:

from traceforge import Tracer
from traceforge.pricing import ModelPrice

tracer = Tracer(pricing={
    "my-internal-model": ModelPrice(input_per_million=0.5, output_per_million=1.5),
    "claude-opus-4-7":   ModelPrice(input_per_million=12.0, output_per_million=60.0),
})

Unknown models cost 0 and emit a one-shot warning so the trace still saves.

Pytest plugin

pip install traceforge-llm auto-registers a pytest plugin (via pytest11 entry point). Three fixtures appear in any test suite:

import pytest

@pytest.mark.asyncio
async def test_agent_runs_under_budget(tracer, tf_assert, tf_snapshot):
    async with tracer.run() as run:
        await my_agent("hello", _run=run)

    tf_assert(
        run.trace,
        has_span="search",
        llm_calls=1,
        max_cost_usd=0.01,
        max_tokens=2000,
    )

    # Golden-trace snapshot — fails if the span-type sequence drifts.
    tf_snapshot.assert_match(run.trace, "agent_v1")

Fixture	What it gives you
`tracer`	A non-auto-saving `Tracer` per test (no `.traceforge/` cruft)
`tf_assert(trace, ...)`	One-line common assertions: `has_span`, `has_span_type`, `no_errors`, `llm_calls`, `tool_calls`, `max_cost_usd`, `max_tokens`, `min_spans`
`tf_snapshot.assert_match(trace, name)`	Records the trace on first run, then asserts span-type-sequence similarity ≥ 0.8 on every subsequent run

Snapshots live in tests/__tf_snapshots__/<name>.jsonl by default — override with --tf-snapshot-dir. Commit them like you commit any other golden file. Refresh after intentional changes with:

pytest --tf-update-snapshots

Instrumentors

Provider	Class	Wraps
Anthropic	`traceforge.integrations.anthropic.AnthropicInstrumentor`	`client.messages.create`
OpenAI	`traceforge.integrations.openai.OpenAIInstrumentor`	`client.chat.completions.create`
LangChain	`traceforge.integrations.langchain.LangChainInstrumentor`	manual `record_chain_step` / `record_llm_step`

LangChain is intentionally manual — auto-patching is fragile across versions, so TraceForge ships a bridge helper you call from your callback handler.

CLI

Command	Purpose
`traceforge init`	Scaffold `traceforge.yaml`, `agent.py` example, `.gitignore` entry
`traceforge list`	Table of local runs (newest first)
`traceforge open <id>`	Open a run's HTML report in your browser
`traceforge show <id>`	Print a run summary to the terminal

<id> accepts a ULID prefix or the human-readable run name (brave-salmon).

Non-goals

No auto-patching by default. Instrumentors are opt-in. Your code stays explicit about what's being traced.
No time-travel debugging. TraceForge records and replays; it does not pause your agent mid-flight.
No cloud storage. Traces live in .traceforge/runs/. Bring your own object store if you want central retention.
No built-in eval scoring. TraceForge captures the run; pair it with evalkit (or your own scorer) for grading.

Status

Feature	Status
Async + sync context manager, `@tracer.trace` decorator	✅ shipped
Anthropic / OpenAI instrumentors	✅ shipped
LangChain bridge (manual)	✅ shipped
File store: `manifest.json` + `run.jsonl` + `report.html`	✅ shipped
Self-contained HTML report (no CDN)	✅ shipped
LLM-mock replay	✅ shipped
Dry-run replay (tool cache)	✅ shipped
Cost tracking (per-span + manifest total)	✅ shipped (v0.2)
Custom pricing tables	✅ shipped (v0.2)
Pytest plugin (`tracer`, `tf_assert`)	✅ shipped (v0.2)
Trace snapshot testing (`tf_snapshot`)	✅ shipped (v0.2)
Streaming + tool-use in instrumentors	deferred
`traceforge diff` (span-level diff)	deferred
Slim mode (`--slim`)	deferred
LangGraph auto-instrumentation	manual only
Cloud storage backends	non-goal

Track progress and propose features via GitHub Issues.

Docs

Replay FAQ — why replays diverge, and how to fix it
Example HTML report — live, self-contained

License

Apache 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scripts		scripts
src/traceforge		src/traceforge
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
traceforge.yaml		traceforge.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TraceForge

Why TraceForge

60-second quickstart

Reports

Replay

Cost tracking

Pytest plugin

Instrumentors

CLI

Non-goals

Status

Docs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TraceForge

Why TraceForge

60-second quickstart

Reports

Replay

Cost tracking

Pytest plugin

Instrumentors

CLI

Non-goals

Status

Docs

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages