From dbf168600afa3e1e9965030666fa135de8423fe8 Mon Sep 17 00:00:00 2001 From: Mike Halagan Date: Wed, 25 Feb 2026 21:45:18 -0600 Subject: [PATCH 1/2] docs: standardize display name to "TraceCraft" (PascalCase) Replace all instances of "Trace Craft" (two words) with "TraceCraft" across README, docs, CLI, TUI, and config files to align the display name with the package name `tracecraft` and class prefix `TraceCraft`, following Python ecosystem conventions (e.g. FastAPI, LangChain). Co-Authored-By: Claude Opus 4.6 --- CHANGELOG.md | 2 +- CLAUDE.md | 4 +- README.md | 20 +++---- docs/api/core.md | 4 +- docs/api/index.md | 4 +- docs/changelog.md | 6 +- docs/contributing.md | 8 +-- docs/deployment/aws-agentcore.md | 8 +-- docs/deployment/azure-foundry.md | 10 ++-- docs/deployment/gcp-vertex-agent.md | 12 ++-- docs/deployment/high-throughput.md | 10 ++-- docs/deployment/index.md | 8 +-- docs/deployment/kubernetes.md | 4 +- docs/deployment/production.md | 4 +- docs/faq.md | 62 +++++++++---------- docs/getting-help.md | 26 ++++---- docs/getting-started/concepts.md | 20 +++---- docs/getting-started/index.md | 16 ++--- docs/getting-started/installation.md | 18 +++--- docs/getting-started/quickstart.md | 12 ++-- docs/glossary.md | 70 +++++++++++----------- docs/index.md | 20 +++---- docs/integrations/auto-instrumentation.md | 10 ++-- docs/integrations/claude-sdk.md | 18 +++--- docs/integrations/cloud-platforms.md | 4 +- docs/integrations/index.md | 8 +-- docs/integrations/langchain.md | 24 ++++---- docs/integrations/llamaindex.md | 46 +++++++------- docs/integrations/otel-receiver.md | 30 +++++----- docs/integrations/pydantic-ai.md | 28 ++++----- docs/migration/from-langfuse.md | 18 +++--- docs/migration/from-langsmith.md | 14 ++--- docs/migration/from-openllmetry.md | 24 ++++---- docs/migration/index.md | 14 ++--- docs/reference/quick-reference.md | 4 +- docs/troubleshooting.md | 10 ++-- docs/user-guide/configuration.md | 6 +- docs/user-guide/decorators.md | 8 +-- docs/user-guide/exporters.md | 2 +- docs/user-guide/index.md | 2 +- docs/user-guide/multi-tenancy.md | 4 +- docs/user-guide/performance.md | 10 ++-- docs/user-guide/processors.md | 4 +- docs/user-guide/remote-trace-sources.md | 8 +-- docs/user-guide/security.md | 16 ++--- docs/user-guide/tui.md | 40 ++++++------- mkdocs.yml | 2 +- src/tracecraft/cli/main.py | 10 ++-- src/tracecraft/tui/app.py | 6 +- src/tracecraft/tui/screens/setup_wizard.py | 6 +- 50 files changed, 362 insertions(+), 362 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 930347b..6828702 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -35,7 +35,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added -- Initial release of Trace Craft +- Initial release of TraceCraft - Core tracing runtime with decorator-based instrumentation - Framework adapters for LangChain, LlamaIndex, PydanticAI, and Claude SDK - Exporters: console, JSONL, OTLP, MLflow, HTML diff --git a/CLAUDE.md b/CLAUDE.md index f5ba1f8..39b938a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,4 +1,4 @@ -# Trace Craft +# TraceCraft Vendor-neutral LLM observability SDK - instrument once, observe anywhere. @@ -99,6 +99,6 @@ with runtime.run("task_name") as run: - The `init()` function in `core/runtime.py` is the main entry point - it creates a global singleton runtime - Decorators in `instrumentation/decorators.py` create Steps and attach them to the current AgentRun via context vars -- Adapters translate framework-specific callbacks/spans into Trace Craft Steps +- Adapters translate framework-specific callbacks/spans into TraceCraft Steps - Processors run in a pipeline (configurable order: SAFETY or EFFICIENCY) before export - The TUI reads from JSONL/SQLite storage for offline trace exploration diff --git a/README.md b/README.md index 1aa4bc2..adca732 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Trace Craft +# TraceCraft [![CI](https://github.com/LocalAI/tracecraft/actions/workflows/ci.yml/badge.svg)](https://github.com/LocalAI/tracecraft/actions/workflows/ci.yml) [![Coverage](https://codecov.io/gh/LocalAI/tracecraft/branch/main/graph/badge.svg)](https://codecov.io/gh/LocalAI/tracecraft) @@ -8,13 +8,13 @@ > **Vendor-neutral LLM observability — instrument once, observe anywhere.** -Trace Craft is a Python observability SDK with a built-in **Terminal UI (TUI)** that lets you visually explore, debug, and analyze agent traces right in your terminal — no browser, no cloud dashboard, no waiting. +TraceCraft is a Python observability SDK with a built-in **Terminal UI (TUI)** that lets you visually explore, debug, and analyze agent traces right in your terminal — no browser, no cloud dashboard, no waiting. --- ## The fastest path: zero code changes -If your app already uses OpenAI, Anthropic, LangChain, LlamaIndex, or any OpenTelemetry-compatible framework, Trace Craft can observe it **without touching a single line of application code**. +If your app already uses OpenAI, Anthropic, LangChain, LlamaIndex, or any OpenTelemetry-compatible framework, TraceCraft can observe it **without touching a single line of application code**. **Step 1 — Install and set one environment variable:** @@ -40,11 +40,11 @@ Traces from any OTLP-compatible framework (OpenLLMetry, LangChain, LlamaIndex, D --- -![Trace Craft TUI - Main View](docs/assets/screenshots/tui-main-view.svg) +![TraceCraft TUI - Main View](docs/assets/screenshots/tui-main-view.svg) *All your agent runs at a glance — name, duration, token usage, and status.* -![Trace Craft TUI - Waterfall and Detail View](docs/assets/screenshots/tui-waterfall-view.svg) +![TraceCraft TUI - Waterfall and Detail View](docs/assets/screenshots/tui-waterfall-view.svg) *Hierarchical waterfall view with timing bars. See exactly where your agent spends its time. Navigate to any LLM step and press `i` for the prompt, `o` for the response, or `a` for attributes.* @@ -85,13 +85,13 @@ Or, if you prefer to write traces to a file and open the TUI separately: tracecraft tui ``` -> **Note:** Call `tracecraft.init()` **before** importing any LLM SDK. Trace Craft patches SDKs at import time — importing first means the patch won't apply. +> **Note:** Call `tracecraft.init()` **before** importing any LLM SDK. TraceCraft patches SDKs at import time — importing first means the patch won't apply. --- ## SDK decorators -For fine-grained control — custom span names, explicit inputs/outputs, structured step hierarchies — Trace Craft provides `@trace_agent`, `@trace_tool`, `@trace_llm`, and `@trace_retrieval` decorators, plus a `step()` context manager for inline instrumentation. See the [SDK Guide](https://tracecraft.dev/getting-started/quickstart/) for details. +For fine-grained control — custom span names, explicit inputs/outputs, structured step hierarchies — TraceCraft provides `@trace_agent`, `@trace_tool`, `@trace_llm`, and `@trace_retrieval` decorators, plus a `step()` context manager for inline instrumentation. See the [SDK Guide](https://tracecraft.dev/getting-started/quickstart/) for details. --- @@ -112,9 +112,9 @@ For fine-grained control — custom span names, explicit inputs/outputs, structu --- -## Why Trace Craft? +## Why TraceCraft? -| Feature | Trace Craft | LangSmith | Langfuse | Phoenix | +| Feature | TraceCraft | LangSmith | Langfuse | Phoenix | |---------|------------|-----------|----------|---------| | **Terminal UI** | **Yes — built-in** | No | No | No | | **Zero-Code Instrumentation** | Yes | No | No | No | @@ -257,4 +257,4 @@ Apache-2.0 — See [LICENSE](LICENSE) for details. --- -Made with care by the Trace Craft Contributors +Made with care by the TraceCraft Contributors diff --git a/docs/api/core.md b/docs/api/core.md index eae0b8c..c45c002 100644 --- a/docs/api/core.md +++ b/docs/api/core.md @@ -1,6 +1,6 @@ # Core API -Core functionality of Trace Craft. +Core functionality of TraceCraft. ## Module: tracecraft.core @@ -59,7 +59,7 @@ Enumeration of step types: ### init() -Initialize Trace Craft with configuration. Returns the global `TraceCraftRuntime` singleton (subsequent calls return the same instance). +Initialize TraceCraft with configuration. Returns the global `TraceCraftRuntime` singleton (subsequent calls return the same instance). **Signature:** diff --git a/docs/api/index.md b/docs/api/index.md index 1129f5d..e0a8529 100644 --- a/docs/api/index.md +++ b/docs/api/index.md @@ -1,6 +1,6 @@ # API Reference -Complete API documentation for Trace Craft. +Complete API documentation for TraceCraft. ## Modules @@ -179,7 +179,7 @@ The following pages contain auto-generated API documentation from source code do ## Type Hints -Trace Craft is fully typed. Import types for static analysis: +TraceCraft is fully typed. Import types for static analysis: ```python from typing import TYPE_CHECKING diff --git a/docs/changelog.md b/docs/changelog.md index d58c4e3..6e13ead 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -1,6 +1,6 @@ # Changelog -All notable changes to Trace Craft will be documented in this file. +All notable changes to TraceCraft will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). @@ -15,7 +15,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added -- Initial release of Trace Craft +- Initial release of TraceCraft - Core instrumentation decorators (`@trace_agent`, `@trace_tool`, `@trace_llm`, `@trace_retrieval`) - Console and JSONL exporters - OTLP exporter for OpenTelemetry Protocol support @@ -61,7 +61,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### 0.1.0 - Initial Release -Trace Craft is now available as a vendor-neutral LLM observability SDK. Key features: +TraceCraft is now available as a vendor-neutral LLM observability SDK. Key features: - **Local-First Development** - Beautiful console output and JSONL files without any backend - **Built on OpenTelemetry** - Industry-standard foundation diff --git a/docs/contributing.md b/docs/contributing.md index caceb04..2a590e9 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -1,6 +1,6 @@ -# Contributing to Trace Craft +# Contributing to TraceCraft -Thank you for your interest in contributing to Trace Craft! This guide will help you get started. +Thank you for your interest in contributing to TraceCraft! This guide will help you get started. ## Code of Conduct @@ -17,7 +17,7 @@ Found a bug? Please report it: - Clear title and description - Steps to reproduce - Expected vs actual behavior - - Your environment (OS, Python version, Trace Craft version) + - Your environment (OS, Python version, TraceCraft version) - Minimal code example if possible ### Suggesting Features @@ -465,4 +465,4 @@ By contributing, you agree that your contributions will be licensed under the Ap Don't hesitate to ask! Create a discussion or reach out to maintainers. -Thank you for contributing to Trace Craft! +Thank you for contributing to TraceCraft! diff --git a/docs/deployment/aws-agentcore.md b/docs/deployment/aws-agentcore.md index 8c62e8a..bbfd031 100644 --- a/docs/deployment/aws-agentcore.md +++ b/docs/deployment/aws-agentcore.md @@ -1,6 +1,6 @@ # AWS Bedrock AgentCore Deployment Guide -Deploy Trace Craft-instrumented applications to AWS with X-Ray and Bedrock AgentCore observability. +Deploy TraceCraft-instrumented applications to AWS with X-Ray and Bedrock AgentCore observability. ## Architecture @@ -10,7 +10,7 @@ Deploy Trace Craft-instrumented applications to AWS with X-Ray and Bedrock Agent │ │ │ ┌─────────────────┐ ┌──────────────────────────────┐ │ │ │ Your Agent │─────▶│ ADOT Collector │ │ -│ │ (Trace Craft │ │ (Sidecar/DaemonSet) │ │ +│ │ (TraceCraft │ │ (Sidecar/DaemonSet) │ │ │ │ enabled) │ └──────────────────────────────┘ │ │ └─────────────────┘ │ │ │ ▼ │ @@ -29,7 +29,7 @@ Deploy Trace Craft-instrumented applications to AWS with X-Ray and Bedrock Agent ## Quick Start -### 1. Install Trace Craft +### 1. Install TraceCraft ```bash pip install tracecraft[aws-agentcore] @@ -56,7 +56,7 @@ exporter = create_agentcore_exporter( use_xray_propagation=True, ) -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init( exporters=[exporter], console=False, # Disable console in production diff --git a/docs/deployment/azure-foundry.md b/docs/deployment/azure-foundry.md index 1025a6b..c6df5fb 100644 --- a/docs/deployment/azure-foundry.md +++ b/docs/deployment/azure-foundry.md @@ -1,6 +1,6 @@ # Azure AI Foundry Deployment Guide -Deploy Trace Craft-instrumented applications to Azure with AI Foundry observability. +Deploy TraceCraft-instrumented applications to Azure with AI Foundry observability. ## Architecture @@ -10,7 +10,7 @@ Deploy Trace Craft-instrumented applications to Azure with AI Foundry observabil │ │ │ ┌─────────────────┐ ┌──────────────────────────────┐ │ │ │ Your Agent │─────▶│ Application Insights │ │ -│ │ (Trace Craft │ │ (AI Foundry Observability) │ │ +│ │ (TraceCraft │ │ (AI Foundry Observability) │ │ │ │ enabled) │ └──────────────────────────────┘ │ │ └─────────────────┘ │ │ │ ▼ │ @@ -36,7 +36,7 @@ Deploy Trace Craft-instrumented applications to Azure with AI Foundry observabil 3. Click **Overview** > **Connection String** 4. Copy the connection string -### 2. Install Trace Craft +### 2. Install TraceCraft ```bash pip install tracecraft[azure-foundry] @@ -66,7 +66,7 @@ exporter = create_foundry_exporter( agent_description="Researches topics and synthesizes information", ) -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init( exporters=[exporter], console=False, # Disable console in production @@ -157,7 +157,7 @@ tracecraft.init(exporters=[exporter]) ## OTel GenAI Semantic Conventions -Trace Craft exports traces following OTel GenAI semantic conventions: +TraceCraft exports traces following OTel GenAI semantic conventions: ### Agent Spans diff --git a/docs/deployment/gcp-vertex-agent.md b/docs/deployment/gcp-vertex-agent.md index 77b9758..e64f46c 100644 --- a/docs/deployment/gcp-vertex-agent.md +++ b/docs/deployment/gcp-vertex-agent.md @@ -1,6 +1,6 @@ # GCP Vertex AI Agent Builder Deployment Guide -Deploy Trace Craft-instrumented applications to GCP with Cloud Trace and Vertex AI Agent Builder observability. +Deploy TraceCraft-instrumented applications to GCP with Cloud Trace and Vertex AI Agent Builder observability. ## Architecture @@ -10,7 +10,7 @@ Deploy Trace Craft-instrumented applications to GCP with Cloud Trace and Vertex | | | +------------------+ +------------------------------+ | | | Your Agent |----->| Cloud Trace | | -| | (Trace Craft | | (OTel/Cloud Trace API) | | +| | (TraceCraft | | (OTel/Cloud Trace API) | | | | enabled) | +------------------------------+ | | +------------------+ | | | v | @@ -36,7 +36,7 @@ Deploy Trace Craft-instrumented applications to GCP with Cloud Trace and Vertex gcloud services enable cloudtrace.googleapis.com --project=YOUR_PROJECT_ID ``` -### 2. Install Trace Craft +### 2. Install TraceCraft ```bash pip install tracecraft[gcp-vertex-agent] @@ -72,7 +72,7 @@ exporter = create_vertex_agent_exporter( reasoning_engine_id="re-001", ) -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init( exporters=[exporter], console=False, # Disable console in production @@ -268,7 +268,7 @@ Example: X-Cloud-Trace-Context: 105445aa7843bc8bf206b12000100000/12345678901234567;o=1 ``` -Trace Craft also supports W3C Trace Context (`traceparent` header) which GCP natively understands. +TraceCraft also supports W3C Trace Context (`traceparent` header) which GCP natively understands. ## Session Tracking for Multi-Turn @@ -290,7 +290,7 @@ run = AgentRun( ## OTel GenAI Semantic Conventions -Trace Craft exports traces following OTel GenAI semantic conventions: +TraceCraft exports traces following OTel GenAI semantic conventions: ### Agent Spans diff --git a/docs/deployment/high-throughput.md b/docs/deployment/high-throughput.md index 8659827..e085053 100644 --- a/docs/deployment/high-throughput.md +++ b/docs/deployment/high-throughput.md @@ -1,6 +1,6 @@ # High Throughput Deployment Guide -Configure Trace Craft for high-volume production environments. +Configure TraceCraft for high-volume production environments. ## Overview @@ -194,10 +194,10 @@ Tested on 8-core, 32GB RAM, Kubernetes cluster: | Configuration | Throughput | Latency (p99) | Drop Rate | |---------------|------------|---------------|-----------| | Baseline (no tracing) | 10,000 req/s | 50ms | 0% | -| Trace Craft (console) | 8,500 req/s | 65ms | 0% | -| Trace Craft (OTLP) | 9,200 req/s | 55ms | 0% | -| Trace Craft (buffered) | 9,800 req/s | 52ms | 0% | -| Trace Craft (sampled 10%) | 9,900 req/s | 51ms | 0% | +| TraceCraft (console) | 8,500 req/s | 65ms | 0% | +| TraceCraft (OTLP) | 9,200 req/s | 55ms | 0% | +| TraceCraft (buffered) | 9,800 req/s | 52ms | 0% | +| TraceCraft (sampled 10%) | 9,900 req/s | 51ms | 0% | ## Monitoring the Tracing System diff --git a/docs/deployment/index.md b/docs/deployment/index.md index ed58e14..085f3fa 100644 --- a/docs/deployment/index.md +++ b/docs/deployment/index.md @@ -1,8 +1,8 @@ # Deployment -Trace Craft is designed to run everywhere - from a developer laptop to high-throughput production +TraceCraft is designed to run everywhere - from a developer laptop to high-throughput production clusters on managed cloud platforms. This section covers how to configure, deploy, and operate -Trace Craft in each environment. +TraceCraft in each environment. ## Deployment Options @@ -21,7 +21,7 @@ Trace Craft in each environment. --- - Deploy Trace Craft alongside AWS Bedrock AgentCore. Covers IAM roles, CloudWatch + Deploy TraceCraft alongside AWS Bedrock AgentCore. Covers IAM roles, CloudWatch integration, and ECS task definitions. [:octicons-arrow-right-24: AWS AgentCore](aws-agentcore.md) @@ -39,7 +39,7 @@ Trace Craft in each environment. --- - Run Trace Craft with Vertex AI Agent Builder. Covers Workload Identity, Cloud Trace + Run TraceCraft with Vertex AI Agent Builder. Covers Workload Identity, Cloud Trace export, and Cloud Run deployment. [:octicons-arrow-right-24: GCP Vertex Agent](gcp-vertex-agent.md) diff --git a/docs/deployment/kubernetes.md b/docs/deployment/kubernetes.md index ba6a872..fd2af2a 100644 --- a/docs/deployment/kubernetes.md +++ b/docs/deployment/kubernetes.md @@ -1,6 +1,6 @@ # Kubernetes Deployment Guide -Deploy Trace Craft-instrumented applications to Kubernetes with OTLP export. +Deploy TraceCraft-instrumented applications to Kubernetes with OTLP export. ## Architecture @@ -10,7 +10,7 @@ Deploy Trace Craft-instrumented applications to Kubernetes with OTLP export. │ │ │ ┌─────────────┐ ┌──────────────────┐ │ │ │ Your App │─────▶│ OTEL Collector │ │ -│ │ (Trace Craft│ │ (DaemonSet) │ │ +│ │ (TraceCraft│ │ (DaemonSet) │ │ │ │ enabled) │ └────────┬─────────┘ │ │ └─────────────┘ │ │ │ ▼ │ diff --git a/docs/deployment/production.md b/docs/deployment/production.md index 713ad67..045437f 100644 --- a/docs/deployment/production.md +++ b/docs/deployment/production.md @@ -1,6 +1,6 @@ # Production Deployment -Best practices for deploying Trace Craft in production environments. +Best practices for deploying TraceCraft in production environments. ## Production Configuration @@ -101,7 +101,7 @@ def health_check(): ### Metrics -Track Trace Craft metrics: +Track TraceCraft metrics: ```python from tracecraft import get_runtime diff --git a/docs/faq.md b/docs/faq.md index a94f28c..0b2e695 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -1,24 +1,24 @@ # Frequently Asked Questions -Find answers to the most common questions about Trace Craft. If your question is not covered here, +Find answers to the most common questions about TraceCraft. If your question is not covered here, open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). --- ## Getting Started -??? question "What is Trace Craft?" - Trace Craft is a vendor-neutral observability SDK for LLM applications. You instrument your - code once with Trace Craft decorators or context managers, and then export traces to any +??? question "What is TraceCraft?" + TraceCraft is a vendor-neutral observability SDK for LLM applications. You instrument your + code once with TraceCraft decorators or context managers, and then export traces to any backend: console, local JSONL files, OTLP-compatible platforms (Jaeger, Honeycomb, Grafana Tempo, etc.), MLflow, or static HTML reports. - Trace Craft captures the full hierarchy of an agent run - agent calls, LLM requests, tool + TraceCraft captures the full hierarchy of an agent run - agent calls, LLM requests, tool invocations, retrieval operations, and more - without locking you into any vendor's proprietary format. -??? question "Which Python version does Trace Craft require?" - Trace Craft requires **Python 3.11 or later**. It uses modern Python features including +??? question "Which Python version does TraceCraft require?" + TraceCraft requires **Python 3.11 or later**. It uses modern Python features including `X | Y` union syntax, built-in generics (`list[str]`, `dict[str, Any]`), and `datetime.now(UTC)`. @@ -27,8 +27,8 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). pip install tracecraft ``` -??? question "Which LLM frameworks does Trace Craft support?" - Trace Craft ships adapters for the most widely used frameworks: +??? question "Which LLM frameworks does TraceCraft support?" + TraceCraft ships adapters for the most widely used frameworks: | Framework | Package Extra | How to Enable | |-----------|--------------|---------------| @@ -43,7 +43,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). `@trace_llm`, and `@trace_retrieval` decorators directly, or wrap calls in a `with tracecraft.step(...)` context manager. -??? question "Can I use Trace Craft without any LLM framework?" +??? question "Can I use TraceCraft without any LLM framework?" Yes. The decorator API and context manager work with any Python code, regardless of which LLM SDK you use. @@ -64,7 +64,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ... ``` -??? question "Does Trace Craft require an external service to work?" +??? question "Does TraceCraft require an external service to work?" No. By default `tracecraft.init()` writes traces to the console and optionally to a local JSONL file. No network connection or cloud account is needed for development. @@ -80,7 +80,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ## Configuration -??? question "How do I configure Trace Craft differently for development and production?" +??? question "How do I configure TraceCraft differently for development and production?" Use environment variables so the same code runs correctly in both environments: ```python title="app.py" @@ -110,8 +110,8 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). TRACECRAFT_SAMPLING_RATE=0.1 ``` -??? question "What environment variables does Trace Craft read?" - Trace Craft respects the following environment variables: +??? question "What environment variables does TraceCraft read?" + TraceCraft respects the following environment variables: | Variable | Description | Default | |----------|-------------|---------| @@ -202,7 +202,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). # Pass the handler at invocation time result = chain.invoke( - {"query": "What is Trace Craft?"}, + {"query": "What is TraceCraft?"}, config={"callbacks": [handler]}, ) ``` @@ -257,7 +257,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ??? question "How do I receive traces from an existing OpenTelemetry setup?" Use the OTLP receiver to accept traces from any OTLP-compatible source and convert them - into Trace Craft steps: + into TraceCraft steps: ```bash pip install "tracecraft[receiver]" @@ -273,11 +273,11 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ) ``` - This is useful when you have an existing OTel pipeline and want to add Trace Craft's TUI, + This is useful when you have an existing OTel pipeline and want to add TraceCraft's TUI, PII redaction, or JSONL export without changing your instrumentation code. -??? question "Does Trace Craft work with async code?" - Yes. All Trace Craft decorators support both synchronous and asynchronous functions. +??? question "Does TraceCraft work with async code?" + Yes. All TraceCraft decorators support both synchronous and asynchronous functions. Context variables (`contextvars`) propagate correctly across `await` boundaries within a single async task. @@ -300,8 +300,8 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ## Exporters -??? question "What export backends does Trace Craft support?" - Trace Craft ships five built-in exporters: +??? question "What export backends does TraceCraft support?" + TraceCraft ships five built-in exporters: | Exporter | Class | Use Case | |----------|-------|----------| @@ -397,7 +397,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ## Troubleshooting -??? question "I get ModuleNotFoundError when importing a Trace Craft adapter. What is wrong?" +??? question "I get ModuleNotFoundError when importing a TraceCraft adapter. What is wrong?" Most adapters and exporters are optional dependencies grouped into extras. Install the extra that matches the adapter you want: @@ -448,7 +448,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). If the modules are imported at the top of a different file that loads first, move `tracecraft.init(auto_instrument=True)` to the very start of your application entry point. -??? question "How do I enable debug logging for Trace Craft itself?" +??? question "How do I enable debug logging for TraceCraft itself?" Set the `tracecraft` logger to `DEBUG` level: ```python @@ -468,7 +468,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ## Performance -??? question "How much overhead does Trace Craft add?" +??? question "How much overhead does TraceCraft add?" The overhead depends on your configuration: | Configuration | Typical Overhead | Max Throughput | @@ -478,11 +478,11 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). | 1% sampling, async batch export | <0.5 ms/trace | ~50 K+ traces/s | For most LLM applications the network latency of the LLM API call (hundreds of - milliseconds) far exceeds Trace Craft's instrumentation overhead. + milliseconds) far exceeds TraceCraft's instrumentation overhead. See [High Throughput](deployment/high-throughput.md) for tuning guidance. -??? question "Trace Craft is using a lot of memory. What can I do?" +??? question "TraceCraft is using a lot of memory. What can I do?" High memory usage is almost always caused by a large in-memory export queue. Reduce it by lowering the batch size and queue limit on the async exporter: @@ -504,7 +504,7 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). Also consider lowering `sampling_rate` to reduce the number of traces being queued. ??? question "What sampling strategies are available?" - Trace Craft supports three sampling strategies via `SamplingProcessor`: + TraceCraft supports three sampling strategies via `SamplingProcessor`: ```python from tracecraft.processors.sampling import SamplingProcessor @@ -535,9 +535,9 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). ## Production -??? question "How do I deploy Trace Craft on Kubernetes?" +??? question "How do I deploy TraceCraft on Kubernetes?" The recommended pattern is to run an OpenTelemetry Collector as a DaemonSet or sidecar - and point Trace Craft's OTLP exporter at it. This decouples your application from the + and point TraceCraft's OTLP exporter at it. This decouples your application from the observability backend. ```yaml title="tracecraft-config.yaml" @@ -554,8 +554,8 @@ open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). See the complete [Kubernetes Deployment](deployment/kubernetes.md) guide for Helm values, resource limits, and HPA configuration. -??? question "How does Trace Craft integrate with cloud-managed AI platforms?" - Trace Craft ships contrib helpers for AWS, Azure, and GCP that handle credential +??? question "How does TraceCraft integrate with cloud-managed AI platforms?" + TraceCraft ships contrib helpers for AWS, Azure, and GCP that handle credential resolution and platform-specific export targets: - **AWS AgentCore:** See [AWS AgentCore](deployment/aws-agentcore.md) for IAM role setup diff --git a/docs/getting-help.md b/docs/getting-help.md index 6394e27..6d8adf0 100644 --- a/docs/getting-help.md +++ b/docs/getting-help.md @@ -1,6 +1,6 @@ # Getting Help -Trace Craft is maintained by an open-source community. This page describes the best way to get help, report problems, and contribute ideas depending on your situation. +TraceCraft is maintained by an open-source community. This page describes the best way to get help, report problems, and contribute ideas depending on your situation. --- @@ -64,7 +64,7 @@ Before opening an issue or starting a discussion, the answer may already exist: - You have a question about how to use a feature - You want to know the recommended approach for a use case - You have a feature idea you want to explore before opening a formal request -- You want to share a project that uses Trace Craft +- You want to share a project that uses TraceCraft - You are unsure whether something is a bug or intentional behavior ### Discussion Categories @@ -73,7 +73,7 @@ Before opening an issue or starting a discussion, the answer may already exist: |---|---| | **Q&A** | Questions about installation, usage, and configuration | | **Ideas** | Feature proposals, enhancement suggestions | -| **Show and Tell** | Projects, integrations, and examples built with Trace Craft | +| **Show and Tell** | Projects, integrations, and examples built with TraceCraft | | **General** | Everything else | ### How to Ask an Effective Question @@ -83,7 +83,7 @@ A well-formed question gets a faster, more useful answer. Include the following: **1. Your environment** ``` -Trace Craft version: 0.x.y (tracecraft --version) +TraceCraft version: 0.x.y (tracecraft --version) Python version: 3.11.x (python --version) OS: macOS 14, Ubuntu 22.04, Windows 11, etc. Installation: pip / uv / conda / source @@ -125,7 +125,7 @@ Paste the full traceback, not just the last line. ## Bug Reports (GitHub Issues) -Use **GitHub Issues** when you have confirmed that Trace Craft behaves incorrectly: crashes, wrong output, silent failures, or behavior that contradicts the documentation. +Use **GitHub Issues** when you have confirmed that TraceCraft behaves incorrectly: crashes, wrong output, silent failures, or behavior that contradicts the documentation. [:octicons-mark-github-16: Open an Issue](https://github.com/LocalAI/tracecraft/issues/new/choose){ .md-button .md-button--primary } @@ -147,7 +147,7 @@ A clear, concise description of what the bug is. ## Environment -- Trace Craft version: (output of `tracecraft --version` or `python -c "import tracecraft; print(tracecraft.__version__)"`) +- TraceCraft version: (output of `tracecraft --version` or `python -c "import tracecraft; print(tracecraft.__version__)"`) - Python version: (output of `python --version`) - OS: (e.g., macOS 14.5, Ubuntu 22.04, Windows 11) - Installation method: (pip, uv, conda, from source) @@ -155,7 +155,7 @@ A clear, concise description of what the bug is. ## Steps to Reproduce -1. Install Trace Craft with `pip install tracecraft` +1. Install TraceCraft with `pip install tracecraft` 2. Run the following code: ```python @@ -185,7 +185,7 @@ Paste the full traceback here. Do not truncate it. ```python # The smallest possible complete script that reproduces the issue. -# It should be runnable as-is (with Trace Craft installed). +# It should be runnable as-is (with TraceCraft installed). import tracecraft tracecraft.init() @@ -208,7 +208,7 @@ After you open an issue, maintainers will apply labels to categorize and priorit | Label | Meaning | |---|---| -| `bug` | Confirmed defect in Trace Craft | +| `bug` | Confirmed defect in TraceCraft | | `needs-reproduction` | Cannot reproduce without more information | | `good first issue` | Suitable for first-time contributors | | `help wanted` | Community contributions welcome | @@ -235,7 +235,7 @@ which specific prompt templates are causing high token costs across runs." ## Proposed Solution -Describe what you would like Trace Craft to do. Be as specific as you can +Describe what you would like TraceCraft to do. Be as specific as you can about the interface, behavior, and configuration. Example: "A `PromptTemplateProcessor` that extracts named template variables @@ -262,7 +262,7 @@ Mockups, links to similar features in other tools, related issues or discussions !!! danger "Do Not Report Security Issues Publicly" If you discover a security vulnerability - including authentication bypasses, data exposure, or dependency vulnerabilities - **do not open a public GitHub Issue or Discussion**. - Public disclosure before a fix is available puts all Trace Craft users at risk. + Public disclosure before a fix is available puts all TraceCraft users at risk. ### Responsible Disclosure Process @@ -275,7 +275,7 @@ Mockups, links to similar features in other tools, related issues or discussions ### What to Include in a Security Report -- Affected Trace Craft version(s) +- Affected TraceCraft version(s) - Description of the vulnerability and its potential impact - Step-by-step reproduction instructions - Any proof-of-concept code (shared privately) @@ -330,7 +330,7 @@ If you want to fix a bug yourself, add a feature, or improve the documentation, ## Response Time Expectations -Trace Craft is a community-maintained open-source project. Response times are best-effort and depend on maintainer availability. +TraceCraft is a community-maintained open-source project. Response times are best-effort and depend on maintainer availability. | Channel | Typical Response | |---|---| diff --git a/docs/getting-started/concepts.md b/docs/getting-started/concepts.md index 7744f86..0ede0b1 100644 --- a/docs/getting-started/concepts.md +++ b/docs/getting-started/concepts.md @@ -1,10 +1,10 @@ # Core Concepts -Understanding the core concepts of Trace Craft will help you use it effectively. +Understanding the core concepts of TraceCraft will help you use it effectively. ## Architectural Overview -Trace Craft is built on three main layers: +TraceCraft is built on three main layers: ```mermaid graph TB @@ -12,7 +12,7 @@ graph TB A[Application] end - subgraph "Trace Craft SDK" + subgraph "TraceCraft SDK" B[Instrumentation Layer] C[Processing Layer] D[Export Layer] @@ -65,7 +65,7 @@ Sends processed traces to destinations: ### Traces and Spans -Trace Craft follows OpenTelemetry's trace model: +TraceCraft follows OpenTelemetry's trace model: ```mermaid graph TB @@ -110,7 +110,7 @@ Key properties of a run: ### Step Types -Trace Craft categorizes operations into semantic types: +TraceCraft categorizes operations into semantic types: ```python from tracecraft.core.models import StepType @@ -126,7 +126,7 @@ These types help backends understand your trace semantics. ### Decorators -Trace Craft provides specialized decorators for each step type: +TraceCraft provides specialized decorators for each step type: #### @trace_agent @@ -351,7 +351,7 @@ tracecraft.init( ### Schema Support -Trace Craft supports two schema conventions: +TraceCraft supports two schema conventions: #### OTel GenAI Conventions @@ -383,11 +383,11 @@ The Arize AI OpenInference schema: } ``` -Trace Craft automatically emits both schemas, making traces compatible with multiple backends. +TraceCraft automatically emits both schemas, making traces compatible with multiple backends. ### Propagation -Trace Craft propagates trace context across: +TraceCraft propagates trace context across: - **Async boundaries**: Preserves context in async/await - **Process boundaries**: Via W3C Trace Context headers @@ -449,7 +449,7 @@ with step("processing") as s: ### 4. Handle Errors -Let Trace Craft capture errors automatically: +Let TraceCraft capture errors automatically: ```python @trace_agent(name="agent") diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md index d9bf422..4e632c8 100644 --- a/docs/getting-started/index.md +++ b/docs/getting-started/index.md @@ -1,10 +1,10 @@ -# Getting Started with Trace Craft +# Getting Started with TraceCraft -Welcome to Trace Craft! This guide will help you get started with instrumenting your LLM applications for observability. +Welcome to TraceCraft! This guide will help you get started with instrumenting your LLM applications for observability. -## What is Trace Craft? +## What is TraceCraft? -Trace Craft is a vendor-neutral observability SDK for LLM applications. It provides: +TraceCraft is a vendor-neutral observability SDK for LLM applications. It provides: - **Unified Instrumentation**: Single API that works across different frameworks - **Flexible Export**: Send traces to multiple backends simultaneously @@ -14,11 +14,11 @@ Trace Craft is a vendor-neutral observability SDK for LLM applications. It provi ## Learning Path -Follow this learning path to master Trace Craft: +Follow this learning path to master TraceCraft: ### 1. Installation -Start by installing Trace Craft with the features you need. +Start by installing TraceCraft with the features you need. [:octicons-arrow-right-24: Installation Guide](installation.md) @@ -30,7 +30,7 @@ Build your first instrumented application in 5 minutes. ### 3. Core Concepts -Understand the key concepts behind Trace Craft. +Understand the key concepts behind TraceCraft. [:octicons-arrow-right-24: Core Concepts](concepts.md) @@ -110,4 +110,4 @@ tracecraft tui Ready to dive deeper? Start with the installation guide: -[:octicons-arrow-right-24: Install Trace Craft](installation.md){ .md-button .md-button--primary } +[:octicons-arrow-right-24: Install TraceCraft](installation.md){ .md-button .md-button--primary } diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md index 1e64e4a..cba2124 100644 --- a/docs/getting-started/installation.md +++ b/docs/getting-started/installation.md @@ -1,6 +1,6 @@ # Installation -This guide covers all the ways to install Trace Craft and its optional dependencies. +This guide covers all the ways to install TraceCraft and its optional dependencies. ## Requirements @@ -27,7 +27,7 @@ This guide covers all the ways to install Trace Craft and its optional dependenc poetry add tracecraft ``` -This installs the core Trace Craft SDK with: +This installs the core TraceCraft SDK with: - OpenTelemetry API and SDK - Rich console output @@ -36,7 +36,7 @@ This installs the core Trace Craft SDK with: ## Optional Dependencies -Trace Craft uses optional dependencies for different features. Install only what you need: +TraceCraft uses optional dependencies for different features. Install only what you need: ### Framework Integrations @@ -168,7 +168,7 @@ No code changes required - just import and initialize! ### Convenience Bundles -Trace Craft provides convenience bundles for common use cases: +TraceCraft provides convenience bundles for common use cases: === "All Features" @@ -214,16 +214,16 @@ Verify your installation: ```python import tracecraft -print(f"Trace Craft version: {tracecraft.__version__}") +print(f"TraceCraft version: {tracecraft.__version__}") # Test basic functionality tracecraft.init() -print("Trace Craft initialized successfully!") +print("TraceCraft initialized successfully!") ``` ## Development Installation -For contributing to Trace Craft: +For contributing to TraceCraft: ```bash # Clone the repository @@ -289,7 +289,7 @@ If you encounter version conflicts with OpenTelemetry packages, ensure you're us pip install --upgrade opentelemetry-api opentelemetry-sdk ``` -Trace Craft requires: +TraceCraft requires: - opentelemetry-api >= 1.20 - opentelemetry-sdk >= 1.20 @@ -316,6 +316,6 @@ Download from: ## Next Steps -Now that Trace Craft is installed, follow the quick start guide: +Now that TraceCraft is installed, follow the quick start guide: [:octicons-arrow-right-24: Quick Start Guide](quickstart.md){ .md-button .md-button--primary } diff --git a/docs/getting-started/quickstart.md b/docs/getting-started/quickstart.md index c9775cb..f4d28f3 100644 --- a/docs/getting-started/quickstart.md +++ b/docs/getting-started/quickstart.md @@ -1,13 +1,13 @@ # Quick Start -Get Trace Craft running in under 2 minutes. +Get TraceCraft running in under 2 minutes. ## Step 1 — Zero code changes (absolute simplest) !!! success "No decorators. No code changes. Just point and trace." If your app already emits OTLP traces (via OpenLLMetry, LangChain, LlamaIndex, DSPy, - or the standard OTel SDK), Trace Craft can receive them with no modifications to your app. + or the standard OTel SDK), TraceCraft can receive them with no modifications to your app. **Install:** @@ -35,7 +35,7 @@ OpenLLMetry, LangChain, LlamaIndex, DSPy, or the standard OpenTelemetry SDK. ## Step 2 — Config file (one line of code) If your app does not already emit OTLP traces, create a config file and add a single -`tracecraft.init()` call before your LLM imports. Trace Craft's auto-instrumentation +`tracecraft.init()` call before your LLM imports. TraceCraft's auto-instrumentation patches the SDKs at runtime and streams traces to the TUI. **Create `.tracecraft/config.yaml` in your project root:** @@ -109,7 +109,7 @@ Once your app is running and the TUI is open, you will see a live trace list: ``` ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ -┃ Trace Craft TUI traces: 47 ┃ +┃ TraceCraft TUI traces: 47 ┃ ┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ TRACE ID NAME DURATION STATUS TOKENS ┃ ┃ ─────────────────────────────────────────────────────────────────────── ┃ @@ -141,7 +141,7 @@ filters, and comparison views. **One line, for Path 2.** If your app calls OpenAI or Anthropic directly without existing OTel instrumentation, add `tracecraft.init()` before your LLM imports and - Trace Craft auto-instruments the SDKs for you. + TraceCraft auto-instruments the SDKs for you. ??? question "What if I'm using LangChain or LlamaIndex?" @@ -149,7 +149,7 @@ filters, and comparison views. adapters are also available for richer context — see the [Integrations](../integrations/index.md) page. -??? question "Can I use the TUI without Trace Craft instrumentation?" +??? question "Can I use the TUI without TraceCraft instrumentation?" **Yes.** The TUI accepts any OpenTelemetry-compatible trace data over OTLP. See the [TUI Guide](../user-guide/tui.md) for details. diff --git a/docs/glossary.md b/docs/glossary.md index da308b2..f2e7596 100644 --- a/docs/glossary.md +++ b/docs/glossary.md @@ -1,14 +1,14 @@ # Glossary -This glossary defines the key terms used in Trace Craft and the broader LLM observability ecosystem. Terms are organized into four categories and sorted alphabetically within each category. +This glossary defines the key terms used in TraceCraft and the broader LLM observability ecosystem. Terms are organized into four categories and sorted alphabetically within each category. --- -## Trace Craft-Specific Terms +## TraceCraft-Specific Terms ### Adapter -A framework-specific integration that translates the tracing callbacks and lifecycle hooks of an external framework into Trace Craft Steps. Adapters exist for LangChain, LlamaIndex, and PydanticAI. Rather than modifying your framework code, you install the adapter once and Trace Craft automatically captures all relevant operations. +A framework-specific integration that translates the tracing callbacks and lifecycle hooks of an external framework into TraceCraft Steps. Adapters exist for LangChain, LlamaIndex, and PydanticAI. Rather than modifying your framework code, you install the adapter once and TraceCraft automatically captures all relevant operations. See also: [Integrations](integrations/index.md), [Step](#step) @@ -18,7 +18,7 @@ See also: [Integrations](integrations/index.md), [Step](#step) The top-level container for a complete agent execution. An `AgentRun` holds metadata (run ID, timestamps, status), the initial input and final output, tags, and the full tree of child Steps. Every traced execution produces exactly one `AgentRun`. -`AgentRun` is the Trace Craft equivalent of a **Trace** in OpenTelemetry. +`AgentRun` is the TraceCraft equivalent of a **Trace** in OpenTelemetry. ```python from tracecraft import TraceCraftRuntime @@ -39,7 +39,7 @@ See also: [Step](#step), [Trace](#trace) ### Auto-Instrumentation -Automatic tracing of third-party SDK calls (such as OpenAI or Anthropic) without modifying your existing code. Trace Craft's auto-instrumentation wraps installed OpenTelemetry instrumentation libraries and patches the SDK at import time, so every API call becomes a traced Step. +Automatic tracing of third-party SDK calls (such as OpenAI or Anthropic) without modifying your existing code. TraceCraft's auto-instrumentation wraps installed OpenTelemetry instrumentation libraries and patches the SDK at import time, so every API call becomes a traced Step. ```python import tracecraft @@ -55,7 +55,7 @@ See also: [Auto-Instrumentation Guide](integrations/auto-instrumentation.md) ### Decorator -A Python function annotation that automatically creates and manages a Step when the decorated function is called. Trace Craft provides four built-in decorators, each corresponding to a semantic operation type: +A Python function annotation that automatically creates and manages a Step when the decorated function is called. TraceCraft provides four built-in decorators, each corresponding to a semantic operation type: | Decorator | Step Type | Typical Use | |---|---|---| @@ -112,7 +112,7 @@ See also: [Exporters Guide](user-guide/exporters.md) ### OTLPReceiverServer -An HTTP server bundled with Trace Craft that listens for incoming OpenTelemetry traces (sent via the OTLP HTTP protocol) and stores them in a Trace Craft storage backend. This allows any OTel-instrumented application - regardless of language or framework - to send traces to Trace Craft for analysis in the TUI. +An HTTP server bundled with TraceCraft that listens for incoming OpenTelemetry traces (sent via the OTLP HTTP protocol) and stores them in a TraceCraft storage backend. This allows any OTel-instrumented application - regardless of language or framework - to send traces to TraceCraft for analysis in the TUI. ```python from tracecraft.receiver import OTLPReceiverServer @@ -159,7 +159,7 @@ See also: [Processors Guide](user-guide/processors.md), [PII Redaction](#pii-red A single traced operation within an `AgentRun`. Steps form a tree: a parent Step can have many child Steps, reflecting the actual call hierarchy of your code. Each Step records its type, name, start time, duration, status, inputs, outputs, and arbitrary attributes. -`Step` is the Trace Craft equivalent of a **Span** in OpenTelemetry. +`Step` is the TraceCraft equivalent of a **Span** in OpenTelemetry. ```python from tracecraft import step @@ -209,7 +209,7 @@ See also: [Step](#step), [Decorator](#decorator) ### TraceCraftConfig -A dataclass that holds all configuration options for a Trace Craft runtime instance. Pass it to `TraceCraftRuntime` directly, or use the convenience parameters on `tracecraft.init()`. +A dataclass that holds all configuration options for a TraceCraft runtime instance. Pass it to `TraceCraftRuntime` directly, or use the convenience parameters on `tracecraft.init()`. ```python from tracecraft import TraceCraftConfig, TraceCraftRuntime @@ -257,7 +257,7 @@ See also: [TraceCraftConfig](#tracecraftconfig), [Multi-Tenancy](user-guide/mult ### TUI -The Terminal User Interface included with Trace Craft for browsing and analyzing stored traces offline. Built with [Textual](https://textual.textualize.io/), the TUI reads from JSONL or SQLite storage and provides an interactive, keyboard-driven interface for exploring the trace tree, inspecting inputs/outputs, reading LLM prompts and completions, and comparing runs. +The Terminal User Interface included with TraceCraft for browsing and analyzing stored traces offline. Built with [Textual](https://textual.textualize.io/), the TUI reads from JSONL or SQLite storage and provides an interactive, keyboard-driven interface for exploring the trace tree, inspecting inputs/outputs, reading LLM prompts and completions, and comparing runs. ```bash # Launch from a JSONL file @@ -277,7 +277,7 @@ See also: [Terminal UI Guide](user-guide/tui.md), [OTLPReceiverServer](#otlprece An OpenTelemetry SDK component that queues completed spans in memory and exports them in batches at regular intervals or when the queue reaches a threshold. Preferred over `SimpleSpanProcessor` in production because it reduces the performance impact of exporting on the critical path. -Trace Craft's `setup_exporter()` uses `BatchSpanProcessor` by default (`batch_export=True`). +TraceCraft's `setup_exporter()` uses `BatchSpanProcessor` by default (`batch_export=True`). See also: [SimpleSpanProcessor](#simplespanprocessor), [OpenTelemetry Receiver](integrations/otel-receiver.md) @@ -287,7 +287,7 @@ See also: [SimpleSpanProcessor](#simplespanprocessor), [OpenTelemetry Receiver]( **OpenTelemetry Protocol** - The standard wire protocol for transmitting traces, metrics, and logs between OpenTelemetry-instrumented applications and backends. Comes in two transport variants: HTTP/protobuf (port 4318) and gRPC (port 4317). -Trace Craft's `OTLPExporter` and `OTLPReceiverServer` both use OTLP HTTP. +TraceCraft's `OTLPExporter` and `OTLPReceiverServer` both use OTLP HTTP. See also: [Exporters Guide](user-guide/exporters.md), [OTLPReceiverServer](#otlpreceiverserver) @@ -295,7 +295,7 @@ See also: [Exporters Guide](user-guide/exporters.md), [OTLPReceiverServer](#otlp ### Propagation -The mechanism for passing trace context (trace ID, span ID, sampling flags) across process or network boundaries so that distributed operations can be linked into a single coherent trace. Trace Craft uses the **W3C Trace Context** standard (`traceparent` and `tracestate` HTTP headers) for cross-service propagation, and Python `contextvars` for propagation across async boundaries within a single process. +The mechanism for passing trace context (trace ID, span ID, sampling flags) across process or network boundaries so that distributed operations can be linked into a single coherent trace. TraceCraft uses the **W3C Trace Context** standard (`traceparent` and `tracestate` HTTP headers) for cross-service propagation, and Python `contextvars` for propagation across async boundaries within a single process. See also: [W3C Trace Context](#w3c-trace-context), [Trace Context](#trace-context) @@ -303,10 +303,10 @@ See also: [W3C Trace Context](#w3c-trace-context), [Trace Context](#trace-contex ### Resource -In OpenTelemetry, a `Resource` is the set of immutable attributes that describe the entity producing telemetry - typically the service name, version, and deployment environment. Trace Craft automatically creates a `Resource` from your `TraceCraftConfig.service_name` and related fields. +In OpenTelemetry, a `Resource` is the set of immutable attributes that describe the entity producing telemetry - typically the service name, version, and deployment environment. TraceCraft automatically creates a `Resource` from your `TraceCraftConfig.service_name` and related fields. ```python -# Equivalent OTel resource attributes populated by Trace Craft +# Equivalent OTel resource attributes populated by TraceCraft { "service.name": "my-agent", "service.version": "1.0.0", @@ -320,7 +320,7 @@ In OpenTelemetry, a `Resource` is the set of immutable attributes that describe Standardized attribute names and values defined by the OpenTelemetry project for common concepts (HTTP requests, database queries, LLM calls, etc.). Using semantic conventions makes traces portable across different backends and analysis tools. -Trace Craft follows the **OTel GenAI Semantic Conventions** and also emits **OpenInference** attributes for maximum backend compatibility. +TraceCraft follows the **OTel GenAI Semantic Conventions** and also emits **OpenInference** attributes for maximum backend compatibility. See also: [OTel GenAI Conventions](#otel-genai-conventions), [OpenInference](#openinference) @@ -340,7 +340,7 @@ See also: [BatchSpanProcessor](#batchspanprocessor) The fundamental unit of work in OpenTelemetry. A Span represents a single operation with a start time, end time, status, and a set of key-value attributes. Spans are linked by parent-child relationships to form a tree within a Trace. -In Trace Craft terminology, a Span corresponds to a [Step](#step). +In TraceCraft terminology, a Span corresponds to a [Step](#step). See also: [Step](#step), [Trace](#trace), [Trace Context](#trace-context) @@ -350,7 +350,7 @@ See also: [Step](#step), [Trace](#trace), [Trace Context](#trace-context) In OpenTelemetry, a Trace is the complete record of a distributed operation: a directed acyclic graph of Spans sharing the same `trace_id`. It represents the full path of a request or task through a system. -In Trace Craft terminology, a Trace corresponds to an [AgentRun](#agentrun). +In TraceCraft terminology, a Trace corresponds to an [AgentRun](#agentrun). See also: [AgentRun](#agentrun), [Span](#span) @@ -358,7 +358,7 @@ See also: [AgentRun](#agentrun), [Span](#span) ### Trace Context -The metadata that links related Spans together into a Trace: primarily the `trace_id` (shared by all Spans in a Trace), the `span_id` (unique per Span), the `parent_span_id` (links child to parent), and trace flags (sampling decision). Trace Craft manages trace context automatically via Python `contextvars`. +The metadata that links related Spans together into a Trace: primarily the `trace_id` (shared by all Spans in a Trace), the `span_id` (unique per Span), the `parent_span_id` (links child to parent), and trace flags (sampling decision). TraceCraft manages trace context automatically via Python `contextvars`. See also: [Propagation](#propagation), [W3C Trace Context](#w3c-trace-context) @@ -376,7 +376,7 @@ See also: [TraceCraftRuntime](#tracecraftruntime) ### Completion -The output text generated by an LLM in response to a prompt. Trace Craft captures completions in `Step.outputs` under the key `completion` (or via OTel GenAI convention attributes such as `gen_ai.completion`). For streaming responses, individual chunks are captured in `streaming_chunks`. +The output text generated by an LLM in response to a prompt. TraceCraft captures completions in `Step.outputs` under the key `completion` (or via OTel GenAI convention attributes such as `gen_ai.completion`). For streaming responses, individual chunks are captured in `streaming_chunks`. See also: [Prompt](#prompt), [Streaming](#streaming), [Token](#token) @@ -384,7 +384,7 @@ See also: [Prompt](#prompt), [Streaming](#streaming), [Token](#token) ### Cost Tracking -Calculating the monetary cost of LLM API calls based on token usage and provider-specific pricing tables. Trace Craft records input and output token counts on LLM Steps and can compute estimated cost in USD when the model and provider are known. +Calculating the monetary cost of LLM API calls based on token usage and provider-specific pricing tables. TraceCraft records input and output token counts on LLM Steps and can compute estimated cost in USD when the model and provider are known. ```python # Attributes recorded on an LLM Step @@ -399,7 +399,7 @@ See also: [Token Counting](#token-counting), [Token](#token) ### Latency -The elapsed wall-clock time for an operation. Trace Craft records `duration_ms` on every Step, making it straightforward to identify slow LLM calls, retrieval bottlenecks, or tool timeouts in the TUI. +The elapsed wall-clock time for an operation. TraceCraft records `duration_ms` on every Step, making it straightforward to identify slow LLM calls, retrieval bottlenecks, or tool timeouts in the TUI. See also: [Step](#step) @@ -407,7 +407,7 @@ See also: [Step](#step) ### PII Redaction -The process of removing or masking personally identifiable information (names, email addresses, phone numbers, credit card numbers, etc.) from trace data before it is stored or exported. Trace Craft's `RedactionProcessor` applies configurable regex patterns client-side, so sensitive data never leaves your infrastructure. +The process of removing or masking personally identifiable information (names, email addresses, phone numbers, credit card numbers, etc.) from trace data before it is stored or exported. TraceCraft's `RedactionProcessor` applies configurable regex patterns client-side, so sensitive data never leaves your infrastructure. ```python from tracecraft.processors.redaction import RedactionProcessor, RedactionMode @@ -424,7 +424,7 @@ See also: [Processor](#processor), [RedactionMode] ### Prompt -The input text (or structured messages) sent to an LLM. Trace Craft captures prompts in `Step.inputs` and via OTel GenAI convention attributes such as `gen_ai.prompt`. Prompts are displayed in the TUI's detail panel and can be redacted if they contain PII. +The input text (or structured messages) sent to an LLM. TraceCraft captures prompts in `Step.inputs` and via OTel GenAI convention attributes such as `gen_ai.prompt`. Prompts are displayed in the TUI's detail panel and can be redacted if they contain PII. See also: [Completion](#completion), [PII Redaction](#pii-redaction) @@ -434,7 +434,7 @@ See also: [Completion](#completion), [PII Redaction](#pii-redaction) **Retrieval-Augmented Generation** - An architecture that combines a retrieval system (vector database, search engine) with an LLM. The retrieval step fetches relevant documents based on the user's query; those documents are added to the LLM prompt as context, grounding the response in factual information. -Trace Craft traces RAG pipelines end-to-end using `StepType.RETRIEVAL` for the retrieval step and `StepType.LLM` for the generation step. +TraceCraft traces RAG pipelines end-to-end using `StepType.RETRIEVAL` for the retrieval step and `StepType.LLM` for the generation step. See also: [StepType](#steptype), `@trace_retrieval` in [Decorators Guide](user-guide/decorators.md) @@ -442,7 +442,7 @@ See also: [StepType](#steptype), `@trace_retrieval` in [Decorators Guide](user-g ### Sampling -Selectively recording only a fraction of all traces to reduce storage costs and processing overhead while still providing statistical visibility. Trace Craft's `SamplingProcessor` supports rate-based sampling with configurable overrides to always capture errors and slow traces. +Selectively recording only a fraction of all traces to reduce storage costs and processing overhead while still providing statistical visibility. TraceCraft's `SamplingProcessor` supports rate-based sampling with configurable overrides to always capture errors and slow traces. ```python from tracecraft.processors.sampling import SamplingProcessor @@ -461,7 +461,7 @@ See also: [Processor](#processor) ### Streaming -Receiving LLM output incrementally as it is generated, token by token, rather than waiting for the complete response. Trace Craft captures streaming responses by accumulating `streaming_chunks` on the LLM Step and recording the final assembled completion. +Receiving LLM output incrementally as it is generated, token by token, rather than waiting for the complete response. TraceCraft captures streaming responses by accumulating `streaming_chunks` on the LLM Step and recording the final assembled completion. See also: [Completion](#completion), [Token](#token) @@ -469,7 +469,7 @@ See also: [Completion](#completion), [Token](#token) ### Token -The basic unit of text that LLMs process. Roughly equivalent to a word fragment (a typical English word is 1-2 tokens). Token counts directly determine the cost and throughput of LLM API calls. Trace Craft records `input_tokens` and `output_tokens` on every LLM Step. +The basic unit of text that LLMs process. Roughly equivalent to a word fragment (a typical English word is 1-2 tokens). Token counts directly determine the cost and throughput of LLM API calls. TraceCraft records `input_tokens` and `output_tokens` on every LLM Step. See also: [Token Counting](#token-counting), [Cost Tracking](#cost-tracking) @@ -477,7 +477,7 @@ See also: [Token Counting](#token-counting), [Cost Tracking](#cost-tracking) ### Token Counting -Measuring the number of tokens in a prompt (input tokens) and a completion (output tokens). Token counts are reported by the LLM provider in the API response and are captured automatically by Trace Craft on LLM Steps. They are the basis for [Cost Tracking](#cost-tracking). +Measuring the number of tokens in a prompt (input tokens) and a completion (output tokens). Token counts are reported by the LLM provider in the API response and are captured automatically by TraceCraft on LLM Steps. They are the basis for [Cost Tracking](#cost-tracking). See also: [Token](#token), [Cost Tracking](#cost-tracking) @@ -487,7 +487,7 @@ See also: [Token](#token), [Cost Tracking](#cost-tracking) ### JSONL -**JSON Lines** - A text format where each line is a self-contained, valid JSON object. Trace Craft uses JSONL as its default local storage format: each line represents one serialized `AgentRun`. JSONL is human-readable, easy to stream, and directly parseable with standard tools. +**JSON Lines** - A text format where each line is a self-contained, valid JSON object. TraceCraft uses JSONL as its default local storage format: each line represents one serialized `AgentRun`. JSONL is human-readable, easy to stream, and directly parseable with standard tools. ```bash # Each line is one AgentRun @@ -500,7 +500,7 @@ See also: [Exporters Guide](user-guide/exporters.md) ### MLflow -An open-source platform for managing the machine learning lifecycle, including experiment tracking, model registry, and evaluation. Trace Craft can export traces to MLflow as runs, making it possible to correlate agent behavior with ML experiments and model versions. +An open-source platform for managing the machine learning lifecycle, including experiment tracking, model registry, and evaluation. TraceCraft can export traces to MLflow as runs, making it possible to correlate agent behavior with ML experiments and model versions. ```python from tracecraft.exporters import MLflowExporter @@ -514,7 +514,7 @@ See also: [Exporters Guide](user-guide/exporters.md) ### OpenInference -A trace schema standard developed by Arize AI for LLM and agent tracing. It defines attribute names such as `llm.model_name`, `llm.token_count.prompt`, `input.value`, and `output.value`. Trace Craft emits OpenInference attributes alongside OTel GenAI attributes, making traces compatible with Phoenix and other Arize-ecosystem backends. +A trace schema standard developed by Arize AI for LLM and agent tracing. It defines attribute names such as `llm.model_name`, `llm.token_count.prompt`, `input.value`, and `output.value`. TraceCraft emits OpenInference attributes alongside OTel GenAI attributes, making traces compatible with Phoenix and other Arize-ecosystem backends. See also: [OTel GenAI Conventions](#otel-genai-conventions), [Core Concepts](getting-started/concepts.md#schema-support) @@ -525,7 +525,7 @@ See also: [OTel GenAI Conventions](#otel-genai-conventions), [Core Concepts](get The OpenTelemetry Semantic Conventions for Generative AI: a standardized set of attribute names for LLM operations maintained by the OpenTelemetry project. Key attributes include `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`, and `gen_ai.usage.output_tokens`. ```python -# Attributes emitted by Trace Craft on LLM Steps +# Attributes emitted by TraceCraft on LLM Steps { "gen_ai.system": "openai", "gen_ai.request.model": "gpt-4o", @@ -541,7 +541,7 @@ See also: [OpenInference](#openinference), [Semantic Conventions](#semantic-conv ### W3C Trace Context -An HTTP header standard ([W3C Recommendation](https://www.w3.org/TR/trace-context/)) for propagating trace context across service boundaries. It defines two headers: `traceparent` (carries `trace_id`, `span_id`, and sampling flags) and `tracestate` (vendor-specific context). Trace Craft uses W3C Trace Context for cross-service propagation. +An HTTP header standard ([W3C Recommendation](https://www.w3.org/TR/trace-context/)) for propagating trace context across service boundaries. It defines two headers: `traceparent` (carries `trace_id`, `span_id`, and sampling flags) and `tracestate` (vendor-specific context). TraceCraft uses W3C Trace Context for cross-service propagation. See also: [Propagation](#propagation), [Trace Context](#trace-context) @@ -549,6 +549,6 @@ See also: [Propagation](#propagation), [Trace Context](#trace-context) ## See Also -- [Core Concepts](getting-started/concepts.md) - Conceptual overview of how Trace Craft works +- [Core Concepts](getting-started/concepts.md) - Conceptual overview of how TraceCraft works - [API Reference](api/index.md) - Full API documentation - [User Guide](user-guide/index.md) - Feature documentation and how-to guides diff --git a/docs/index.md b/docs/index.md index 8b1051f..01e707d 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,14 +1,14 @@ -# Trace Craft +# TraceCraft **Vendor-neutral LLM observability — instrument once, observe anywhere.** -Trace Craft is a Python observability SDK with a built-in **Terminal UI (TUI)** that lets you visually explore, debug, and analyze your agent traces right in your terminal — no browser, no cloud dashboard, no waiting. +TraceCraft is a Python observability SDK with a built-in **Terminal UI (TUI)** that lets you visually explore, debug, and analyze your agent traces right in your terminal — no browser, no cloud dashboard, no waiting. --- ## The fastest path: zero code changes -If your app already uses OpenAI, Anthropic, LangChain, LlamaIndex, or any OpenTelemetry-compatible framework, Trace Craft can observe it **without touching a single line of application code**. +If your app already uses OpenAI, Anthropic, LangChain, LlamaIndex, or any OpenTelemetry-compatible framework, TraceCraft can observe it **without touching a single line of application code**. **Step 1 — Install and set one environment variable:** @@ -38,13 +38,13 @@ Traces from any OTLP-compatible framework (OpenLLMetry, LangChain, LlamaIndex, D After traces are flowing in, the TUI gives you complete visibility into every agent run: -![Trace Craft TUI - Main View](assets/screenshots/tui-main-view.svg) +![TraceCraft TUI - Main View](assets/screenshots/tui-main-view.svg) *All your agent runs at a glance — name, duration, token usage, and status. Select any trace to drill down.* Select any trace to expand the full call hierarchy with timing bars. Navigate to any LLM step and press `i` for the prompt, `o` for the response, or `a` for all span attributes and metadata. -![Trace Craft TUI - Waterfall and Detail View](assets/screenshots/tui-waterfall-view.svg) +![TraceCraft TUI - Waterfall and Detail View](assets/screenshots/tui-waterfall-view.svg) *Hierarchical waterfall view — agents, tools, and LLM calls with precise timing. See exactly where your agent spends its time.* @@ -87,7 +87,7 @@ tracecraft tui !!! tip "Call `tracecraft.init()` before importing any LLM SDK" - Trace Craft patches SDKs at import time. Import your LLM libraries **after** calling `init()` so the patches apply correctly. + TraceCraft patches SDKs at import time. Import your LLM libraries **after** calling `init()` so the patches apply correctly. [:octicons-arrow-right-24: Full Configuration Reference](user-guide/configuration.md) @@ -95,15 +95,15 @@ tracecraft tui ## Path 3 — SDK decorators and custom tracing -For fine-grained control — custom span names, explicit inputs/outputs, structured step hierarchies — Trace Craft provides `@trace_agent`, `@trace_tool`, `@trace_llm`, and `@trace_retrieval` decorators, plus a `step()` context manager for inline instrumentation. +For fine-grained control — custom span names, explicit inputs/outputs, structured step hierarchies — TraceCraft provides `@trace_agent`, `@trace_tool`, `@trace_llm`, and `@trace_retrieval` decorators, plus a `step()` context manager for inline instrumentation. [:octicons-arrow-right-24: SDK Guide](getting-started/quickstart.md) --- -## Why Trace Craft? +## Why TraceCraft? -| Feature | Trace Craft | LangSmith | Langfuse | Phoenix | +| Feature | TraceCraft | LangSmith | Langfuse | Phoenix | |---------|------------|-----------|----------|---------| | **Terminal UI** | **Yes — built-in** | No | No | No | | **Zero-Code Instrumentation** | Yes | No | No | No | @@ -267,4 +267,4 @@ graph LR ## License -Trace Craft is licensed under the Apache-2.0 License. See [LICENSE](https://github.com/LocalAI/tracecraft/blob/main/LICENSE) for details. +TraceCraft is licensed under the Apache-2.0 License. See [LICENSE](https://github.com/LocalAI/tracecraft/blob/main/LICENSE) for details. diff --git a/docs/integrations/auto-instrumentation.md b/docs/integrations/auto-instrumentation.md index fa1ce8c..4db8432 100644 --- a/docs/integrations/auto-instrumentation.md +++ b/docs/integrations/auto-instrumentation.md @@ -1,6 +1,6 @@ # Auto-Instrumentation -Trace Craft can automatically instrument popular LLM SDKs and agent frameworks **without any code changes** to your application — no decorators, no wrappers, no refactoring. Just initialize Trace Craft before your LLM imports and every call is captured. +TraceCraft can automatically instrument popular LLM SDKs and agent frameworks **without any code changes** to your application — no decorators, no wrappers, no refactoring. Just initialize TraceCraft before your LLM imports and every call is captured. **Supported:** OpenAI · Anthropic · LangChain · LlamaIndex @@ -49,7 +49,7 @@ tracecraft tui !!! warning "Initialize Before Importing SDKs" `tracecraft.init()` must be called **before** importing OpenAI, Anthropic, - LangChain, or LlamaIndex. Trace Craft patches the SDK at import time — if + LangChain, or LlamaIndex. TraceCraft patches the SDK at import time — if you import first, the patch won't apply. ```python @@ -118,7 +118,7 @@ message = client.messages.create( ### LangChain -Trace Craft patches LangChain's `CallbackManager` to automatically inject its callback handler into **all** chains, agents, and tools — no explicit `callbacks=[...]` required. +TraceCraft patches LangChain's `CallbackManager` to automatically inject its callback handler into **all** chains, agents, and tools — no explicit `callbacks=[...]` required. Automatically traces: @@ -151,7 +151,7 @@ result = chain.invoke({"topic": "bears"}) ### LlamaIndex -Trace Craft registers its span handler with LlamaIndex's global instrumentation dispatcher, automatically capturing all query and retrieval operations. +TraceCraft registers its span handler with LlamaIndex's global instrumentation dispatcher, automatically capturing all query and retrieval operations. Automatically traces: @@ -383,7 +383,7 @@ tracecraft.init(auto_instrument=True) # Won't patch already-imported modules ### Duplicate Spans -If you're seeing duplicate spans, you might be calling `init()` twice. Trace Craft's +If you're seeing duplicate spans, you might be calling `init()` twice. TraceCraft's `init()` is idempotent — the second call is a no-op. If you need to re-instrument, call `disable_auto_instrumentation()` first: diff --git a/docs/integrations/claude-sdk.md b/docs/integrations/claude-sdk.md index 41d5957..e5b58df 100644 --- a/docs/integrations/claude-sdk.md +++ b/docs/integrations/claude-sdk.md @@ -1,8 +1,8 @@ # Claude SDK Integration -Trace Craft integrates with the Claude Agent SDK through the `ClaudeTraceCraftr` adapter, which +TraceCraft integrates with the Claude Agent SDK through the `ClaudeTraceCraftr` adapter, which hooks into the SDK's `PreToolUse`, `PostToolUse`, `Stop`, and `SubagentStop` events to capture -every tool call and subagent invocation as a Trace Craft Step — with no changes to your Claude +every tool call and subagent invocation as a TraceCraft Step — with no changes to your Claude agent prompts or tool definitions. ## Installation @@ -11,7 +11,7 @@ agent prompts or tool definitions. pip install "tracecraft[claude-sdk]" ``` -This installs Trace Craft with `claude-code-sdk` support. The `claude-code-sdk` package must also +This installs TraceCraft with `claude-code-sdk` support. The `claude-code-sdk` package must also be installed: ```bash @@ -26,7 +26,7 @@ import tracecraft from tracecraft.adapters.claude_sdk import ClaudeTraceCraftr from claude_code_sdk import query, ClaudeCodeOptions -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init(console=True) # Create the tracer @@ -53,7 +53,7 @@ asyncio.run(main()) `ClaudeTraceCraftr` works through Claude SDK's hook system. When you call `tracer.get_options()`, it returns a `ClaudeCodeOptions` object with four hook handlers pre-configured: -| Hook | When called | What Trace Craft does | +| Hook | When called | What TraceCraft does | |---|---|---| | `PreToolUse` | Before any tool runs | Creates a `Step`, records `start_time` and `inputs` | | `PostToolUse` | After the tool returns | Completes the step with `end_time`, `duration_ms`, and `outputs` | @@ -66,7 +66,7 @@ Each hook is identified by a `tool_use_id` that correlates `PreToolUse` with its ## Tool Type Mapping -The adapter maps Claude SDK tool names to Trace Craft `StepType` values: +The adapter maps Claude SDK tool names to TraceCraft `StepType` values: | Tool | StepType | Notes | |---|---|---| @@ -240,7 +240,7 @@ asyncio.run(main()) ## Streaming Support -Claude SDK messages stream as they arrive. Trace Craft captures tool-level spans regardless of +Claude SDK messages stream as they arrive. TraceCraft captures tool-level spans regardless of how you consume the message stream: ```python @@ -273,7 +273,7 @@ asyncio.run(main()) ## Subagent Tracing -When Claude uses the `Task` tool to spawn a subagent, Trace Craft creates an `AGENT`-typed step +When Claude uses the `Task` tool to spawn a subagent, TraceCraft creates an `AGENT`-typed step and waits for the `SubagentStop` hook to close it with the subagent's result. ### Task Tool Tracing @@ -548,7 +548,7 @@ tool_use_id str | None — Same ID as the matching PreToolUse ### Stop Called when the Claude agent finishes its session (normal completion or error). `tool_use_id` -is typically `None`. Trace Craft uses this hook to close any steps that were opened but not +is typically `None`. TraceCraft uses this hook to close any steps that were opened but not yet completed. ### SubagentStop diff --git a/docs/integrations/cloud-platforms.md b/docs/integrations/cloud-platforms.md index 69f38ec..26da76a 100644 --- a/docs/integrations/cloud-platforms.md +++ b/docs/integrations/cloud-platforms.md @@ -1,6 +1,6 @@ # Cloud Platform Integrations -Trace Craft integrates with major cloud AI platforms. +TraceCraft integrates with major cloud AI platforms. ## Supported Platforms @@ -149,7 +149,7 @@ See [GCP Vertex Agent Deployment](../deployment/gcp-vertex-agent.md) for details ## Multi-Cloud -Use Trace Craft with multiple cloud platforms: +Use TraceCraft with multiple cloud platforms: ```python config = TraceCraftConfig( diff --git a/docs/integrations/index.md b/docs/integrations/index.md index 5a58db2..f3e7d56 100644 --- a/docs/integrations/index.md +++ b/docs/integrations/index.md @@ -1,10 +1,10 @@ # Integrations -Trace Craft integrates with popular LLM frameworks and cloud platforms. +TraceCraft integrates with popular LLM frameworks and cloud platforms. !!! success "No custom integration code required" - Point any OTLP-instrumented app at Trace Craft and you're done: + Point any OTLP-instrumented app at TraceCraft and you're done: ```bash tracecraft serve --tui @@ -12,7 +12,7 @@ Trace Craft integrates with popular LLM frameworks and cloud platforms. ``` Works with OpenLLMetry, LangChain, LlamaIndex, DSPy, PydanticAI, and any standard - OpenTelemetry SDK — no Trace Craft-specific code needed. + OpenTelemetry SDK — no TraceCraft-specific code needed. ## Framework Integrations @@ -92,7 +92,7 @@ Trace Craft integrates with popular LLM frameworks and cloud platforms. ### LangChain -The simplest path requires no Trace Craft-specific code — just set the OTLP endpoint: +The simplest path requires no TraceCraft-specific code — just set the OTLP endpoint: ```bash OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 python your_langchain_app.py diff --git a/docs/integrations/langchain.md b/docs/integrations/langchain.md index de57921..bf5c24b 100644 --- a/docs/integrations/langchain.md +++ b/docs/integrations/langchain.md @@ -1,6 +1,6 @@ # LangChain Integration -Trace Craft provides native integration with LangChain through the `TraceCraftCallbackHandler`. +TraceCraft provides native integration with LangChain through the `TraceCraftCallbackHandler`. ## Installation @@ -8,7 +8,7 @@ Trace Craft provides native integration with LangChain through the `TraceCraftCa pip install "tracecraft[langchain]" ``` -This installs Trace Craft with LangChain support (`langchain-core>=0.1`). +This installs TraceCraft with LangChain support (`langchain-core>=0.1`). ## Quick Start @@ -18,7 +18,7 @@ from tracecraft.adapters.langchain import TraceCraftCallbackHandler from langchain_openai import ChatOpenAI from langchain_core.prompts import ChatPromptTemplate -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init() # Create LangChain components @@ -50,7 +50,7 @@ The `TraceCraftCallbackHandler` implements LangChain's callback interface and au - Token usage - Errors and retries -All events are converted to Trace Craft spans with proper hierarchy. +All events are converted to TraceCraft spans with proper hierarchy. ## Basic Examples @@ -67,7 +67,7 @@ chain = prompt | llm handler = TraceCraftCallbackHandler() result = chain.invoke( - {"topic": "Trace Craft"}, + {"topic": "TraceCraft"}, config={"callbacks": [handler]} ) ``` @@ -101,7 +101,7 @@ chain = ( handler = TraceCraftCallbackHandler() result = chain.invoke( - {"text": "Trace Craft is amazing!"}, + {"text": "TraceCraft is amazing!"}, config={"callbacks": [handler]} ) ``` @@ -210,7 +210,7 @@ chain = ( # Trace RAG pipeline handler = TraceCraftCallbackHandler() result = chain.invoke( - "What is Trace Craft?", + "What is TraceCraft?", config={"callbacks": [handler]} ) ``` @@ -238,14 +238,14 @@ chain = ( handler = TraceCraftCallbackHandler() result = chain.invoke( - "Explain Trace Craft's architecture", + "Explain TraceCraft's architecture", config={"callbacks": [handler]} ) ``` ## Streaming -Trace Craft supports LangChain streaming: +TraceCraft supports LangChain streaming: ```python from langchain_openai import ChatOpenAI @@ -267,7 +267,7 @@ for chunk in chain.stream( ## LangGraph Integration -Trace Craft works with LangGraph: +TraceCraft works with LangGraph: ```python from langgraph.graph import StateGraph @@ -336,7 +336,7 @@ result = chain.invoke( ### Multiple Handlers -Combine Trace Craft with other handlers: +Combine TraceCraft with other handlers: ```python from langchain.callbacks import StdOutCallbackHandler @@ -430,7 +430,7 @@ chain.invoke(input) # Handler not passed ### Missing Spans -Make sure Trace Craft is initialized: +Make sure TraceCraft is initialized: ```python import tracecraft diff --git a/docs/integrations/llamaindex.md b/docs/integrations/llamaindex.md index da3d9e4..347e11d 100644 --- a/docs/integrations/llamaindex.md +++ b/docs/integrations/llamaindex.md @@ -1,7 +1,7 @@ # LlamaIndex Integration -Trace Craft integrates with LlamaIndex through the `TraceCraftSpanHandler`, a native LlamaIndex span -handler that captures every component invocation as a Trace Craft Step with full hierarchy, token +TraceCraft integrates with LlamaIndex through the `TraceCraftSpanHandler`, a native LlamaIndex span +handler that captures every component invocation as a TraceCraft Step with full hierarchy, token counts, and error details — without changing your existing LlamaIndex code. ## Installation @@ -10,7 +10,7 @@ counts, and error details — without changing your existing LlamaIndex code. pip install "tracecraft[llamaindex]" ``` -This installs Trace Craft with LlamaIndex support (`llama-index-core>=0.10`). +This installs TraceCraft with LlamaIndex support (`llama-index-core>=0.10`). ## Quick Start @@ -23,7 +23,7 @@ from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader from llama_index.core.callbacks import CallbackManager from datetime import UTC, datetime -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init(console=True) # Attach the span handler to LlamaIndex settings @@ -36,7 +36,7 @@ with run_context(run): documents = SimpleDirectoryReader("data").load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine() - response = query_engine.query("What is Trace Craft?") + response = query_engine.query("What is TraceCraft?") print(response) # Free internal tracking state when done @@ -44,7 +44,7 @@ handler.clear() ``` !!! note "run_context is required" - Trace Craft creates steps only when an `AgentRun` is active. Always wrap LlamaIndex calls + TraceCraft creates steps only when an `AgentRun` is active. Always wrap LlamaIndex calls inside `run_context(run)` or `runtime.run("name")`. ## How It Works @@ -131,7 +131,7 @@ run = AgentRun(name="chat", start_time=datetime.now(UTC)) with run_context(run): messages = [ ChatMessage(role="system", content="You are a helpful assistant."), - ChatMessage(role="user", content="What makes Trace Craft different?"), + ChatMessage(role="user", content="What makes TraceCraft different?"), ] response = llm.chat(messages) print(response.message.content) @@ -194,7 +194,7 @@ with run_context(run): print(response) ``` -Trace Craft captures the retriever step (type `RETRIEVAL`) and the synthesizer step (type `LLM`) +TraceCraft captures the retriever step (type `RETRIEVAL`) and the synthesizer step (type `LLM`) as children of the overall query workflow. ## RAG Pipelines @@ -219,7 +219,7 @@ query_engine = index.as_query_engine(similarity_top_k=3) run = AgentRun(name="basic_rag", start_time=datetime.now(UTC)) with run_context(run): response = query_engine.query( - "What are the installation steps for Trace Craft?" + "What are the installation steps for TraceCraft?" ) print(response) for node in response.source_nodes: @@ -282,7 +282,7 @@ with run_context(run): print(response) ``` -Trace Craft captures retrieval, reranking, and synthesis as separate steps in the trace tree. +TraceCraft captures retrieval, reranking, and synthesis as separate steps in the trace tree. ## Chat Engines @@ -300,7 +300,7 @@ chat_engine = index.as_chat_engine(chat_mode="best", verbose=False) run = AgentRun(name="chat_session", start_time=datetime.now(UTC)) with run_context(run): - response = chat_engine.chat("What is Trace Craft?") + response = chat_engine.chat("What is TraceCraft?") print(response) response = chat_engine.chat("How does it compare to LangSmith?") @@ -325,7 +325,7 @@ chat_engine = index.as_chat_engine( chat_mode="context", memory=memory, system_prompt=( - "You are a technical assistant for Trace Craft. " + "You are a technical assistant for TraceCraft. " "Answer questions based on the documentation." ), ) @@ -333,7 +333,7 @@ chat_engine = index.as_chat_engine( run = AgentRun(name="chat_with_memory", start_time=datetime.now(UTC)) with run_context(run): turns = [ - "What is the purpose of Trace Craft?", + "What is the purpose of TraceCraft?", "What step types are available?", "Can you give me an example using the TOOL step type?", ] @@ -361,7 +361,7 @@ def multiply(a: float, b: float) -> float: def search_docs(query: str) -> str: """Search the documentation for information about a topic.""" - return f"Documentation results for '{query}': Trace Craft is an observability SDK for LLMs." + return f"Documentation results for '{query}': TraceCraft is an observability SDK for LLMs." multiply_tool = FunctionTool.from_defaults(fn=multiply) search_tool = FunctionTool.from_defaults(fn=search_docs) @@ -376,7 +376,7 @@ agent = ReActAgent.from_tools( run = AgentRun(name="react_agent", start_time=datetime.now(UTC)) with run_context(run): - response = agent.query("What is 47 times 89? Also, what is Trace Craft?") + response = agent.query("What is 47 times 89? Also, what is TraceCraft?") print(response) ``` @@ -397,14 +397,14 @@ docs_tool = QueryEngineTool( query_engine=index_docs.as_query_engine(), metadata=ToolMetadata( name="documentation", - description="Search the Trace Craft documentation.", + description="Search the TraceCraft documentation.", ), ) code_tool = QueryEngineTool( query_engine=index_code.as_query_engine(), metadata=ToolMetadata( name="source_code", - description="Search the Trace Craft source code.", + description="Search the TraceCraft source code.", ), ) @@ -413,7 +413,7 @@ agent = ReActAgent.from_tools([docs_tool, code_tool], verbose=False) run = AgentRun(name="multi_tool_agent", start_time=datetime.now(UTC)) with run_context(run): response = agent.query( - "How does Trace Craft's LangChain adapter capture token counts? " + "How does TraceCraft's LangChain adapter capture token counts? " "Show me the relevant source code." ) print(response) @@ -455,7 +455,7 @@ with run_context(run): response = agent.query( "What is the weather in Seattle? " "Also, what is 123 plus 456? " - "Tell me about Trace Craft." + "Tell me about TraceCraft." ) print(response) ``` @@ -505,7 +505,7 @@ query_engine = index.as_query_engine(streaming=True) run = AgentRun(name="streaming_rag", start_time=datetime.now(UTC)) with run_context(run): - streaming_response = query_engine.query("Describe the Trace Craft architecture.") + streaming_response = query_engine.query("Describe the TraceCraft architecture.") streaming_response.print_response_stream() handler.clear() @@ -513,7 +513,7 @@ handler.clear() !!! tip "Streaming and token counts" When streaming is enabled, LlamaIndex may not provide final token counts until the stream - completes. Trace Craft captures whatever usage data is available in the response object. + completes. TraceCraft captures whatever usage data is available in the response object. ## Advanced Usage @@ -544,7 +544,7 @@ tracecraft.get_runtime().end_run(run) ### Multiple Indices -Trace Craft handles calls across multiple indices in the same run: +TraceCraft handles calls across multiple indices in the same run: ```python from tracecraft.core.context import run_context @@ -566,7 +566,7 @@ Both queries appear as sibling steps inside the same run. ### Custom Components -When you subclass a LlamaIndex component, Trace Craft infers the step type from the class and module +When you subclass a LlamaIndex component, TraceCraft infers the step type from the class and module name. Ensure your class name contains a descriptive keyword (`llm`, `retriever`, `tool`, `agent`) so the inference logic maps it correctly: diff --git a/docs/integrations/otel-receiver.md b/docs/integrations/otel-receiver.md index 7acc9d8..9b8b8af 100644 --- a/docs/integrations/otel-receiver.md +++ b/docs/integrations/otel-receiver.md @@ -1,16 +1,16 @@ # OpenTelemetry Integration -Trace Craft provides seamless OpenTelemetry (OTel) integration, allowing you to collect traces from any OTel-instrumented application and view them in Trace Craft's powerful TUI. +TraceCraft provides seamless OpenTelemetry (OTel) integration, allowing you to collect traces from any OTel-instrumented application and view them in TraceCraft's powerful TUI. !!! tip "When to Use This" Use the OTel integration when you want to: - **Receive traces** from existing OTel-instrumented applications - **Use standard instrumentation** libraries (OpenAI, Anthropic, LangChain, etc.) - - **Send to multiple backends** simultaneously (Trace Craft + DataDog, etc.) + - **Send to multiple backends** simultaneously (TraceCraft + DataDog, etc.) - **Integrate with existing OTel infrastructure** (collectors, pipelines) - For simpler use cases, consider [Auto-Instrumentation](auto-instrumentation.md) or [Trace Craft decorators](../user-guide/decorators.md). + For simpler use cases, consider [Auto-Instrumentation](auto-instrumentation.md) or [TraceCraft decorators](../user-guide/decorators.md). --- @@ -120,7 +120,7 @@ tracecraft tui sqlite://traces/my_traces.db ``` !!! success "That's it!" - Your traces are now being collected and can be viewed in Trace Craft's terminal UI. + Your traces are now being collected and can be viewed in TraceCraft's terminal UI. --- @@ -152,7 +152,7 @@ The `setup_exporter()` function replaces 20+ lines of OpenTelemetry boilerplate tracer = trace.get_tracer("my-agent") ``` -=== "After (With Trace Craft) - 3 lines" +=== "After (With TraceCraft) - 3 lines" ```python from tracecraft.otel import setup_exporter @@ -204,7 +204,7 @@ The `instrument` parameter accepts any combination of these SDK names: pip install opentelemetry-instrumentation-anthropic ``` - Trace Craft will display a warning if a requested package is missing. + TraceCraft will display a warning if a requested package is missing. --- @@ -212,7 +212,7 @@ The `instrument` parameter accepts any combination of these SDK names: `setup_exporter()` respects standard OpenTelemetry environment variables, making it easy to configure in different environments: -| Trace Craft Variable | OTel Fallback | Description | Example | +| TraceCraft Variable | OTel Fallback | Description | Example | |---------------------|---------------|-------------|---------| | `TRACECRAFT_ENDPOINT` | `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP endpoint URL | `http://localhost:4318` | | `TRACECRAFT_SERVICE_NAME` | `OTEL_SERVICE_NAME` | Service name | `my-agent` | @@ -246,13 +246,13 @@ The `instrument` parameter accepts any combination of these SDK names: ## Backend URL Schemes -Trace Craft supports custom URL schemes for different observability backends: +TraceCraft supports custom URL schemes for different observability backends: | Scheme | Converts To | Default Port | Use Case | |--------|-------------|--------------|----------| | `http://` | `http://` | 4318 | Local development, internal collectors | | `https://` | `https://` | 4318 | Production OTLP endpoints | -| `tracecraft://` | `http://` | 4318 | Trace Craft receiver (alias for http) | +| `tracecraft://` | `http://` | 4318 | TraceCraft receiver (alias for http) | | `datadog://` | `https://` | 4318 | DataDog OTLP intake | | `azure://` | `https://` | 443 | Azure Application Insights | | `aws://` | `https://` | 443 | AWS X-Ray | @@ -260,7 +260,7 @@ Trace Craft supports custom URL schemes for different observability backends: ### Examples -=== "Local Trace Craft" +=== "Local TraceCraft" ```python tracer = setup_exporter(endpoint="tracecraft://localhost:4318") @@ -286,9 +286,9 @@ Trace Craft supports custom URL schemes for different observability backends: --- -## Trace Craft Span Attributes +## TraceCraft Span Attributes -To ensure your traces display correctly in Trace Craft's TUI, set these attributes on your spans: +To ensure your traces display correctly in TraceCraft's TUI, set these attributes on your spans: ### Step Types @@ -1009,7 +1009,7 @@ signal.signal(signal.SIGINT, graceful_shutdown) ### Empty Input/Output in TUI -??? question "Are you setting Trace Craft attributes?" +??? question "Are you setting TraceCraft attributes?" Ensure you set the required attributes on parent spans: ```python @@ -1093,7 +1093,7 @@ Use `parse_endpoint` to understand how URLs are interpreted: ```python from tracecraft.otel import parse_endpoint -# Parse a Trace Craft URL +# Parse a TraceCraft URL config = parse_endpoint("tracecraft://myhost:4318/custom/path") print(f"Scheme: {config.scheme}") # tracecraft print(f"Host: {config.host}") # myhost @@ -1186,7 +1186,7 @@ print(f"Backend Type: {config.backend_type}") # tracecraft --- - Alternative approach using Trace Craft's native instrumentation + Alternative approach using TraceCraft's native instrumentation [:octicons-arrow-right-24: Auto-Instrumentation](auto-instrumentation.md) diff --git a/docs/integrations/pydantic-ai.md b/docs/integrations/pydantic-ai.md index 5919be7..8de5aca 100644 --- a/docs/integrations/pydantic-ai.md +++ b/docs/integrations/pydantic-ai.md @@ -1,7 +1,7 @@ # PydanticAI Integration -Trace Craft integrates with PydanticAI through the `TraceCraftSpanProcessor`, an OpenTelemetry -`SpanProcessor` that intercepts PydanticAI's Logfire-based spans and converts them to Trace Craft +TraceCraft integrates with PydanticAI through the `TraceCraftSpanProcessor`, an OpenTelemetry +`SpanProcessor` that intercepts PydanticAI's Logfire-based spans and converts them to TraceCraft Steps. You get full traces — LLM calls, tool use, structured output, and retries — with no changes to your PydanticAI agent code. @@ -11,7 +11,7 @@ to your PydanticAI agent code. pip install "tracecraft[pydantic-ai]" ``` -This installs Trace Craft with PydanticAI support, including `opentelemetry-sdk` and +This installs TraceCraft with PydanticAI support, including `opentelemetry-sdk` and `pydantic-ai>=0.0.14`. ## Quick Start @@ -26,7 +26,7 @@ from opentelemetry.sdk.trace import TracerProvider from pydantic_ai import Agent from datetime import UTC, datetime -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init(console=True) # Wire the span processor into an OTel TracerProvider @@ -38,7 +38,7 @@ trace.set_tracer_provider(provider) # Create a PydanticAI agent agent = Agent("openai:gpt-4o-mini", system_prompt="You are a helpful assistant.") -# Wrap execution in a Trace Craft run +# Wrap execution in a TraceCraft run run = AgentRun(name="pydantic_ai_demo", start_time=datetime.now(UTC)) with run_context(run): result = agent.run_sync("What is the capital of France?") @@ -152,7 +152,7 @@ asyncio.run(main()) ### Pydantic Models as Output -PydanticAI validates agent responses against a Pydantic model. Trace Craft captures the LLM call +PydanticAI validates agent responses against a Pydantic model. TraceCraft captures the LLM call and the validation result: ```python @@ -270,7 +270,7 @@ def lookup_fact(topic: str) -> str: """Look up a fact about a topic.""" facts = { "python": "Python was created by Guido van Rossum in 1991.", - "tracecraft": "Trace Craft is a vendor-neutral LLM observability SDK.", + "tracecraft": "TraceCraft is a vendor-neutral LLM observability SDK.", } return facts.get(topic.lower(), f"No fact found for '{topic}'.") @@ -364,7 +364,7 @@ for step in run.steps: ### RunContext Dependencies -PydanticAI uses `RunContext` to inject dependencies into tools. Trace Craft traces the tool +PydanticAI uses `RunContext` to inject dependencies into tools. TraceCraft traces the tool calls transparently regardless of the dependency type: ```python @@ -445,7 +445,7 @@ with run_context(run): ### Automatic Retries -PydanticAI automatically retries failed tool calls and LLM requests. Trace Craft captures each +PydanticAI automatically retries failed tool calls and LLM requests. TraceCraft captures each attempt as a separate step, so you can see exactly where failures occurred: ```python @@ -526,7 +526,7 @@ async def main() -> None: asyncio.run(main()) ``` -Trace Craft marks the LLM step with `is_streaming = True` and accumulates text in +TraceCraft marks the LLM step with `is_streaming = True` and accumulates text in `step.streaming_chunks`. ### Structured Output Streaming @@ -597,7 +597,7 @@ with run_context(run): ### Model Switching -You can run the same agent logic against different models and compare results in Trace Craft: +You can run the same agent logic against different models and compare results in TraceCraft: ```python from pydantic_ai import Agent @@ -661,7 +661,7 @@ agent = Agent("openai:gpt-4o-mini") run = AgentRun(name="custom_prompt", start_time=datetime.now(UTC)) with run_context(run): result = agent.run_sync( - "What is Trace Craft?", + "What is TraceCraft?", system_prompt=build_system_prompt(user_tier="pro", locale="English"), ) print(result.data) @@ -734,7 +734,7 @@ finally: ### 4. Use Descriptive Run Names -Run names appear in the Trace Craft TUI and JSONL exports. Choose names that identify the +Run names appear in the TraceCraft TUI and JSONL exports. Choose names that identify the workflow, user, and session for easy filtering: ```python @@ -747,7 +747,7 @@ run = AgentRun( ### 5. Handle Tool Errors Explicitly When a tool raises an exception that is not `ModelRetry`, PydanticAI propagates it to the caller. -Trace Craft captures the error in the span, but you should still handle it at the application level: +TraceCraft captures the error in the span, but you should still handle it at the application level: ```python from pydantic_ai.exceptions import UnexpectedModelBehavior diff --git a/docs/migration/from-langfuse.md b/docs/migration/from-langfuse.md index c9816e5..733dbae 100644 --- a/docs/migration/from-langfuse.md +++ b/docs/migration/from-langfuse.md @@ -1,10 +1,10 @@ -# Migrating from Langfuse to Trace Craft +# Migrating from Langfuse to TraceCraft -This guide helps you migrate from Langfuse to Trace Craft for LLM observability. +This guide helps you migrate from Langfuse to TraceCraft for LLM observability. ## Key Differences -| Feature | Langfuse | Trace Craft | +| Feature | Langfuse | TraceCraft | |---------|----------|------------| | Architecture | Cloud + self-host | Local-first | | UI | Web dashboard | HTML reports + OTLP backends | @@ -13,7 +13,7 @@ This guide helps you migrate from Langfuse to Trace Craft for LLM observability. ## Migration Steps -### 1. Install Trace Craft +### 1. Install TraceCraft ```bash pip install tracecraft @@ -36,7 +36,7 @@ def my_agent(query: str) -> str: return process(query) ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python import tracecraft @@ -62,7 +62,7 @@ span.end(output=result) trace.update(output=final_result) ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python from tracecraft.core.context import run_context @@ -97,7 +97,7 @@ handler = CallbackHandler() chain.invoke(input, config={"callbacks": [handler]}) ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python from tracecraft.adapters.langchain import TraceCraftCallbackHandler @@ -116,7 +116,7 @@ os.environ["LANGFUSE_SECRET_KEY"] = "sk-..." os.environ["LANGFUSE_HOST"] = "https://cloud.langfuse.com" ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python import tracecraft @@ -133,7 +133,7 @@ tracecraft.init(exporters=[ ## Feature Mapping -| Langfuse Feature | Trace Craft Equivalent | +| Langfuse Feature | TraceCraft Equivalent | |------------------|----------------------| | `@observe()` | `@trace_agent`, `@trace_llm`, `@trace_tool` | | `langfuse.trace()` | `AgentRun` + `run_context` | diff --git a/docs/migration/from-langsmith.md b/docs/migration/from-langsmith.md index d647577..7fb35fd 100644 --- a/docs/migration/from-langsmith.md +++ b/docs/migration/from-langsmith.md @@ -1,10 +1,10 @@ -# Migrating from LangSmith to Trace Craft +# Migrating from LangSmith to TraceCraft -This guide helps you migrate from LangSmith to Trace Craft for LLM observability. +This guide helps you migrate from LangSmith to TraceCraft for LLM observability. ## Key Differences -| Feature | LangSmith | Trace Craft | +| Feature | LangSmith | TraceCraft | |---------|-----------|------------| | Vendor Lock-in | LangChain ecosystem | Vendor-neutral | | Export Formats | Proprietary | OTLP, JSONL, HTML | @@ -13,7 +13,7 @@ This guide helps you migrate from LangSmith to Trace Craft for LLM observability ## Migration Steps -### 1. Install Trace Craft +### 1. Install TraceCraft ```bash pip install tracecraft @@ -34,13 +34,13 @@ from langchain_openai import ChatOpenAI llm = ChatOpenAI() ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python import tracecraft from tracecraft.adapters.langchain import TraceCraftCallbackHandler -# Initialize Trace Craft +# Initialize TraceCraft tracecraft.init(console=True, jsonl=True) # Use the callback handler @@ -94,7 +94,7 @@ tracecraft.init(exporters=[otlp]) ## Feature Mapping -| LangSmith Feature | Trace Craft Equivalent | +| LangSmith Feature | TraceCraft Equivalent | |-------------------|----------------------| | `@traceable` decorator | `@tracecraft.trace_agent` | | Run trees | Nested Steps with parent_id | diff --git a/docs/migration/from-openllmetry.md b/docs/migration/from-openllmetry.md index afc0aec..8ed082c 100644 --- a/docs/migration/from-openllmetry.md +++ b/docs/migration/from-openllmetry.md @@ -1,10 +1,10 @@ -# Migrating from OpenLLMetry to Trace Craft +# Migrating from OpenLLMetry to TraceCraft -This guide helps you migrate from OpenLLMetry (Traceloop) to Trace Craft. +This guide helps you migrate from OpenLLMetry (Traceloop) to TraceCraft. ## Key Differences -| Feature | OpenLLMetry | Trace Craft | +| Feature | OpenLLMetry | TraceCraft | |---------|-------------|------------| | Focus | Auto-instrumentation | Explicit + auto | | Protocol | OpenTelemetry native | OTLP export + local | @@ -13,7 +13,7 @@ This guide helps you migrate from OpenLLMetry (Traceloop) to Trace Craft. ## Migration Steps -### 1. Install Trace Craft +### 1. Install TraceCraft ```bash pip install tracecraft @@ -34,7 +34,7 @@ Traceloop.init( ) ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python import tracecraft @@ -71,7 +71,7 @@ def summarize(docs: list): return summary ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python import tracecraft @@ -103,7 +103,7 @@ with Traceloop.set_association_properties({ result = my_function() ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python from tracecraft.core.context import run_context @@ -134,16 +134,16 @@ Traceloop.set_prompt( ) ``` -**After (Trace Craft):** +**After (TraceCraft):** ```python -# Trace Craft doesn't manage prompts - use your preferred solution +# TraceCraft doesn't manage prompts - use your preferred solution # Prompts are captured in LLM step inputs automatically ``` ## Feature Mapping -| OpenLLMetry Feature | Trace Craft Equivalent | +| OpenLLMetry Feature | TraceCraft Equivalent | |---------------------|----------------------| | `@workflow` | `@trace_agent` | | `@task` | `@trace_tool` | @@ -155,12 +155,12 @@ Traceloop.set_prompt( ## Keeping OpenTelemetry Native -If you want to stay with pure OpenTelemetry but need Trace Craft features: +If you want to stay with pure OpenTelemetry but need TraceCraft features: ```python from tracecraft.exporters.otlp import OTLPExporter -# Trace Craft converts its traces to OTLP spans +# TraceCraft converts its traces to OTLP spans otlp = OTLPExporter( endpoint="http://otel-collector:4317", protocol="grpc" diff --git a/docs/migration/index.md b/docs/migration/index.md index 33a1a4d..3a1974f 100644 --- a/docs/migration/index.md +++ b/docs/migration/index.md @@ -1,6 +1,6 @@ # Migration Guides -Moving from another LLM observability tool to Trace Craft is usually straightforward. Trace Craft +Moving from another LLM observability tool to TraceCraft is usually straightforward. TraceCraft is vendor-neutral and supports all major export formats (OTLP, JSONL, HTML), so you can keep your existing backend while switching the instrumentation layer. @@ -15,7 +15,7 @@ and a step-by-step migration checklist. --- - Replace LangSmith tracing with Trace Craft. Covers `@traceable` to `@trace_agent` + Replace LangSmith tracing with TraceCraft. Covers `@traceable` to `@trace_agent` conversion, callback handler setup, and OTLP export as an alternative to the LangSmith cloud. @@ -25,7 +25,7 @@ and a step-by-step migration checklist. --- - Replace Langfuse SDK calls with Trace Craft decorators and context managers. Covers + Replace Langfuse SDK calls with TraceCraft decorators and context managers. Covers observation mapping, dataset export, and self-hosted alternatives. [:octicons-arrow-right-24: Migrate from Langfuse](from-langfuse.md) @@ -34,16 +34,16 @@ and a step-by-step migration checklist. --- - Replace OpenLLMetry instrumentation with Trace Craft. Covers span mapping, workflow + Replace OpenLLMetry instrumentation with TraceCraft. Covers span mapping, workflow decorators, and reusing your existing OTLP collector configuration. [:octicons-arrow-right-24: Migrate from OpenLLMetry](from-openllmetry.md) -## Why Migrate to Trace Craft? +## Why Migrate to TraceCraft? -| Feature | LangSmith | Langfuse | OpenLLMetry | Trace Craft | +| Feature | LangSmith | Langfuse | OpenLLMetry | TraceCraft | |---------|:---------:|:--------:|:-----------:|:----------:| | Vendor-neutral export | No | Partial | Yes | Yes | | Works fully offline | No | Self-host | Yes | Yes | @@ -64,7 +64,7 @@ Regardless of which tool you are migrating from, the process follows the same pa 5. __Remove the old SDK__ once all modules have been migrated and verified. !!! tip "Zero-Downtime Migration" - Because Trace Craft can export to the same OTLP backend as your current tool, you can run + Because TraceCraft can export to the same OTLP backend as your current tool, you can run both side by side during the transition. Simply point both SDKs at the same collector and compare the output before fully cutting over. diff --git a/docs/reference/quick-reference.md b/docs/reference/quick-reference.md index 5ffa5e6..3a309ce 100644 --- a/docs/reference/quick-reference.md +++ b/docs/reference/quick-reference.md @@ -1,6 +1,6 @@ # Quick Reference -This page provides a condensed reference for Trace Craft's most commonly used APIs, configuration options, CLI commands, and patterns. For full documentation, follow the links in each section. +This page provides a condensed reference for TraceCraft's most commonly used APIs, configuration options, CLI commands, and patterns. For full documentation, follow the links in each section. --- @@ -191,7 +191,7 @@ with step("data_preprocessing", type=StepType.WORKFLOW) as s: ## Config File -Trace Craft loads `.tracecraft/config.yaml` from the project root (or `~/.tracecraft/config.yaml`) automatically. Explicit `init()` parameters always win. +TraceCraft loads `.tracecraft/config.yaml` from the project root (or `~/.tracecraft/config.yaml`) automatically. Explicit `init()` parameters always win. **Minimal config:** diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index f0ad34a..db7848f 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -1,7 +1,7 @@ # Troubleshooting This guide covers the most common problems encountered when installing, configuring, or running -Trace Craft. Each issue includes the root cause and a concrete solution. +TraceCraft. Each issue includes the root cause and a concrete solution. If your problem is not listed here, check the [FAQ](faq.md) or open an issue on [GitHub](https://github.com/LocalAI/tracecraft/issues). @@ -20,7 +20,7 @@ ModuleNotFoundError: No module named 'tracecraft.adapters.langchain' **Cause** -Trace Craft uses optional dependency groups (extras) to keep the base package lightweight. +TraceCraft uses optional dependency groups (extras) to keep the base package lightweight. Adapters and some exporters are not installed unless you request them explicitly. **Solution** @@ -64,7 +64,7 @@ AttributeError: module 'opentelemetry.trace' has no attribute 'use_span' **Cause** -Trace Craft targets the OpenTelemetry Python SDK `>=1.20`. An older version installed by +TraceCraft targets the OpenTelemetry Python SDK `>=1.20`. An older version installed by another package in your environment is taking precedence. **Solution** @@ -227,7 +227,7 @@ blocking the collector port, or the backend process not yet running. nc -zv localhost 4317 ``` -2. Enable debug logging to see what Trace Craft is attempting to send: +2. Enable debug logging to see what TraceCraft is attempting to send: ```python import logging @@ -358,7 +358,7 @@ async def worker(task: str) -> str: ... ``` -For framework integrations, always use the Trace Craft adapter: +For framework integrations, always use the TraceCraft adapter: ```python from tracecraft.adapters.langchain import TraceCraftCallbackHandler diff --git a/docs/user-guide/configuration.md b/docs/user-guide/configuration.md index f49a49b..13db3f9 100644 --- a/docs/user-guide/configuration.md +++ b/docs/user-guide/configuration.md @@ -1,6 +1,6 @@ # Configuration -Trace Craft can be configured through code, a config file, or environment variables. This guide covers all available configuration options. +TraceCraft can be configured through code, a config file, or environment variables. This guide covers all available configuration options. ## Configuration Precedence @@ -27,7 +27,7 @@ tracecraft.init(service_name="code-service") ## Config File -The easiest way to configure Trace Craft for a project is a config file at `.tracecraft/config.yaml` in your project root (or `~/.tracecraft/config.yaml` globally). The file is loaded automatically — no code changes required. +The easiest way to configure TraceCraft for a project is a config file at `.tracecraft/config.yaml` in your project root (or `~/.tracecraft/config.yaml` globally). The file is loaded automatically — no code changes required. ### Minimal Config @@ -306,7 +306,7 @@ export TRACECRAFT_AUTO_INSTRUMENT=openai,langchain # selective !!! warning "Initialize Before Importing SDKs" `tracecraft.init()` must be called **before** importing OpenAI, Anthropic, - LangChain, or LlamaIndex. Trace Craft patches at import time — importing first + LangChain, or LlamaIndex. TraceCraft patches at import time — importing first means the patch won't apply. ### Console Output diff --git a/docs/user-guide/decorators.md b/docs/user-guide/decorators.md index 0e1dfb9..ac83af0 100644 --- a/docs/user-guide/decorators.md +++ b/docs/user-guide/decorators.md @@ -1,12 +1,12 @@ # Custom Instrumentation (Decorators) -Trace Craft provides decorators for adding custom semantic tracing to your code. +TraceCraft provides decorators for adding custom semantic tracing to your code. !!! info "Decorators are Optional" **For most use cases, you don't need decorators.** - Trace Craft's [auto-instrumentation](../integrations/auto-instrumentation.md) automatically + TraceCraft's [auto-instrumentation](../integrations/auto-instrumentation.md) automatically captures all OpenAI and Anthropic LLM calls without any code changes. Framework adapters for LangChain and LlamaIndex also provide automatic tracing. @@ -24,7 +24,7 @@ Trace Craft provides decorators for adding custom semantic tracing to your code. ## Overview -Trace Craft offers four main decorators: +TraceCraft offers four main decorators: | Decorator | Purpose | Step Type | |-----------|---------|-----------| @@ -598,7 +598,7 @@ async def agent(query: str) -> str: ## Next Steps -- [Configuration](configuration.md) - Configure Trace Craft behavior +- [Configuration](configuration.md) - Configure TraceCraft behavior - [Exporters](exporters.md) - Send traces to different backends - [Processors](processors.md) - Process and transform traces - [API Reference](../api/decorators.md) - Complete decorator API diff --git a/docs/user-guide/exporters.md b/docs/user-guide/exporters.md index e5c5295..e065a46 100644 --- a/docs/user-guide/exporters.md +++ b/docs/user-guide/exporters.md @@ -1,6 +1,6 @@ # Exporters -Exporters send trace data to different backends. Trace Craft supports multiple exporters simultaneously. +Exporters send trace data to different backends. TraceCraft supports multiple exporters simultaneously. ## Available Exporters diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md index 7191767..2029940 100644 --- a/docs/user-guide/index.md +++ b/docs/user-guide/index.md @@ -1,6 +1,6 @@ # User Guide -Welcome to the Trace Craft SDK Guide. This section covers Trace Craft's Python SDK in detail — decorators, configuration, exporters, processors, and advanced patterns. +Welcome to the TraceCraft SDK Guide. This section covers TraceCraft's Python SDK in detail — decorators, configuration, exporters, processors, and advanced patterns. !!! tip "Looking for the Terminal UI guide?" The TUI has its own section: [Terminal UI](tui.md) diff --git a/docs/user-guide/multi-tenancy.md b/docs/user-guide/multi-tenancy.md index 96b9118..cb0904a 100644 --- a/docs/user-guide/multi-tenancy.md +++ b/docs/user-guide/multi-tenancy.md @@ -1,10 +1,10 @@ # Multi-Tenancy -Handle multiple tenants with separate configurations using Trace Craft runtimes. +Handle multiple tenants with separate configurations using TraceCraft runtimes. ## Overview -Trace Craft supports multi-tenancy through isolated runtimes, each with its own configuration. +TraceCraft supports multi-tenancy through isolated runtimes, each with its own configuration. ## Basic Multi-Tenancy diff --git a/docs/user-guide/performance.md b/docs/user-guide/performance.md index e754163..467611f 100644 --- a/docs/user-guide/performance.md +++ b/docs/user-guide/performance.md @@ -1,6 +1,6 @@ # Performance Guide -Trace Craft is designed to add minimal overhead to your LLM applications. This guide explains how to tune the SDK for your performance requirements, from development setups where you want full observability to high-throughput production environments where every millisecond counts. +TraceCraft is designed to add minimal overhead to your LLM applications. This guide explains how to tune the SDK for your performance requirements, from development setups where you want full observability to high-throughput production environments where every millisecond counts. --- @@ -8,7 +8,7 @@ Trace Craft is designed to add minimal overhead to your LLM applications. This g ### Why Performance Matters in LLM Observability -LLM calls already carry significant latency (hundreds of milliseconds to multiple seconds). The observability layer should never become the bottleneck. Trace Craft is built with this constraint in mind: +LLM calls already carry significant latency (hundreds of milliseconds to multiple seconds). The observability layer should never become the bottleneck. TraceCraft is built with this constraint in mind: - **Synchronous decorators** add only function-call overhead when processors are bypassed. - **Processor pipelines** run inline but are short-circuited by sampling so most traces never reach redaction or export. @@ -332,7 +332,7 @@ with step("process_batch", type=StepType.WORKFLOW) as s: ### Garbage Collection Considerations -Trace Craft holds `AgentRun` objects in memory from `start_run()` until `end_run()` exports and releases them. For long-running agents (minutes or more), keep the following in mind: +TraceCraft holds `AgentRun` objects in memory from `start_run()` until `end_run()` exports and releases them. For long-running agents (minutes or more), keep the following in mind: - Large `inputs` and `outputs` dicts are held for the full run duration. - Deeply nested step hierarchies (many children) can accumulate significant memory. @@ -352,7 +352,7 @@ config = TraceCraftConfig( ### When to Use Async -Trace Craft's decorators transparently support both synchronous and asynchronous functions. The `@trace_agent`, `@trace_tool`, `@trace_llm`, and `@trace_retrieval` decorators detect `asyncio.iscoroutinefunction` at decoration time and wrap accordingly. +TraceCraft's decorators transparently support both synchronous and asynchronous functions. The `@trace_agent`, `@trace_tool`, `@trace_llm`, and `@trace_retrieval` decorators detect `asyncio.iscoroutinefunction` at decoration time and wrap accordingly. ```python # Synchronous — works directly @@ -370,7 +370,7 @@ Use the async path whenever your application is already async. The overhead is i ### Context Propagation Across Async Boundaries -Python's `contextvars` (which Trace Craft uses internally) do **not** automatically propagate across `asyncio.gather()` or `asyncio.create_task()`. Each new task gets an independent copy of the context at creation time. For most simple cases this is fine, but for complex fan-out patterns use the helpers from `tracecraft.contrib.async_helpers`. +Python's `contextvars` (which TraceCraft uses internally) do **not** automatically propagate across `asyncio.gather()` or `asyncio.create_task()`. Each new task gets an independent copy of the context at creation time. For most simple cases this is fine, but for complex fan-out patterns use the helpers from `tracecraft.contrib.async_helpers`. ```python from tracecraft.contrib.async_helpers import gather_with_context diff --git a/docs/user-guide/processors.md b/docs/user-guide/processors.md index 9af23f6..40c276f 100644 --- a/docs/user-guide/processors.md +++ b/docs/user-guide/processors.md @@ -1,6 +1,6 @@ # Processors -Processors transform trace data before export. Trace Craft includes built-in processors for PII redaction, sampling, and enrichment. +Processors transform trace data before export. TraceCraft includes built-in processors for PII redaction, sampling, and enrichment. ## Overview @@ -62,7 +62,7 @@ mode=RedactionMode.HASH ### Built-In Patterns -Trace Craft redacts common PII: +TraceCraft redacts common PII: - Email addresses - Phone numbers diff --git a/docs/user-guide/remote-trace-sources.md b/docs/user-guide/remote-trace-sources.md index e881be7..6e513d9 100644 --- a/docs/user-guide/remote-trace-sources.md +++ b/docs/user-guide/remote-trace-sources.md @@ -1,6 +1,6 @@ # Remote Trace Sources -Trace Craft's TUI can pull traces that already live in your cloud observability platform — no need to copy data locally. Connect directly to AWS X-Ray, GCP Cloud Trace, Azure Monitor, or DataDog and browse those traces with the same interactive interface you use for locally-stored traces. +TraceCraft's TUI can pull traces that already live in your cloud observability platform — no need to copy data locally. Connect directly to AWS X-Ray, GCP Cloud Trace, Azure Monitor, or DataDog and browse those traces with the same interactive interface you use for locally-stored traces. !!! info "Read-only connection" Remote backends are **read-only**. The TUI fetches and displays traces; it never writes, modifies, or deletes records in your platform. `save()` and `delete()` operations always raise `NotImplementedError`. @@ -196,7 +196,7 @@ Requires **two environment variables** set at shell level: | `DD_APP_KEY` | DataDog Application key (from Organization Settings → Application Keys) | !!! danger "Never put DataDog credentials in config files" - `DD_API_KEY` and `DD_APP_KEY` must be set as environment variables. Trace Craft validates at startup that both are present and raises `ValueError` with a clear error message if either is missing. + `DD_API_KEY` and `DD_APP_KEY` must be set as environment variables. TraceCraft validates at startup that both are present and raises `ValueError` with a clear error message if either is missing. ### Supported Sites @@ -239,7 +239,7 @@ All remote backends use an in-memory TTL cache (default: **60 seconds**) to prev ### What Gets Populated -Remote backends map platform-specific spans to Trace Craft's canonical `AgentRun` and `Step` models. The accuracy of LLM-specific fields depends on whether your instrumentation wrote them: +Remote backends map platform-specific spans to TraceCraft's canonical `AgentRun` and `Step` models. The accuracy of LLM-specific fields depends on whether your instrumentation wrote them: | Field | Populated When | |-------|---------------| @@ -251,7 +251,7 @@ Remote backends map platform-specific spans to Trace Craft's canonical `AgentRun | `cloud_trace_id` | Always set to the platform's native trace ID | !!! tip "Best field coverage" - If your agents are instrumented with Trace Craft's own decorators or the OpenTelemetry GenAI semantic conventions, all LLM fields will be populated when viewing traces from any platform. + If your agents are instrumented with TraceCraft's own decorators or the OpenTelemetry GenAI semantic conventions, all LLM fields will be populated when viewing traces from any platform. ### Step Type Inference diff --git a/docs/user-guide/security.md b/docs/user-guide/security.md index 55ed7f6..798da55 100644 --- a/docs/user-guide/security.md +++ b/docs/user-guide/security.md @@ -1,6 +1,6 @@ # Security Guide -Trace Craft follows a security-first design philosophy: sensitive data is protected by default, not as an afterthought. This guide covers PII redaction, credential handling, compliance considerations, and best practices for operating Trace Craft in security-conscious environments. +TraceCraft follows a security-first design philosophy: sensitive data is protected by default, not as an afterthought. This guide covers PII redaction, credential handling, compliance considerations, and best practices for operating TraceCraft in security-conscious environments. --- @@ -8,7 +8,7 @@ Trace Craft follows a security-first design philosophy: sensitive data is protec ### Security-First Design Philosophy -Trace Craft makes safe behavior the default: +TraceCraft makes safe behavior the default: - **PII redaction is enabled by default.** You must explicitly disable it. - **17 built-in patterns** cover the most common credential and PII types out of the box. @@ -17,7 +17,7 @@ Trace Craft makes safe behavior the default: ### What Is Protected by Default -Out of the box, with no configuration changes, Trace Craft redacts: +Out of the box, with no configuration changes, TraceCraft redacts: - Email addresses - Phone numbers @@ -168,7 +168,7 @@ config = TraceCraftConfig( Patterns are compiled as Python `re` patterns and applied to all string values in the trace dict. !!! warning "ReDoS Safety" - Avoid patterns with nested quantifiers or catastrophic backtracking characteristics. Trace Craft's built-in patterns are designed to be ReDoS-safe, and custom patterns should be too. Test patterns with tools like `regexr.com` or Python's `re` module before deploying. + Avoid patterns with nested quantifiers or catastrophic backtracking characteristics. TraceCraft's built-in patterns are designed to be ReDoS-safe, and custom patterns should be too. Test patterns with tools like `regexr.com` or Python's `re` module before deploying. ### Field-Based Rules @@ -268,7 +268,7 @@ def authenticate(username: str, password: str) -> bool: ## Compliance Considerations -Trace Craft provides tools to help meet compliance requirements, but compliance is the responsibility of your organization. This section outlines common patterns. +TraceCraft provides tools to help meet compliance requirements, but compliance is the responsibility of your organization. This section outlines common patterns. ### GDPR @@ -302,7 +302,7 @@ The Health Insurance Portability and Accountability Act requires protection of P **PHI Handling:** -PHI includes names, dates, phone numbers, geographic data, social security numbers, and other identifiers. Trace Craft's built-in patterns cover several of these (phone, SSN, email), but you should add custom rules for domain-specific identifiers: +PHI includes names, dates, phone numbers, geographic data, social security numbers, and other identifiers. TraceCraft's built-in patterns cover several of these (phone, SSN, email), but you should add custom rules for domain-specific identifiers: ```python from tracecraft.processors.redaction import RedactionRule @@ -476,7 +476,7 @@ with tenant_b.trace_context(): ### Role-Based Access -Trace Craft does not implement RBAC directly; access control is enforced at the storage layer: +TraceCraft does not implement RBAC directly; access control is enforced at the storage layer: - **JSONL files:** Use filesystem ACLs to restrict read access. - **SQLite:** Use filesystem permissions per database file. @@ -511,7 +511,7 @@ def with_audit_context(user_id: str, session_id: str): ## Security Checklist for Production -Use this checklist before deploying Trace Craft to a production environment: +Use this checklist before deploying TraceCraft to a production environment: - [ ] **Redaction enabled.** `RedactionConfig(enabled=True)` is the default; verify it has not been disabled. - [ ] **Sensitive parameters excluded.** All functions accepting API keys, passwords, or tokens use `exclude_inputs` or `capture_inputs=False`. diff --git a/docs/user-guide/tui.md b/docs/user-guide/tui.md index da60a60..66886ba 100644 --- a/docs/user-guide/tui.md +++ b/docs/user-guide/tui.md @@ -1,6 +1,6 @@ # Terminal UI -The Trace Craft Terminal UI (TUI) is the flagship feature of Trace Craft — a powerful, interactive trace explorer that runs right in your terminal. Analyze LLM traces, debug agent behavior, inspect prompts and responses, compare performance across runs, and understand exactly what your AI application did at every step. +The TraceCraft Terminal UI (TUI) is the flagship feature of TraceCraft — a powerful, interactive trace explorer that runs right in your terminal. Analyze LLM traces, debug agent behavior, inspect prompts and responses, compare performance across runs, and understand exactly what your AI application did at every step. !!! tip "Two Ways to Get Traces into the TUI" @@ -23,7 +23,7 @@ The Trace Craft Terminal UI (TUI) is the flagship feature of Trace Craft — a p --- - **Path A — Trace Craft instrumentation** (you control the code): + **Path A — TraceCraft instrumentation** (you control the code): The fastest option is `receiver=True` — one `init()` call, then start the TUI: @@ -73,7 +73,7 @@ The Trace Craft Terminal UI (TUI) is the flagship feature of Trace Craft — a p ### Main View — All Your Agent Runs -![Trace Craft TUI - Main View](../assets/screenshots/tui-main-view.svg) +![TraceCraft TUI - Main View](../assets/screenshots/tui-main-view.svg) *The main view shows all captured agent runs with name, duration, token usage, and status. Navigate with arrow keys. The filter bar at the top lets you search and filter by project or session.* @@ -83,7 +83,7 @@ The Trace Craft Terminal UI (TUI) is the flagship feature of Trace Craft — a p Select any trace and press `Enter` — the waterfall expands to show the complete call tree: -![Trace Craft TUI - Waterfall View](../assets/screenshots/tui-waterfall-view.svg) +![TraceCraft TUI - Waterfall View](../assets/screenshots/tui-waterfall-view.svg) *The waterfall shows agent → tool → LLM call hierarchy with timing bars. See exactly where your agent spends its time.* @@ -93,7 +93,7 @@ Select any trace and press `Enter` — the waterfall expands to show the complet Press `i` while viewing any span to see the exact prompt, system message, and context: -![Trace Craft TUI - Input View](../assets/screenshots/tui-input-view.svg) +![TraceCraft TUI - Input View](../assets/screenshots/tui-input-view.svg) *Full prompt inspection — every system message, user message, and context sent to the model.* @@ -103,7 +103,7 @@ Press `i` while viewing any span to see the exact prompt, system message, and co Press `o` to view the model's complete response: -![Trace Craft TUI - Output View](../assets/screenshots/tui-output-view.svg) +![TraceCraft TUI - Output View](../assets/screenshots/tui-output-view.svg) *Model response with token counts and cost estimates displayed inline.* @@ -113,7 +113,7 @@ Press `o` to view the model's complete response: Press `a` to view all span attributes — model name, temperature, token counts, costs, and custom metadata: -![Trace Craft TUI - Attributes View](../assets/screenshots/tui-attributes-view.svg) +![TraceCraft TUI - Attributes View](../assets/screenshots/tui-attributes-view.svg) *All span attributes: model, provider, temperature, token usage, cost, custom metadata, and timing.* @@ -125,7 +125,7 @@ There are two independent paths. Choose the one that fits your setup — or use --- -### Path A: Trace Craft Instrumentation +### Path A: TraceCraft Instrumentation Use this when you own the code. You have three options — auto-instrumentation is the fastest to set up. @@ -302,7 +302,7 @@ Traces appear live in the TUI as they are received (with `--watch` mode enabled ### Installation -=== "Path A — Trace Craft instrumentation" +=== "Path A — TraceCraft instrumentation" ```bash pip install "tracecraft[tui]" @@ -318,7 +318,7 @@ Traces appear live in the TUI as they are received (with `--watch` mode enabled ### Launch -**Path A — after running your Trace Craft-instrumented agent:** +**Path A — after running your TraceCraft-instrumented agent:** ```bash # Open TUI from config-specified storage (default: traces/tracecraft.db) @@ -354,7 +354,7 @@ Traces appear in the TUI in real-time. The TUI supports both JSONL and SQLite storage backends. SQLite enables additional features: -![Trace Craft TUI - SQLite Database View](../assets/screenshots/tui-db-main-view.svg) +![TraceCraft TUI - SQLite Database View](../assets/screenshots/tui-db-main-view.svg) *SQLite view: project and session columns are populated, enabling filtering by project or session. Agent names like "WeatherAgent" and "MathTutor" are shown with full project context.* @@ -370,7 +370,7 @@ tracecraft.init(sqlite=True) # Saves to traces/tracecraft.db All TUI actions are keyboard-driven. Press `?` at any time to show the built-in keyboard shortcut reference: -![Trace Craft TUI - Help Screen](../assets/screenshots/tui-help-screen.svg) +![TraceCraft TUI - Help Screen](../assets/screenshots/tui-help-screen.svg) *The help screen (press `?`) shows all keyboard shortcuts in context.* @@ -450,7 +450,7 @@ Navigate with `↑`/`↓`. Press `Enter` to select a trace and expand the waterf The waterfall shows the complete call hierarchy for the selected trace. Press `Enter` on any trace to expand it: -![Trace Craft TUI - Waterfall View](../assets/screenshots/tui-waterfall-view.svg) +![TraceCraft TUI - Waterfall View](../assets/screenshots/tui-waterfall-view.svg) - **Agent spans** are shown at the top level - **Tool calls** are nested under agents @@ -466,7 +466,7 @@ Navigate the waterfall with `↑`/`↓` to select specific spans. Press `i`, `o` Press `/` to activate the filter bar. Type to filter traces by name or agent: -![Trace Craft TUI - Filter Active](../assets/screenshots/tui-filter-active.svg) +![TraceCraft TUI - Filter Active](../assets/screenshots/tui-filter-active.svg) *Filter bar active with search text typed. The result count updates in real-time.* @@ -476,7 +476,7 @@ Press `Escape` to clear the filter and return to the full trace list. Click the **ERRORS** toggle button or use the filter dropdown to show only traces with errors: -![Trace Craft TUI - Error Traces Only](../assets/screenshots/tui-errors-only.svg) +![TraceCraft TUI - Error Traces Only](../assets/screenshots/tui-errors-only.svg) *Error filter active — shows "9 of 61" matching traces. Error traces are immediately visible for fast debugging.* @@ -488,7 +488,7 @@ This is the fastest way to find and debug failing agent runs. Press `i` to view the exact input sent to the selected operation: -![Trace Craft TUI - Input View](../assets/screenshots/tui-input-view.svg) +![TraceCraft TUI - Input View](../assets/screenshots/tui-input-view.svg) - **For LLM spans**: Shows system messages, user messages, and any documents/context - **For tool spans**: Shows the arguments passed to the tool @@ -500,7 +500,7 @@ Press `i` to view the exact input sent to the selected operation: Press `o` to view the output from the selected operation: -![Trace Craft TUI - Output View](../assets/screenshots/tui-output-view.svg) +![TraceCraft TUI - Output View](../assets/screenshots/tui-output-view.svg) - **For LLM spans**: Shows the model's response, with token counts and cost estimates - **For tool spans**: Shows the tool's return value @@ -512,7 +512,7 @@ Press `o` to view the output from the selected operation: Press `a` to see all metadata for the selected span: -![Trace Craft TUI - Attributes View](../assets/screenshots/tui-attributes-view.svg) +![TraceCraft TUI - Attributes View](../assets/screenshots/tui-attributes-view.svg) - Model name and provider - Temperature, max tokens, and other LLM parameters @@ -528,7 +528,7 @@ Press `a` to see all metadata for the selected span: Press `N` to add notes to any trace (requires SQLite storage): -![Trace Craft TUI - Notes Editor](../assets/screenshots/tui-notes-editor.svg) +![TraceCraft TUI - Notes Editor](../assets/screenshots/tui-notes-editor.svg) *Notes editor for the selected trace. Notes persist across TUI sessions in the SQLite database.* @@ -572,7 +572,7 @@ Press `p` on any LLM span to open the **Playground** — an interactive prompt e - See the new response alongside the original - Iterate on prompts without changing your code -The Playground connects to real LLM APIs, so your Trace Craft API key configuration must be set for the chosen model. +The Playground connects to real LLM APIs, so your TraceCraft API key configuration must be set for the chosen model. --- diff --git a/mkdocs.yml b/mkdocs.yml index 77e2051..3971c97 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,4 +1,4 @@ -site_name: Trace Craft Documentation +site_name: TraceCraft Documentation site_url: https://tracecraft.dev site_description: Vendor-neutral LLM observability SDK - instrument once, observe anywhere site_author: Local AI diff --git a/src/tracecraft/cli/main.py b/src/tracecraft/cli/main.py index 462c343..46ed521 100644 --- a/src/tracecraft/cli/main.py +++ b/src/tracecraft/cli/main.py @@ -1,5 +1,5 @@ """ -CLI entry point for Trace Craft commands. +CLI entry point for TraceCraft commands. Provides commands for viewing, validating, and exporting traces. """ @@ -27,7 +27,7 @@ # Create the main app app = typer.Typer( name="tracecraft", - help="Trace Craft CLI - View and manage LLM observability traces", + help="TraceCraft CLI - View and manage LLM observability traces", add_completion=False, ) @@ -70,7 +70,7 @@ def main( ), ] = None, ) -> None: - """Trace Craft CLI - View and manage LLM observability traces.""" + """TraceCraft CLI - View and manage LLM observability traces.""" pass @@ -172,8 +172,8 @@ def view( @app.command() def info() -> None: - """Show Trace Craft configuration and info.""" - console.print("[bold]Trace Craft Info[/bold]") + """Show TraceCraft configuration and info.""" + console.print("[bold]TraceCraft Info[/bold]") console.print() table = Table(show_header=False) diff --git a/src/tracecraft/tui/app.py b/src/tracecraft/tui/app.py index f02091b..a5268a0 100644 --- a/src/tracecraft/tui/app.py +++ b/src/tracecraft/tui/app.py @@ -1,5 +1,5 @@ """ -Trace Craft Terminal UI - LangSmith-style trace explorer. +TraceCraft Terminal UI - LangSmith-style trace explorer. A real-time, interactive terminal interface for exploring and debugging LLM/Agent traces with table and waterfall views. @@ -75,7 +75,7 @@ def _get_bindings() -> list[Any]: class TraceCraftApp(App if TEXTUAL_AVAILABLE else object): # type: ignore[misc] """ - Main Trace Craft TUI application. + Main TraceCraft TUI application. Provides a LangSmith-style interface for exploring traces with: - Table view showing all traces on the left @@ -86,7 +86,7 @@ class TraceCraftApp(App if TEXTUAL_AVAILABLE else object): # type: ignore[misc] """ TITLE = "TRACECRAFT" - SUB_TITLE = "Trace Craft" + SUB_TITLE = "TraceCraft" CSS = f""" /* ============================================ diff --git a/src/tracecraft/tui/screens/setup_wizard.py b/src/tracecraft/tui/screens/setup_wizard.py index c65d207..85e1b40 100644 --- a/src/tracecraft/tui/screens/setup_wizard.py +++ b/src/tracecraft/tui/screens/setup_wizard.py @@ -1,7 +1,7 @@ """ -Setup wizard screen for first-time Trace Craft TUI users. +Setup wizard screen for first-time TraceCraft TUI users. -Provides a welcome screen with options to initialize Trace Craft +Provides a welcome screen with options to initialize TraceCraft with a global or local database, or open an existing file. NOIR SIGNAL theme styling. """ @@ -66,7 +66,7 @@ class SetupWizardScreen(Screen if TEXTUAL_AVAILABLE else object): # type: ignor """ Setup wizard screen for first-time users. - Displays options for initializing Trace Craft: + Displays options for initializing TraceCraft: - Global database (~/.tracecraft/) - Local database (.tracecraft/ in current directory) - Open existing file From 8ff0177079097540407bd49bd0a1df342ee0c06a Mon Sep 17 00:00:00 2001 From: Mike Halagan Date: Thu, 26 Feb 2026 16:08:21 -0600 Subject: [PATCH 2/2] Removed old documents --- dev/DOCUMENTATION_SITE_PLAN.md | 314 ------ dev/api-reference.md | 385 ------- dev/architecture/DESIGN_REVIEW.md | 806 -------------- dev/architecture/FEATURE_EVALUATION.md | 995 ------------------ dev/architecture/PAIN_POINTS.md | 235 ----- dev/research/GEMINI-RESEARCH.md | 289 ----- dev/research/OpenAI-research1.md | 73 -- dev/research/OpenAI-research2.md | 38 - dev/research/combined-research.md | 262 ----- dev/research/gemini-focused.md | 404 ------- ...m-agent-observability-market-validation.md | 440 -------- dev/research/openai-research-focused.md | 209 ---- 12 files changed, 4450 deletions(-) delete mode 100644 dev/DOCUMENTATION_SITE_PLAN.md delete mode 100644 dev/api-reference.md delete mode 100644 dev/architecture/DESIGN_REVIEW.md delete mode 100644 dev/architecture/FEATURE_EVALUATION.md delete mode 100644 dev/architecture/PAIN_POINTS.md delete mode 100644 dev/research/GEMINI-RESEARCH.md delete mode 100644 dev/research/OpenAI-research1.md delete mode 100644 dev/research/OpenAI-research2.md delete mode 100644 dev/research/combined-research.md delete mode 100644 dev/research/gemini-focused.md delete mode 100644 dev/research/llm-agent-observability-market-validation.md delete mode 100644 dev/research/openai-research-focused.md diff --git a/dev/DOCUMENTATION_SITE_PLAN.md b/dev/DOCUMENTATION_SITE_PLAN.md deleted file mode 100644 index 486aaf7..0000000 --- a/dev/DOCUMENTATION_SITE_PLAN.md +++ /dev/null @@ -1,314 +0,0 @@ -# Documentation Site Plan for TraceCraft - -## Current State Assessment - -### Documentation Quality by Category - -| Aspect | Quality | Notes | -|--------|---------|-------| -| Type hints | ★★★★★ | Comprehensive, strict mypy mode | -| Docstrings | ★★★★☆ | Google-style, consistent across modules | -| Examples | ★★★★☆ | 13+ organized examples with clear structure | -| API reference site | ☆☆☆☆☆ | None exists | -| User guides | ★☆☆☆☆ | Only README and examples | - -### Current Documentation Assets - -- **README.md**: High-level project overview, quick start, core features -- **examples/README.md**: Comprehensive learning path, dependency guides, troubleshooting -- **docs/**: Contains research files and deployment guides (not user-facing) -- **Code docstrings**: Google-style throughout, excellent coverage - -### Critical Gaps - -1. No API reference website -2. No architectural decision records (ADRs) -3. No troubleshooting/FAQ beyond examples -4. No deployment/configuration guides -5. No performance tuning documentation -6. No contribution guidelines -7. No migration guides for version upgrades -8. No integration guides for backends (Langfuse, Datadog, etc.) - ---- - -## Recommended Approach: MkDocs + Material Theme - -### Why MkDocs over Sphinx - -| Factor | MkDocs | Sphinx | -|--------|--------|--------| -| Markup language | Markdown (matches existing docs) | reStructuredText (would require conversion) | -| Setup complexity | Single YAML file | Multiple config files | -| Live preview | Built-in dev server | Requires rebuild each change | -| Theme quality | Material theme is modern out-of-box | Requires customization | -| Docstring extraction | mkdocstrings plugin | sphinx-autodoc (native) | -| Learning curve | Low | Medium-High | - -### When to Consider Sphinx Instead - -- Need PDF/ePub output for offline documentation -- Need intersphinx linking to Python stdlib, OpenTelemetry docs -- Require complex cross-referencing between API elements - ---- - -## Required Plugins - -```yaml -plugins: - - search # Built-in search functionality - - mkdocstrings: # Auto-generate API docs from docstrings - handlers: - python: - options: - docstring_style: google - show_source: true - show_root_heading: true - members_order: source - - autorefs # Cross-referencing between pages - - gen-files # Generate pages programmatically - - literate-nav # Navigation from markdown - - mike # Documentation versioning -``` - ---- - -## Proposed Documentation Structure - -``` -docs/ -├── index.md # Landing page (adapted from README) -├── getting-started/ -│ ├── installation.md # pip/uv install, requirements -│ ├── quickstart.md # 5-minute hello world -│ └── configuration.md # Environment variables, config options -├── user-guide/ -│ ├── decorators.md # @trace_agent, @trace_tool, @trace_llm, etc. -│ ├── exporters.md # Console, JSONL, HTML, OTLP -│ ├── processors.md # Enrichment, redaction, sampling -│ ├── adapters.md # LangChain, LlamaIndex, PydanticAI -│ └── storage.md # JSONL, SQLite, MLflow backends -├── tutorials/ -│ ├── basic-tracing.md # Link to examples/01-getting-started -│ ├── framework-integration.md # Link to examples/02-frameworks -│ ├── production-setup.md # Link to examples/04-production -│ └── evaluation.md # Link to examples/06-evaluation -├── reference/ -│ ├── api/ # Auto-generated from docstrings -│ │ ├── core.md # tracecraft.core module -│ │ ├── instrumentation.md # tracecraft.instrumentation module -│ │ ├── exporters.md # tracecraft.exporters module -│ │ ├── processors.md # tracecraft.processors module -│ │ ├── adapters.md # tracecraft.adapters module -│ │ ├── schema.md # tracecraft.schema module -│ │ └── storage.md # tracecraft.storage module -│ ├── cli.md # Command-line interface reference -│ └── configuration.md # All config options with defaults -├── integrations/ -│ ├── langfuse.md # Langfuse backend setup -│ ├── datadog.md # Datadog APM integration -│ ├── phoenix.md # Arize Phoenix setup -│ └── grafana-tempo.md # Tempo + Grafana setup -├── deployment/ -│ ├── production.md # Production best practices -│ ├── kubernetes.md # K8s deployment patterns -│ └── serverless.md # Lambda/Cloud Functions -├── architecture/ -│ ├── overview.md # System design -│ ├── schema-design.md # Dual-dialect schema explanation -│ └── decisions.md # ADRs -└── contributing.md # Development setup, guidelines -``` - ---- - -## Implementation Steps - -### Phase 1: Basic Setup - -1. Install dependencies: - - ```bash - uv add --dev mkdocs-material mkdocstrings[python] mkdocs-autorefs mike - ``` - -2. Create `mkdocs.yml` configuration - -3. Create initial `docs/index.md` from README - -4. Set up API reference auto-generation - -### Phase 2: Content Migration - -1. Migrate existing docs/ content to proper structure -2. Create getting-started guides from examples -3. Write user guide pages for each major feature -4. Add integration guides for each backend - -### Phase 3: CI/CD Integration - -1. Add GitHub Actions workflow: - - ```yaml - # .github/workflows/docs.yml - name: Documentation - on: - push: - branches: [main] - pull_request: - - jobs: - build: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: astral-sh/setup-uv@v4 - - run: uv sync --dev - - run: uv run mkdocs build --strict - - deploy: - if: github.ref == 'refs/heads/main' - needs: build - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - uses: astral-sh/setup-uv@v4 - - run: uv sync --dev - - run: uv run mkdocs gh-deploy --force - ``` - -2. Configure GitHub Pages for hosting - -### Phase 4: Versioning - -1. Set up mike for version management -2. Configure version switcher in theme -3. Document release process for docs - ---- - -## Example mkdocs.yml Configuration - -```yaml -site_name: TraceCraft -site_url: https://tracecraft.dev -site_description: Vendor-neutral LLM observability SDK - -repo_name: tracecraft/tracecraft -repo_url: https://github.com/tracecraft/tracecraft - -theme: - name: material - palette: - - scheme: default - primary: indigo - accent: indigo - toggle: - icon: material/brightness-7 - name: Switch to dark mode - - scheme: slate - primary: indigo - accent: indigo - toggle: - icon: material/brightness-4 - name: Switch to light mode - features: - - navigation.tabs - - navigation.sections - - navigation.expand - - navigation.top - - search.suggest - - search.highlight - - content.code.copy - - content.code.annotate - -plugins: - - search - - mkdocstrings: - handlers: - python: - options: - docstring_style: google - show_source: true - show_root_heading: true - members_order: source - separate_signature: true - show_signature_annotations: true - - autorefs - -markdown_extensions: - - pymdownx.highlight: - anchor_linenums: true - - pymdownx.superfences - - pymdownx.tabbed: - alternate_style: true - - admonition - - pymdownx.details - - attr_list - - md_in_html - - toc: - permalink: true - -nav: - - Home: index.md - - Getting Started: - - Installation: getting-started/installation.md - - Quick Start: getting-started/quickstart.md - - Configuration: getting-started/configuration.md - - User Guide: - - Decorators: user-guide/decorators.md - - Exporters: user-guide/exporters.md - - Processors: user-guide/processors.md - - Adapters: user-guide/adapters.md - - API Reference: - - Core: reference/api/core.md - - Instrumentation: reference/api/instrumentation.md - - Exporters: reference/api/exporters.md - - Integrations: - - Langfuse: integrations/langfuse.md - - Datadog: integrations/datadog.md - - Contributing: contributing.md -``` - ---- - -## API Reference Page Example - -For auto-generating API docs from docstrings: - -```markdown -# Core Module - -The core module contains the fundamental data models and runtime for TraceCraft. - -## Models - -::: tracecraft.core.models - options: - show_root_heading: false - members_order: source - -## Runtime - -::: tracecraft.core.runtime - options: - show_root_heading: false - members_order: source - -## Configuration - -::: tracecraft.core.config - options: - show_root_heading: false - members_order: source -``` - ---- - -## Resources - -- [MkDocs Documentation](https://www.mkdocs.org/) -- [Material for MkDocs](https://squidfunk.github.io/mkdocs-material/) -- [mkdocstrings](https://mkdocstrings.github.io/) -- [mike - MkDocs Versioning](https://github.com/jimporter/mike) diff --git a/dev/api-reference.md b/dev/api-reference.md deleted file mode 100644 index 1d701e4..0000000 --- a/dev/api-reference.md +++ /dev/null @@ -1,385 +0,0 @@ -# API Reference - -Complete API reference for TraceCraft. - -## Module: tracecraft - -Main entry point with convenience functions and re-exports. - -### Functions - -#### `init(**kwargs) -> TALRuntime` - -Initialize the global TraceCraft runtime. - -```python -import tracecraft - -# Simple initialization -tracecraft.init() - -# With configuration -tracecraft.init( - service_name="my-service", - console_enabled=True, - jsonl_enabled=True, -) -``` - -#### `get_runtime() -> TALRuntime | None` - -Get the global runtime instance. - -```python -runtime = tracecraft.get_runtime() -if runtime: - print(f"Service: {runtime.config.service_name}") -``` - -### Decorators - -#### `@trace_agent(name=None, exclude_inputs=None, capture_inputs=True, runtime=None)` - -Trace an agent function execution. - -**Parameters:** - -- `name: str | None` - Step name (defaults to function name) -- `exclude_inputs: list[str] | None` - Parameter names to exclude from capture -- `capture_inputs: bool` - If False, no inputs are captured -- `runtime: TALRuntime | None` - Explicit runtime for multi-tenant scenarios - -**Example:** - -```python -@tracecraft.trace_agent(name="research_agent") -async def research(query: str) -> str: - return await process(query) - -# Exclude sensitive parameters -@tracecraft.trace_agent(exclude_inputs=["api_key"]) -def auth_agent(user: str, api_key: str) -> bool: - return authenticate(user, api_key) -``` - -#### `@trace_tool(name=None, exclude_inputs=None, capture_inputs=True, runtime=None)` - -Trace a tool function execution. - -**Parameters:** Same as `@trace_agent` - -**Example:** - -```python -@tracecraft.trace_tool(name="web_search") -def search(query: str) -> list[str]: - return fetch_results(query) -``` - -#### `@trace_llm(name=None, model=None, provider=None, exclude_inputs=None, capture_inputs=True, runtime=None)` - -Trace an LLM function call with model metadata. - -**Parameters:** - -- `name: str | None` - Step name (defaults to function name) -- `model: str | None` - Model name (e.g., "gpt-4", "claude-3-opus") -- `provider: str | None` - Model provider (e.g., "openai", "anthropic") -- `exclude_inputs: list[str] | None` - Parameter names to exclude -- `capture_inputs: bool` - If False, no inputs are captured -- `runtime: TALRuntime | None` - Explicit runtime - -**Example:** - -```python -@tracecraft.trace_llm(model="gpt-4", provider="openai") -def call_llm(prompt: str) -> str: - return client.chat.completions.create(...) -``` - -#### `@trace_retrieval(name=None, exclude_inputs=None, capture_inputs=True, runtime=None)` - -Trace a retrieval/search function. - -**Parameters:** Same as `@trace_agent` - -**Example:** - -```python -@tracecraft.trace_retrieval(name="vector_search") -def search_docs(query: str) -> list[Document]: - return vector_store.search(query) -``` - -#### `@trace_llm_stream(name=None, model=None, provider=None, exclude_inputs=None, capture_inputs=True, runtime=None)` - -Trace streaming LLM calls that yield tokens. - -**Example:** - -```python -@tracecraft.trace_llm_stream(model="gpt-4o", provider="openai") -async def stream_chat(prompt: str) -> AsyncGenerator[str, None]: - async for chunk in client.chat.completions.create(..., stream=True): - if chunk.choices[0].delta.content: - yield chunk.choices[0].delta.content -``` - -### Context Managers - -#### `step(name, type=StepType.WORKFLOW) -> Generator[Step, None, None]` - -Create a traced step using a context manager. - -**Parameters:** - -- `name: str` - Step name -- `type: StepType` - Step type (default: WORKFLOW) - -**Example:** - -```python -from tracecraft import step -from tracecraft.core.models import StepType - -with step("data_processing", type=StepType.WORKFLOW) as s: - result = process_data() - s.attributes["count"] = 100 - s.outputs["result"] = result -``` - ---- - -## Module: tracecraft.core.models - -Data models for traces and steps. - -### Classes - -#### `AgentRun` - -Represents a complete agent execution trace. - -```python -@dataclass -class AgentRun: - id: UUID - name: str - start_time: datetime - end_time: datetime | None - status: str # "running", "completed", "error" - steps: list[Step] - inputs: dict[str, Any] - outputs: dict[str, Any] - metadata: dict[str, Any] - tags: list[str] - error: str | None -``` - -#### `Step` - -Represents a single step within a trace. - -```python -@dataclass -class Step: - id: UUID - trace_id: UUID - parent_id: UUID | None - type: StepType - name: str - start_time: datetime - end_time: datetime | None - duration_ms: float | None - inputs: dict[str, Any] - outputs: dict[str, Any] - attributes: dict[str, Any] - children: list[Step] - error: str | None - error_type: str | None - model_name: str | None - model_provider: str | None -``` - -#### `StepType` - -Enumeration of step types. - -```python -class StepType(str, Enum): - AGENT = "agent" - TOOL = "tool" - LLM = "llm" - RETRIEVAL = "retrieval" - WORKFLOW = "workflow" - EMBEDDING = "embedding" - RERANK = "rerank" - GUARDRAIL = "guardrail" -``` - ---- - -## Module: tracecraft.core.runtime - -Runtime management for TraceCraft. - -### Classes - -#### `TALRuntime` (alias: `TraceCraftRuntime`) - -The main runtime class that manages tracing infrastructure. - -**Constructor:** - -```python -TALRuntime(config: TraceCraftConfig | None = None) -``` - -**Methods:** - -- `start_run(name, inputs=None, tags=None) -> AgentRun` - Start a new trace -- `end_run(run, outputs=None, error=None)` - End a trace -- `trace_context() -> ContextManager` - Context manager for scoped runtime selection -- `shutdown()` - Clean up resources - -**Example:** - -```python -from tracecraft import TraceCraftRuntime, TraceCraftConfig - -runtime = TraceCraftRuntime( - config=TraceCraftConfig(service_name="my-service") -) - -# Use as context manager -with runtime.trace_context(): - run = runtime.start_run("my_agent", inputs={"query": "hello"}) - # ... do work ... - runtime.end_run(run, outputs={"result": "world"}) -``` - ---- - -## Module: tracecraft.core.config - -Configuration classes. - -### Classes - -See [Configuration Reference](../user/configuration.md) for detailed documentation. - -- `TraceCraftConfig` - Main configuration -- `RedactionConfig` - PII redaction settings -- `SamplingConfig` - Trace sampling settings -- `ExporterConfig` - Exporter settings -- `AzureFoundryConfig` - Azure AI Foundry settings -- `AWSAgentCoreConfig` - AWS AgentCore settings -- `GCPVertexAgentConfig` - GCP Vertex Agent settings - -### Functions - -#### `load_config_from_env() -> TraceCraftConfig` - -Load configuration from environment variables. - -#### `load_config(**kwargs) -> TraceCraftConfig` - -Load configuration from environment variables with overrides. - ---- - -## Module: tracecraft.adapters - -Framework adapters for automatic instrumentation. - -### LangChain Adapter - -```python -from tracecraft.adapters.langchain import TraceCraftCallbackHandler - -handler = TraceCraftCallbackHandler() -chain.invoke(input, config={"callbacks": [handler]}) -``` - -### LlamaIndex Adapter - -```python -from tracecraft.adapters.llamaindex import TraceCraftSpanHandler - -import llama_index.core -llama_index.core.global_handler = TraceCraftSpanHandler() -``` - -### Claude SDK Adapter - -```python -from tracecraft import ClaudeTraceCraftr - -traced_agent = ClaudeTraceCraftr(agent) -result = await traced_agent.run(prompt) -``` - -### Pydantic AI Adapter - -```python -from tracecraft.adapters.pydantic_ai import TraceCraftInstrumentor - -instrumentor = TraceCraftInstrumentor() -instrumentor.instrument() -``` - ---- - -## Module: tracecraft.exporters - -Trace exporters for different backends. - -### Available Exporters - -- `ConsoleExporter` - Human-readable console output -- `JSONLExporter` - JSON Lines file format -- `OTLPExporter` - OpenTelemetry Protocol -- `MLflowExporter` - MLflow tracking -- `HTMLExporter` - HTML report generation - -### Example: Custom Exporter - -```python -from tracecraft.exporters.base import BaseExporter -from tracecraft.core.models import AgentRun - -class CustomExporter(BaseExporter): - def export(self, run: AgentRun) -> None: - # Custom export logic - pass - - def shutdown(self) -> None: - # Cleanup - pass -``` - ---- - -## Module: tracecraft.processors - -Trace processors for data transformation. - -### Available Processors - -- `RedactionProcessor` - PII redaction -- `SamplingProcessor` - Trace sampling -- `EnrichmentProcessor` - Metadata enrichment - -### Configuration - -Processor order can be configured via `ProcessorOrder`: - -```python -from tracecraft.core.config import TraceCraftConfig, ProcessorOrder - -config = TraceCraftConfig( - processor_order=ProcessorOrder.SAFETY # Enrich -> Redact -> Sample - # or - processor_order=ProcessorOrder.EFFICIENCY # Sample -> Redact -> Enrich -) -``` diff --git a/dev/architecture/DESIGN_REVIEW.md b/dev/architecture/DESIGN_REVIEW.md deleted file mode 100644 index 807f0f9..0000000 --- a/dev/architecture/DESIGN_REVIEW.md +++ /dev/null @@ -1,806 +0,0 @@ -# Design Review: Addressing Opinionated Decisions - -This document analyzes the 10 most opinionated aspects of TraceCraft and provides recommendations on whether redesign is warranted, along with specific design proposals where applicable. - ---- - -## Decision Framework - -For each concern, I evaluate: - -- **Legitimacy**: Is this a real problem or a theoretical concern? -- **Impact**: How many users would be affected? -- **Redesign Cost**: How much effort to fix? -- **Breaking Change**: Would this break existing users? -- **Verdict**: Keep, Modify, or Redesign - ---- - -## 1. Global Singleton Runtime - -### Current Design - -```python -# Only way to use TraceCraft -tracecraft.init(config=config) -runtime = get_runtime() # Returns global singleton -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **High** - Multi-tenant SaaS, testing, and modular codebases genuinely need isolated instances | -| Impact | **Medium** - Affects platform teams, not solo developers | -| Redesign Cost | **Medium** - Requires adding instance-based API alongside global | -| Breaking Change | **No** - Additive change | - -### Verdict: **REDESIGN** - Add Instance-Based API - -### Proposed Design - -```python -# KEEP: Global convenience API (unchanged) -import tracecraft -tracecraft.init() - -@trace_agent(name="my_agent") -def my_agent(): ... - -# ADD: Explicit instance API for advanced users -from tracecraft import TraceCraftRuntime - -# Create isolated runtime instances -runtime_tenant_a = TraceCraftRuntime( - config=ConfigA(), - exporters=[ExporterA()] -) -runtime_tenant_b = TraceCraftRuntime( - config=ConfigB(), - exporters=[ExporterB()] -) - -# Use with explicit context -with runtime_tenant_a.trace_context() as ctx: - # All traces in this block go to runtime_tenant_a - result = my_agent() - -# Or via dependency injection -@trace_agent(name="my_agent", runtime=runtime_tenant_a) -def my_agent(): ... -``` - -### Implementation Notes - -- Global `init()` creates a default runtime stored in module state -- `TraceCraftRuntime` is a first-class, instantiable class -- Decorators accept optional `runtime` parameter; default to global -- Context managers allow scoped runtime selection -- Enables testing with isolated runtimes per test - ---- - -## 2. Fixed Processor Pipeline Order - -### Current Design - -```python -# Hardcoded in runtime.py -processors = [ - TokenEnrichmentProcessor(), # Always first - RedactionProcessor(), # Always second - SamplingProcessor(), # Always third -] -# No user control over ordering -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **Medium** - Some users want sampling before redaction for efficiency | -| Impact | **Low** - Most users don't think about processor order | -| Redesign Cost | **Low** - Simple configuration change | -| Breaking Change | **No** - Default behavior unchanged | - -### Verdict: **MODIFY** - Make Order Configurable - -### Proposed Design - -```python -from tracecraft import TraceCraftConfig, ProcessorOrder - -config = TraceCraftConfig() - -# Option 1: Predefined strategies -config.processor_order = ProcessorOrder.EFFICIENCY # Sample → Redact → Enrich -config.processor_order = ProcessorOrder.SAFETY # Enrich → Redact → Sample (default) - -# Option 2: Explicit ordering -config.processor_order = [ - "sampling", # Sample first (skip work on dropped traces) - "redaction", # Redact survivors - "enrichment", # Enrich last -] - -# Option 3: Insert custom processors -config.processors = [ - SamplingProcessor(rate=0.1), - CustomAuditProcessor(), # User's custom processor - RedactionProcessor(), - TokenEnrichmentProcessor(), -] -``` - -### Implementation Notes - -- Default remains current order (safety-first) -- Predefined strategies cover common patterns -- Advanced users can specify exact order -- Custom processors can be inserted at any position -- Validate that required processors are present (warn if redaction missing) - ---- - -## 3. Automatic Input Capture via Reflection - -### Current Design - -```python -@trace_agent(name="my_agent") -def my_agent(query: str, api_key: str, config: dict): - # ALL parameters captured automatically - # api_key appears in traces before redaction sees it - ... -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **High** - Security risk; sensitive params captured before redaction | -| Impact | **High** - Affects anyone with sensitive function parameters | -| Redesign Cost | **Low** - Add exclude parameter to decorators | -| Breaking Change | **No** - Additive; default behavior unchanged | - -### Verdict: **REDESIGN** - Add Input Exclusion - -### Proposed Design - -```python -# Option 1: Exclude specific parameters -@trace_agent( - name="my_agent", - exclude_inputs=["api_key", "credentials", "token"] -) -def my_agent(query: str, api_key: str, credentials: dict): - ... -# Traces show: {"query": "...", "api_key": "[EXCLUDED]", "credentials": "[EXCLUDED]"} - -# Option 2: Include only specific parameters (allowlist) -@trace_agent( - name="my_agent", - include_inputs=["query", "user_id"] # Only these captured -) -def my_agent(query: str, api_key: str, user_id: str): - ... - -# Option 3: Parameter-level annotation (more Pythonic) -from tracecraft import Sensitive - -@trace_agent(name="my_agent") -def my_agent( - query: str, - api_key: Annotated[str, Sensitive()], # Never captured - config: dict -): - ... - -# Option 4: Capture nothing by default, explicit opt-in -@trace_agent(name="my_agent", capture_inputs=False) -def my_agent(query: str, api_key: str): - ... -``` - -### Recommended Approach - -Implement **Option 1** (exclude_inputs) as it's: - -- Backward compatible (default excludes nothing) -- Simple to understand -- Handles the common case (exclude a few sensitive params) - -Also implement **Option 4** (capture_inputs=False) for maximum control. - -### Implementation Notes - -- Add `exclude_inputs: list[str] = []` to all trace decorators -- Add `capture_inputs: bool = True` for complete opt-out -- Excluded params show as `"[EXCLUDED]"` not omitted (preserves arity visibility) -- Consider adding common exclusions to config: `config.default_excluded_inputs = ["api_key", "token", "password"]` - ---- - -## 4. Dual-Schema Dialect as Default - -### Current Design - -```python -# Default in schema/canonical.py -class SchemaDialect(Enum): - OTEL_GENAI = "otel_genai" - OPENINFERENCE = "openinference" - BOTH = "both" # DEFAULT - generates both attribute sets -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **Medium** - Payload bloat is real but rarely critical | -| Impact | **Low** - Most users don't notice attribute duplication | -| Redesign Cost | **Trivial** - Change default enum value | -| Breaking Change | **Minor** - Users relying on both dialects would need to opt-in | - -### Verdict: **MODIFY** - Change Default to Single Dialect - -### Proposed Design - -```python -# New default: OTel GenAI (industry standard) -DEFAULT_SCHEMA_DIALECT = SchemaDialect.OTEL_GENAI - -# Configuration -config = TraceCraftConfig() -config.schema_dialect = SchemaDialect.OTEL_GENAI # Default -config.schema_dialect = SchemaDialect.OPENINFERENCE # For Arize/Phoenix users -config.schema_dialect = SchemaDialect.BOTH # Explicit opt-in for compatibility - -# Auto-detection based on exporter (nice-to-have) -# If exporting to Phoenix → auto-select OpenInference -# If exporting to generic OTLP → auto-select OTel GenAI -``` - -### Implementation Notes - -- Change default from `BOTH` to `OTEL_GENAI` -- Document dialect selection guidance: - - Use `OTEL_GENAI` for: Jaeger, Tempo, Datadog, generic OTLP - - Use `OPENINFERENCE` for: Arize Phoenix, OpenInference-native tools - - Use `BOTH` for: Migration periods, multi-backend setups -- Consider auto-detection based on configured exporters - ---- - -## 5. Console + JSONL Exporters Enabled by Default - -### Current Design - -```python -# In env_config.py -class EnvironmentSettings: - console_enabled: bool = True # Always on - jsonl_enabled: bool = True # Always on -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **High** - Breaks in read-only filesystems, noisy in production | -| Impact | **Medium** - Affects serverless/container deployments | -| Redesign Cost | **Low** - Environment-aware defaults | -| Breaking Change | **Minor** - Production users get quieter defaults | - -### Verdict: **REDESIGN** - Environment-Aware Defaults - -### Proposed Design - -```python -# Smart defaults based on detected environment -def get_default_exporters() -> dict: - env = detect_environment() - - if env == "development" or env == "test": - return { - "console_enabled": True, - "jsonl_enabled": True, - } - elif env in ("staging", "production"): - return { - "console_enabled": False, # Don't pollute logs - "jsonl_enabled": False, # Don't write to filesystem - } - else: - # Unknown environment: safe defaults - return { - "console_enabled": False, - "jsonl_enabled": False, - } - -def detect_environment() -> str: - """Detect environment from various signals.""" - # 1. Explicit configuration - if os.getenv("TRACECRAFT_ENVIRONMENT"): - return os.getenv("TRACECRAFT_ENVIRONMENT") - - # 2. Common cloud indicators - if os.getenv("AWS_LAMBDA_FUNCTION_NAME"): - return "production" - if os.getenv("KUBERNETES_SERVICE_HOST"): - return "production" - if os.getenv("CLOUD_RUN_JOB"): - return "production" - - # 3. CI indicators - if os.getenv("CI") or os.getenv("GITHUB_ACTIONS"): - return "test" - - # 4. Default to development (local machine) - return "development" -``` - -### Alternative: Explicit Mode Selection - -```python -import tracecraft - -# Quick local development (current behavior) -tracecraft.init(mode="local") # Console + JSONL enabled - -# Production mode -tracecraft.init(mode="production") # Only configured exporters, no defaults - -# Explicit (always works) -tracecraft.init( - console=False, - jsonl=False, - exporters=[OTLPExporter(...)] -) -``` - -### Implementation Notes - -- Add environment detection logic -- Default to quiet mode in detected production environments -- Keep `mode="local"` for explicit local debugging -- Document the detection logic so users understand behavior -- Always allow explicit override via config - ---- - -## 6. Redaction Disabled by Default - -### Current Design - -```python -# In config.py -class RedactionConfig: - enabled: bool = False # Off by default -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **High** - Contradicts "governance built-in" promise; privacy risk | -| Impact | **High** - Users may leak PII without realizing | -| Redesign Cost | **Low** - Change default boolean | -| Breaking Change | **Medium** - Users relying on seeing full data would need to opt-out | - -### Verdict: **REDESIGN** - Enable by Default with Development Bypass - -### Proposed Design - -```python -# New defaults -class RedactionConfig: - enabled: bool = True # ON by default (privacy-first) - mode: RedactionMode = RedactionMode.MASK # "[REDACTED]" replacement - -# Development override for debugging -tracecraft.init( - mode="development", # Implies redaction disabled for debugging -) - -# Or explicit -config = TraceCraftConfig() -config.redaction.enabled = False # Explicit opt-out - -# Production should require explicit disable -if environment == "production" and not config.redaction.enabled: - warnings.warn( - "PII redaction is disabled in production. " - "Set TRACECRAFT_REDACTION_ENABLED=true or acknowledge with " - "TRACECRAFT_ALLOW_UNREDACTED_PRODUCTION=true" - ) -``` - -### Implementation Notes - -- Change default to `enabled=True` -- Add `mode="development"` that disables redaction + enables console/JSONL -- In production, warn if redaction explicitly disabled -- Document migration path for existing users -- Consider a `TRACECRAFT_UNSAFE_DISABLE_REDACTION=true` env var for explicit acknowledgment - ---- - -## 7. Strict StepType Enum - -### Current Design - -```python -class StepType(str, Enum): - AGENT = "agent" - LLM = "llm" - TOOL = "tool" - RETRIEVAL = "retrieval" - MEMORY = "memory" - GUARDRAIL = "guardrail" - EVALUATION = "evaluation" - WORKFLOW = "workflow" - ERROR = "error" - # No extensibility -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **Low-Medium** - Edge case; most patterns fit existing types | -| Impact | **Low** - Advanced users only | -| Redesign Cost | **Medium** - Requires schema changes | -| Breaking Change | **No** - Additive | - -### Verdict: **MINOR MODIFY** - Add CUSTOM Type with Subtype Field - -### Proposed Design - -```python -class StepType(str, Enum): - AGENT = "agent" - LLM = "llm" - TOOL = "tool" - RETRIEVAL = "retrieval" - MEMORY = "memory" - GUARDRAIL = "guardrail" - EVALUATION = "evaluation" - WORKFLOW = "workflow" - ERROR = "error" - CUSTOM = "custom" # NEW: escape hatch - -class Step(BaseModel): - step_type: StepType - custom_type: str | None = None # NEW: subtype for CUSTOM - - @model_validator(mode="after") - def validate_custom_type(self): - if self.step_type == StepType.CUSTOM and not self.custom_type: - raise ValueError("custom_type required when step_type is CUSTOM") - return self - -# Usage -@trace_step(step_type=StepType.CUSTOM, custom_type="planner") -def planning_phase(): ... - -@trace_step(step_type=StepType.CUSTOM, custom_type="router") -def route_request(): ... -``` - -### Alternative: Keep as-is with Documentation - -The current enum covers most use cases. Could also just document: - -- `AGENT` = any autonomous decision-making component -- `WORKFLOW` = orchestration, routing, planning -- `TOOL` = any function/action execution - -### Recommendation - -Add `CUSTOM` type for escape hatch, but don't over-engineer. Most users can map their concepts to existing types. - ---- - -## 8. Thread-Local Context Model - -### Current Design - -```python -# In decorators.py -_current_run: ContextVar[AgentRun | None] = ContextVar("current_run", default=None) -_current_step: ContextVar[Step | None] = ContextVar("current_step", default=None) -_pending_parents: dict[str, str] = {} # Global dict for parent tracking -MAX_STEP_DEPTH = 100 # Arbitrary limit -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **High** - Async context loss is a real problem | -| Impact | **Medium** - Affects async-heavy applications | -| Redesign Cost | **High** - Fundamental architecture change | -| Breaking Change | **Potentially** - Behavior changes in async code | - -### Verdict: **MODIFY** - Improve Async Support, Don't Rewrite - -### Proposed Design - -Full redesign is too costly. Instead: - -```python -# 1. Document limitations clearly -""" -Note: TraceCraft uses Python ContextVars for trace context propagation. -In async code, context is automatically propagated in most cases, but -may be lost when using: -- asyncio.create_task() without context copying -- run_in_executor() without explicit context -- Third-party async libraries that don't propagate context - -Use the provided helpers for these cases. -""" - -# 2. Provide async-aware helpers (already partially exist) -from tracecraft.contrib.async_helpers import ( - create_task_with_context, - gather_with_context, - run_in_executor_with_context, -) - -# Instead of: -task = asyncio.create_task(my_coroutine()) - -# Use: -task = create_task_with_context(my_coroutine()) - -# 3. Add context snapshot/restore for manual cases -from tracecraft import capture_context, restore_context - -ctx = capture_context() # Snapshot current trace context - -async def worker(): - with restore_context(ctx): # Restore in new async context - await do_work() - -# 4. Remove arbitrary MAX_STEP_DEPTH or make configurable -config.max_step_depth = 100 # Default -config.max_step_depth = None # Unlimited (warn about memory) -``` - -### Implementation Notes - -- Don't attempt full async rewrite (too risky, too costly) -- Document known limitations prominently -- Provide helpers that "just work" for common async patterns -- Make `MAX_STEP_DEPTH` configurable with warning for unlimited -- Consider adding async-specific decorators: `@trace_agent_async` - ---- - -## 9. Deep Copy in Redaction Processor - -### Current Design - -```python -# In processors/base.py -def process(self, run: AgentRun) -> AgentRun: - run_copy = run.model_copy(deep=True) # Always deep copy - # ... redact run_copy ... - return run_copy -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **Medium** - Real memory concern for large traces | -| Impact | **Low** - Only affects high-throughput systems with large traces | -| Redesign Cost | **Medium** - Requires careful mutation handling | -| Breaking Change | **No** - Optimization | - -### Verdict: **MODIFY** - Lazy/Conditional Copying - -### Proposed Design - -```python -class RedactionProcessor(BaseProcessor): - def __init__(self, config: RedactionConfig): - self.config = config - - def process(self, run: AgentRun) -> AgentRun: - # Skip entirely if disabled - if not self.config.enabled: - return run # No copy needed - - # Check if redaction is actually needed - if not self._needs_redaction(run): - return run # No copy needed - - # Only copy if we're actually going to modify - run_copy = run.model_copy(deep=True) - self._redact(run_copy) - return run_copy - - def _needs_redaction(self, run: AgentRun) -> bool: - """Quick scan to check if any redaction patterns match.""" - # Fast string scan without copying - content = self._extract_redactable_content(run) - return any( - pattern.search(content) - for pattern in self.patterns - ) -``` - -### Alternative: Copy-on-Write Wrapper - -```python -class LazyRedactedRun: - """Wrapper that only copies when mutation is needed.""" - def __init__(self, run: AgentRun): - self._original = run - self._copy = None - - def get_mutable(self) -> AgentRun: - if self._copy is None: - self._copy = self._original.model_copy(deep=True) - return self._copy - - def get_result(self) -> AgentRun: - return self._copy if self._copy else self._original -``` - -### Implementation Notes - -- Add fast pre-scan to check if redaction needed -- Skip copy entirely when redaction disabled -- Consider copy-on-write for advanced optimization -- Benchmark to ensure optimization is worthwhile - ---- - -## 10. Environment Name Validation - -### Current Design - -```python -# In env_config.py -class EnvironmentType(str, Enum): - DEVELOPMENT = "development" - STAGING = "staging" - PRODUCTION = "production" - TEST = "test" - # Only these 4 allowed -``` - -### Assessment - -| Criterion | Evaluation | -|-----------|------------| -| Legitimacy | **High** - Arbitrary restriction with no benefit | -| Impact | **Low** - Easy workaround (use "staging" for everything) | -| Redesign Cost | **Trivial** - Remove enum restriction | -| Breaking Change | **No** - Additive | - -### Verdict: **REDESIGN** - Allow Arbitrary Strings - -### Proposed Design - -```python -# Remove enum, use string with suggestions -class EnvironmentSettings(BaseSettings): - environment: str = "development" - - @field_validator("environment") - @classmethod - def validate_environment(cls, v: str) -> str: - known_environments = { - "development", "staging", "production", "test", - "local", "ci", "qa", "integration", "canary", - "preview", "sandbox" - } - if v.lower() not in known_environments: - # Warn but allow - warnings.warn( - f"Unknown environment '{v}'. Known environments: {known_environments}. " - "Custom environments are allowed but may not have optimized defaults." - ) - return v.lower() - -# Usage -TRACECRAFT_ENVIRONMENT=canary # Works! -TRACECRAFT_ENVIRONMENT=my-custom-env # Works with warning -``` - -### Implementation Notes - -- Change from Enum to str -- Keep list of "known" environments for smart defaults -- Warn on unknown environments (don't error) -- Document that custom environments get neutral defaults - ---- - -## Summary: Redesign Decisions - -| # | Aspect | Verdict | Priority | Effort | -|---|--------|---------|----------|--------| -| 1 | Global Singleton Runtime | **REDESIGN** | P1 | Medium | -| 2 | Fixed Processor Order | **MODIFY** | P2 | Low | -| 3 | Automatic Input Capture | **REDESIGN** | P1 | Low | -| 4 | Dual-Schema Default | **MODIFY** | P3 | Trivial | -| 5 | Console/JSONL Defaults | **REDESIGN** | P1 | Low | -| 6 | Redaction Disabled | **REDESIGN** | P1 | Low | -| 7 | Strict StepType Enum | **MINOR MODIFY** | P3 | Low | -| 8 | Thread-Local Context | **MODIFY** | P2 | Medium | -| 9 | Deep Copy Redaction | **MODIFY** | P3 | Medium | -| 10 | Environment Validation | **REDESIGN** | P2 | Trivial | - ---- - -## Recommended Implementation Order - -### Phase 1: Quick Wins (1-2 weeks) - -High impact, low effort changes: - -1. **Environment validation** - Remove enum restriction (trivial) -2. **Redaction default** - Enable by default (trivial, high impact) -3. **Input exclusion** - Add `exclude_inputs` parameter (low effort, high impact) -4. **Schema dialect default** - Change to `OTEL_GENAI` (trivial) - -### Phase 2: Smart Defaults (2-3 weeks) - -Environment-aware behavior: - -5. **Console/JSONL defaults** - Environment detection (low-medium effort) -6. **Processor ordering** - Add configuration options (low effort) - -### Phase 3: Architecture Improvements (4-6 weeks) - -Larger changes for advanced users: - -7. **Instance-based runtime** - Add alongside global API (medium effort) -8. **Async context helpers** - Document + provide utilities (medium effort) -9. **Lazy redaction copying** - Optimization (medium effort) - -### Phase 4: Nice-to-Have (As needed) - -Lower priority: - -10. **Custom StepType** - Add CUSTOM type (low effort, low impact) - ---- - -## Migration Notes - -### For Existing Users - -Most changes are additive or change defaults. Migration guide: - -```python -# If you relied on seeing full unredacted data: -config.redaction.enabled = False # Explicit opt-out - -# If you relied on dual schema output: -config.schema_dialect = SchemaDialect.BOTH # Explicit opt-in - -# If you relied on console/JSONL in production: -tracecraft.init(console=True, jsonl=True) # Explicit enable -``` - -### Deprecation Strategy - -For breaking changes, use deprecation warnings for one minor version: - -```python -import warnings - -if config.schema_dialect is None: - warnings.warn( - "Default schema dialect changing from BOTH to OTEL_GENAI in v2.0. " - "Set schema_dialect explicitly to suppress this warning.", - DeprecationWarning - ) -``` diff --git a/dev/architecture/FEATURE_EVALUATION.md b/dev/architecture/FEATURE_EVALUATION.md deleted file mode 100644 index b6dafab..0000000 --- a/dev/architecture/FEATURE_EVALUATION.md +++ /dev/null @@ -1,995 +0,0 @@ -# Feature Evaluation: Proposed Enhancements - -This document evaluates proposed features for TraceCraft, analyzing their value proposition, implementation approach, and integration with existing architecture. - ---- - -## Executive Summary - -| Feature | Recommendation | Complexity | Value | -|---------|----------------|------------|-------| -| Central SQLite with Projects/Versioning | **Implement** | Medium-High | High | -| Log Correlation in TUI | **Implement** | Medium | High | -| Evaluation Sets in TUI | **Implement** | Medium | High | -| AI Inspect for Failed Evaluations | **Implement** | Medium | High | - -All four features are worth implementing. They build naturally on the existing architecture and address real workflow gaps. - ---- - -## Feature 1: Enhanced SQLite Storage with Projects, Versioning, and Playground Persistence - -### Current State - -The SQLite backend (`src/tracecraft/storage/sqlite.py`) provides solid trace storage with: - -- Denormalized schema for fast queries (traces, steps, trace_tags tables) -- WAL mode for concurrency -- Rich querying (name, duration, cost, tags, time range filtering) -- Statistics and analytics methods - -However, it lacks: - -- Multi-project/workspace organization -- Trace versioning (original vs. modified) -- Integration with playground changes (currently saved to separate JSON files via `IterationHistory`) - -### Why This is Valuable - -1. **Multi-project support**: Teams working on multiple agents need isolation and organization -2. **Version tracking**: Comparing original traces vs. playground modifications enables systematic prompt engineering -3. **Unified storage**: Playground iterations scattered in JSON files are not queryable or shareable -4. **Audit trail**: Understanding what changed and when is critical for production debugging - -### Implementation Approach - -#### 1. Schema Extensions - -Add new tables to the SQLite schema: - -```sql --- Projects/workspaces for organizing traces -CREATE TABLE projects ( - id TEXT PRIMARY KEY, - name TEXT NOT NULL UNIQUE, - description TEXT, - created_at TEXT NOT NULL, - updated_at TEXT NOT NULL, - settings TEXT -- JSON blob for project-specific config -); - --- Track trace versions (original, playground modifications, etc.) -CREATE TABLE trace_versions ( - id TEXT PRIMARY KEY, - trace_id TEXT NOT NULL, -- Original trace ID - version_number INTEGER NOT NULL, - version_type TEXT NOT NULL, -- 'original', 'playground', 'manual' - parent_version_id TEXT, -- For branching history - created_at TEXT NOT NULL, - created_by TEXT, -- user_id if available - notes TEXT, - data TEXT NOT NULL, -- Full AgentRun JSON - FOREIGN KEY (trace_id) REFERENCES traces(id), - UNIQUE(trace_id, version_number) -); - --- Playground iterations linked to trace versions -CREATE TABLE playground_iterations ( - id TEXT PRIMARY KEY, - trace_version_id TEXT NOT NULL, - step_id TEXT NOT NULL, - iteration_number INTEGER NOT NULL, - prompt TEXT NOT NULL, - output TEXT, - input_tokens INTEGER, - output_tokens INTEGER, - duration_ms INTEGER, - notes TEXT, - is_best BOOLEAN DEFAULT FALSE, - created_at TEXT NOT NULL, - FOREIGN KEY (trace_version_id) REFERENCES trace_versions(id) ON DELETE CASCADE -); - --- Add project_id to traces table -ALTER TABLE traces ADD COLUMN project_id TEXT REFERENCES projects(id); -CREATE INDEX idx_traces_project ON traces(project_id); -``` - -#### 2. Storage Layer Changes - -Extend `SQLiteTraceStore` in `src/tracecraft/storage/sqlite.py`: - -```python -class SQLiteTraceStore(BaseTraceStore): - # New methods for project management - def create_project(self, name: str, description: str = "") -> str: ... - def list_projects(self) -> list[dict]: ... - def get_project(self, project_id: str) -> dict | None: ... - def delete_project(self, project_id: str) -> bool: ... - - # Versioning support - def create_version( - self, - trace_id: str, - version_type: str = "playground", - notes: str = "", - modified_run: AgentRun | None = None - ) -> str: ... - - def get_versions(self, trace_id: str) -> list[dict]: ... - def get_version(self, version_id: str) -> AgentRun | None: ... - def compare_versions(self, v1_id: str, v2_id: str) -> dict: ... - - # Playground persistence - def save_playground_iteration( - self, - trace_id: str, - step_id: str, - prompt: str, - output: str, - tokens: dict, - duration_ms: int, - notes: str = "" - ) -> str: ... - - def get_iterations(self, trace_id: str, step_id: str) -> list[dict]: ... - def mark_best_iteration(self, iteration_id: str) -> bool: ... -``` - -#### 3. Playground Integration - -Modify `src/tracecraft/playground/runner.py` to persist iterations: - -```python -async def replay_step( - run: AgentRun, - step_id: str | None = None, - step_name: str | None = None, - modified_prompt: str | None = None, - provider: BaseReplayProvider | None = None, - store: SQLiteTraceStore | None = None, # NEW: optional persistence - save_iteration: bool = True, # NEW: control persistence -) -> ReplayResult: - # ... existing replay logic ... - - # Persist iteration if store provided - if store and save_iteration: - store.save_playground_iteration( - trace_id=run.id, - step_id=step.id, - prompt=modified_prompt or original_prompt, - output=result.output, - tokens={"input": result.input_tokens, "output": result.output_tokens}, - duration_ms=result.duration_ms - ) - - return result -``` - -#### 4. Migration Strategy - -Add schema migration support in `_ensure_schema()`: - -```python -def _ensure_schema(self) -> None: - current_version = self._get_schema_version() - - migrations = [ - (2, self._migrate_v2_projects), - (3, self._migrate_v3_versions), - (4, self._migrate_v4_iterations), - ] - - for version, migration_fn in migrations: - if current_version < version: - migration_fn() - self._set_schema_version(version) -``` - -### Files to Modify/Create - -| File | Changes | -|------|---------| -| `src/tracecraft/storage/sqlite.py` | Schema extensions, new methods | -| `src/tracecraft/storage/base.py` | Add versioning protocol methods | -| `src/tracecraft/playground/runner.py` | Persistence integration | -| `src/tracecraft/playground/comparison.py` | Remove file-based `IterationHistory.save/load`, use store | -| `src/tracecraft/tui/app.py` | Project selection, version navigation | - ---- - -## Feature 2: Log Correlation in Terminal UI - -### Current State - -The TUI displays trace hierarchy, inputs/outputs, and metrics but has no visibility into application logs. The `IOViewer` widget shows: - -- Input mode (step inputs) -- Output mode (step outputs) -- Attributes mode (metadata) -- JSON mode (full serialization) -- Error mode (error details) - -There is no log capture or display mechanism. - -### Why This is Valuable - -1. **Debug context**: Logs often contain critical debugging information not in structured trace data -2. **Correlation**: Seeing "what the code was doing" alongside "what the LLM returned" is essential -3. **Industry standard**: Jaeger, Datadog, and other observability tools correlate logs with traces -4. **Searchability**: Finding traces via log content is a common workflow - -### Implementation Approach - -#### 1. Log Capture via Context - -Create a log handler that captures logs during traced execution: - -**New file: `src/tracecraft/instrumentation/log_capture.py`** - -```python -import logging -from contextvars import ContextVar -from dataclasses import dataclass, field -from datetime import datetime - -@dataclass -class CapturedLog: - timestamp: datetime - level: str - logger_name: str - message: str - step_id: str | None = None - trace_id: str | None = None - extra: dict = field(default_factory=dict) - -# Context var for current log buffer -_log_buffer: ContextVar[list[CapturedLog]] = ContextVar("log_buffer", default=[]) - -class TracingLogHandler(logging.Handler): - """Handler that captures logs during traced execution.""" - - def emit(self, record: logging.LogRecord) -> None: - from tracecraft.core.context import get_current_run, get_current_step - - buffer = _log_buffer.get() - run = get_current_run() - step = get_current_step() - - if run is not None: # Only capture during traced execution - log = CapturedLog( - timestamp=datetime.fromtimestamp(record.created), - level=record.levelname, - logger_name=record.name, - message=record.getMessage(), - step_id=step.id if step else None, - trace_id=run.id, - extra=getattr(record, "extra", {}), - ) - buffer.append(log) - -def install_log_capture(level: int = logging.DEBUG) -> TracingLogHandler: - """Install the tracing log handler on the root logger.""" - handler = TracingLogHandler() - handler.setLevel(level) - logging.root.addHandler(handler) - return handler - -def get_captured_logs() -> list[CapturedLog]: - """Get logs captured in current context.""" - return _log_buffer.get().copy() - -def clear_captured_logs() -> None: - """Clear the log buffer.""" - _log_buffer.set([]) -``` - -#### 2. Extend Data Models - -Add logs to `AgentRun` in `src/tracecraft/core/models.py`: - -```python -@dataclass -class AgentRun: - # ... existing fields ... - - # Log entries captured during execution - logs: list[dict] = field(default_factory=list) -``` - -#### 3. Storage Schema Extension - -Add logs table to SQLite: - -```sql -CREATE TABLE trace_logs ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - trace_id TEXT NOT NULL, - step_id TEXT, - timestamp TEXT NOT NULL, - level TEXT NOT NULL, - logger_name TEXT, - message TEXT NOT NULL, - extra TEXT, -- JSON blob - FOREIGN KEY (trace_id) REFERENCES traces(id) ON DELETE CASCADE -); - -CREATE INDEX idx_logs_trace ON trace_logs(trace_id); -CREATE INDEX idx_logs_step ON trace_logs(step_id); -CREATE INDEX idx_logs_level ON trace_logs(level); -CREATE INDEX idx_logs_message ON trace_logs(message); -- For search -``` - -#### 4. TUI Log Viewer Widget - -**New file: `src/tracecraft/tui/widgets/log_viewer.py`** - -```python -from textual.widgets import DataTable, Input -from textual.containers import Vertical - -class LogViewer(Vertical): - """Widget for viewing and searching logs associated with a trace.""" - - BINDINGS = [ - ("d", "toggle_debug", "Toggle DEBUG"), - ("i", "toggle_info", "Toggle INFO"), - ("w", "toggle_warning", "Toggle WARN"), - ("e", "toggle_error", "Toggle ERROR"), - ] - - def __init__(self) -> None: - super().__init__() - self.logs: list[dict] = [] - self.filter_text: str = "" - self.show_levels: set[str] = {"DEBUG", "INFO", "WARNING", "ERROR"} - - def compose(self): - yield Input(placeholder="Search logs...", id="log-search") - yield DataTable(id="log-table") - - def on_mount(self) -> None: - table = self.query_one("#log-table", DataTable) - table.add_columns("Time", "Level", "Logger", "Message") - table.cursor_type = "row" - - def set_logs(self, logs: list[dict]) -> None: - self.logs = logs - self._refresh_table() - - def filter_by_step(self, step_id: str | None) -> None: - """Filter logs to show only those for a specific step.""" - # ... implementation - - def _refresh_table(self) -> None: - table = self.query_one("#log-table", DataTable) - table.clear() - - for log in self.logs: - if log["level"] not in self.show_levels: - continue - if self.filter_text and self.filter_text.lower() not in log["message"].lower(): - continue - - table.add_row( - log["timestamp"][:19], # Trim to seconds - log["level"], - log["logger_name"][:20], - log["message"][:100], - ) -``` - -#### 5. TUI Integration - -Add log viewer as a new mode in `app.py`: - -```python -class TraceCraftApp(App): - BINDINGS = [ - # ... existing bindings ... - ("l", "show_logs", "Logs"), - ] - - def compose(self): - # ... existing layout ... - yield LogViewer(id="log-viewer", classes="hidden") - - def action_show_logs(self) -> None: - self.query_one("#log-viewer").toggle_class("hidden") - self.query_one("#io-viewer").toggle_class("hidden") -``` - -### Files to Modify/Create - -| File | Changes | -|------|---------| -| `src/tracecraft/instrumentation/log_capture.py` | **New** - Log capture handler | -| `src/tracecraft/core/models.py` | Add `logs` field to `AgentRun` | -| `src/tracecraft/storage/sqlite.py` | Add `trace_logs` table, query methods | -| `src/tracecraft/tui/widgets/log_viewer.py` | **New** - Log display widget | -| `src/tracecraft/tui/app.py` | Add log viewer, keybinding, layout | - ---- - -## Feature 3: Evaluation Set Creation and Results Viewing in TUI - -### Current State - -Evaluation capabilities exist but are code-only: - -- `src/tracecraft/contrib/evaluation.py` - Context managers and decorators -- `src/tracecraft/datasets/converters.py` - Export to CSV, HuggingFace, JSONL, golden datasets -- `src/tracecraft/integrations/ragas.py` - RAGAS integration -- `src/tracecraft/integrations/deepeval.py` - DeepEval integration - -The TUI has no evaluation functionality - users cannot: - -- Select traces/steps to include in an evaluation set -- Define expected outputs interactively -- View evaluation results - -### Why This is Valuable - -1. **Accessibility**: Creating evaluation sets should not require writing code -2. **Iterative curation**: Selecting specific traces, editing expected outputs, refining datasets -3. **Results visibility**: Seeing pass/fail rates, score distributions, failure patterns -4. **Workflow integration**: Natural flow from trace inspection to evaluation to improvement - -### Implementation Approach - -#### 1. Evaluation Set Data Model - -**New file: `src/tracecraft/evaluation/models.py`** - -```python -from dataclasses import dataclass, field -from datetime import datetime -from enum import Enum - -class EvalSetStatus(str, Enum): - DRAFT = "draft" - READY = "ready" - RUNNING = "running" - COMPLETED = "completed" - -@dataclass -class EvalCase: - """Single evaluation case derived from a trace step.""" - id: str - trace_id: str - step_id: str - input: str - expected_output: str | None = None - actual_output: str | None = None - context: list[str] = field(default_factory=list) # For RAG - metadata: dict = field(default_factory=dict) - - # Results (populated after evaluation) - scores: dict = field(default_factory=dict) - passed: bool | None = None - notes: str = "" - -@dataclass -class EvalSet: - """Collection of evaluation cases.""" - id: str - name: str - description: str = "" - project_id: str | None = None - cases: list[EvalCase] = field(default_factory=list) - - # Evaluation config - evaluator_type: str = "custom" # "ragas", "deepeval", "custom" - metrics: list[str] = field(default_factory=list) - thresholds: dict = field(default_factory=dict) - - # Status tracking - status: EvalSetStatus = EvalSetStatus.DRAFT - created_at: datetime = field(default_factory=datetime.now) - last_run_at: datetime | None = None - - # Aggregate results - total_cases: int = 0 - passed_cases: int = 0 - failed_cases: int = 0 - avg_scores: dict = field(default_factory=dict) - -@dataclass -class EvalResult: - """Results from a single evaluation run.""" - id: str - eval_set_id: str - run_at: datetime - duration_ms: int - case_results: list[dict] # Per-case scores - aggregate_scores: dict - passed_count: int - failed_count: int - total_count: int -``` - -#### 2. Storage Schema - -```sql -CREATE TABLE eval_sets ( - id TEXT PRIMARY KEY, - name TEXT NOT NULL, - description TEXT, - project_id TEXT, - evaluator_type TEXT DEFAULT 'custom', - metrics TEXT, -- JSON array - thresholds TEXT, -- JSON dict - status TEXT DEFAULT 'draft', - created_at TEXT NOT NULL, - last_run_at TEXT, - FOREIGN KEY (project_id) REFERENCES projects(id) -); - -CREATE TABLE eval_cases ( - id TEXT PRIMARY KEY, - eval_set_id TEXT NOT NULL, - trace_id TEXT, - step_id TEXT, - input TEXT NOT NULL, - expected_output TEXT, - actual_output TEXT, - context TEXT, -- JSON array for RAG - metadata TEXT, -- JSON dict - FOREIGN KEY (eval_set_id) REFERENCES eval_sets(id) ON DELETE CASCADE -); - -CREATE TABLE eval_results ( - id TEXT PRIMARY KEY, - eval_set_id TEXT NOT NULL, - case_id TEXT NOT NULL, - run_id TEXT NOT NULL, -- Groups results from same run - run_at TEXT NOT NULL, - scores TEXT NOT NULL, -- JSON dict - passed BOOLEAN, - notes TEXT, - FOREIGN KEY (eval_set_id) REFERENCES eval_sets(id) ON DELETE CASCADE, - FOREIGN KEY (case_id) REFERENCES eval_cases(id) ON DELETE CASCADE -); -``` - -#### 3. TUI Screens - -**New file: `src/tracecraft/tui/screens/eval_set_builder.py`** - -```python -from textual.screen import Screen -from textual.widgets import DataTable, Button, Input, Select - -class EvalSetBuilderScreen(Screen): - """Screen for creating/editing evaluation sets.""" - - BINDINGS = [ - ("a", "add_selected", "Add to Set"), - ("e", "edit_expected", "Edit Expected"), - ("s", "save_set", "Save"), - ("escape", "go_back", "Back"), - ] - - def __init__(self, store, selected_traces: list[str] = None): - super().__init__() - self.store = store - self.selected_traces = selected_traces or [] - self.eval_set = EvalSet(id=str(uuid4()), name="New Evaluation Set") - - def compose(self): - yield Input(placeholder="Evaluation set name...", id="name-input") - yield Select( - options=[ - ("Custom Evaluator", "custom"), - ("RAGAS (RAG)", "ragas"), - ("DeepEval", "deepeval"), - ], - id="evaluator-select" - ) - yield DataTable(id="cases-table") - yield Button("Run Evaluation", id="run-btn") - - def action_add_selected(self) -> None: - """Add currently selected trace/step as eval case.""" - # Creates EvalCase from selected step - # Opens modal to set expected output - ... - - def action_edit_expected(self) -> None: - """Edit expected output for selected case.""" - # Opens text editor for expected output - ... -``` - -**New file: `src/tracecraft/tui/screens/eval_results.py`** - -```python -from textual.screen import Screen -from textual.widgets import DataTable, Static, ProgressBar - -class EvalResultsScreen(Screen): - """Screen for viewing evaluation results.""" - - BINDINGS = [ - ("f", "filter_failed", "Failed Only"), - ("d", "show_details", "Details"), - ("r", "re_run", "Re-run"), - ("escape", "go_back", "Back"), - ] - - def compose(self): - yield Static(id="summary-panel") # Pass rate, avg scores - yield ProgressBar(id="pass-rate-bar") - yield DataTable(id="results-table") - - def set_results(self, eval_set: EvalSet, results: list[EvalResult]) -> None: - """Populate the screen with evaluation results.""" - # Update summary panel with aggregates - # Populate table with per-case results - # Color-code passed/failed rows - ... -``` - -#### 4. Main App Integration - -Add keybindings and actions to `app.py`: - -```python -class TraceCraftApp(App): - BINDINGS = [ - # ... existing ... - ("E", "create_eval_set", "Create Eval Set"), - ("R", "view_eval_results", "Eval Results"), - ] - - def action_create_eval_set(self) -> None: - """Open evaluation set builder with selected traces.""" - selected = self.get_selected_traces() # Multi-select support - self.push_screen(EvalSetBuilderScreen(self.store, selected)) - - def action_view_eval_results(self) -> None: - """Open evaluation results viewer.""" - self.push_screen(EvalResultsListScreen(self.store)) -``` - -### Files to Modify/Create - -| File | Changes | -|------|---------| -| `src/tracecraft/evaluation/models.py` | **New** - EvalSet, EvalCase, EvalResult | -| `src/tracecraft/storage/sqlite.py` | Add eval tables, CRUD methods | -| `src/tracecraft/tui/screens/eval_set_builder.py` | **New** - Creation screen | -| `src/tracecraft/tui/screens/eval_results.py` | **New** - Results screen | -| `src/tracecraft/tui/app.py` | Add keybindings, screen navigation | -| `src/tracecraft/evaluation/runner.py` | **New** - Orchestrate evaluation runs | - ---- - -## Feature 4: AI Inspect for Failed Evaluations - -### Current State - -No AI-assisted analysis exists. When evaluations fail, users must manually: - -1. Read the input/output -2. Compare to expected output -3. Identify why it failed -4. Determine what to fix - -This is time-consuming and requires expertise. - -### Why This is Valuable - -1. **Root cause analysis**: AI can identify patterns humans might miss -2. **Actionable recommendations**: Suggests specific prompt/code changes -3. **Learning acceleration**: Helps users understand failure modes -4. **Batch analysis**: Can analyze multiple failures to find common issues - -### Implementation Approach - -#### 1. Inspector Module - -**New file: `src/tracecraft/evaluation/inspector.py`** - -```python -from dataclasses import dataclass -from enum import Enum - -class FailureCategory(str, Enum): - HALLUCINATION = "hallucination" - INCOMPLETE = "incomplete" - FORMAT_ERROR = "format_error" - WRONG_FACTS = "wrong_facts" - CONTEXT_IGNORED = "context_ignored" - INSTRUCTION_MISSED = "instruction_missed" - TONE_MISMATCH = "tone_mismatch" - OTHER = "other" - -@dataclass -class InspectionResult: - case_id: str - failure_category: FailureCategory - root_cause: str - evidence: list[str] # Specific quotes/examples - suggested_fixes: list[str] - confidence: float # 0-1 - analysis_tokens: int - -@dataclass -class BatchInspectionResult: - results: list[InspectionResult] - common_patterns: list[str] - priority_fixes: list[str] # Most impactful changes - summary: str - -INSPECTION_PROMPT = '''You are an expert at analyzing LLM evaluation failures. - -## Task -Analyze why this evaluation case failed and provide actionable recommendations. - -## Input -{input} - -## Expected Output -{expected_output} - -## Actual Output -{actual_output} - -## Evaluation Scores -{scores} - -## Analysis Required -1. **Failure Category**: Classify as one of: hallucination, incomplete, format_error, wrong_facts, context_ignored, instruction_missed, tone_mismatch, other -2. **Root Cause**: Explain specifically why the output doesn't match expectations -3. **Evidence**: Quote specific parts of the output that demonstrate the issue -4. **Suggested Fixes**: List 2-3 concrete changes to the prompt or system that could fix this - -Respond in JSON format: -```json -{ - "failure_category": "...", - "root_cause": "...", - "evidence": ["...", "..."], - "suggested_fixes": ["...", "..."], - "confidence": 0.85 -} -```''' - -class AIInspector: - """AI-powered evaluation failure analysis.""" - - def __init__( - self, - provider: str = "openai", - model: str = "gpt-4o", - api_key: str | None = None, - ): - self.provider = provider - self.model = model - self.api_key = api_key or os.environ.get("OPENAI_API_KEY") - - async def inspect_case(self, case: EvalCase) -> InspectionResult: - """Analyze a single failed evaluation case.""" - prompt = INSPECTION_PROMPT.format( - input=case.input, - expected_output=case.expected_output, - actual_output=case.actual_output, - scores=json.dumps(case.scores, indent=2), - ) - - # Call LLM (OpenAI or Anthropic) - response = await self._call_llm(prompt) - result = json.loads(response) - - return InspectionResult( - case_id=case.id, - failure_category=FailureCategory(result["failure_category"]), - root_cause=result["root_cause"], - evidence=result["evidence"], - suggested_fixes=result["suggested_fixes"], - confidence=result["confidence"], - analysis_tokens=..., - ) - - async def inspect_batch( - self, - cases: list[EvalCase], - find_patterns: bool = True, - ) -> BatchInspectionResult: - """Analyze multiple failures and identify patterns.""" - # Inspect each case - results = await asyncio.gather(*[ - self.inspect_case(case) for case in cases - ]) - - if find_patterns: - # Additional LLM call to find common patterns - patterns = await self._find_patterns(results) - else: - patterns = [] - - return BatchInspectionResult( - results=results, - common_patterns=patterns, - priority_fixes=self._prioritize_fixes(results), - summary=self._generate_summary(results), - ) - - async def _call_llm(self, prompt: str) -> str: - """Call the configured LLM provider.""" - if self.provider == "openai": - from openai import AsyncOpenAI - client = AsyncOpenAI(api_key=self.api_key) - response = await client.chat.completions.create( - model=self.model, - messages=[{"role": "user", "content": prompt}], - response_format={"type": "json_object"}, - ) - return response.choices[0].message.content - elif self.provider == "anthropic": - # Similar for Anthropic - ... -``` - -#### 2. TUI Integration - -**New file: `src/tracecraft/tui/screens/ai_inspect.py`** - -```python -from textual.screen import Screen -from textual.widgets import Static, Button, LoadingIndicator - -class AIInspectScreen(Screen): - """Screen for AI-powered failure analysis.""" - - BINDINGS = [ - ("r", "re_inspect", "Re-analyze"), - ("a", "apply_fix", "Apply Fix"), - ("n", "next_failure", "Next"), - ("escape", "go_back", "Back"), - ] - - def __init__(self, case: EvalCase, inspector: AIInspector): - super().__init__() - self.case = case - self.inspector = inspector - self.result: InspectionResult | None = None - - def compose(self): - yield Static(id="case-summary") - yield LoadingIndicator(id="loading") - yield Static(id="analysis-result", classes="hidden") - yield Button("Apply Suggested Fix", id="apply-btn", classes="hidden") - - async def on_mount(self) -> None: - """Run inspection when screen mounts.""" - self.result = await self.inspector.inspect_case(self.case) - self._display_result() - - def _display_result(self) -> None: - """Display the inspection result.""" - self.query_one("#loading").add_class("hidden") - - result_widget = self.query_one("#analysis-result") - result_widget.update(self._format_result()) - result_widget.remove_class("hidden") - - self.query_one("#apply-btn").remove_class("hidden") - - def _format_result(self) -> str: - """Format inspection result for display.""" - r = self.result - return f""" -## Failure Category: {r.failure_category.value} -Confidence: {r.confidence:.0%} - -## Root Cause -{r.root_cause} - -## Evidence -{chr(10).join(f'- "{e}"' for e in r.evidence)} - -## Suggested Fixes -{chr(10).join(f'{i+1}. {fix}' for i, fix in enumerate(r.suggested_fixes))} -""" -``` - -#### 3. Batch Analysis Screen - -```python -class BatchInspectScreen(Screen): - """Analyze multiple failures to find patterns.""" - - def compose(self): - yield Static("Analyzing failures...", id="status") - yield ProgressBar(id="progress") - yield Static(id="patterns", classes="hidden") - yield DataTable(id="fixes-table", classes="hidden") - - async def analyze(self, cases: list[EvalCase]) -> None: - """Run batch analysis.""" - result = await self.inspector.inspect_batch(cases) - - # Display common patterns - self.query_one("#patterns").update( - "## Common Patterns\n" + - "\n".join(f"- {p}" for p in result.common_patterns) - ) - - # Display priority fixes in table - table = self.query_one("#fixes-table") - table.add_columns("Priority", "Fix", "Affected Cases") - for i, fix in enumerate(result.priority_fixes, 1): - table.add_row(str(i), fix, "...") -``` - -#### 4. Integration with Eval Results Screen - -Add AI Inspect action to `eval_results.py`: - -```python -class EvalResultsScreen(Screen): - BINDINGS = [ - # ... existing ... - ("I", "ai_inspect", "AI Inspect"), - ("B", "batch_inspect", "Batch Inspect Failed"), - ] - - def action_ai_inspect(self) -> None: - """Inspect selected failed case with AI.""" - case = self.get_selected_case() - if case and not case.passed: - inspector = AIInspector() - self.app.push_screen(AIInspectScreen(case, inspector)) - - def action_batch_inspect(self) -> None: - """Analyze all failed cases for patterns.""" - failed = [c for c in self.cases if not c.passed] - if failed: - inspector = AIInspector() - self.app.push_screen(BatchInspectScreen(failed, inspector)) -``` - -### Files to Modify/Create - -| File | Changes | -|------|---------| -| `src/tracecraft/evaluation/inspector.py` | **New** - AI inspection logic | -| `src/tracecraft/tui/screens/ai_inspect.py` | **New** - Single case inspection | -| `src/tracecraft/tui/screens/batch_inspect.py` | **New** - Batch pattern analysis | -| `src/tracecraft/tui/screens/eval_results.py` | Add AI inspect keybindings | - ---- - -## Implementation Priority - -Based on dependencies and value: - -### Phase 1: Foundation - -1. **Enhanced SQLite with Projects/Versioning** - Required foundation for other features - -### Phase 2: Core Features (can be parallelized) - -2. **Evaluation Sets in TUI** - High user value, builds on storage -3. **Log Correlation** - Independent, high debugging value - -### Phase 3: Advanced - -4. **AI Inspect** - Requires evaluation infrastructure from Phase 2 - ---- - -## Summary - -All four features are recommended for implementation: - -| Feature | Key Value | Key Challenge | -|---------|-----------|---------------| -| SQLite Projects/Versioning | Unified storage, team workflows | Schema migration, backward compatibility | -| Log Correlation | Debug context, searchability | Log capture without performance impact | -| Evaluation Sets in TUI | No-code evaluation, iteration | UI complexity, evaluator integration | -| AI Inspect | Automated root cause analysis | LLM cost, response quality | - -Each feature builds naturally on TraceCraft's existing architecture and addresses genuine workflow gaps for LLM application developers. diff --git a/dev/architecture/PAIN_POINTS.md b/dev/architecture/PAIN_POINTS.md deleted file mode 100644 index a6bfc73..0000000 --- a/dev/architecture/PAIN_POINTS.md +++ /dev/null @@ -1,235 +0,0 @@ -# Pain Points TraceCraft Solves - -This document outlines the top four pain points in LLM observability that TraceCraft uniquely addresses, why existing solutions fail to solve them, and how TraceCraft's architecture provides a differentiated solution. - ---- - -## Pain Point #1: Vendor & Framework Lock-in Trap - -### The Problem - -When you instrument your LLM application for observability today, you make a binding decision: - -- **Use LangSmith** → Your tracing code is coupled to LangChain. Migrate to LlamaIndex or PydanticAI? Re-instrument everything. -- **Use Langfuse** → You're locked to their platform schema. Want to switch to Datadog later? Export/import headaches. -- **Use OpenLLMetry** → Great instrumentation, but it only captures data—you still need to pick a backend and adapt to its schema. - -### Why Current Solutions Don't Solve This - -| Solution | Lock-in Type | What Breaks When You Switch | -|----------|--------------|----------------------------| -| LangSmith | Framework + Platform | All decorator-based instrumentation; evaluation pipelines; prompt management | -| Langfuse | Platform | Schema mappings; dashboard configurations; alerting rules | -| Datadog LLM Obs | Platform + Pricing | Custom spans; cost attribution; all widgets | -| Raw OpenTelemetry | None, but... | You write 100+ lines of boilerplate for LLM-specific semantics every time | - -The fundamental issue: **observability platforms treat instrumentation as their moat**, not as a portable asset. - -### Why TraceCraft Is Different - -TraceCraft's "instrument once, observe anywhere" model means: - -```python -# This code NEVER changes regardless of backend -@trace_agent(name="research") -def research(query: str): - return analyze(search(query)) - -# Only configuration changes -tracecraft.init(exporters=[ - ConsoleExporter(), # Local dev - OTLPExporter(endpoint="..."), # Production: Grafana/Jaeger/Tempo - # Or tomorrow: DatadogExporter(), LangfuseExporter(), etc. -]) -``` - -The dual-dialect schema (OTel GenAI + OpenInference) means your traces are compatible with both standards without re-instrumentation. - ---- - -## Pain Point #2: The "Backend Required" Development Tax - -### The Problem - -Every existing LLM observability tool requires infrastructure setup before you see anything useful: - -- **Langfuse**: Docker Compose with Postgres + ClickHouse, or pay for cloud -- **LangSmith**: Create account, configure API keys, set environment variables -- **Jaeger/Tempo**: Deploy collector, configure storage, set up Grafana dashboards -- **Phoenix**: Better (local mode exists), but still requires explicit setup - -The result: developers choose between **printf debugging** (fast but blind) and **infrastructure setup** (powerful but slow). - -### Why This Is Painful - -1. **Day-1 friction**: New team member joins → spends 2 hours setting up observability before writing code -2. **"I'll add tracing later"**: Without instant feedback, instrumentation becomes technical debt -3. **Context switching**: Debug locally → check cloud dashboard → lose flow state -4. **Cost in exploration**: Experimenting with a new approach? That's another trace export bill - -### Why Current Solutions Don't Solve This - -- **LangSmith**: Cloud-first design; local mode is an afterthought -- **Langfuse**: Self-hosted option exists but requires database infrastructure -- **OpenLLMetry**: Instrumentation only—still need to configure a backend -- **Phoenix**: Closest to solving this, but designed for evaluation, not debugging - -### Why TraceCraft Is Different - -```python -import tracecraft -tracecraft.init() # That's it. Zero config. - -# Immediately get: -# 1. Rich console tree showing execution flow -# 2. JSONL file for later analysis -# 3. HTML report generation on demand -``` - -The "local-first DX" philosophy means: - -- **Zero infrastructure** to see your first trace -- **Beautiful console output** with Rich formatting shows agent trees inline -- **JSONL files** that can be version-controlled alongside code -- **HTML reports** for sharing with non-technical stakeholders - -Transition to production is configuration, not re-implementation. - ---- - -## Pain Point #3: The Governance Afterthought Problem - -### The Problem - -LLM applications process sensitive data: user queries contain PII, API responses may include confidential information, and prompts often contain proprietary business logic. Yet observability tools treat governance as an add-on: - -- **Redaction**: Most tools don't have it; those that do require post-processing -- **Sampling**: Applied at the backend, not the SDK—you still send sensitive data -- **Compliance**: GDPR requires data minimization, but traces capture everything by default - -### The Real-World Consequence - -A healthcare AI assistant processing patient messages: - -1. Traces contain PHI (Protected Health Information) -2. Traces sent to cloud observability platform (potential HIPAA violation) -3. Security review catches this 6 months into production -4. Team has to retrofit redaction or disable tracing entirely - -### Why Current Solutions Don't Solve This - -| Solution | PII Handling | Where It Happens | -|----------|--------------|------------------| -| LangSmith | Basic masking available | Server-side (data already transmitted) | -| Langfuse | Manual tagging | Server-side | -| Datadog | Sensitive data scanner | Server-side, paid feature | -| OpenLLMetry | None built-in | N/A | - -The pattern: **governance happens after your data leaves your infrastructure**. - -### Why TraceCraft Is Different - -```python -from tracecraft import TraceCraftConfig, RedactionMode - -config = TraceCraftConfig() -config.redaction.enabled = True -config.redaction.mode = RedactionMode.HASH # Preserve linkability for debugging -config.sampling.rate = 0.1 # Only export 10% of traces - -tracecraft.init(config=config) -``` - -Built-in, client-side governance: - -- **15+ PII patterns** (emails, phones, SSNs, credit cards, API keys, Bearer tokens, private keys) -- **ReDoS-safe regex** (production-hardened against denial-of-service) -- **Allowlist support** for false positives -- **Hashing mode** preserves correlation without exposing values -- **Client-side sampling** means sensitive traces never leave your infrastructure - -This isn't an afterthought—it's core architecture. The `RedactionProcessor` runs in the pipeline *before* any exporter sees the data. - ---- - -## Pain Point #4: Multi-Framework Semantic Chaos - -### The Problem - -Real-world AI teams don't use a single framework. A typical production stack might include: - -- **LangChain** for the orchestration layer -- **LlamaIndex** for RAG pipelines -- **Raw OpenAI/Anthropic SDK** for simple completions -- **PydanticAI** for structured outputs -- **Custom code** for business logic - -Each framework has its own tracing semantics: - -| Framework | Trace Concept | Step Naming | Metadata Schema | -|-----------|---------------|-------------|-----------------| -| LangChain | "Runs" with callbacks | `ChatOpenAI`, `RetrievalQA` | `run_id`, `parent_run_id` | -| LlamaIndex | "Events" with handlers | `LLMPredictEvent`, `RetrieveEvent` | `event_id`, `span_id` | -| OpenAI SDK | None (raw API calls) | N/A | N/A | -| PydanticAI | "Spans" | `agent_run`, `tool_call` | Custom schema | - -### Why This Is Painful - -1. **No unified view**: Your Grafana dashboard has 4 different trace formats that don't correlate -2. **Performance comparison impossible**: "Is LangChain or LlamaIndex faster for retrieval?" requires manual data normalization -3. **Debugging across boundaries**: An error in your LlamaIndex retriever called by a LangChain agent? Good luck following that trace -4. **Migration paralysis**: Teams avoid framework changes because observability would break - -### Why Current Solutions Don't Solve This - -- **LangSmith**: Only understands LangChain semantics natively -- **Langfuse**: Has integrations but each produces different trace shapes -- **OpenLLMetry**: Instruments each framework separately—no semantic unification -- **Datadog/New Relic**: Generic spans without LLM-specific meaning - -### Why TraceCraft Is Different - -```python -# All frameworks produce the SAME Step model: -from tracecraft.adapters.langchain import track_langchain -from tracecraft.adapters.llamaindex import track_llamaindex -from tracecraft.adapters.pydantic_ai import track_pydantic_ai - -# Unified trace tree regardless of source: -# AgentRun -# ├── Step(type=AGENT, name="orchestrator") # LangChain -# │ ├── Step(type=RETRIEVAL, name="rag") # LlamaIndex -# │ │ └── Step(type=LLM, name="embed") # Raw OpenAI -# │ └── Step(type=TOOL, name="validate") # PydanticAI -``` - -The canonical `Step` model normalizes: - -- **StepType enum**: `AGENT`, `LLM`, `TOOL`, `RETRIEVAL`, `EMBEDDING`, etc. -- **Consistent timing**: `started_at`, `ended_at`, `duration_ms` -- **Unified I/O**: `inputs`, `outputs`, `error` -- **LLM metadata**: `model`, `tokens`, `cost` (regardless of provider) - -One trace format. One dashboard. One debugging experience—even when your stack is heterogeneous. - ---- - -## Summary - -| # | Pain Point | Core Issue | TraceCraft Solution | -|---|------------|------------|---------------------| -| 1 | **Vendor Lock-in** | Instrumentation tied to platforms | Dual-dialect schema + pluggable exporters | -| 2 | **Backend Required** | No zero-config local debugging | Local-first with console/JSONL/HTML defaults | -| 3 | **Governance Afterthought** | PII redaction happens too late | Client-side pipeline with built-in redaction/sampling | -| 4 | **Multi-Framework Chaos** | No unified semantics across frameworks | Canonical Step model normalizes all frameworks | - ---- - -## Sources - -- [Best LLM Observability Tools in 2025 - Firecrawl](https://www.firecrawl.dev/blog/best-llm-observability-tools) -- [LLM Observability Best Practices - Maxim AI](https://www.getmaxim.ai/articles/llm-observability-best-practices-for-2025/) -- [AI Agent Observability - OpenTelemetry](https://opentelemetry.io/blog/2025/ai-agent-observability/) -- [Langfuse vs LangSmith Comparison - ZenML](https://www.zenml.io/blog/langfuse-vs-langsmith) -- [LangSmith Alternatives - Helicone](https://www.helicone.ai/blog/best-langsmith-alternatives) -- [Top LLM Observability Platforms - Agenta](https://agenta.ai/blog/top-llm-observability-platforms) diff --git a/dev/research/GEMINI-RESEARCH.md b/dev/research/GEMINI-RESEARCH.md deleted file mode 100644 index 44f35d1..0000000 --- a/dev/research/GEMINI-RESEARCH.md +++ /dev/null @@ -1,289 +0,0 @@ -The Unified LLM Observability Framework: Market Validation, Technical Architecture, and Strategic Opportunity Analysis -Executive Summary -The rapid proliferation of Large Language Models (LLMs) and the subsequent shift toward agentic workflows have precipitated a crisis in software observability. As engineering teams transition from experimental prototypes to production-grade systems, they encounter a fragmented landscape where debugging requires navigating incompatible telemetry standards, proprietary SDKs, and "walled garden" platforms. This report validates the thesis that a significant, widespread, and developer-painful gap exists in the current LLM/agent observability ecosystem. The necessity for a unifying framework—analogous to LiteLLM but for telemetry—is not merely a convenience but a structural requirement for the maturation of AI engineering. -Rigorous analysis of the current ecosystem, spanning proprietary platforms like LangSmith and Datadog to open-source solutions like Arize Phoenix and Langfuse, reveals that no single standard effectively unifies the "three pillars" of agent observability: traceability (execution graphs), evaluation (quality scores), and inference cost (token economics). Developers are currently forced to instrument their code specifically for a chosen vendor, creating significant technical debt and vendor lock-in. While OpenTelemetry (OTel) provides a robust transport standard, its semantic conventions for GenAI are nascent, inconsistently implemented across providers, and lack the high-level abstractions necessary for effective agent debugging. -The Verdict: Yes, a strong opportunity exists. However, the path to success does not lie in building "yet another observability dashboard." The winning opportunity is a Telemetry Abstraction Layer (TAL)—an open-source Python framework that functions as a "universal router" for observability data. This framework must abstract the act of instrumentation itself, allowing developers to write telemetry code once and route it to LangSmith, Datadog, localized debugging tools, or data lakes without code changes. -The recommended "wedge" for adoption is Local-First, Zero-Setup Debugging. The framework should initially position itself as a superior alternative to print statements—a "Logfire for everyone"—that instantly provides rich, structured console output and local visualization for agents, while silently buffering OTel-compliant traces that can be directed to enterprise backends when the project scales. - -1. Introduction: The Observability Crisis in the Agentic Era -The software industry is currently navigating a paradigm shift from deterministic code—where logic flows are explicit, linear, and predictable—to probabilistic AI systems, where "code" includes natural language prompts, stochastic model outputs, and autonomous agentic decisions. This shift has fundamentally broken the traditional paradigms of Application Performance Monitoring (APM). In a deterministic system, a CPU spike or a stack trace usually identifies the root cause of a failure. In an agentic system, failures are often semantic—hallucinations, infinite loops, or poor reasoning—requiring a completely new data model that captures prompts, completions, retrieval contexts, and tool outputs as first-class citizens. -1.1 The Shift from Latency to Semantics -In traditional microservices, the primary questions asked of an observability system are: "Is it up?" and "Is it fast?" The unit of measurement is the millisecond, and the unit of data is the span. In the era of Generative AI (GenAI), these questions remain, but they are superseded by questions of quality and correctness: "Is the answer true?" "Did the agent retrieve the correct document?" "Why did the model choose Tool A instead of Tool B?" -This shift necessitates a move from latency-centric observability to semantic-centric observability. A trace is no longer just a timeline of operations; it is a narrative of thought. It must capture the "internal monologue" of the agent, the specific context retrieved from a vector database, and the structured output returned to the user. Existing tools, built primarily for the former, are struggling to adapt to the latter, leading to a proliferation of specialized point solutions that fragment the developer experience. -1.2 The "Tower of Babel" Effect -As the market rushes to fill this gap, a "Tower of Babel" scenario has emerged. We have excellent tools for specific niches—LangSmith for LangChain debugging, Arize Phoenix for RAG evaluation, Datadog for infrastructure monitoring—but they speak different languages. A "Trace" in LangSmith is a different data structure than a "Trace" in Datadog. An "Evaluation Score" in Phoenix has no native home in New Relic. -This fragmentation forces engineering teams into a binary choice: -Vendor Lock-in: Commit to a single platform (e.g., LangSmith) and rewrite all instrumentation to match its proprietary SDKs, accepting that moving to another platform later will require a complete refactor. -Instrumentation Chaos: Attempt to cobble together multiple SDKs—ddtrace for infra, langsmith for debugging, openinference for evals—resulting in bloated codebases, performance overhead, and disjointed data that is impossible to correlate. -This report analyzes this fragmentation in detail, validates the specific pain points developers face, and proposes a unified architectural solution. -2. Landscape Mapping: The Fragmented Ecosystem -To understand the opportunity, we must first map the current terrain. The observability landscape for LLMs can be categorized into three distinct "walled gardens," each optimizing for different stakeholders and creating friction for interoperability. -2.1 The Framework-Native Gardens (e.g., LangSmith) -LangSmith, built by the creators of LangChain, represents the "Framework-Native" approach. It is designed to offer a "batteries-included" experience for developers within the LangChain ecosystem. -Data Model: LangSmith's core unit is the Run. A Run is a highly flexible, hierarchical object that represents an execution block—whether it is an LLM call, a Chain, a Tool, or a Retriever.1 It prioritizes inputs and outputs as key-value pairs, allowing for rich JSON structures to be stored and visualized. It treats the "Prompt" not just as a string, but as a versioned artifact linked to a central Hub. -Integration Path: The primary integration is via LangChain's internal callback system. Developers simply set an environment variable (LANGCHAIN_TRACING_V2=true), and the framework automatically emits rich traces.2 For non-LangChain code, it offers the @traceable decorator, which manually wraps functions to create Runs.3 -The Lock-in: While LangSmith is arguably the best tool for debugging complex agent graphs, its deep integration creates a "works best with LangChain" dynamic. Developers using custom orchestration logic, PydanticAI, or bare OpenAI SDKs find themselves manually reconstructing the rich trace structures that LangSmith gets "for free" with LangChain. The data model is highly specialized for chain-of-thought reasoning, making migration to general-purpose APMs difficult without significant loss of fidelity. -2.2 The Enterprise Infrastructure Giants (e.g., Datadog, New Relic) -Major APM vendors like Datadog, New Relic, and Dynatrace view LLMs as just another component in the distributed system stack. Their primary customer is the Platform Engineer or SRE (Site Reliability Engineer). -Data Model: These platforms rely on the Span, the fundamental atom of distributed tracing. An LLM call is just a Span with specific attributes (tags) attached. Datadog's LLM Observability module, for instance, categorizes spans into kinds like LLM or Workflow and captures inputs/outputs as span tags like @input.value or @output.value.4 -Integration Path: Integration is typically achieved via proprietary agents (e.g., the Datadog Agent) or SDKs (e.g., ddtrace). While they increasingly support OpenTelemetry ingestion, their "native" experience often requires using their specific instrumentation libraries to unlock full features like cost tracking and PII scanning.6 -The Mismatch: The data model is often a retrofit. Representing a complex, multi-turn chat session with intermediate reasoning steps using standard spans is awkward. Concepts like "Feedback" (a user clicking thumbs down) or "Evaluation" (an LLM judge scoring a response) are often forced into custom metrics or events that do not sit naturally alongside the trace data.8 This leads to a "flat" view of the world that misses the semantic richness required for debugging AI logic. -2.3 The Open-Source & Evaluation-First Tools (e.g., Arize Phoenix, Langfuse) -Tools like Arize Phoenix, Langfuse, and W&B Weave have emerged to fill the gap between the two groups above. They position themselves as "AI-Native" and often lead with evaluation capabilities. -Data Model: These tools often use a hybrid model. Arize Phoenix, for example, is built natively on OpenInference (an extension of OpenTelemetry) and treats traces as a means to an end: evaluation.9 They focus heavily on "Spans" that represent retrieval steps, capturing embedding vectors and document relevance scores. W&B Weave uses a Call object that emphasizes the versioning of the function definition itself ("Ops"), allowing developers to track how changes in code affect output quality.11 -Integration Path: They champion OpenTelemetry but often require specific "flavors" of it. To get the most out of Phoenix, developers are encouraged to use the openinference-instrumentation libraries.12 Langfuse offers its own SDKs that batch and flush events asynchronously.13 -The Fragmentation: While they offer better interoperability than closed platforms, they still suffer from the "instrumentation gap." A developer using Langfuse's SDK cannot easily switch to Phoenix without rewriting their instrumentation code. Each tool has its own opinion on how to structure a "Session" or how to represent a "Tool Call," leading to subtle but painful incompatibilities. -3. Validating Developer Pain: The Friction of Fragmentation -To justify the development of a new framework, we must move beyond hypothetical architecture and ground the opportunity in validated, recurring developer pain points. Research into developer communities, GitHub issues, and documentation gaps reveals four distinct categories of "hair-on-fire" problems that current solutions fail to adequately address. -3.1 The "Instrumentation Tax" and Vendor Lock-In -The primary friction point for developers is the high cost of manual instrumentation and the resulting vendor lock-in. -The Decorator Problem: To get visibility into custom agent logic, developers must decorate their functions. -LangSmith users must import traceable and wrap functions: @traceable(run_type="tool").3 -W&B Weave users must import weave and wrap functions: @weave.op().11 -Datadog users must use ddtrace: @tracer.wrap().6 -The Lock-in Mechanism: Once a codebase is saturated with vendor-specific decorators to capture inputs, outputs, and metadata, the cost of switching becomes prohibitive. This is "vendor lock-in by API contamination." A team that starts with LangSmith for prototyping and later wants to move to Datadog for production compliance faces a complete rewrite of their telemetry layer. -Evidence of Pain: Community discussions highlight the reluctance of teams to adopt specific tools because they "don't want to marry the SDK".14 The "OpenLLMetry" project 16 attempts to solve this via auto-instrumentation, but as we will discuss later, auto-instrumentation is often too brittle for production use. -3.2 The "Black Box" of Agentic Workflows -As agents become more autonomous—looping, self-correcting, and calling tools recursively—"flat" logging becomes useless. -The Loop Problem: An agent might execute a loop: Think -> Tool A -> Observe -> Think -> Tool B -> Observe -> Answer. Standard logs capture this as a sequence of disjointed events. Developers report that without a graph view that visually links these steps into a coherent "Thread," they are flying blind.14 -Cross-Framework Friction: The struggle is acute when mixing frameworks. A common pattern is using LangChain for orchestration and LlamaIndex for retrieval. The LangChain callbacks naturally trace the agent's high-level steps, but when the execution enters the LlamaIndex retriever, the trace often "breaks" or becomes opaque. The LangChain tracer doesn't "see" the internal spans of the LlamaIndex library unless explicitly bridged, leading to disjointed traces where context is lost at the library boundary.17 -3.3 High Cardinality and the Cost of Observability -LLM applications generate high-cardinality data by definition. Prompts and completions are unique strings, and metadata often includes session IDs, user IDs, and complex configuration parameters. -The Token Log Problem: Storing every prompt and completion in a traditional APM (like Datadog or Splunk) is prohibitively expensive. These platforms charge by ingested gigabyte or indexed span. A single RAG trace, containing retrieved documents and a long context window, can be massive.19 -The Sampling Gap: To manage costs, developers need Tail-Based Sampling. They want to say: "Log 100% of traces where an error occurred or where the user gave negative feedback, but only 1% of successful traces." Most current SDKs do not support this logic client-side; they send everything to the backend, incurring ingress costs before the data can be filtered.20 Implementing tail sampling usually requires deploying a separate OTel Collector service, which adds significant infrastructure complexity for an application developer. -3.4 Privacy, Compliance, and PII Nightmares -As GenAI moves into regulated industries (Finance, Healthcare), the presence of Personally Identifiable Information (PII) in prompts becomes a critical blocker. -The "Prompt Leak" Risk: Prompts often contain sensitive data. If a prompt containing a patient's name is sent to a SaaS observability provider without redaction, it constitutes a compliance violation (HIPAA, GDPR).22 -The Redaction Gap: While OpenTelemetry supports processors for redaction, configuring them often requires deep knowledge of the OTel Collector and complex YAML configurations.23 There is a lack of "application-level" controls where a developer can simply say redact_pii=True in the Python SDK and have it happen before the data leaves the process memory. Teams often default to building their own rudimentary logging to files simply to avoid sending sensitive prompts to the cloud.24 -4. Technical Deep Dive: The Semantic Gap -To understand why a simple "wrapper" hasn't already solved this problem, we must examine the technical incompatibility between the different "languages" of observability. The industry is currently struggling to reconcile the APM model (optimized for systems) with the GenAI model (optimized for semantics). -4.1 The Data Model Mismatch -At the core of the fragmentation is a disagreement on what a "Trace" actually is. -4.1.1 The APM Model: Spans and Latency -In the OpenTelemetry (OTel) world, the fundamental atom is the Span. -Structure: A Span is a timed operation with a Trace ID, Span ID, Parent ID, Start Time, End Time, and a collection of Attributes (Key-Value pairs).4 -Purpose: It is designed to answer: "Which operation was slow?" -Fit for GenAI: Poor. When an agent engages in a multi-turn conversation, it's not just a "latency" event. It's a stateful interaction. Representing a "Chat History" (a list of message objects) inside a Span Attribute requires serializing the JSON into a string. This makes querying and visualization difficult. You cannot easily ask, "Show me all traces where the 3rd user message contained the word 'refund'." -4.1.2 The Workflow Model: Runs and Objects -In the LangSmith and W&B world, the fundamental atom is a Run or Object. -Structure: A Run is a hierarchical object that treats inputs and outputs as first-class citizens, not just stringified attributes. It supports complex types like lists of documents, images, or function call arguments.1 -Purpose: It is designed to answer: "What did the agent do and say?" -The Conflict: Mapping a LangSmith "Run" to a Datadog "Span" is a "lossy" compression. You typically lose the structural fidelity of the retrieved documents or the precise structure of the function call arguments when forcing them into standard APM span tags. -4.2 The OpenTelemetry Reality Check -OpenTelemetry is the "standard" transport, but it is not yet a "standard" data model for GenAI. -Experimental Status: The OTel Semantic Conventions for GenAI are currently experimental (v1.3x).25 Attribute names like gen_ai.system.model vs. gen_ai.request.model are subject to change. -Implementation Lag: Different vendors implement different versions of the standard. Datadog might support v1.37+ 26, while other tools might rely on older conventions or custom namespaces (e.g., llm.*instead of gen_ai.*). -The "Span" Problem Redux: OTel is designed for operation latency. It is awkward for capturing "Chat History." Trying to stuff a 50-turn conversation context into a single Span Attribute violates best practices for tag cardinality and size, yet it is necessary for debugging agent memory issues.27 -4.3 The Evaluator's Dilemma -A critical component of AI engineering is Evaluation (Evals)—scoring the quality of an output. This introduces a "Feedback" data type that does not exist in traditional APM. -Feedback as Metadata: In LangSmith, a "Feedback" score is a separate entity linked to a Run ID. You can add feedback asynchronously (e.g., a human reviews the chat log 2 days later).28 -The OTel Gap: OpenTelemetry does not have a native concept of "Late-Arriving Feedback." If you want to attach a score to a trace that finished 2 days ago, standard distributed tracing systems struggle. They treat traces as immutable once the window closes. -Workarounds: Tools like Arize Phoenix handle this by treating "Evaluations" as a separate data stream that they virtually join with traces in their own backend UI.29 However, this "virtual join" logic is proprietary. There is no standard way to emit an "OTel Metric" that essentially says, "Update the quality score of Trace X to 0.5." -5. Competitive Analysis: Why hasn't this already won? -Identifying the opportunity requires understanding why existing attempts have not yet standardized the market. Several open-source initiatives have aimed to solve aspects of LLM observability, but each has limitations that prevent it from becoming the universal standard. -5.1 OpenLLMetry (Traceloop) -OpenLLMetry is arguably the closest existing solution to the proposed framework. Built on OpenTelemetry, it provides an SDK for Python and Node.js that auto-instruments common LLM libraries.16 -Strengths: -Standards-Based: Strictly adheres to OpenTelemetry semantic conventions. -Broad Support: Instruments OpenAI, Anthropic, LangChain, LlamaIndex, Chroma, Pinecone, etc. -Vendor Neutral: Can export to any OTLP destination (Honeycomb, Datadog, Dynatrace). -Weaknesses (The Gap): -"Magic" over Control: It relies heavily on monkey-patching (auto-instrumentation). While convenient, this is fragile. If a library updates its internal method names, the instrumentation silently fails. Developers building mission-critical agents often prefer explicit decorators to ensure stability.18 -Lack of Local Developer Experience: OpenLLMetry is focused on exporting data. It does not provide a robust local debugging experience (like a rich console logger). It assumes you have a backend set up. -Evaluation Gap: It focuses on tracing, not evaluation. It treats evals as a separate concern, whereas the proposed framework argues that tracing and evaluation must be unified in the instrumentation layer. -5.2 OpenLIT -OpenLIT is another OTel-native wrapper, focusing on GPU metrics and LLM performance.32 -Strengths: -Infrastructure Focus: Great for monitoring self-hosted models (Kubernetes, GPU stats). -Weaknesses (The Gap): -Niche Focus: Less emphasis on the "Agentic" reasoning logic (prompts/tools) and more on the "Infrastructure" layer (latency/throughput). -Adoption: Less community traction compared to Traceloop or native vendor SDKs. -5.3 LangChain / LangSmith (The "Gorilla") -LangChain's built-in tracing is the default for a huge portion of the market. -Strengths: -Ubiquity: If you use LangChain, it's already there. -Richness: The trace data is semantically perfect for agents. -Weaknesses (The Gap): -The "Walled Garden": It is extremely difficult to use LangChain's tracing system outside of LangChain. If you write a custom script using just openai and pydantic, you have to jump through hoops to get that data into LangSmith's format without adopting the whole framework.15 -Cost: LangSmith can become expensive for high-volume applications, driving users to seek open alternatives.35 -6. The Opportunity: A "Telemetry Abstraction Layer" (TAL) -The landscape analysis confirms that the market is stuck in a "local optimum." We have tools that are good at specific things (LangSmith for agents, Datadog for infra, Phoenix for evals), but no tool that unifies them at the instrumentation layer. -The Hypothesis: There is a massive opportunity for a "LiteLLM for Observability." -Just as LiteLLM allows developers to swap OpenAI for Anthropic by changing one line of configuration, this new framework should allow developers to swap LangSmith for Datadog (or use both) by changing one environment variable. -6.1 Defining the Category -We define this new category as a Telemetry Abstraction Layer (TAL). -A TAL must: -Decouple Instrumentation from Destination: The code that captures the trace (@trace) should not know or care where the trace is going. -Unify Data Models: It must internally represent traces in a canonical superset format that can be losslessly converted to OTel Spans, LangSmith Runs, or W&B Calls. -Provide "Batteries-Included" Governance: PII redaction, cost calculation, and tail sampling should be built-in features of the SDK, not external infrastructure concerns. -Prioritize Developer Experience (DX): It must offer a local debugging experience that is so good developers use it before they even care about production observability. -7. Proposed Framework Architecture: "TelemetryRouter" -Based on the analysis, we propose the architecture for TelemetryRouter (working title). This framework is designed to be the "Universal Adapter" for AI observability. -7.1 Architecture Blueprint -The framework consists of three decoupled components: -7.1.1 Layer 1: The Unified Instrumentation SDK (The "Input") -This is the only part of the framework the developer touches. It replaces vendor-specific SDKs. -The Universal Decorator: @tr.trace -Automatically captures function name, inputs (args/kwargs), output (return value), and execution time. -Smart Context: Automatically detects if it's running inside a parent trace (using Python contextvars) to construct the execution tree. It handles async context propagation out of the box, solving the common "broken trace" issue in async Python.36 -The "Logfire-style" API: -tr.info("Step completed", metadata={...}) -tr.eval(name="relevance", score=0.9, rationale="...") -> Crucial Differentiator: Unifying Logging and Evaluation. -Framework Adapters: -LangChainCallbackHandler: A drop-in class that funnels LangChain events into TelemetryRouter. -LlamaIndexInstrumentor: Hooks into LlamaIndex's event bus. -7.1.2 Layer 2: The Processing Pipeline (Middleware) -Before data leaves the application, it passes through a chain of processors. This addresses the "Governance" and "Cost" pain points. -PII Guardrail: -Configuration: pii_rules=["email", "credit_card", regex_pattern] -Action: Automatically redacts matching strings in inputs/outputs before they are stored in the buffer. -Value: Solves the compliance blocker client-side. -Tail Sampler (Client-Side): -Configuration: sample_rate=0.1, always_keep_errors=True. -Action: Buffers the trace in memory. When the root span ends, the sampler decides whether to flush it to the exporter or drop it. -Value: Drastically reduces ingestion costs for Datadog/Splunk without missing error traces. -Token Counter: -Uses tiktoken to estimate token usage for every text field, attaching gen_ai.usage.input_tokens attributes automatically, even if the LLM provider didn't return them. -7.1.3 Layer 3: The Routing & Export Layer (The "Output") -The router manages "Exporters." A developer can configure multiple exporters simultaneously (Dual-Writing). -The "Console" Exporter (Default): -Prints a beautiful, color-coded, hierarchical tree of the trace to the terminal. -Replaces the need for print debugging. -Key feature: Collapsible sections (if supported by terminal) or distinct indentation for nested agent steps. -The "OTLP" Exporter: -Maps the internal trace model to standard OpenTelemetry Span and LogRecord objects. -Sends to any OTLP endpoint (Honeycomb, Phoenix, Jaeger, Datadog Agent). -The "Vendor-Native" Exporters (Optional): -LangSmithAdapter: Maps the internal model to Run objects and sends to the LangSmith API (bypassing the need for LangChain). -WandBExporter: Maps to W&B Trace format. -7.2 The "Canonical Trace Schema" -To make this work, TelemetryRouter relies on an internal Canonical Schema that acts as a superset of all others. -Span Object: -id, parent_id, name, start/end_time. -kind: LLM, CHAIN, TOOL, RETRIEVAL, AGENT. (Semantically richer than OTel's CLIENT/INTERNAL). -attributes: Dictionary of metadata. -events: List of logs or feedback scores attached to the span. -inputs/outputs: Preserved as rich objects (JSON), not stringified (until export time). -7.3 What We Will NOT Build -To maintain scope and ensure success, the framework must explicitly exclude certain features: -No Backend UI: We will not build a dashboard. We route to existing dashboards (Phoenix, LangSmith, Datadog). -No Model Proxy: We are not LiteLLM. We do not proxy the actual API calls (unless using auto-instrumentation). We strictly handle the telemetry side. -No Training Pipeline: We focus on inference and agents, not model training curves (TensorBoard territory). -8. Adoption Strategy: The "Wedge" -The graveyard of open-source projects is full of "better standards" that nobody used. To win, we need a Adoption Wedge—a specific use case that offers 10x value for 1 minute of effort. -8.1 The Wedge: "Icecream for Agents" -The initial go-to-market strategy should not focus on "Enterprise Observability" but on "Local Debugging." -The Hook: "Stop debugging your agents with print()." -The Solution: A library that, with one line of code, prints a beautiful, collapsible tree view of the agent's execution to the terminal (or a local Streamlit/HTML file). -The Experience: -Python -import telemetry_router as tr - -@tr.trace -def my_agent(query): - ... - -Running this script instantly outputs a beautiful, readable log hierarchy. -Value: Instant gratification. No API keys, no servers, no Docker. -8.2 The Bridge: "Production Ready in One Env Var" -Once the developer is hooked on the local debugging experience, they deploy the agent. -The Pitch: "Don't change your code. Just set export TELEMETRY_BACKEND=langsmith (or Datadog)." -The Magic: The exact same @tr.trace decorators now silently stream OTel data to the enterprise backend. -Migration Story: This makes the framework the safest choice for a new project. You aren't choosing a backend yet. You are choosing a neutral instrumentation layer that postpones the vendor decision. -8.3 The Top 3 Integration Targets -LangSmith: Because it is the current "gold standard" for visualization. If we can promise "LangSmith-quality traces without LangChain code," we win a huge segment of the market.28 -OpenTelemetry (Generic): Because it unlocks the entire enterprise ecosystem (Datadog, Honeycomb, AWS X-Ray) for free. -Pydantic / Instructor: These are the rising stars of the "No-Framework" movement. Building a native integration for PydanticAI (similar to logfire) will capture the bleeding-edge developers who are rejecting LangChain.37 -8.4 Building Community Trust -To avoid being seen as "just another tool," the project should: -Be clearly vendor-neutral: Governance should not be owned by a single observability vendor. -Prioritize Docs: "How to" guides for every major backend (How to send traces to Datadog, How to send to Phoenix, etc.). -Plugin Architecture: Allow the community to write their own Exporters (e.g., a "Slack Exporter" that posts alerts on errors). -9. Conclusion & Verdict -The research confirms a critical gap in the LLM observability stack. The market currently forces a binary choice: ease of use with lock-in (LangSmith) or vendor-neutrality with high friction (raw OpenTelemetry). -The opportunity for a new open-source framework is high, provided it positions itself as an abstraction layer rather than a destination. By focusing on the "Local Debugging" wedge—giving developers immediate value on their laptop—the framework can gain the adoption density required to become the de facto standard for routing telemetry in the GenAI era. -Recommendation: Build the "TelemetryRouter" (working title). -Scope: Inference-time tracing and evaluation logging. -Differentiation: "Write once, debug locally, route anywhere." -First Release: A Python SDK that replaces print with structured, colorful CLI traces and includes a simple OTLP exporter. -This approach solves the "Fragmentation Crisis" by accepting that fragmentation is inevitable and providing the "universal adapter" required to navigate it. -Evidence Table: - -Pain Point -Description -Affected Personas -Why Current Solutions Fail -Evidence Source -Vendor Lock-in -Rewriting telemetry code to switch backends. -App Developers, Platform Engineers -Vendor SDKs are proprietary; no common interface. -3 -High Cardinality Cost -Storing full prompts/responses is expensive. -Platform Engineers, FinOps -Lack of client-side tail sampling in standard SDKs. -19 -Privacy/PII -Risk of leaking sensitive prompts to SaaS. -Security Engineers, App Developers -Redaction config is complex (OTel Collector) or late (Server-side). -22 -Local Debugging -Lack of good local visualization tools. -App Developers, AI Engineers -Tools require running servers (Phoenix) or SaaS (LangSmith). -40 -Cross-Framework Tracing -Traces "break" between LangChain and LlamaIndex. -AI Engineers -Frameworks use isolated callback systems. -17 -Evaluation Disconnect -Scores (Evals) are not linked to Traces in APM. -AI Engineers, Data Scientists -OTel lacks native "Feedback" semantics. -28 - -Works cited -Dataset prebuilt JSON schema types - Docs by LangChain, accessed January 19, 2026, -Introducing OpenTelemetry support for LangSmith - LangChain Blog, accessed January 19, 2026, -LangSmith Tracing Deep Dive — Beyond the Docs | by aviad rozenhek | Medium, accessed January 19, 2026, -LLM Observability Terms and Concepts - Datadog Docs, accessed January 19, 2026, -HTTP API Reference - LLM Observability - Datadog Docs, accessed January 19, 2026, -LangChain - Datadog Docs, accessed January 19, 2026, -OpenTelemetry in Datadog, accessed January 19, 2026, -10 Best LLM Monitoring Tools to Use in 2025 (Ranked & Reviewed) - ZenML Blog, accessed January 19, 2026, -Best LLM Observability Tools in 2025 - Firecrawl, accessed January 19, 2026, -OpenTelemetry (OTEL) Concepts: Span, Trace, Session - Arize AI, accessed January 19, 2026, -Tracing Basics - Weights & Biases Documentation - Wandb, accessed January 19, 2026, -phoenix/tutorials/evals/evaluate_agent.ipynb at main · Arize-ai/phoenix - GitHub, accessed January 19, 2026, -Get Started with Tracing - Langfuse, accessed January 19, 2026, -How are people managing agentic LLM systems in production? : r/LangChain - Reddit, accessed January 19, 2026, -LangSmith reviews, pricing, and alternatives (December 2025), accessed January 19, 2026, -What is OpenLLMetry? - Dynatrace, accessed January 19, 2026, -LlamaIndex vs LangChain: Which Framework Is Best for Agentic AI Workflows? - ZenML, accessed January 19, 2026, -Feature: re-write Langchain instrumentation to use Langchain Callbacks · Issue #541 · traceloop/openllmetry - GitHub, accessed January 19, 2026, -High Cardinality in Metrics: Challenges, Causes, and Solutions - Sawmills.ai, accessed January 19, 2026, -opentelemetry-collector-contrib/processor/tailsamplingprocessor/README.md at main - GitHub, accessed January 19, 2026, -Sampling - OpenTelemetry, accessed January 19, 2026, -Mastering the OpenTelemetry Redaction Processor - Dash0, accessed January 19, 2026, -opentelemetry-collector-contrib/processor/redactionprocessor/README.md at main - GitHub, accessed January 19, 2026, -Function piiRedactionMiddleware - LangChain Docs, accessed January 19, 2026, -Semantic conventions for generative AI systems - OpenTelemetry, accessed January 19, 2026, -OpenTelemetry Instrumentation - LLM Observability - Datadog Docs, accessed January 19, 2026, -An Introduction to Observability for LLM-based applications using OpenTelemetry, accessed January 19, 2026, -The best LLM evaluation tools of 2026 | by Dave Davies | Online Inference - Medium, accessed January 19, 2026, -Running Evals on Traces - Phoenix - Arize AI, accessed January 19, 2026, -OpenTelemetry for GenAI and the OpenLLMetry project, accessed January 19, 2026, -Manual vs. auto instrumentation OpenTelemetry: Choose what's right - Cribl, accessed January 19, 2026, -Overview - OpenLIT, accessed January 19, 2026, -OpenLit: The Unified Observability Layer for LLM Applications | by vishal acharya | Medium, accessed January 19, 2026, -What are your biggest pain points when debugging LangChain applications in production?, accessed January 19, 2026, -Comparing LLM Evaluation Platforms: Top Frameworks for 2025 - Arize AI, accessed January 19, 2026, -The Hidden Gaps in AI Agents Observability | by Ronen Schaffer | Medium, accessed January 19, 2026, -Pydantic AI - Pydantic AI, accessed January 19, 2026, -How Hyperlint Cut Review Time by 80% with Logfire - Pydantic, accessed January 19, 2026, -How do you balance high cardinality data needs with observability tool costs? - Reddit, accessed January 19, 2026, -What I learned wiring observability (OpenTelemetry) tracing into Vercel AI SDK routes : r/LocalLLaMA - Reddit, accessed January 19, 2026, -Debugging 101: Replace print() with icecream ic() - YouTube, accessed January 19, 2026, diff --git a/dev/research/OpenAI-research1.md b/dev/research/OpenAI-research1.md deleted file mode 100644 index 6f69a37..0000000 --- a/dev/research/OpenAI-research1.md +++ /dev/null @@ -1,73 +0,0 @@ -Bridging the LLM Observability Gap Across Providers - -The Fragmented Landscape of LLM Observability - -Modern LLM applications and agent systems involve complex, multi-step reasoning chains that can be hard to debug without proper visibility. An agent may invoke tools, make API calls, and produce intermediate reasoning that remains hidden from the developer . In such cases, tracing each step – capturing prompts, model responses, tool calls, etc. – is crucial for understanding failures or anomalies  . - -Today’s observability landscape for LLMs is highly fragmented. Different platforms and providers offer their own proprietary tracing solutions: - • LangChain/LangSmith (LangChain’s suite): LangSmith is LangChain’s managed observability platform tightly integrated with the LangChain ecosystem. It provides rich dashboards, alerts, and evaluation tools, but it’s LangChain-native and shines primarily if you are “all-in” on the LangChain stack . Its deep integration means LangChain users get automatic tracing, but developers using other frameworks have to instrument separately or cannot easily leverage LangSmith’s features  . - • Cloud Provider Solutions: Azure’s AI Foundry and AWS’s Bedrock AgentCore now include built-in agent observability. Azure Foundry’s observability uses OpenTelemetry (OTel) under the hood with custom semantic conventions for multi-agent systems , exporting traces to Azure Monitor. AWS’s AgentCore has a built-in telemetry service as well. However, these tend to be cloud-specific. For example, AWS AgentCore’s default tracing can be turned off and rerouted via OTel if you prefer an external tool like Langfuse  . This demonstrates that even cloud solutions acknowledge the need to send traces to third-party observability backends. - • Third-Party Observability Platforms: A flurry of LLM observability tools emerged in 2023–2024, including both open-source and commercial offerings. Notable examples are Langfuse, Arize Phoenix, Comet Opik, Helicone, OpenLLMetry, PostHog LLM Analytics, and others  . These platforms often provide a dedicated UI to inspect LLM calls and chain-of-thought traces. For instance, Langfuse (open-source, MIT-licensed) offers detailed hierarchical traces of prompts, tools, and model generations, along with metrics for token usage, latency, and cost  . Langfuse is framework-agnostic and built on open standards (it “comes with strong OpenTelemetry support” ), making it flexible to integrate across many LLM libraries. In contrast, LangSmith (by LangChain) uses a custom “Run Tree” trace model tied to LangChain’s internals . Arize’s Phoenix is another open platform (albeit Elastic License) that similarly provides deep tracing, evaluation, and even automatic instrumentors for popular libraries, built on an OTel foundation . Comet’s Opik (Apache-2.0) likewise focuses on tracing, logging, and evaluating LLM apps, with a simple Python SDK (@opik.track decorator) for quick integration . Many of these tools can be self-hosted for free, since developers have voiced frustration at purely paid SaaS offerings in this space  . - • General APM and ML Ops Tools: Traditional observability and ML tracking services are also part of the picture. Datadog, New Relic, Honeycomb, etc., are APM platforms now adding LLM-specific support. For example, Datadog initially required using their proprietary SDK or manually annotating spans to monitor LLM workloads, which meant extra integration effort for teams already using open standards . (This is changing – more on standardization below.) Meanwhile, ML experiment trackers like Weights & Biases (W&B) and MLflow were not originally built for capturing detailed prompt traces. W&B has introduced a “Prompts” or W&B Traces feature (via the weave library) to log LLM interactions (prompts, responses, token counts, etc.) into the W&B UI . This is powerful if you’re in the W&B ecosystem, but again it’s a separate integration (e.g. calling weave.init() and using W&B’s methods  ). MLflow would likewise require custom logging code to record each prompt/response as artifacts or metrics – no built-in LLM trace view exists by default. - -In summary, each observability provider or framework today tends to require a unique setup and instrumentation approach. If you start with one (say, LangSmith) and later want to switch to another (say, Datadog or W&B), you likely must reconfigure your logging and tracing calls in code, or even change libraries entirely. This siloed landscape is cumbersome for developers who want flexibility. As an example, before recent improvements, a team using Datadog for APM but trying out OpenTelemetry had to “maintain parallel instrumentation paths” – essentially duplicate their effort – because Datadog’s LLM monitoring and generic OTel tracing were not natively unified . The lack of a common standard meant vendor lock-in for observability: each tool spoke its own language. - -Known Pain Points and Gaps in the Current Ecosystem - -Given this fragmentation, several pain points have become evident (and are voiced by developers): - • Redundant Integrations & Vendor Lock-In: Perhaps the biggest gripe is having to instrument or configure observability over and over for each platform. Without a unified interface, teams end up writing one set of logging calls for LangChain/LangSmith, another for OpenTelemetry or Datadog, yet another for W&B, etc. This duplication wastes time and introduces inconsistency. Datadog explicitly acknowledged this problem – prior to adopting a standard, teams using OTel alongside Datadog had to do extra work or “bypass” parts of their pipeline . In other words, there was no “write once, run anywhere” for LLM traces. This kind of lock-in also makes it risky to adopt a new observability tool; nobody wants to rip out their tracing code each time they change providers. - • Lack of Standard Schema: Until recently, there was no universal agreement on how to represent LLM interactions in telemetry data. Each tool defined its own span attributes or data model for prompts, token counts, model names, etc. This made it hard to compare or merge data from different systems. As a result, observability data was siloed – one couldn’t easily export traces from LangSmith into another system, or combine W&B logs with Datadog traces, because the formats differed. This gap is now being addressed by industry-wide efforts to define OpenTelemetry semantic conventions for generative AI . In late 2024, a standardized vocabulary for LLM operations (covering prompts, completions, token usage, tool calls, model info, errors, etc.) was introduced in OpenTelemetry v1.37. These GenAI semantic conventions provide a consistent schema for traces/events across any framework or vendor  . The work was driven by a community including contributors from OpenLLMetry (Traceloop) and major cloud providers . The benefit is interoperability: with common attribute names and structure, an LLM trace collected in one library can be understood by another tool . This standardization is very new – meaning most current frameworks are only beginning to adopt it (or still use their proprietary schemas). The gap here was clearly felt and is now being filled, but many existing tools in the wild pre-date the standard. - • Multiple Dashboards & Siloed Analysis: Because each observability solution often comes with its own UI or database, teams sometimes find themselves jumping between multiple dashboards to get a full picture. For instance, one might use LangSmith’s UI to debug agent logic, but use Datadog’s APM for overall service metrics, and maybe W&B for tracking prompt experiment results. This separation hinders holistic analysis. An ideal scenario would allow viewing LLM traces alongside other application telemetry. In fact, that’s a selling point of using a common standard – e.g. Datadog’s vision is to let you see “GenAI traces alongside your existing APM traces, logs, metrics” in one place . The pain point here is that without integration, important context can be lost. (Did a spike in latency in the web app correspond to a specific prompt that went haywire? If LLM observability lives in a different silo, correlating that is harder.) - • Onboarding and Configuration Overhead: Each new tool comes with its own SDK, environment variables, and configuration nuances. While many LLM observability SDKs strive to be easy, there’s still a cognitive load to learning each one. For example, LangSmith requires using LangChain’s tracing callbacks or decorators; Opik uses its decorator and Comet integration; W&B Weave auto-patches certain libraries but you must initialize it properly  ; OpenLLMetry suggests using their SDK to instrument code , etc. If a developer is experimenting with multiple observability solutions (not uncommon in this fast-evolving space), this means repeated setup and learning curve. One community complaint has been the profusion of new tools “launching every week” and uncertainty about which ones are actually used in practice . The plethora of options can itself be a pain point – a unified approach could reduce analysis paralysis by providing a common layer that works with many backends. - • Cost and Closed-Source Frustrations: Many early entrants in LLM observability offered limited free tiers or required paid plans for meaningful usage volumes, which frustrated indie developers and startups. A Reddit discussion from 2024 highlighted this, with users complaining: “Why the hell is everything so expensive?”  and citing tools that gate higher event counts behind steep monthly fees . While some tools branded themselves as “open source,” they sometimes required a cloud account or had features gated unless you paid (the so-called “open core” model). This led to pushback – for example, one developer responded by pointing to Opik and Phoenix as truly open solutions that one can self-host without artificial limits  . The broader point is that there’s demand for an open, vendor-neutral solution that doesn’t lock you into a particular platform or pricing model. If an observability library can be used freely and flexibly, it would address this pain. The popularity of Langfuse’s open-source edition (nearly 18k GitHub stars by late 2025 ) and OpenLLMetry’s quick integration into the OTel standard shows the community’s appetite for open solutions. - -In summary, the known gaps in today’s LLM observability are: a lack of unified standards (historically), heavy integration and reconfiguration effort when switching providers, fragmentation of data/views, and frustration with closed or costly tools. These pain points are well-recognized by both users and vendors – indeed, much of the recent movement in the field (like OTel’s new conventions, or Datadog adding native OTel support) is a response to these issues. - -Emerging Standards and Convergence (A Path Forward) - -The good news is that the industry is actively converging on solutions to close these gaps. The focal point of this convergence is OpenTelemetry (OTel) – an open standard for telemetry (traces, metrics, logs) that many are extending to cover LLM use cases. Recent developments include: - • OpenTelemetry’s GenAI Semantic Conventions: In late 2024, the OpenTelemetry project officially accepted a set of semantic conventions specifically for generative AI/LLM observability  . These conventions (partly contributed by the team behind OpenLLMetry) define standard attribute names and span structures for things like prompts, model IDs, token counts, tool invocations, etc. For example, attributes like gen_ai.request.prompt, gen_ai.response, gen_ai.usage.input_tokens, gen_ai.provider.name, and span operation names like agent_run or tool_call are standardized  . The establishment of this standard is a key enabler for cross-platform observability – it lets different tools speak the same “language” when recording LLM interactions. As a Dynatrace announcement put it, this brings “consistency across platforms” and a “unified approach to monitoring and troubleshooting, regardless of the platform or environment”, greatly simplifying development and maintenance of LLM apps . In practical terms, an agent trace recorded with these conventions can be exported to any OTel-compatible backend, and it will carry rich, self-describing information (prompt text, model name, latency, error reasons, etc.) that any observer can parse. - • Adoption by Vendors and Tools: We are already seeing major providers embrace these standards. Datadog, for instance, announced native support for OTel’s GenAI semantic schema in its LLM Observability product . This means you can now instrument your application once using the OTel API (with the GenAI conventions), and send that data to Datadog without any Datadog-specific code – Datadog will recognize the standard fields and populate its LLM dashboards accordingly  . In their words, teams can “instrument once with OTel…and analyze anywhere,” eliminating duplicate effort  . Similarly, Honeycomb and other APM vendors are adding support for these conventions, since it allows them to ingest LLM traces just like any other span. -Cloud platforms are also aligning: Microsoft’s Azure Foundry not only uses OTel, but actively contributed new multi-agent observability extensions to OTel (in collaboration with Cisco) . These include conventions for things like agent-to-agent interactions and tool usage within agents   – crucial for standardizing complex agent workflows. Those conventions have been integrated into Azure’s SDKs (Semantic Kernel, Azure packages for LangChain/LangGraph, etc.) so that customers can get “unified observability for agentic systems” across frameworks with minimal fuss . AWS’s Bedrock AgentCore, as noted, is built on OTel too (via AWS Distro for OTel). Although it had its own default trace pipeline, the fact that one can disable it and swap in an external OTel endpoint (e.g. Langfuse’s collector) shows the interoperability OTel enables  . We also see open-source tools embracing the standard: Langfuse’s latest versions aim for compliance with the official OTel GenAI schema , and Arize Phoenix’s instrumentation (called OpenInference) was designed to be OTel-compatible from the start . - • Integration and Extensibility: The move to OTel means that pluggable exporters and receivers can be used for LLM telemetry, just as with any other trace data. This is powerful: for example, a single instrumented codebase could export trace data to multiple destinations – you might send it to a self-hosted open-source backend and to a vendor cloud service for comparison. OpenLLMetry’s SDK explicitly supports this kind of flexibility, allowing data to be sent to “a range of supported destinations from Traceloop to Datadog to Honeycomb” . In effect, OTel serves as a common pipeline. There are OTel exporters or collectors for dozens of backends (Jaeger, Zipkin, Datadog, Splunk, etc.), so once your LLM traces are in OTel format, you can choose where to ship them with configuration, not code changes. Even hybrid scenarios become easier (e.g., send basic metrics to Datadog, but full detailed traces to a specialized LLM observability DB for deeper analysis). This standardization is breaking the one-to-one coupling between “how you instrument” and “where you send the data.” - • Continued Gaps: Despite this progress, it’s important to note that simply having a standard doesn’t automatically solve all usability issues. OTel’s own APIs can be low-level and verbose for application developers. Setting up an OTel SDK, configuring exporters or an OTel Collector, and making sure you use the correct semantic attributes can be non-trivial, especially if you’re not already familiar with observability frameworks. There can also be subtle idiosyncrasies – e.g., ensuring sensitive data (prompts) are properly redacted, or dealing with performance overhead of tracing each token stream. In short, there is still room for a higher-level framework that makes it dead simple to adopt these standards in the context of LLM apps. We are beginning to see moves in this direction: Arize Phoenix, for example, provides “automatic instrumentors” for many libraries (so it can capture your OpenAI or LangChain calls without you writing manual trace code) . W&B’s Weave auto-patching is another approach to reduce developer effort . But these solutions are each tied to their ecosystem (Phoenix to Arize’s platform, Weave to W&B). The broader opportunity is to create an open-source, framework-agnostic library that simplifies OTel usage for LLMs. - -Opportunity: A Unified Open-Source Observability Framework for LLMs - -Given the pain points and the evolving standards discussed, there is a clear opportunity for a new open-source framework that bridges the observability gap in a way that’s vendor-neutral and developer-friendly. The goal of such a project would be to let users “configure observability once” and seamlessly swap or integrate providers, much like how LiteLLM provides a single interface to many LLM providers. What might this look like in practice, and what specific focus would give it significant value? - -Key Focus Areas for the Framework: - • Standardized Core, Specialized for LLMs: At its heart, the framework should implement the emerging OpenTelemetry GenAI semantic conventions . By using the standardized trace schema (spans for prompts, tools, etc., with the proper attributes), it ensures that any data captured will be understood by other tools supporting the standard. This addresses the schema standardization gap. But importantly, the framework can go beyond raw OTel by adding high-level abstractions for common LLM/agent patterns. For example, it could provide an InstrumentedChain class or decorators for LLM calls that automatically create spans for each step (prompt, model completion, tool call) with all the metadata. Essentially, it would wrap the complexity of using OTel API directly. The end result: a developer integrates once (e.g., wraps their agent or LLM calls with this library’s functions) and gets rich traces in a consistent format, no matter where they send them. This “instrument once” approach is exactly what teams want  . The framework’s job is to make that one-time instrumentation extremely simple and tailored to LLM scenarios (covering things like nested agent loops, chain-of-thought logging, token counts, etc. out of the box). - • Pluggable Exporters for Multiple Backends: To truly allow switching observability providers without reconfiguration, the framework should support multiple output modes. This could include: - • OTLP Exporter – sending data via OTLP (OpenTelemetry protocol) so it can go to any OTel-compatible collector or backend (Datadog, Dynatrace, Azure Monitor, etc.) . This covers the traditional observability systems. - • Native SDK Hooks – for platforms that aren’t OTel-first. For instance, W&B might not ingest OTLP directly, but the framework could optionally integrate with W&B’s SDK to log the same information to a W&B run (leveraging W&B’s weave or low-level API). Similarly, an integration for MLflow could log key metrics (like total tokens or latency) and artifacts (perhaps prompts and outputs as text artifacts) to MLflow, allowing users of those tools to see the data in their existing tracking setup. These integrations would be optional plugins, but they broaden the appeal to MLOps users who have incumbent platforms. - • Self-hosted UI – The framework could also offer a lightweight web UI or utilize an existing open UI (like Jaeger, or even hooking into Langfuse or PostHog) for those who want to self-host a front-end. This isn’t strictly necessary for MVP, as the focus is on being a framework rather than yet another dashboard. But providing an easy local viewer (even if just reusing Jaeger/Grafana) could help users who don’t want to rely on any external service at all. -The key is that switching the export target should be as simple as changing a config setting or import. For example, one could initialize the observability with myobs.init(exporter="datadog", config={...}) vs exporter="wandb" or exporter="console" (for local debugging), etc., without changing how the code is instrumented. This decoupling is achievable thanks to the common schema – and we see precedent: OpenLLMetry already demonstrated sending one instrumentation’s data to multiple destinations . A new framework can extend that flexibility further. - • Framework-Agnostic and Agent-Agnostic Integration: To gain traction, the library should work with any LLM application, whether you built it with LangChain, Azure’s SDK, AWS’s AgentCore, custom code, or anything else. This means providing integration hooks or patches for popular libraries. For example, it could detect if LangChain is being used and piggyback on LangChain’s callback system to generate standard OTel spans (for those who prefer not to use LangSmith). Or integrate with AWS AgentCore’s hooks (since AgentCore can emit OTel spans, maybe the library can capture/forward them). In Azure’s case, if someone is using Semantic Kernel or Foundry, our framework could ingest the OTel spans they produce and still allow switching backends. Essentially, it acts as an adapter layer across frameworks. This is reminiscent of Phoenix’s approach of having “dozens of automatic instrumentors” for different libraries  – a new project could centralize these instrumentations under one umbrella, fully open-source. By validating that it can plug into known pain points (LangChain without LangSmith, Foundry outside Azure, AgentCore outside AWS, etc.), it addresses real gaps. For instance, LangSmith is easy if you use LangChain, but if you use a different agent framework, you’re out of luck – our library would step in there. - • Focus on LLM/Agent-Specific Telemetry: The framework should incorporate known pain points in debugging LLMs as first-class considerations. This includes capturing the content of prompts and responses (with options to redact or anonymize as needed for privacy), logging model parameters (which model, which version, temperature, etc.), tracking token usage and cost per call, and recording tool invocations and their results  . These are the data that engineers find invaluable when debugging AI agent failures . For example, if an agent makes a wrong decision, the trace should show the exact prompt that led to it and any intermediate reasoning. The new framework can leverage the standardized attributes for these (e.g., gen_ai.operation.name, gen_ai.request.model, gen_ai.usage.total_tokens, etc.) so that this rich context is captured uniformly  . By targeting these LLM-specific details (rather than just being a generic tracer), the project provides significant added value over using OpenTelemetry alone. It essentially packages best practices and conventions (some gleaned from LangSmith, Langfuse, etc.) into a reusable library. - • Truly Open-Source and Community-Driven: To address the frustration around pseudo-open tools, this framework should be fully open-source (permissive license) with no feature gating. The value proposition is not to lock users into yet another platform, but to free them from that lock-in. It could gain traction similarly to how Langfuse did, by being transparent and community-friendly. If it can demonstrate compatibility with other systems, one could even see collaboration instead of competition – e.g., Langfuse’s backend could ingest traces from this library (since it speaks OTel), or the library could optionally forward to LangSmith for those who want to use LangChain’s UI but also have a copy of data elsewhere. Being Switzerland in the observability war, so to speak, could win mindshare. - -Figure: An example trace visualization from an open-source LLM observability tool (OpenLLMetry/Traceloop). This shows a hierarchical agent trace with steps like prompt calls and tool usage, along with metadata such as model parameters and token counts. A unified framework would capture similar traces in a standard format, allowing them to be viewed on any backend.   - -By focusing on these aspects, the new framework directly tackles known pain points. For instance, no more re-instrumenting code just because you switched from one monitoring service to another – instrument once, then choose your exporter (this aligns exactly with what practitioners have been asking for ). The use of open standards means it would “play nice” with all major providers, letting teams integrate LLM observability into their existing tools and workflows rather than forcing a separate silo. And because it’s specific to LLMs, it can incorporate domain-specific conveniences that general tracing libraries lack (making it much easier to adopt than raw OpenTelemetry for someone working on, say, a LangChain app outside of LangSmith). - -Critically, there is evidence that such a unified approach would be welcomed. We see developers already trying to piece together similar capabilities: e.g., the AWS blog showing how to funnel AgentCore traces to Langfuse via OTel endpoints  , or community projects like OpenLLMetry which arose to bridge gaps and got standardized . We also see frustration when this kind of easy interoperability is missing – as one Datadog engineer noted, having to maintain parallel instrumentation for different systems was a pain point . By validating these gaps (through the sources and examples above), we confirm there’s a real need. A framework that makes multi-provider observability seamless – analogous to how LiteLLM made multi-provider model access seamless – could rapidly gain adoption. - -Conclusion: A Clear Path to Impact - -In conclusion, the current state of LLM/agent observability is ripe for improvement. The pain points are well-known in the community: too many one-off integrations, inconsistent schemas, and difficulty combining or switching tools  . Recent progress with OpenTelemetry’s GenAI conventions and open-source observability platforms have laid the groundwork for a more unified future  . What’s still missing is a developer-centric framework that pulls these threads together into an easy-to-use package. By building a Python open-source library focused on standardized LLM tracing with pluggable backends, we can fill this gap. This would let developers configure their observability once – capturing all the rich trace data needed for debugging LLM applications – and then send it wherever it needs to go with minimal effort. The value of such a framework is that it eliminates the trade-off between using best-of-breed observability tools and avoiding vendor lock-in or rework. - -Such a project could gain significant traction by addressing the confirmed pain points (integration fatigue, lock-in, cost) with an open, flexible solution. It aligns with the direction the industry is already heading (standardized, cross-platform telemetry), but offers a practical implementation that developers can adopt today. In short, there is a clear opening for a LiteLLM-like observability framework – one that will make observability configuration for LLM/agent applications as interchangeable and straightforward as swapping an API key. The research above validates the need and opportunity: by learning from existing tools’ limitations and leveraging the new standard, a unified observability framework for LLMs could truly accelerate AI application development and operations. - -Sources: - - 1. Microsoft Azure AI Foundry – “Trace and Observe AI Agents” (preview documentation)   - 2. ZenML Blog – “Langfuse vs LangSmith: Which Observability Platform Fits Your LLM Stack?”   - 3. Reddit – Discussion on LLM observability tools pricing and openness   - 4. PostHog Blog – “7 best free open source LLM observability tools right now”   - 5. Datadog Blog – “Datadog LLM Observability supports OpenTelemetry GenAI Conventions”   - 6. Dynatrace Community – “OpenLLMetry conventions are now part of OpenTelemetry”   - 7. Arize (Phoenix) Blog – “The Role of OpenTelemetry in LLM Observability”   - 8. AWS Machine Learning Blog – “Amazon Bedrock AgentCore Observability with Langfuse”   - 9. W&B Medium Article – “W&B: Instrument-Everything Way (LLM tracing with Weave)”   - 10. PostHog (Ian Vanagas) – Open-source LLM observability tools list   (demonstrating multi-destination support) diff --git a/dev/research/OpenAI-research2.md b/dev/research/OpenAI-research2.md deleted file mode 100644 index 595b49f..0000000 --- a/dev/research/OpenAI-research2.md +++ /dev/null @@ -1,38 +0,0 @@ -Existing Solutions for LLM Agent Tracing & Observability - -Open-Source Developer SDKs - -Several open-source projects already tackle LLM/agent observability with a focus on easy code integration and vendor-neutral tracing. Notable examples include: - • OpenLLMetry (by Traceloop) – A popular open-source SDK (6.8k+ ⭐ on GitHub) built on OpenTelemetry to trace LLM applications  . With one-line init (Traceloop.init()), it auto-instruments LLM calls, tools, and vector DB interactions, emitting standard OpenTelemetry GenAI spans. This means you can export traces to any backend (Datadog, Honeycomb, Jaeger, etc.) without code changes  . OpenLLMetry provides high-level decorators for workflows and agent tools to capture nested agent steps (e.g. it supports LangChain agents, Crew, Haystack, etc.) . Essentially, it already delivers “instrument once, debug anywhere” for LLM apps, very similar to the project’s goal. The library captures prompts, model responses, tool calls, errors, token usage, and latency as structured spans , comparable to LangSmith-level detail. - • Langfuse – An open-source LLM observability platform (MIT-licensed, ~21k ⭐) with an SDK for Python/TypeScript . Langfuse records traces of LLM chains and agents: every prompt, completion, tool invocation, token count, latency, etc. (viewable in a web UI) . Its SDK is built on OpenTelemetry as well  . Developers initialize it with one line and an API key, and can use decorators or context managers (e.g. @Langfuse.trace) to instrument functions . Langfuse integrates with popular frameworks (LangChain, LlamaIndex, etc.) to automatically log their internals . It’s vendor-neutral to an extent – you can self-host the Langfuse backend, or even send Langfuse’s OTel spans to other APM tools  . However, it is geared toward using the Langfuse UI/DB for analysis by default (so switching “backends” means setting up their endpoint or an OTel collector). - • Langtrace – Another open-source observability tool (AGPL license, ~1.1k ⭐). Langtrace provides a hosted or self-hosted backend with a lightweight SDK for Python/TS  . It similarly wraps OpenTelemetry under the hood . One langtrace.init() call instruments LLM API calls, tools, vector DB operations, etc., capturing traces and metrics of agent runs  . The trace data (prompts, outputs, errors, timings) is sent either to Langtrace’s service or an OpenTelemetry endpoint. Langtrace emphasizes real-time monitoring, cost & latency analytics, and even evaluation metrics, much like LangSmith’s feature set . The key difference is it’s open-source and can be self-hosted. Still, it aims to be a one-stop tracing solution, rather than a simple “pipe” into any arbitrary backend. - • Helicone – An open-source proxy solution (not an SDK) that logs all OpenAI/LLM calls . By pointing your API base URL to Helicone, it captures prompts, responses, costs, and errors without code changes . This is great for LLM API call logging, but Helicone doesn’t inherently understand multi-step agent flows – it treats each API request separately. There’s no built-in concept of an agent workflow or tool usage chain; you’d have to correlate calls yourself. So Helicone addresses part of the problem (prompt/response observability) but not full agent trace semantics. - • Others – The ecosystem is rich. OpenLIT (OpenTelemetry-native LLM tracer) is an emerging project with a Python API for minimal-overhead LLM tracing  . It similarly emits OTLP spans and can integrate with any backend. Lunary (open-source GenAI observability) focuses on RAG use-cases, tracing retrieval calls alongside generation for better insight into context and answer quality  . There are also specialized tools like Hugging Face’s TruLens (for evaluating outputs) and Arize’s Phoenix (monitoring drift/metrics)   – these are adjacent (evaluation/monitoring) but not direct tracing competitors. In summary, multiple open SDKs exist that capture “LangSmith-style” traces of LLM applications, with OpenLLMetry being the clearest match to an “agent-first, vendor-agnostic” runtime. - -Closed-Source / Proprietary Solutions - -In addition to open tools, several proprietary or cloud-native solutions cover this space: - • LangSmith by LangChain – The inspiration for this project, LangSmith is LangChain’s own observability/tracing platform (commercial, with a cloud and self-host option). It records every step of an agent’s execution graph – model calls, tool invocations, inputs/outputs, errors, tokens – very comprehensively . LangSmith integrates seamlessly if you use LangChain (just add a decorator or env var to enable tracing), but it’s framework-agnostic in principle . You can instrument custom code or other frameworks with their SDK (e.g. @trace decorator)  . However, using LangSmith means sending data to LangChain’s backend (or running their server) – effectively vendor lock-in to their platform/UI. It’s not an open library you can repurpose freely, and switching off LangSmith would require re-instrumenting to a different system. This is exactly the lock-in the user’s project wants to avoid. - • Cloud Provider Tracing (Azure / AWS) – Both Microsoft and Amazon have introduced GenAI agent observability tied to their ecosystems. For example, Azure AI “Foundry” (preview) can emit OpenTelemetry traces for agent frameworks, viewable in Azure Application Insights . Similarly, AWS Bedrock Agent services integrate with AWS’s Distro for OTel (ADOT) – you instrument your agent code with AWS’s OTel SDK and send traces to CloudWatch X-Ray or Amazon Managed Grafana  . These solutions provide rich agent traces (steps, tools, etc.) out-of-the-box if you use their agent orchestration. But they are inherently tied to the Azure/AWS platforms (the telemetry goes into their monitoring tools). Adopting them means limited portability – e.g. you couldn’t easily switch to Datadog or self-hosted Jaeger without replacing the instrumentation. - • APM/Observability Vendors – Traditional observability providers are also adding GenAI tracing support. Datadog, for instance, just announced native support for the OpenTelemetry GenAI conventions (v1.37+)  . Previously, you had to use Datadog’s custom SDK to log LLM spans, but now Datadog accepts standard OTLP spans for prompts, completions, tools, etc. . This means if you instrument with an OTel-based library (like OpenLLMetry or your own), Datadog’s LLM Observability product will understand and display those agent traces with no extra code  . Other vendors (New Relic, Splunk, Honeycomb, etc.) can likewise ingest OTel GenAI spans – the entire industry is converging on the OTel standard for AI telemetry  . However, these vendors did not originally offer an easy Python SDK specifically for agent tracing – they rely on you bringing your own instrumentation (or using a third-party SDK). Now that standards are in place, vendors are ensuring compatibility, but the developer still needs a library to generate those traces. - -Is There an Open Opportunity? - -Given the landscape above, there are already robust open-source SDKs and tools offering “LangSmith-quality” agent traces without vendor lock-in. In particular, Traceloop’s OpenLLMetry stands out as a direct open solution fulfilling the project’s core goals: it’s framework-agnostic, captures nested agent/tool workflows, and exports to many backends via configuration  . Langfuse and Langtrace similarly use OTel under the hood and focus on agent-style trace semantics (though they often encourage use of their own backends)  . - -The key gap would be in developer experience and scope, rather than raw capability. All the existing open solutions rely on OpenTelemetry concepts to some degree – for example, OpenLLMetry exposes spans/traces and uses decorators like @workflow to mark agent boundaries  . A new project could differentiate by providing a more agent-native API (as described in the prompt) so that developers don’t need to think in terms of spans at all. For instance, an AgentRun object with a tree of Steps might feel more intuitive than decorating functions. Also, many current tools either come with a full backend/UI or require setting up an OTel collector to visualize data. There may be room for a lightweight library that by default logs traces to console/file for quick local debugging (and optionally sends to any backend) – essentially “plug-and-play” for developers who just want to see what their agent is doing right now. - -In summary, the concept of a vendor-neutral, instrument-once tracing runtime is already being pursued by multiple projects. Any new entrant would need to clearly state what it does differently. Based on this research, possible differentiators could be: - • Simplicity and zero-config DX: Even less setup than OpenLLMetry – e.g. auto-detect agent/tool usage without requiring decorators, and a local debug view out-of the box. Current solutions like OpenLLMetry are powerful but still benefit from optional manual annotations for best results . - • Strictly focused scope: Many tools (Langfuse, Langtrace) are evolving into full platforms (with eval, prompt management, etc.). A project that stays narrow – just high-quality tracing to feed other systems – could attract those who don’t want a new dashboard. - • Community governance: OpenLLMetry is Apache-2.0 and widely adopted, but it’s maintained by a company (Traceloop) driving towards a SaaS. An independent OSS project might appeal to contributors who want a purely community-driven alternative. (Though note, OpenLLMetry’s popularity suggests most are content with it for now.) - -Bottom line: The space is crowded – any claim of a “truly open opportunity” must be tempered with the fact that at least one open library (OpenLLMetry) essentially already offers what’s described, and others aren’t far behind. The proposed project would need to clearly articulate why existing solutions don’t fully solve the problem. It could be that by emphasizing an “agent-first” abstraction and ultra-simple integration, the project finds a niche despite the overlap. But this deep research reveals that multiple tools (open and closed) are offering very similar capabilities today, so careful positioning will be necessary. - -Sources: - • Traceloop OpenLLMetry README (open-source LLM observability SDK)   - • Langfuse Documentation – tracing LLM agents and tools (open-source observability platform)   - • Langtrace GitHub – open-source end-to-end LLM tracing tool by Scale3 Labs   - • LangSmith vs LangChain – LangSmith overview (LangChain’s observability platform, proprietary)   - • Microsoft & AWS docs – OpenTelemetry tracing in Azure Foundry and AWS Bedrock agents   - • Datadog blog – Native support for OpenTelemetry GenAI trace data (showing industry standardization)   diff --git a/dev/research/combined-research.md b/dev/research/combined-research.md deleted file mode 100644 index 53a75e9..0000000 --- a/dev/research/combined-research.md +++ /dev/null @@ -1,262 +0,0 @@ -What from Gemini is worth incorporating - -Yes, there are several implementation-grade insights in Gemini’s writeup that strengthen (and in a few places, correct) the earlier report: - - 1. Standards reality: OTel GenAI is still moving and is shifting “content” toward events -OpenTelemetry’s GenAI semantic conventions are explicitly in a “Development” phase and include a dedicated GenAI event for operation details that can carry chat history and parameters.  -This matters because you cannot rely on “prompt/completion as span attributes” staying stable, and some backends already feel this pain when semconv versions change.  - 2. Polyglot schema: support both OTel GenAI and OpenInference namespaces -OpenInference defines agent-native kinds and conventions (tool, retriever, etc.) and is what Phoenix and other AI-native systems expect.  -TraceCraft TAL should treat this as a “dual dialect” problem, not a “pick one” problem. - 3. Thread context propagation is a real, recurring breakage point -OpenLLMetry documents that ThreadPoolExecutor can cause broken traces unless you explicitly propagate context.  -Gemini’s suggestion of a context-aware executor utility is very practical. - 4. Framework adapters should use the “real” extension points - - • LangChain: Callback handlers and async callback managers are the stable integration seam.  - • LlamaIndex: Their instrumentation model is Dispatcher + SpanHandler/EventHandler.  -This aligns with how you should implement adapters rather than monkey-patching internals. - - 5. PydanticAI is already OTel-first, so you should intercept providers rather than “add tracing” -PydanticAI/Logfire uses the global OpenTelemetry providers by default and supports providing a tracer_provider.  -So TraceCraft should integrate by providing or wrapping the TracerProvider and enforcing governance there. - 6. Governance via SpanProcessor has limits: mutation at on_end is not uniformly supported -The spec ecosystem has long-running issues about mutating spans at on_end and the immutability model of ReadableSpan in some languages.  -So your redaction design must not assume you can safely mutate ended spans. You should design for “sanitize before export” (export-time transform) and/or “sanitize at capture time” (recommended for strict environments). - 7. Deterministic trace IDs: feasible, but must preserve sampling randomness -OTel Python’s IdGenerator docs state you must keep at least the lower 64 bits uniformly random because samplers depend on it.  -So deterministic session-binding needs careful design or it will break TraceIdRatioBased sampling. - 8. Local “Flight Recorder” HTML should be more than “Rich to HTML” -Gemini’s D3 collapsible tree plus a simple timeline view is a compelling wedge. You can still ship a “simple HTML” in v0.1, but architect it so you can evolve into a richer viewer without changing the SDK surface. - -⸻ - -Updated Research Report: TraceCraft TAL Implementation Blueprint - -1) Core dependency stack and why - -OpenTelemetry foundations - -TraceCraft TAL should be built on: - • opentelemetry-api, opentelemetry-sdk - • opentelemetry-exporter-otlp (HTTP and/or gRPC) - • A configurable TracerProvider, IdGenerator, Sampler, and SpanProcessors pipeline - -Why: OTel is the only realistic “route anywhere” substrate, and GenAI conventions are now part of the official spec (though still unstable).  - -Build on OpenLLMetry for library-level instrumentation - -OpenLLMetry already focuses on instrumenting common LLM libraries and documents real-world pitfalls like threading context propagation.  -TraceCraft should treat OpenLLMetry as the “span capture engine” for low-level calls, and focus on: - • agent-native structure - • local-first DX - • governance - • routing - -Schema and serialization - -Use Pydantic v2 for the canonical AgentRun/Step model: - • strong typing - • stable JSON output - • easy redaction via structured traversal - • versioned schema evolution - -Local-first UX - • rich for console tree output (fast adoption wedge) - • HTML artifact generation (v0.1 simple template, v0.2 D3 interactive) - • optional CLI via typer (nice-to-have, not required for v0.1) - -⸻ - -2) Standards strategy: “Polyglot Schema” (OTel GenAI + OpenInference) - -Why you need this - • OTel GenAI conventions are evolving and increasingly push content into events for privacy and correctness.  - • OpenInference is what Phoenix and related tooling expects for AI-native semantics, especially RAG and tool/retriever concepts.  - -Concrete approach - -TraceCraft TAL should maintain a canonical internal schema, then generate: - • OTel GenAI attributes and events (gen_ai.* plus GenAI events) - • OpenInference attributes (input.value, output.value, tool.name, retrieval docs, span kinds) - -Export configuration determines: - • “OTel-strict mode”: content moved to events, minimal attributes, aggressive redaction defaults - • “Debug-rich mode”: include OpenInference content keys for local artifacts, optionally for OTLP if user explicitly opts in - -This dual dialect approach avoids choosing sides and makes your SDK future-proof as OTel GenAI stabilizes.  - -⸻ - -3) Runtime correctness: async and threads - -Async context propagation - -LangChain and agent frameworks commonly use async callback managers, so your adapter must support async callbacks and preserve context.  - -ThreadPoolExecutor propagation - -A recurring pain point: OTel context does not automatically propagate into threads, causing broken traces. OpenLLMetry explicitly documents this and provides guidance.  - -Implementation requirement -Ship a tiny utility module: - • tracecraft.contrib.context.propagating_executor.ThreadPoolExecutor - • tracecraft.contrib.context.wrap_callable(copy_context().run(...)) - -And document it as: - • “Use this executor for parallel tool calls or embedding calls that run in threads.” - -⸻ - -4) Trace identity: optional deterministic binding, safely - -IdGenerator - -OTel Python supports custom IdGenerators, but you must keep lower 64 bits random to preserve sampling correctness.  - -Recommendation - • v0.1: default RandomIdGenerator - • v0.2: optional “session-bound” trace IDs that encode a stable hash in upper bits and keep random lower bits (and document tradeoffs clearly) - -Do not ship deterministic IDs in v0.1 unless you have tests covering sampling behavior and W3C constraints. - -⸻ - -5) Framework adapters: what “best practice” actually looks like - -LangChain adapter - -Use BaseCallbackHandler and implement both sync and async callbacks because LangChain has distinct flows and callback managers.  - -Must handle: - • on_chain_start / on_chain_end / on_chain_error - • on_llm_start / on_llm_end / on_llm_error - • on_tool_start / on_tool_end / on_tool_error - • on_retriever_start / on_retriever_end / on_retriever_error (critical for RAG)  - -Also: expect callback inconsistencies in the wild (there are issues where tool callbacks do not fire in certain versions). Your adapter should degrade gracefully and still produce a coherent run tree.  - -LlamaIndex adapter - -Use LlamaIndex’s Dispatcher + SpanHandler (and optionally EventHandler) model.  - -This is the right place to: - • capture retrieval nodes/documents - • map Span enter/exit/drop to Step lifecycle - • record exceptions on drop - -Also note: LlamaIndex instrumentation can have version-specific gaps around newer workflow components, so test across at least 2 versions (current stable and one prior minor) and document supported versions.  - -PydanticAI / Logfire integration - -This is interception, not instrumentation: - • Logfire config sets global tracer provider by default and allows setting custom providers.  -So TraceCraft should offer: - • tracecraft.integrations.pydanticai.configure(tracer_provider=...) - • or a “wrap provider” mode that installs TraceCraft processors/exporters while letting Logfire span creation continue - -⸻ - -6) Governance pipeline: redaction, allowlists, hashing, and “kill switches” - -Redaction strategy must acknowledge SpanProcessor mutability limits - -You cannot assume you can safely mutate a span’s attributes at on_end across implementations; this is a known limitation/controversy.  - -Updated recommendation -Implement redaction in one of these safe ways: - - 1. Export-time transformation (recommended default): build a sanitized envelope from the canonical AgentRun/Step model and export that, rather than trying to mutate ended spans. - 2. Capture-time sanitization (opt-in strict mode): drop or redact sensitive fields before they ever become span attributes/events. - -Add these governance capabilities to MVP scope (they are tractable and very valuable) - • Path-based redaction (keys, JSON paths) - • Regex detectors (email, API keys, etc.) - • Allowlist-only mode (“fail closed”) - • Hash-redaction option for correlating repeated sensitive inputs without exposing content - -Optional but high-leverage: Budget “kill switch” - -Gemini’s idea is good but should be explicitly optional: - • track token estimate pre-call - • abort tool or LLM calls if budget exceeded - -This is more invasive (it changes execution), so it should be off by default, but it could become a marquee feature for production agent safety. - -⸻ - -7) Local Flight Recorder: buffered trace + HTML “trace player” - -Buffering processor - -Buffer spans (or Steps) until root run completes, then decide: - • always write artifacts - • on_error only - • sample successful local artifacts - -This aligns with your tail-sampling goals and also keeps local output readable. - -HTML report evolution path - -v0.1: simple standalone HTML with: - • collapsible tree - • side panel for inputs/outputs - • basic search - -v0.2: add: - • D3 collapsible tree + timeline/Gantt (useful for parallel tool calls) - • exclusive vs inclusive duration calculation - • “open node in place” UX - -This improves the “LangSmith-quality debugging” story without building a backend. - -⸻ - -8) OTLP routing reality check: backend quirks you must account for - -Even if OTLP works, UIs differ in how well they render agent-native constructs. Example: W&B Weave explicitly notes limitations rendering OTEL tool calls in their chat view.  - -Implication -Your “route anywhere” promise must be worded honestly: - • “OTLP everywhere” - • “best-effort UI parity” - • plus optional vendor-native exporters later for higher-fidelity rendering - -⸻ - -9) Testing plan updates (more specific) - -Compatibility matrix tests - -Run the same golden-agent scenario across: - • no framework (decorators only) - • LangChain callback adapter - • LlamaIndex dispatcher adapter - • PydanticAI/Logfire interception mode - -Assert: - • canonical AgentRun tree matches expected structure - • OTLP export contains expected attributes/events in both OTel and OpenInference namespaces (when enabled) - -Thread propagation regression test - -Reproduce the “broken traces in ThreadPoolExecutor” scenario described by OpenLLMetry docs and ensure your context-aware executor fixes it.  - -Semconv drift tests - -Pin a semconv version in CI and also run a “latest semconv” job: - • ensure events-based GenAI details still export correctly even when attributes change - • this guards against the real semconv drift issues seen in the ecosystem.  - -⸻ - -Bottom line changes to the implementation plan - -Compared to the prior report, the updated plan should explicitly include: - • Dual dialect export (OTel GenAI + OpenInference) with configurable strictness  - • A first-class threading context propagation utility  - • PydanticAI/Logfire provider interception as a primary integration mode  - • Redaction architecture that does not depend on mutating ended spans  - • A clear path to a richer local HTML “trace player” beyond Rich-to-HTML - -If you want, next I can turn this into an “Implementation Spec” you can drop into a GitHub repo as /docs/architecture.md, including concrete module boundaries, interfaces (Exporter, Processor, Adapter), and the exact env var configuration surface. diff --git a/dev/research/gemini-focused.md b/dev/research/gemini-focused.md deleted file mode 100644 index 3e6bebb..0000000 --- a/dev/research/gemini-focused.md +++ /dev/null @@ -1,404 +0,0 @@ -Comprehensive Architectural Specification and Implementation Strategy for TraceCraft Telemetry Abstraction Layer (TAL) - -1. Introduction: The Crisis of Observability in Agentic Systems -The emergence of autonomous agentic frameworks—principally LangChain, LlamaIndex, and PydanticAI—has precipitated a fundamental shift in software architecture, transitioning from deterministic, procedural execution to non-deterministic, probabilistic reasoning loops. This paradigm shift has exposed severe deficiencies in traditional Application Performance Monitoring (APM) and observability standards. While OpenTelemetry (OTel) has established itself as the ubiquitous transport protocol for distributed tracing, its standard semantic conventions, designed primarily for HTTP-based microservices, fail to capture the cognitive nuances of agentic behaviors such as multi-step reasoning, tool selection, retrieval-augmented generation (RAG), and self-reflection. -The TraceCraft Telemetry Abstraction Layer (TAL) is proposed as a necessary infrastructure component to bridge this gap. It functions not merely as a passive logger but as an active, intelligent middleware that intercepts, normalizes, governs, and semanticizes the execution flow of AI agents. The complexity of this undertaking is compounded by the fragmentation of the ecosystem: distinct frameworks employ radically different internal event models—LangChain utilizes a callback system, LlamaIndex employs a hierarchical event dispatcher, and PydanticAI integrates natively with OpenTelemetry but through an opinionated lens (Logfire). Furthermore, the standardization landscape is in flux, characterized by a schism between the official, albeit nascent, OpenTelemetry Semantic Conventions for Generative AI and the industry-led OpenInference standard, which offers richer, albeit non-standard, attribute schemas. -This report provides an exhaustive, implementation-grade blueprint for the TraceCraft TAL. It dissects the architectural requirements for building a robust runtime capable of handling asynchronous context propagation in Python, specifies the precise adaptation logic required for major agent frameworks, defines a unified semantic schema that satisfies competing standards, and details the construction of governance engines and local debugging tools. The objective is to define a system that provides developers with "Flight Recorder" fidelity—the ability to replay, inspect, and debug the cognitive trajectory of an agent—while ensuring enterprise-grade governance and compatibility with the broader observability ecosystem. -2. The Observability Landscape and Standardization Challenges -Constructing a Telemetry Abstraction Layer requires a rigorous analysis of the underlying standards upon which it builds. The TAL must serve as a Rosetta Stone, translating framework-specific vernacular into a universal observability language. However, the definition of that "universal language" is currently a subject of active contention and development within the industry. -2.1 The Schism: OpenTelemetry GenAI vs. OpenInference -Two primary specifications compete to define how Generative AI operations should be represented in telemetry data: the official OpenTelemetry Semantic Conventions and the OpenInference standard developed by Arize AI. -The OpenTelemetry Semantic Conventions for Generative AI represent the official effort to standardize LLM telemetry within the CNCF ecosystem. As of 2025, these conventions are in a "Development" status, implying that breaking changes are possible and stability guarantees are not yet met.1 The specification focuses heavily on the request-response model typical of API interactions. It defines span names such as chat and generate_content, and attributes are strictly namespaced under gen_ai.* (e.g., gen_ai.usage.input_tokens, gen_ai.response.model).2 A critical limitation identified in the research is the deprecation of content-bearing attributes like gen_ai.prompt and gen_ai.completion in favor of event-based logging, which, while technically cleaner for privacy, complicates the immediate debugging experience for developers who expect to see inputs and outputs directly on the span.4 -Conversely, OpenInference has emerged as a pragmatic, implementation-first standard, driven by the needs of AI engineers rather than protocol purists. It extends the semantic richness of traces to include concepts native to agents, such as "Tools" and "Retrievers," rather than treating everything as a generic dependency call. OpenInference uses a flatter, more direct attribute schema, prioritizing developer utility with keys like input.value, output.value, and tool.name.5 Crucially, OpenInference defines specific semantic conventions for RAG, including retrieval.documents (containing content, scores, and metadata IDs), which OpenTelemetry generic conventions currently lack.6 -Implications for TAL Architecture: The TAL cannot simply choose one standard over the other. Choosing OpenTelemetry ensures long-term infrastructure compatibility but sacrifices immediate debugging fidelity and RAG observability. Choosing OpenInference provides rich debugging features but risks vendor lock-in or non-compliance with strict OTel collectors. Therefore, the TAL must implement a Polyglot Schema Strategy, actively populating attributes from both standards simultaneously where non-conflicting, and providing configuration options to prioritize one namespace over the other during export. This "Dual-Dialect" approach ensures that a trace generated by TraceCraft TAL is valid in a strict Jaeger instance while simultaneously lighting up rich features in AI-specific backends like Arize Phoenix or Langfuse. -2.2 Framework Fragmentation -Beyond the output standards, the input sources—the agent frameworks—exhibit profound architectural differences that the TAL must abstract away. -LangChain: Operates on a CallbackHandler architecture. Instrumentation relies on hooks like on_chain_start, on_llm_start, and on_tool_end. This model is event-driven and somewhat disconnected from the execution context, requiring the instrumentation layer to manually maintain the stack of active spans to correctly associate children with parents.7 -LlamaIndex: Utilizes a more sophisticated instrumentation module (introduced in v0.10.x) centered around a Dispatcher pattern. It explicitly distinguishes between Events (point-in-time occurrences) and Spans (durational operations). This architecture is cleaner but requires the TAL to implement distinct SpanHandler and EventHandler classes to capture the full fidelity of the execution.9 -PydanticAI: Represents a modern approach where observability is a first-class citizen, built directly on OpenTelemetry via the Logfire library. Here, the challenge is not adding instrumentation, but intercepting and governing it. The TAL must inject itself as the TracerProvider to capture spans that the framework emits natively, ensuring that governance policies (like redaction) are applied before the data leaves the process.11 -3. Runtime Architecture Implementation -The Runtime is the heart of the TAL. It is responsible for managing the lifecycle of traces, generating unique identifiers, propagating context across asynchronous boundaries, and maintaining the integrity of the trace graph. -3.1 Context Propagation in Asynchronous Python -The most formidable technical challenge in instrumenting Python-based agents is maintaining the trace context (the linkage between a parent span and a child span) across asynchronous execution boundaries. Agents frequently utilize asyncio.gather for parallel tool execution or offload heavy tasks (like embedding generation) to thread pools. Without rigorous context management, traces become fragmented, appearing as disjointed root spans rather than a cohesive graph. -3.1.1 The ContextVars Foundation -OpenTelemetry for Python relies on the contextvars module to manage active spans.12 A ContextVar is a mechanism that stores state local to an asynchronous task, similar to how threading.local stores state for a thread. The TAL must define a global ContextVar, typically named current_context, which holds the OTel Context object. -However, standard contextvars usage is insufficient for all agentic patterns. While asyncio.create_task (and by extension gather) automatically copies the context to the new task in Python 3.7+, this automatic propagation breaks down when interacting with concurrent.futures.ThreadPoolExecutor, which is often used by vector database clients to perform blocking I/O without freezing the event loop. -3.1.2 Implementing a Context-Aware Executor -To solve the "broken trace" problem in RAG pipelines, the TAL Runtime must implement and enforce the use of a Context-Aware Executor. This component wraps the standard executor and manually copies the context into the worker thread. - -Python - -import contextvars -import functools -from concurrent.futures import ThreadPoolExecutor - -class TALContextExecutor(ThreadPoolExecutor): - """ - A ThreadPoolExecutor that propagates the OpenTelemetry context - to the worker threads, ensuring traces remain connected. - """ - def submit(self, fn, *args, **kwargs): - # Capture the context from the calling thread - context = contextvars.copy_context() - # Define a wrapper that runs the function inside the captured context - def wrapper(*args,**kwargs): - return context.run(fn, *args, **kwargs) - - return super().submit(wrapper, *args, **kwargs) - -The TAL must provide this executor as a utility and, where possible, patch the agent framework's default executor to use this implementation. This ensures that when LlamaIndex offloads an embedding request to a thread, the resulting HTTP span is correctly linked as a child of the retrieval span.12 -3.2 ID Generation and The Root Span Problem -In distributed systems, a trace ID is typically generated at the edge (e.g., a load balancer or ingress controller). In agentic systems, the "trace" often originates internally within a background worker or a CLI script. -3.2.1 Deterministic vs. Random IDs -While OTel defaults to random 128-bit Trace IDs, the TAL should support Deterministic Session Binding. For debugging purposes, it is highly advantageous to link a trace to a semantic session_id or conversation_id. -The TAL Runtime should allow the injection of a custom IdGenerator.15 This generator can create Trace IDs that are statistically unique but deterministically derived from a session identifier if one is provided. This allows a developer to query a backend for trace_id="session-12345..." rather than hunting for a random hex string. -Implementation Detail: -The IdGenerator must subclass opentelemetry.sdk.trace.id_generator.IdGenerator and implement generate_trace_id() and generate_span_id(). The implementation must ensure 64-bit randomness in the lower bits to satisfy W3C Trace Context requirements for sampling, while potentially encoding session metadata in the upper bits.15 -3.2.2 Handling Orphaned Spans -A common pattern in agents is "Fire and Forget" background tasks (e.g., memory reflection steps). These tasks often start after the main HTTP request has returned, causing them to detach from the trace. -The TAL Runtime must implement a Session Context Store. When a user request initiates an agent, the TAL creates a Session object. Even if the main thread completes, the Session remains active. Any background task spawned by the agent must look up this Session and attach itself to the Session's active trace ID. This transforms the concept of a "Trace" from a strict request-response lifecycle to a "Session Lifecycle," which is far more appropriate for long-running agents. -3.3 The Polyglot Schema Engine -To address the standards conflict identified in Section 2.1, the TAL Runtime must implement a Schema Normalization Engine that runs at span creation time. -Mechanism: -When an adapter requests a new span (e.g., runtime.start_span(name="tool_call", inputs=...)), the engine does not simply pass these arguments to OTel. Instead, it expands them into a superset of attributes. -Schema Mapping Table: - -Concept -TAL Internal Representation -OpenTelemetry GenAI Attribute -OpenInference Attribute -Model Name -model.name -gen_ai.request.model -llm.model_name -Input Data -input.payload -Deprecated / Event-based -input.value -Output Data -output.payload -Deprecated / Event-based -output.value -Token Cost -usage.input -gen_ai.usage.input_tokens -llm.token_count.prompt -Tool Name -tool.name -gen_ai.operation.name -tool.name -System -system.provider -gen_ai.system -llm.system - -Conflict Resolution Strategy: -The runtime logic is additive. - -Python - -def start_span(self, name, **kwargs): - attributes = {} - if "model" in kwargs: - attributes["gen_ai.request.model"] = kwargs["model"] - attributes["llm.model_name"] = kwargs["model"] - - if "input" in kwargs: - # OpenInference uses input.value - attributes["input.value"] = str(kwargs["input"]) - # OTel prefers we don't put PII in attributes, but for local debug we must. - # We flag this for the Governance engine to potentially redact later. - attributes["_governance.sensitive"] = True - - return self.tracer.start_as_current_span(name, attributes=attributes) - -This approach ensures that the "Input Value" is present for local debugging (via OpenInference keys) but can be stripped by the Governance engine before upstream export if strict OTel compliance is required. -4. Framework-Specific Adapter Implementation -The TAL must provide specialized adapters for the major frameworks. These adapters act as the translation layer, converting framework-specific events into the TAL's internal Polyglot Schema. -4.1 LlamaIndex Adapter -LlamaIndex's architecture is event-driven and hierarchical, utilizing a Dispatcher system that is highly amenable to instrumentation. -4.1.1 The SpanHandler Implementation -The core integration point is the BaseSpanHandler class.10 The TAL adapter must subclass this to intercept span lifecycle events. -class_name: Must return a unique identifier, e.g., TraceCraftSpanHandler. -new_span: This method is called when a span starts. The adapter must: -Extract id_and map it to OTel's Span ID. -Parse bound_args to extract function inputs. Since LlamaIndex passes raw Python objects (like QueryBundle or Node), the adapter must perform Object Serialization. It should extract text content from Node objects and query strings from QueryBundle objects to populate input.value. -prepare_to_exit_span: Called upon successful completion. The adapter captures the return value. If the return value is a Response object, it must extract the response.source_nodes to populate the retrieval.documents attribute, adhering to the OpenInference RAG specification.6 -prepare_to_drop_span: This hook is critical for error tracking. It provides the exception object. The adapter must call span.record_exception(err) and set the span status to StatusCode.ERROR. -4.1.2 Dispatcher Registration -To ensure full coverage, the adapter must attach itself to the root dispatcher. - -Python - -import llama_index.core.instrumentation as instrument -from tracecraft.adapters.llamaindex import TraceCraftSpanHandler - -def instrument_llamaindex(): - dispatcher = instrument.get_dispatcher("llama_index.core.instrumentation.root_dispatcher") - handler = TraceCraftSpanHandler() - dispatcher.add_span_handler(handler) - -This registration should be idempotent to prevent duplicate span handling if the user calls the instrumentation function multiple times.9 -4.2 LangChain Adapter -LangChain's architecture is callback-based. While it offers a Tracer interface, the BaseCallbackHandler is the most reliable integration point for synchronous and asynchronous event capture. -4.2.1 The Sync/Async Dual Implementation -LangChain separates synchronous and asynchronous callbacks (e.g., on_chain_start vs. on_chain_start (async)). The TAL adapter must implement both versions for every event type to ensure no events are missed in mixed-mode applications.7 -Mapping Logic: -on_chain_start / on_agent_start: Initiates a span with kind=INTERNAL. If the chain class name suggests an agent (e.g., AgentExecutor), the span should be tagged as openinference.span.kind=AGENT. -on_llm_start: Initiates a span with kind=CLIENT. This is the point to capture gen_ai.request.model and gen_ai.system (provider). -on_tool_start: Initiates a span with kind=INTERNAL (or TOOL in OpenInference semantics). Captures tool.name and input arguments. -on_retriever_end: This is a critical hook for RAG observability. The payload contains a list of Document objects. The adapter must transform this list into the JSON structure required by retrieval.documents (content, score, metadata).6 -4.2.2 The Stack Management Challenge -Unlike LlamaIndex, LangChain's callback interface does not always explicitly pass the parent run ID in a way that aligns with OTel's context. The adapter must maintain a thread-local (or context-local) Run Stack. -Push: On start, push the new OTel span onto the stack. -Pop: On end or error, pop the span and finish it. -Parenting: When starting a span, peek at the stack to find the parent. This manual management is necessary because LangChain's internal execution graph does not map 1:1 to OTel's ContextVars based propagation in all scenarios. -4.3 PydanticAI Integration -PydanticAI is unique as it instruments itself using OpenTelemetry natively via Logfire.11 The TAL's role here is not adaptation but Interception. -4.3.1 Provider Injection -PydanticAI allows for the configuration of a custom TracerProvider. The TAL must provide a utility to swap the default Logfire provider with the TAL provider. - -Python - -from opentelemetry import trace -from tracecraft.runtime import TALTracerProvider - -def instrument_pydantic_ai(): - # Instantiate the TAL provider with our processors (Redaction, Buffering) - provider = TALTracerProvider() - - # Set as the global default, which PydanticAI will pick up - trace.set_tracer_provider(provider) - - # PydanticAI uses the global tracer by default unless configured otherwise - -This approach ensures that PydanticAI's spans flow through the TAL's governance and debugging pipelines.11 -4.3.2 Attribute Enrichment -Since the TAL does not control the creation of PydanticAI spans (the framework does), it cannot force the Polyglot Schema at creation time. Instead, the TAL must use a SpanProcessor to enrich these spans just before export. The processor inspects the PydanticAI-native attributes and injects the corresponding OpenInference/OTel GenAI attributes to ensure consistency with traces from other frameworks. -5. Governance Engine Implementation -The Governance Engine is a differentiator for the TraceCraft TAL. It moves observability from passive monitoring to active policy enforcement. This is implemented via a chain of SpanProcessors. -5.1 PII Redaction Processor -Redaction must occur in-memory, ensuring sensitive data never reaches the network layer or the local debug file. -Architecture: -The RedactionProcessor must implement the SpanProcessor interface and hook into the on_end(span) method. Since ReadableSpan attributes are immutable in some OTel implementations, the processor may need to construct a new dictionary of attributes and replace the span's internal attribute storage.17 -Scrubbing Logic: -Targeted Keys: The processor should only scan attributes known to contain user content: input.value, output.value, gen_ai.prompt, message.content. Scanning system attributes (like latencies) is wasteful. -Regex Patterns: Implement a library of high-performance compiled regexes for SSNs, Credit Cards, API Keys (sk-[a-zA-Z0-9]+), and Email addresses. -Allowlist Mode: For high-security environments, the governance engine should support an "Allowlist Only" mode. If enabled, any attribute key not explicitly in the allowed_keys configuration is dropped entirely. This is a "Fail-Closed" security model.19 -Hash-Based Redaction: Instead of replacing sensitive values with ***, the TAL should offer an option to replace them with Hash(value). This allows analysts to see that the same user input caused an error multiple times without knowing what the input was (Cardinality Analysis). -5.2 The "Kill Switch" and Cost Control -Governance also implies resource control. The TAL must track token usage in real-time. -Implementation: -This cannot be done in a SpanProcessor (which is reactive/async). It requires a Synchronous Hook within the adapters. -Token Counting: On new_span (LlamaIndex) or on_llm_start (LangChain), the adapter extracts the input text. -Estimation: Use a fast, approximate tokenizer (like tiktoken) to estimate cost. -Budget Check: Check a thread-safe SessionBudget counter. -Intervention: If current_usage + estimated_cost > limit, the adapter raises a GovernanceException immediately. -Effect: This aborts the framework's execution flow before the API call is made to the LLM provider, effectively acting as a circuit breaker for run-away agents. -6. Local Debugging and "Flight Recorder" Tools -Developers require immediate feedback loops. The "Flight Recorder" feature allows developers to see traces locally without spinning up a complex Docker-based observability stack (Jaeger, Zipkin, Collector). -6.1 The Buffering Span Processor -To show a complete trace locally, the TAL must capture spans until the operation completes. -Implementation: -Data Structure: A thread-safe deque (double-ended queue) is used to buffer completed spans.20 -Trigger: The buffer is not flushed based on time, but on the completion of a Root Span. -Tail Sampling for Local Debug: The processor can be configured to FlushOnCondition. -ALWAYS: Dump every trace to disk (noisy). -ON_ERROR: Only dump traces where root_span.status == ERROR. This is highly effective for debugging flaky agents without wading through successful logs.21 -6.2 Standalone HTML Report Generator -The specific requirement for local debugging is a standalone HTML file that acts as a "Trace Player." -Why D3.js? -While Playwright's trace viewer is powerful, it requires a Node.js runtime to generate traces. A D3.js implementation allows the TAL (Python) to generate the report by simply replacing a string placeholder in an HTML template. D3.js is lightweight, capable of rendering collapsible trees (essential for deep agent traces), and runs entirely in the browser.22 -Technical Specification for Report Generation: -Tree Construction: The buffered flat list of spans must be converted into a hierarchical JSON tree. -Find the span with parent_id == None (Root). -Recursively find all spans where parent_id == current_span.context.span_id. -Compute exclusive_duration (Total duration - duration of children) for each node to highlight bottlenecks. -JSON Embedding: -Python - -# Python Generator Logic - -tree_data = construct_tree(buffer) -json_payload = json.dumps(tree_data) - -with open("templates/report.html", "r") as f: - template = f.read() - -# Injection: The template contains - -html_out = template.replace("{{ DATA_PLACEHOLDER }}", json_payload) - -with open("agent_trace.html", "w") as f: - f.write(html_out) - -Visualization Features: -Timeline View: A Gantt chart visualization using D3 to show parallel execution (common in asyncio.gather tool calls). -Data Inspection: Clicking a node opens a side panel displaying input.value and output.value (attributes populated by the Polyglot Schema). -Search: A Javascript-based client-side search to find spans containing specific text or error codes. -6.3 The CLI Tool -To streamline the developer experience, the TAL should include a CLI entry point. -tracecraft run script.py: Runs the python script with the TAL instrumentation auto-injected. -tracecraft view last: Opens the most recently generated HTML report in the default browser. -7. Deployment Topologies -The TAL is designed to support flexible deployment models ranging from local development to scaled production Kubernetes environments. -7.1 Local / Notebook Topology -Configuration: export TRACECRAFT_MODE=LOCAL -Components: Adapters + Buffering Processor + HTML Exporter. -Output: Local .html files or console output. -Use Case: Rapid prototyping in Jupyter notebooks or local IDE debugging. -7.2 Production / Sidecar Topology -Configuration: export TRACECRAFT_MODE=PRODUCTION -Components: Adapters + Redaction Processor + OTLP Exporter. -Output: Protocol Buffers over gRPC to an OpenTelemetry Collector. -Mechanism: The OTLP exporter is configured to send data to localhost:4317 (Collector running as a sidecar) or a remote endpoint (Langfuse/Arize/Honeycomb). -Benefit: Offloads the network overhead and retries to the Collector. The TAL process remains lightweight. -7.3 Hybrid Topology (Sampling) -In high-volume production, tracing every agent step is cost-prohibitive. -Head Sampling: The TAL Runtime configures the OTel TraceIdRatioBased sampler to record only 1% of sessions. -Tail Sampling: The TAL sends all traces to a local Collector sidecar, which is configured with a tail_sampling processor to forward only traces that contain errors or high latency, dropping the rest.21 -8. Conclusion and Future Outlook -The TraceCraft Telemetry Abstraction Layer represents a critical piece of infrastructure for the maturing AI engineering stack. By strictly decoupling the application logic from the observability implementation, it solves the fragmentation problem inherent in the current ecosystem. -The proposed architecture specifically addresses the nuances of agentic systems: -Context-Aware Runtime: Solves the broken trace problem in asynchronous RAG pipelines via custom executors and contextvars management. -Polyglot Schema: Bridges the gap between the rigid OTel standard and the rich OpenInference standard, ensuring compatibility without sacrificing fidelity. -Active Governance: Moves beyond logging to enforcement, providing redaction and cost control at the runtime level. -Local "Flight Recorder": Democratizes observability for developers by providing zero-dependency visualization tools. -As the OpenTelemetry Semantic Conventions for GenAI mature and stabilize, the TAL's internal mapping logic can be updated without breaking changes to the application code, effectively "future-proofing" the agent implementation. This architecture transforms observability from a passive debugging utility into a strategic asset for reliability and governance in autonomous systems. -9. Appendix: Reference Data Tables -9.1 Semantic Attribute Mapping Reference -Domain -Attribute Key -Type -Description -OTel Mapping -OpenInference Mapping -LLM -model.name -String -The name of the model invoked -gen_ai.request.model -llm.model_name -LLM -model.provider -String -The vendor (OpenAI, Anthropic) -gen_ai.system -llm.provider -LLM -gen_ai.usage.input_tokens -Int -Count of prompt tokens -gen_ai.usage.input_tokens -llm.token_count.prompt -LLM -gen_ai.usage.output_tokens -Int -Count of completion tokens -gen_ai.usage.output_tokens -llm.token_count.completion -Chain -input.value -String -Input to the chain/agent -Event Payload -input.value -Chain -output.value -String -Final output of the chain -Event Payload -output.value -RAG -retrieval.documents -JSON -List of retrieved chunks -N/A -retrieval.documents -Tool -tool.name -String -Name of the tool function -gen_ai.operation.name -tool.name -Tool -tool.parameters -JSON -Arguments passed to tool -N/A -tool.parameters -Error -error.type -String -Class name of exception -error.type -error.type - -9.2 Adapter Capabilities Matrix -Feature -LlamaIndex Adapter -LangChain Adapter -PydanticAI Interceptor -Integration Point -BaseSpanHandler -BaseCallbackHandler -TracerProvider -Async Support -Native (Dispatcher) -Requires Dual Handlers -Native (OTel) -RAG Observability -High (Nodes) -Medium (Documents) -N/A (Model focused) -Input Capture -via bound_args -via on_chain_start -via Span Attributes -Governance Hook -Sync handle check -Sync callback check -Pre-export Processor - -9.3 Governance Regex Pattern Examples -Entity -Pattern (Python Regex) -Action -Email -[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+ -Mask e***@***.com -API Key (OpenAI) -sk-[a-zA-Z0-9]{48} -Redact `` -IPv4 Address -\b(?:\d{1,3}\.){3}\d{1,3}\b -Hash SHA256(IP) -Credit Card -\b(?:\d[ -]*?){13,16}\b -Redact `` - -Works cited -Semantic conventions for generative AI systems - OpenTelemetry, accessed January 19, 2026, -Semantic Conventions for GenAI agent and framework spans - OpenTelemetry, accessed January 19, 2026, -Semantic conventions for generative client AI spans - OpenTelemetry, accessed January 19, 2026, -Gen AI | OpenTelemetry, accessed January 19, 2026, -Spans - Arize AX Docs, accessed January 19, 2026, -Semantic Conventions | openinference - GitHub Pages, accessed January 19, 2026, -BaseCallbackHandler - LangChain.js, accessed January 19, 2026, -langchain.callbacks.base.BaseCallbackHandler, accessed January 19, 2026, -Instrumentation | LlamaIndex Python Documentation, accessed January 19, 2026, -Span handlers - LlamaIndex, accessed January 19, 2026, -Debugging & Monitoring with Pydantic Logfire - Pydantic AI, accessed January 19, 2026, -asyncio context propagation in Python 3.5/3.6 · Issue #71 · open-telemetry/opentelemetry-python - GitHub, accessed January 19, 2026, -Is OpenTelemetry in Python safe to use with Async? - Stack Overflow, accessed January 19, 2026, -Context Propagation in Asynchronous and Multithreaded Backends | Leapcell, accessed January 19, 2026, -opentelemetry.sdk.trace.id_generator, accessed January 19, 2026, -Integrate Logfire, accessed January 19, 2026, -opentelemetry-collector-contrib/processor/attributesprocessor/README.md at main - GitHub, accessed January 19, 2026, -Python : Opentelemetry - Filtering PII Data from Logs - Reddit, accessed January 19, 2026, -Mastering the OpenTelemetry Redaction Processor - Dash0, accessed January 19, 2026, -opentelemetry.sdk.trace package, accessed January 19, 2026, -Tail-based sampling - OpenTelemetry, accessed January 19, 2026, -D3.js tree diagram generated from external (JSON) data - GitHub Gist, accessed January 19, 2026, -Create an interactive tree structure from json using D3 | Javascript - YouTube, accessed January 19, 2026, -Tail Sampling with OpenTelemetry: Why it's useful, how to do it, and what to consider, accessed January 19, 2026, diff --git a/dev/research/llm-agent-observability-market-validation.md b/dev/research/llm-agent-observability-market-validation.md deleted file mode 100644 index f129606..0000000 --- a/dev/research/llm-agent-observability-market-validation.md +++ /dev/null @@ -1,440 +0,0 @@ -# LLM/Agent Observability Unification: Market Validation Report - -**Date:** January 2026 -**Purpose:** Validate the opportunity for a new open-source Python framework that unifies LLM and agent observability across vendors - ---- - -## 1. Executive Summary - -### Strongest Validated Pain Points - -1. **Framework-Observability Lock-in**: LangSmith delivers best-in-class debugging for LangChain applications, but 84% of LangSmith users are LangChain users ([Mirascope](https://mirascope.com/blog/langsmith-alternatives)). Moving to a different framework means losing observability insights. Langfuse explicitly markets against this: "Langfuse solves three major pain points teams hit with LangSmith: lock-in, opaque pricing, and limited production feedback loops" ([Langfuse](https://langfuse.com/faq/all/langsmith-alternative)). - -2. **Fragmented Agent Instrumentation**: "Frameworks like OpenAI's Agent SDK, LangGraph, and CrewAI simplify development but often abstract away control flow, represent agents differently, and require manual instrumentation" ([Medium - Hidden Gaps in AI Agent Observability](https://medium.com/@ronen.schaffer/the-hidden-gaps-in-ai-agents-observability-36ad4decd576)). Per-call LLM spans are often missing in tool loops without manual patching ([Langfuse GitHub Issue #11505](https://github.com/langfuse/langfuse/issues/11505)). - -3. **OpenTelemetry GenAI Conventions Are Incomplete**: Current OTel conventions "address LLM completions but lack coverage for agentic systems. Without standard conventions, observability is fragmented across custom attributes" ([OTel GitHub Issue #2664](https://github.com/open-telemetry/semantic-conventions/issues/2664)). - -4. **PII/Privacy Concerns at the Tracing Layer**: Enterprises face challenges because "LLMs don't automatically distinguish between sensitive and non-sensitive data" and prompt logging creates compliance risks ([Kong](https://konghq.com/blog/enterprise/building-pii-sanitization-for-llms-and-agentic-ai)). - -### Clearest Opportunity Statement - -**There is a validated, substantial opportunity** for a framework that provides LiteLLM-style abstraction for observability backends—not another observability platform, but a portable instrumentation layer that: - -- Captures consistent agent/LLM trace semantics across ANY orchestration approach -- Routes traces to ANY backend (Langfuse, Datadog, Phoenix, direct OTel) via pluggable exporters -- Provides first-class PII redaction before data leaves the application - -### Recommended Wedge: "Unified Agent Trace SDK" - -**The sharpest wedge is NOT another OpenTelemetry exporter** (OpenLLMetry exists), **nor another observability platform** (Langfuse/Phoenix exist). The wedge is: - -> A minimal Python SDK providing a canonical agent/LLM trace data model with pluggable exporters—enabling "instrument once, observe anywhere" for teams using heterogeneous frameworks (LangChain + custom agents + direct SDK calls) and multiple observability backends. - -**Why this wins adoption:** - -1. Solves the "I have LangChain AND custom agents AND direct OpenAI calls" problem -2. Enables A/B testing observability backends without re-instrumentation -3. PII redaction as a first-class citizen attracts enterprise adopters -4. OpenTelemetry-compatible but higher-level (doesn't require OTel expertise) - ---- - -## 2. Evidence Table - -| Pain Point | Who Experiences It | Evidence Source | Frequency/Strength | Why Current Solutions Fail | -|------------|-------------------|-----------------|-------------------|---------------------------| -| **Framework lock-in** (LangSmith best for LangChain only) | App devs using mixed frameworks | [Langfuse Comparison](https://langfuse.com/faq/all/langsmith-alternative), [Mirascope](https://mirascope.com/blog/langsmith-alternatives) | Strong - multiple vendor docs acknowledge this | LangSmith requires manual instrumentation for non-LangChain; Langfuse requires setup for each framework | -| **Per-call LLM spans missing in agent loops** | Platform engineers debugging AutoGen/CrewAI | [Langfuse GitHub #11505](https://github.com/langfuse/langfuse/issues/11505) | High - specific bug report with reproduction | Frameworks abstract away control flow; manual patching required | -| **OTel GenAI conventions incomplete for agents** | ML engineers, platform teams | [OTel GitHub #2664](https://github.com/open-telemetry/semantic-conventions/issues/2664), [OTel GitHub #2665](https://github.com/open-telemetry/semantic-conventions/issues/2665) | High - official GitHub issues with RFC | OTel GenAI SIG still "Development" status; agent spans experimental | -| **Inconsistent semantic conventions across vendors** | Platform engineers stitching tools | [Arize OpenInference](https://github.com/Arize-ai/openinference/), [OTel Docs](https://opentelemetry.io/docs/specs/semconv/gen-ai/) | High - multiple competing standards | OpenInference vs OTel gen_ai vs proprietary models require translation | -| **Global tracer provider conflicts** (OTEL auto-instrumentation pollutes traces) | DevOps, platform engineers | [Langfuse GitHub Discussion #9136](https://github.com/orgs/langfuse/discussions/9136) | Medium - Langfuse v3 discussion | No easy isolation; must manually construct TracerProviders | -| **PII in prompts/completions logged without redaction** | Security teams, compliance officers | [Kong Blog](https://konghq.com/blog/enterprise/building-pii-sanitization-for-llms-and-agentic-ai), [LangChain Blog](https://blog.langchain.com/handling-pii-data-in-langchain/) | High - multiple vendor docs address it | Redaction is optional/manual; not first-class in most SDKs | -| **Context propagation breaks across async/streaming** | Backend engineers | [OTel Blog](https://opentelemetry.io/blog/2025/ai-agent-observability/), [Dev.to Guide](https://dev.to/kuldeep_paul/a-practical-guide-to-distributed-tracing-for-ai-agents-1669) | Medium - mentioned in best practices docs | Requires careful baggage propagation; many instrumentations fail silently | -| **Cost explosion at scale with APM pricing** | FinOps, platform teams | [Firecrawl](https://www.firecrawl.dev/blog/best-llm-observability-tools), [Datadog Pricing](https://www.datadoghq.com/product/llm-observability/) | Medium - vendor pricing models reveal this | LLM traces are high-cardinality; token bodies are large; APM pricing doesn't fit | -| **Evaluation signals stored differently by each vendor** | ML engineers doing evals | [Medium - LLM Eval Tools 2026](https://medium.com/online-inference/the-best-llm-evaluation-tools-of-2026-40fd9b654dce) | Medium - comparison articles highlight this | Each platform has own feedback/scoring schema; no portability | -| **Multi-cloud agent tracing immature** | Enterprise architects | [InfoWorld](https://www.infoworld.com/article/4085736/google-boosts-vertex-ai-agent-builder-with-new-observability-and-deployment-tools.html), [Xenoss](https://xenoss.io/blog/aws-bedrock-vs-azure-ai-vs-google-vertex-ai) | Medium - vendor comparisons note gaps | Each cloud has proprietary tracing (CloudWatch vs Azure Monitor vs Vertex); no unified view | - ---- - -## 3. Landscape and Gap Analysis - -### 3.1 Vendor-by-Vendor Comparison - -| Vendor/Tool | Trace Model | Integration Approach | Agent Support | OTel Compatibility | Lock-in Risk | -|-------------|-------------|---------------------|---------------|-------------------|--------------| -| **LangSmith** | Hierarchical runs with LangChain semantics | Env vars (LangChain); SDK (others) | LangGraph-native | OTel export supported (but native recommended for perf) | High - best with LangChain | -| **Langfuse** | Traces → Generations/Spans (OTel-based) | Callbacks, decorators, OTel SDK | Framework-agnostic | Native OTel | Low - open source, self-host | -| **Arize Phoenix** | OpenInference spans | SDK decorators, OTel compatible | Via OpenInference conventions | Compatible (with translation) | Low - open source | -| **Datadog LLM Obs** | Extends APM traces with LLM spans | SDK auto-instrumentation; OTel GenAI v1.37+ | Unified model across frameworks | Native support since 2024 | Medium - SaaS lock-in | -| **W&B Weave** | @weave.op decorator-based trees | Decorator-only | Manual instrumentation | No OTel export | Medium - W&B ecosystem | -| **MLflow Tracing** | OTel-compatible spans | fluent API, autolog | LangChain/LlamaIndex integrations | OTel export supported | Low - open source | -| **OpenLLMetry** | OTel spans with gen_ai attributes | Auto-instrumentation libraries | Via OTel GenAI conventions | Native | Low - fully OTel | -| **Langtrace** | OTel spans | SDK init | Instrumentations for 30+ providers | Native | Low - open source | -| **Cloud Providers** | Proprietary (CloudWatch/Azure Monitor/Vertex) | Native SDK integration | Varies - early stage | Limited/custom connectors | High - cloud lock-in | - -### 3.2 Where Interoperability Breaks - -#### Semantic Convention Divergence - -| Standard | Example Attributes | Primary Users | -|----------|-------------------|---------------| -| **OpenInference** | `llm.input_messages..message.role` | Arize Phoenix | -| **OTel gen_ai** | `gen_ai.request.model`, `gen_ai.usage.input_tokens` | Datadog, OpenLLMetry | -| **LangSmith** | `langsmith.run_type`, custom metadata schema | LangChain users | - -Translation required between all three ([Arize Translation Docs](https://arize.com/docs/phoenix/tracing/concepts-tracing/translating-conventions)) - -#### Agent/Task Concepts Missing from OTel - -- "Current OTel semantic conventions for LLMs cover completions, but do not capture requester context—who or what initiated the task" ([OTel #2665](https://github.com/open-telemetry/semantic-conventions/issues/2665)) -- No standard for: agent memory, state transitions, human-in-the-loop, multi-agent delegation - -#### Framework Callback Incompatibility - -| Framework | Callback Mechanism | Notes | -|-----------|-------------------|-------| -| LangChain | `BaseCallbackHandler` with typed events | Well-documented | -| LlamaIndex | `CallbackManager` → migrating to `instrumentation` module | In transition | -| OpenAI Agents SDK | Built-in tracing | Not externally pluggable | -| CrewAI | Custom callbacks | Limited documentation | - -Each requires separate integration work. - -#### Evaluation Signal Storage - -| Platform | Feedback Storage | -|----------|-----------------| -| Langfuse | Scores attached to traces | -| LangSmith | Feedback linked to runs | -| W&B Weave | Logged as Weave objects | - -No portable format for human feedback/scores. - -### 3.3 Why OpenTelemetry Alone Doesn't Solve It - -| OTel Limitation | Impact | Evidence | -|-----------------|--------|----------| -| GenAI conventions still "Development" status | Can't rely on stability | [OTel Docs](https://opentelemetry.io/docs/specs/semconv/gen-ai/) | -| Agent conventions in RFC stage only | No standard for tasks/actions/memory | [OTel #2664](https://github.com/open-telemetry/semantic-conventions/issues/2664) | -| High learning curve | Adoption friction | "OpenTelemetry's complexity creates real barriers to adoption" ([OTel Blog](https://opentelemetry.io/blog/2025/stability-proposal-announcement/)) | -| No built-in PII redaction | Must add custom processors | Separate concern not addressed by spec | -| Prompt bodies not designed for OTel | "Nobody designed telemetry for multi-kilobyte prompts" ([Nir Gazit, OpenLLMetry](https://horovits.medium.com/opentelemetry-for-genai-and-the-openllmetry-project-81b9cea6a771)) | - ---- - -## 4. Competitive Analysis: Why Hasn't This Already Won? - -### 4.1 OpenLLMetry (Traceloop) - -- **What it does**: OTel-based auto-instrumentation for LLMs (6,600+ stars) -- **Why it hasn't won**: - - Still requires OTel expertise to configure collectors/exporters - - No higher-level abstractions for agent workflows - - No built-in PII handling - - Users must understand OTel ecosystem to benefit - -### 4.2 Langfuse - -- **What it does**: Open-source observability platform (YC W23) -- **Why it hasn't won as "the standard"**: - - It's a platform, not a portable SDK—you instrument FOR Langfuse - - Teams using Datadog/New Relic don't want a second platform - - Doesn't solve "I want traces in BOTH Langfuse AND Datadog" - -### 4.3 OpenInference (Arize) - -- **What it does**: Extended OTel conventions for LLM tracing -- **Why it hasn't won**: - - Primarily benefits Phoenix users - - Requires translation to use with non-Arize tools - - Competing standard vs OTel gen_ai (fragmentation) - -### 4.4 OTel GenAI SIG - -- **What it does**: Official standardization effort -- **Why it hasn't won yet**: - - Started April 2024; agent conventions still RFC - - Moving slowly due to multi-vendor consensus requirements - - Focuses on conventions, not SDK/DX - - "Development" status = breaking changes expected - -### 4.5 Vendor Incentives Misalignment - -| Vendor | Incentive | -|--------|-----------| -| LangChain | Keep LangSmith tightly coupled | -| Datadog | Consolidate customers on their platform | -| Cloud Providers | Lock into their tracing (CloudWatch/Azure Monitor/Vertex) | - -**No vendor is incentivized** to build truly portable instrumentation. - ---- - -## 5. Recommended Framework Concept - -### 5.1 What to Build: "Unified Agent Trace SDK" - -A minimal Python SDK that provides: - -#### A. Canonical Trace Schema for LLM/Agent Workflows - -``` -Trace -├── Span (type: agent_run | llm_call | tool_call | retrieval | memory_access) -│ ├── Attributes (model, provider, tokens, cost, latency) -│ ├── Events (input_messages, output_messages, tool_result, error) -│ └── Children (nested spans) -└── Metadata (session_id, user_id, environment, tags) -``` - -**Required span types** (covering the intersection of all major platforms): - -| Span Type | Description | -|-----------|-------------| -| `agent_run` | Top-level agent execution | -| `llm_call` | Any LLM inference (chat, completion, embedding) | -| `tool_call` | Tool/function invocation | -| `retrieval` | RAG retrieval step | -| `memory_access` | Read/write to agent memory | -| `evaluation` | Human or automated scoring | - -**Attribute mapping** to both OpenInference and OTel gen_ai (translation layer built-in). - -#### B. Plugin Architecture for Exporters/Backends - -```python -from unified_trace import configure_tracing, LangfuseExporter, DatadogExporter, OTelExporter - -configure_tracing( - exporters=[ - LangfuseExporter(public_key="..."), # Primary - DatadogExporter(api_key="..."), # Secondary - OTelExporter(endpoint="...") # OTLP fallback - ], - redaction=PIIRedactor(mode="mask"), # First-class -) -``` - -**Priority exporters to build**: - -| Priority | Backend | Rationale | -|----------|---------|-----------| -| 1 | Langfuse | Open-source, easy to test, large community | -| 2 | Datadog LLM Observability | Enterprise demand, validates OTel bridge | -| 3 | Direct OTLP | Universal compatibility with any OTel backend | - -#### C. Instrumentation Strategy (Framework-Agnostic) - -**Layer 1: Auto-instrumentation** (like OpenLLMetry) - -```python -from unified_trace import instrument_openai, instrument_anthropic -instrument_openai() # Monkey-patches OpenAI client -``` - -**Layer 2: Framework callbacks** (for LangChain, LlamaIndex, etc.) - -```python -from unified_trace.langchain import UnifiedTraceCallbackHandler -chain.invoke(..., callbacks=[UnifiedTraceCallbackHandler()]) -``` - -**Layer 3: Manual spans** (for custom agents) - -```python -from unified_trace import trace_agent, trace_tool - -@trace_agent(name="research_agent") -async def research(query: str): - with trace_tool("web_search") as span: - results = await search(query) - span.set_attribute("result_count", len(results)) - return results -``` - -#### D. Redaction/PII Controls as First-Class - -```python -from unified_trace import PIIRedactor, RedactionMode - -redactor = PIIRedactor( - mode=RedactionMode.MASK, # or HASH, REMOVE, SYNTHETIC - patterns=["email", "phone", "ssn", "credit_card"], - custom_patterns=[r"CUSTOMER-\d+"], - allowlist=["support@company.com"], # Don't redact these -) - -configure_tracing(redaction=redactor) -``` - -Redaction happens **before** data is exported—data never leaves the application unredacted. - -#### E. Optional OTel Bridging Layer - -For teams with existing OTel infrastructure: - -```python -from unified_trace.otel import create_otel_tracer_provider - -# Wraps unified traces as OTel spans with gen_ai semantic conventions -tracer_provider = create_otel_tracer_provider( - exporters=["otlp://collector:4317"], - convention="gen_ai_v1_37" # or "openinference" -) -``` - -### 5.2 What NOT to Build (Scope Limits) - -| Explicitly Excluded | Reason | -|---------------------|--------| -| Full observability platform/UI | Langfuse, Phoenix exist; don't compete | -| Prompt management/versioning | Different problem; prompt registries exist | -| Evaluation framework | Deep eval is separate domain; just capture signals | -| Model training telemetry | Out of scope (inference/runtime only) | -| Custom storage backend | Use existing platforms; we're a router | -| Real-time alerting | Backend responsibility | - ---- - -## 6. Adoption Strategy - -### 6.1 Initial "Wow" Integration Path (10 Minutes) - -```bash -pip install unified-trace -``` - -```python -# main.py -from unified_trace import configure_tracing, LangfuseExporter -from openai import OpenAI - -configure_tracing( - exporters=[LangfuseExporter(public_key="pk-...", secret_key="sk-...")], - auto_instrument=["openai"] # One line enables tracing -) - -client = OpenAI() -response = client.chat.completions.create( - model="gpt-4", - messages=[{"role": "user", "content": "Hello!"}] -) -# Trace automatically captured and exported to Langfuse -``` - -**Result**: Zero code changes beyond import + configure. Full trace visible in Langfuse within 10 minutes. - -### 6.2 Top 3 Backends to Support First - -| Backend | Rationale | Integration Complexity | -|---------|-----------|----------------------| -| **Langfuse** | Open-source, self-hostable, large community (11k+ GitHub stars), easy testing | Low - REST API, good docs | -| **Datadog LLM Obs** | Enterprise demand, validates OTel GenAI compliance, large customer base | Medium - requires OTel GenAI v1.37+ format | -| **Direct OTLP** | Universal - works with Jaeger, Honeycomb, Grafana Tempo, New Relic, any collector | Medium - must comply with gen_ai conventions | - -### 6.3 How to Win: Adoption Levers - -#### Compatibility Story - -- "Works with your existing LangChain callbacks—just add one line" -- "Export to Langfuse AND Datadog simultaneously" -- "Switch backends without changing instrumentation code" - -#### Migration Path - -- Provide `unified-trace migrate` CLI that reads existing LangSmith/Langfuse traces and shows equivalent unified-trace code -- Document exact translation from LangSmith/Langfuse/Phoenix patterns - -#### Community Strategy - -- Contribute upstream to OTel GenAI SIG (legitimacy) -- Partner with Langfuse (natural ally—they want more users, not lock-in) -- Write integration guides for LangChain, LlamaIndex, CrewAI, AutoGen - -#### Enterprise Hooks - -- PII redaction compliance story (GDPR, HIPAA) -- "Instrument once, send to compliance-approved backend" -- Audit trail for what data was redacted - -### 6.4 Success Metrics (First 6 Months) - -| Metric | Target | Rationale | -|--------|--------|-----------| -| GitHub stars | 1,000+ | Indicates community interest | -| PyPI downloads/month | 10,000+ | Active usage signal | -| Production users | 50+ companies | Validates real-world utility | -| Exporters supported | 5+ | Langfuse, Datadog, OTLP, Phoenix, MLflow | -| Framework integrations | LangChain, LlamaIndex, OpenAI SDK, Anthropic | Covers 80%+ of use cases | - ---- - -## 7. Final Recommendation - -### Decision: **Yes, a strong opportunity exists** - -**Confidence Level**: High (based on multiple corroborating sources) - -**The sharpest wedge**: A portable instrumentation SDK—not a platform—that solves the "instrument once, observe anywhere" problem for teams with: - -- Multiple frameworks (LangChain + custom agents + direct SDK calls) -- Multiple observability needs (dev: Langfuse, prod: Datadog) -- Compliance requirements (PII redaction as first-class) - -**Why this wins**: - -1. **Unmet need**: OpenLLMetry requires OTel expertise; Langfuse is a platform not a router; LangSmith is framework-coupled -2. **Clear value prop**: "LiteLLM for observability"—developers already understand this pattern -3. **Enterprise hook**: PII redaction + compliance story creates budget -4. **Community alignment**: Partners with (doesn't compete against) Langfuse, Phoenix, OpenLLMetry - -**Key risk**: OTel GenAI SIG eventually stabilizes and provides native solution -**Mitigation**: Build as OTel-compatible from day one; position as "higher-level SDK on top of OTel" - ---- - -## 8. Sources - -### Primary Documentation - -- [OpenTelemetry GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) -- [OpenTelemetry GenAI Agent Spans](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/) -- [Langfuse Documentation](https://langfuse.com/docs/observability/overview) -- [LangSmith Documentation](https://docs.langchain.com/langsmith/home) -- [Arize OpenInference](https://github.com/Arize-ai/openinference/) -- [OpenLLMetry (Traceloop)](https://github.com/traceloop/openllmetry) -- [Datadog LLM Observability](https://www.datadoghq.com/product/llm-observability/) -- [Langtrace GitHub](https://github.com/Scale3-Labs/langtrace) - -### GitHub Issues & Discussions - -- [OTel #2664: Semantic Conventions for GenAI Agentic Systems](https://github.com/open-telemetry/semantic-conventions/issues/2664) -- [OTel #2665: Semantic Conventions for GenAI Tasks](https://github.com/open-telemetry/semantic-conventions/issues/2665) -- [Langfuse #11505: Missing per-call LLM spans in AutoGen](https://github.com/langfuse/langfuse/issues/11505) -- [Langfuse Discussion #9136: OTel auto-instrumentation conflicts](https://github.com/orgs/langfuse/discussions/9136) -- [Langfuse Roadmap Discussion 2026](https://github.com/orgs/langfuse/discussions/11391) -- [MLflow #18216: Manual + Automatic Tracing with LangGraph](https://github.com/mlflow/mlflow/issues/18216) - -### Comparison & Analysis Articles - -- [Langfuse vs LangSmith Comparison (ZenML)](https://www.zenml.io/blog/langfuse-vs-langsmith) -- [Best LLM Observability Tools 2025 (Firecrawl)](https://www.firecrawl.dev/blog/best-llm-observability-tools) -- [LangSmith Alternatives (Mirascope)](https://mirascope.com/blog/langsmith-alternatives) -- [AI Agent Observability - OTel Blog](https://opentelemetry.io/blog/2025/ai-agent-observability/) -- [OpenTelemetry for GenAI (OTel Blog 2024)](https://opentelemetry.io/blog/2024/otel-generative-ai/) -- [Hidden Gaps in AI Agent Observability (Medium)](https://medium.com/@ronen.schaffer/the-hidden-gaps-in-ai-agents-observability-36ad4decd576) -- [OpenTelemetry for GenAI and OpenLLMetry (Medium)](https://horovits.medium.com/opentelemetry-for-genai-and-the-openllmetry-project-81b9cea6a771) - -### Vendor/Tool Documentation - -- [LiteLLM Callbacks](https://docs.litellm.ai/docs/observability/callbacks) -- [Datadog OTel GenAI Support](https://www.datadoghq.com/blog/llm-otel-semantic-convention/) -- [Kong PII Sanitization](https://konghq.com/blog/enterprise/building-pii-sanitization-for-llms-and-agentic-ai) -- [LlamaIndex Observability](https://developers.llamaindex.ai/python/framework/module_guides/observability/) -- [LangChain PII Handling](https://blog.langchain.com/handling-pii-data-in-langchain/) - -### Cloud Provider Comparisons - -- [AWS Bedrock vs Azure AI vs Google Vertex (Xenoss)](https://xenoss.io/blog/aws-bedrock-vs-azure-ai-vs-google-vertex-ai) -- [Google Vertex AI Agent Builder Updates (InfoWorld)](https://www.infoworld.com/article/4085736/google-boosts-vertex-ai-agent-builder-with-new-observability-and-deployment-tools.html) diff --git a/dev/research/openai-research-focused.md b/dev/research/openai-research-focused.md deleted file mode 100644 index 7a3a0f1..0000000 --- a/dev/research/openai-research-focused.md +++ /dev/null @@ -1,209 +0,0 @@ -TraceCraft TAL – Implementation Research - -Core Library Dependencies and Telemetry Architecture - -TraceCraft TAL will be built on top of OpenTelemetry’s Python SDK as its foundation for trace collection and context propagation . By leveraging OpenTelemetry under the hood, the library can emit standard trace data that is compatible with a wide range of backends (e.g. Datadog, Honeycomb, OpenTelemetry Collector) via OTLP without custom integration code  . In practice, this means TraceCraft can use OpenTelemetry’s TracerProvider, Span API, and context mechanism to create spans for each agent run and step, ensuring propagation across function calls and async tasks. The project will likely depend on OpenTelemetry’s Python packages such as opentelemetry-api and opentelemetry-sdk (for the tracer and span processors) and opentelemetry-exporter-otlp (for OTLP exporting). Using OpenTelemetry provides built-in support for context propagation (via context variables), which will help TraceCraft work seamlessly in asynchronous code by preserving trace context across async/await boundaries. - -Critically, TraceCraft is not reinventing low-level instrumentation of LLM/AI libraries – instead, it will build on existing OpenTelemetry-based instrumentation (such as the OpenLLMetry project by Traceloop). OpenLLMetry provides a suite of provider-specific instrumentors (for OpenAI, Anthropic, various vector DBs, etc.) and extensions that capture model-specific telemetry . Rather than writing our own monkey-patches for each LLM API or database, TraceCraft can depend on OpenLLMetry’s instrumentation packages. For example, the project can use opentelemetry-instrumentation-openai to automatically trace calls made with the OpenAI Python SDK . These instrumentations already record key attributes like model name, prompt, and token counts. By calling the appropriate instrumentor classes (e.g. OpenAIInstrumentor().instrument() for OpenAI) during TraceCraft initialization, we ensure that LLM provider calls are captured as spans with minimal effort . Similarly, OpenLLMetry covers many vector stores and frameworks, so TraceCraft can activate those as needed (OpenLLMetry’s repo notes support for OpenAI/Azure, Anthropic, Chroma, Pinecone, LlamaIndex, LangChain, etc. out-of-the-box  ). - -Besides OpenTelemetry, a few other Python libraries will be important. For configuration and possible CLI functionality, Click or Typer can be used to parse environment variables or command-line options (for example, enabling an tracecraft CLI command to initialize tracing or generate a report). For schema modeling, Pydantic (v2) or Python’s dataclasses will help define the AgentRun and Step data models. Using Pydantic’s BaseModel for the canonical schema ensures type validation and easy serialization to JSON (which is useful for producing JSONL logs or debugging outputs). Pydantic can model nested objects (e.g. a Step containing a list of child Steps) and handle optional fields like exceptions or token counts. If minimizing dependencies is a goal, dataclasses could be used with manual serialization, but Pydantic’s convenience and ecosystem (and perhaps compatibility with things like FastAPI or other tools if needed later) make it a strong choice for this SDK. - -Another likely dependency is Rich, a library for rich text output in the console. Rich provides a Tree construct and styled console logging, which is ideal for printing the hierarchical trace of AgentRuns and Steps in a visually appealing way. By utilizing Rich (or a similar library), TraceCraft can render a tree view of an agent’s steps with indentation, colors, and unicode tree connectors, making it far more readable than plain print() output. In fact, delivering a “beautiful hierarchical console output” is a key adoption driver – developers should be able to call their agent with TraceCraft enabled and immediately see a clear tree of the agent’s reasoning steps in the terminal. Rich also supports exporting console content to HTML, which we can leverage for the local HTML report artifact. For instance, Rich’s Console can record output and produce an HTML file with the same formatting and coloring as the terminal output. This means the “local HTML report” could essentially be the same Rich-rendered trace, saved to an HTML file, so that users can open it in a browser for inspection (no server needed). - -Finally, the project may include a small CLI tool (using Typer or Click) for utility commands. While not a primary focus of the SDK, a CLI could allow actions like tracecraft collect (to run a Python script with TraceCraft enabled, similar to how opentelemetry-instrument works) or tracecraft report to open a saved HTML report. This isn’t strictly necessary for v0.1, but using a CLI library would make it easy to add such developer conveniences if desired. - -AgentRun and Step Schema Design (Structured Traces) - -TraceCraft revolves around a canonical schema for traces: an AgentRun (the top-level trace for a single invocation of an agent or chain) composed of nested Step objects forming a tree. Designing this schema is crucial for ensuring all relevant data is captured in a structured way. We will likely define Python classes or Pydantic models for these. - -AgentRun: This represents one complete run of an agent (or chain of LLM calls, tools, etc.). It will include metadata like a unique run ID, start and end timestamps, and possibly a user/session ID for correlation. The AgentRun is essentially a special Step at the root, so it may share many fields with Step. For instance, AgentRun could have an attribute for the root input (the user’s query or initial prompt to the agent) and the final output (the agent’s final answer). It also contains the sequence of top-level steps the agent took (each of which may have further sub-steps). - -Step: Each Step represents a meaningful unit of work in the agent’s execution. Based on the scope, initial Step types (or “kinds”) include: "agent", "llm", "tool", "retrieval", "guardrail", "workflow", and "error". We will likely implement Step with a field for type (enum or string) to tag what kind of step it is. There will also be a name or description (e.g. the tool name or a custom label for the step), and standard telemetry fields like start time, end time or duration, and a place for inputs and outputs. Crucially, inputs/outputs should be structured objects when possible, not just strings – for example, an LLM step’s input might include a list of message objects (with role and content), and output might be a message or text. We want to capture these as data, since the schema is “LangSmith-quality” meaning it preserves rich information for each call. Steps will also have an exception field (populated if the step resulted in an error) and possibly a retry counter or metadata if the step was retried. Each Step can contain a list of children Steps, allowing a hierarchy (for example, an agent step might have child steps for each tool it invoked, or a workflow step might contain sub-task steps). This naturally forms a tree that we can traverse for printing or exporting. - -Using Pydantic for these models is beneficial. We can define a BaseStep model and perhaps subclass or parameterize it for different step types if needed (or simply use conditional fields). Pydantic will make it easy to output JSON – for instance, we can call agent_run.json() to get a JSON representation that can be written to the JSONL file. Each Step’s inputs and outputs can be of type Any or more specific Pydantic models (for known structures like LLM messages, we might define a small model). This structured approach means that downstream consumers (or a future UI) can parse the JSONL or in-memory objects easily to display or analyze the trace. - -To implement the API surface on top of these models: we will provide convenience decorators and context managers. The scope calls for @tracecraft.agent and @tracecraft.step decorators, as well as an tracecraft.step() context manager. Internally, these will create and record Step objects. For example, an @tracecraft.agent decorator on a function would, when the function is called, start a new AgentRun (or agent-type Step) before executing the function and automatically close it at the end. This is similar to OpenLLMetry’s approach: OpenLLMetry offers an @agent decorator to trace an autonomous agent as a single unit, and @tool to mark each tool invocation . TraceCraft will follow this pattern – developers can annotate their code so that agent and tool boundaries are captured without heavy lifting. Likewise, @tracecraft.workflow or @tracecraft.task could mark logical groupings of steps in non-agent chains (OpenLLMetry has @workflow and @task for multi-step processes , which we can emulate). The tracecraft.step() context manager would be a way to instrument arbitrary code blocks as a Step – e.g. with tracecraft.step("parsing input"): would create a step of type workflow or custom, record timing and any logged events within, and close on exit. - -One hard requirement is supporting async functions and context propagation. The decorators and context managers must work in asynchronous code just as well as in sync code. We can achieve this by relying on OpenTelemetry’s context mechanism (which uses thread-local or contextvar under the hood) to attach the current span/trace to async task contexts. Additionally, if we maintain our own stack of current Step objects (for building the in-memory tree), we can use Python’s contextvars to ensure the current Step pointer is preserved across async calls. OpenLLMetry’s docs confirm that their decorators work with both sync and async functions, using the same annotation for both . We will mirror this approach – possibly our decorator will detect if the function is a coroutine and handle it accordingly, but likely the same pattern “start step, run function, end step” can be applied generically. - -In summary, the schema modeling will give us a flexible but robust way to represent agent traces. Each Step will carry all the info needed (timestamps, type, inputs, outputs, tokens, error, children, etc.), and the AgentRun will serve as a container for the entire execution. This structured data is the core that feeds all other functionality: printing, processing, and exporting. - -Trace Capturing and Context Propagation - -When tracecraft.init() is called, the library will set up the tracing runtime. This involves configuring OpenTelemetry and enabling instrumentation. Concretely, init() will perform steps such as: - • Initialize OpenTelemetry TracerProvider: Create a tracer provider (likely the OTLP tracer provider or default SDK provider) and set it as global. Add span processors (discussed later for batching, sampling, etc.) and exporters as configured. If no explicit backend is configured (the default “local mode”), we might attach a simple console exporter or no exporter at all initially (since by default we focus on local output). If an OTLP endpoint is configured via env vars or parameters, we attach an OTLPSpanExporter to send traces out. The goal is to make switching backends a config change, not a code change. By default, we lean towards local debugging (no remote exporter until enabled). - • Enable relevant instrumentation: Use OpenLLMetry or OpenTelemetry instrumentors for the major libraries. For example, if the user has the OpenAI SDK installed, we call OpenAIInstrumentor().instrument()  which will monkey-patch openai.Completion.create and similar methods to start spans around each API call. Similarly, for LangChain integration, we might detect if LangChain is imported and if so attach a LangChain-specific callback handler (LangChain’s callback system allows hooking into chain and LLM events – we can create a callback that calls TraceCraft to create steps for each chain, tool, etc.). OpenLLMetry actually auto-detects some frameworks: it will “automatically detect the framework and annotate your traces” for LangChain, Haystack, LlamaIndex, etc . TraceCraft can take a simpler approach for v0.1 by providing explicit adapters: e.g. tracecraft.integrations.langchain could provide a LangChain CallbackHandler that a user can plug into their LangChain agent to emit TraceCraft steps. The MVP requires at least one of LangChain or LlamaIndex integration; given their popularity, implementing a LangChain callback is likely. This callback would listen for events like on_chain_start, on_chain_end, on_tool_start, on_tool_end, etc., and within those, call tracecraft APIs to create or finalize steps corresponding to those events. In effect, this translates LangChain’s internal trace events into our AgentRun/Step structure (so a LangChain “AgentExecution” might correspond to an agent Step, a LangChain Tool run to a tool Step, and an LLM call to an llm Step). - • Start a root span (AgentRun): If desired, tracecraft.init() could also start a root tracing span for the entire app or session (though more typically, the tracing begins when an agent function is invoked). In many cases, the user will decorate an entry-point function with @tracecraft.agent, which will trigger creation of an AgentRun when that function is called. Internally, that will call OpenTelemetry’s tracer.start_as_current_span() to create a new span representing the agent run. The span’s name and attributes will follow OTel’s semantic conventions for an agent invocation. According to the new Generative AI semantic standards, an agent invocation span should use an operation name like "invoke_agent" and include attributes like gen_ai.agent.name (human-readable name of the agent, if available) . If the agent has an identifier or description, we attach those as gen_ai.agent.id or description . The span kind would usually be INTERNAL for in-process agents (OTel recommends INTERNAL for in-proc, vs CLIENT for out-of-process agents) . We’ll adhere to these conventions so that if the data is sent out via OTLP, it aligns with what other tools expect. - • Context propagation: As each new span/Step is created, OpenTelemetry will handle linking it to the current context, so child spans automatically know their parent. For example, when the OpenAIInstrumentor starts a span for openai.chat.completions.create(), it will do so in the context of whatever current span is active (which would be the agent or tool step if our decorators set it correctly). This yields the proper parent-child relationships forming the trace tree. TraceCraft will also maintain its own in-memory stack of Steps (especially for those steps created via our decorators or context managers). We can use a context-local variable (via Python’s contextvars) to store the current Step, so that nested calls know where to attach. This is similar in concept to how logging libraries propagate context. By using contextvars, the current Step pointer will flow to child coroutines, ensuring that even in async workflows, we know the proper parent. In essence, whenever an tracecraft.step() context manager is entered, we push a new Step on the stack and also start a new OpenTelemetry span; on exit we pop it and end the span. This dual tracking ensures both our high-level structure and the low-level OTel tracing stay in sync. - • Capturing inputs/outputs: We will instrument our decorators such that they automatically capture function arguments as the Step’s input, and function return values (or exceptions) as the Step’s output or error. For example, the @tracecraft.agent decorator can wrap the function call in code that records args/kwargs (perhaps converting them to something JSON-serializable or at least repr) under step.inputs. If the function returns a value, we set step.outputs to that value (or a truncated version if it’s large). If it throws an exception, we mark the step type “error” or set an step.error with exception info. This approach is analogous to how LangSmith or other observability SDKs capture inputs/outputs for each run. OpenTelemetry spans support attaching Events or Logs as well – we might log an event on the span for the prompt and result content if those are large, rather than as attributes (OTel’s GenAI semantic conventions actually suggest recording prompt and response text as span Events (Logs) with semantic field names ). TraceCraft could do similar: e.g. emit a log event on the span with event.name="prompt" and the prompt text. However, since we have our structured Step model, we can store the content directly in the Step object for local use, and only decide what to send externally (maybe omitting full content by default). - -In summary, TraceCraft’s capture mechanism uses a combination of manual instrumentation (via our decorators/callbacks) and automatic instrumentation (via OpenLLMetry’s instrumentors) to comprehensively catch everything that happens in an agent run. The OpenTelemetry SDK will tie these together into a single trace context. The end result is that by the time an agent finishes execution, we have: a populated AgentRun object in memory (with the tree of Steps and all metadata), and a set of OpenTelemetry spans (one per Step) that have been recorded (but not necessarily exported yet). At this point, we can proceed with processing and output. - -Local-First Debugging: Console and Artifact Outputs - -One of TraceCraft’s primary differentiators is its local-first debugging experience. Immediately after instrumenting with tracecraft.init(), developers should get useful trace output without any further setup – essentially a “better print()”. Achieving this involves two things: (1) pretty console printing of the trace as it happens (or at least when the run completes), and (2) optional local trace artifact files (JSONL logs and an HTML report). - -Console Tree Output: We will implement a console reporter that prints the hierarchy of steps in real time. A likely strategy is to print incrementally as steps execute (especially if steps are long-running, streaming them could be helpful), but the simplest approach is to print once the run is complete. We can traverse the Step tree and print a nicely formatted tree. Using the rich library’s Tree object, for example, we can create a root node for the AgentRun and add each child step as a branch, nesting further for sub-steps. Each line in the tree can include the step type, name, and perhaps a short summary of result (e.g., for an LLM call step, we might show the first part of the model’s response or the token count). We’ll also include timing information – for instance, we could append the duration in milliseconds for each step, and mark steps that failed with a different color or an “[ERROR]” tag. The output might look something like: - -📦 AgentRun 1234 (User question: "How...") – 2.3s -├── 🔧 tool: SearchDocs – 1.5s -│ └── 📖 retrieval: VectorDB query – 1.4s (3 results) -├── 🤖 llm: OpenAI ChatCompletion – 0.7s (prompt tokens=50, completion tokens=150) -└── ✅ final_answer: "The answer to ... abridged" - -This is just an illustrative format – the idea is to use emojis or prefixes for different step types (agent, tool, llm, etc.), and to indent according to depth. OpenLLMetry itself doesn’t print to console by default (it focuses on exporting), but they do suggest that when running locally you can set disable_batch=True to see traces immediately  – presumably this uses the OpenTelemetry console exporter to dump spans to stdout immediately. However, raw span dumps are not very user-friendly. Our goal is a more human-readable tree, hence using a custom formatted print. Libraries like Rich make it straightforward to output colored text and structured layouts in the terminal. - -We will print the console trace by default, without requiring any flags or keys. This “instant value” is key to adoption. In effect, as soon as a developer wraps their code with TraceCraft, they should see the agent’s execution flow in their terminal (without having to send data to a cloud or open a UI). If they prefer not to see it (for example in production or automated runs), we can allow a config flag to suppress console output, but default is on. - -Local JSONL and HTML Reports: Besides the live console, TraceCraft will produce artifacts that persist the trace for later analysis. The JSONL file (JSON Lines) will contain either the entire run as one JSON object or each step/event as separate lines appended. A convenient approach is one JSON per AgentRun appended to a file, so you get a log of runs. Alternatively, we could log each Step as it completes as a JSON line. Both are viable; logging per run makes it easier to treat each as a unit. The JSON should include everything (except perhaps very large fields if we exclude them for safety by default). This provides a local log of traces which could be ingested by other tools if needed. - -The HTML report will likely be a single-file report capturing the same information in a readable format. We have a few ways to generate this: - • Using Rich’s HTML export: Rich can capture console output and render it as HTML. We could simply take the rich Tree we printed and call console.save_html("report.html"), which would produce an HTML file with styled text matching what was in the terminal. This is quick and ensures the nicely formatted tree is preserved. - • Custom HTML with interactivity: If we want something more advanced, we could generate an HTML that includes collapsible sections for steps, or allows viewing full prompt/response on click. This might involve embedding a small JS library or writing a template. However, given MVP scope, a static HTML (possibly with basic CSS styling of hierarchy) is sufficient. Even a plaintext tree saved as .html (in
 tag with some coloring) would do the job.
-
-Either way, the HTML is self-contained – no server needed. The user can open the file in a browser to view the trace after a run. This complements the console output (which might get truncated or hard to scroll if the trace is very large).
-
-Importantly, all this happens with zero external setup. No API keys or web services are required for these local outputs. This means TraceCraft can be advertised as a “superior alternative to print() for agents” – it’s as easy as adding tracecraft.init() and immediately getting structured insight into agent behavior.
-
-We should also consider real-time streaming vs end-of-run output. For long-running agents, it might be useful to stream the trace as it is built (so the user sees steps appear progressively). This could be done by printing each step when it finishes. But doing so in a nicely nested way (especially with Rich) is complex (since you’d have to dynamically update the tree). A simpler model is to collect everything and then print at the end. Given most agent runs are relatively short (seconds), end-of-run printing is fine for MVP. If needed, we could flush partial info (e.g. print each tool/LLM call as it completes) to give quicker feedback.
-
-Finally, these local outputs should be on by default but configurable. tracecraft.init() can take options like console_log=True/False, jsonl_path="...", html_path="...". If a path is not provided, we use a default (maybe create an tracecraft_logs/ folder or similar). If the user only wants to use the remote exporting, they can disable local logs.
-
-In summary, the local debugging experience for TraceCraft consists of automatic console trace visualization and automatically saved trace artifacts, all without requiring the developer to change their workflow or deploy extra infrastructure. This is a key competitive advantage over other telemetry tools that often require setting up a UI or backend first.
-
-In-Process Governance: Redaction and Tail Sampling
-
-A standout feature of TraceCraft TAL is the in-process processing pipeline for data governance before any telemetry leaves the process. This includes redaction of sensitive content and client-side tail sampling for cost control. These capabilities aim to make the telemetry production-ready (avoiding leaks of sensitive data and controlling volume) without requiring an external collector or complex YAML configs.
-
-Redaction Processor
-
-By default, TraceCraft will operate in a “safe mode” that avoids capturing extremely sensitive data by accident. For example, prompts and completions might be recorded in full in memory for debugging, but we might exclude them from being exported by default unless the user explicitly opts in. This stance is similar to OpenLLMetry’s handling of content: by default, OpenLLMetry logs prompts and outputs to span attributes for visibility, but it provides a flag to disable this for privacy . In fact, their instrumentation allows turning off content logging by setting TRACELOOP_TRACE_CONTENT=false . TraceCraft can implement something analogous – e.g., a configuration like redact_content=True as default, meaning we would either omit or scrub the actual text of user prompts and LLM responses in the data that goes to external backends (and maybe even in local logs, depending on user preference).
-
-Rule-based redaction: We will allow users to specify patterns or use built-in detectors to redact certain content within any captured text. For instance, we can provide a default set of regexes for things like API keys, email addresses, credit card numbers, etc. If any text in inputs/outputs matches these, we can replace them with [REDACTED]. Python has regex and there are libraries for PII detection (like presidio from Microsoft, though that might be too heavy to bundle). A simple approach is fine for MVP: e.g., detect sequences of 16 digits, etc., or allow the user to pass their own regex patterns via config.
-
-Path-based redaction: Because our telemetry is structured, users might want to redact entire fields. For example, an organization may decide that no prompt text should ever leave the process – so the “prompt” field in an LLM Step should be stripped. We can implement this by allowing config like redact_fields = ["prompt", "completion", "messages[].content"] to specify keys to remove. The redaction processor would traverse each Step’s inputs/outputs and null out or remove those fields. By default “safe mode”, we might have a rule to redact the actual content of prompts and completions. Perhaps we only keep metadata like token counts and not the full text. This ensures that if TraceCraft is inadvertently left enabled in a production system, it doesn’t accidentally send user conversation data to an observability backend (which could be a compliance risk). The user can override this: for a development environment, they might disable redaction to see full prompts remotely.
-
-We’ll implement redaction as a pipeline stage that runs after a run completes (or on each span end). For efficiency, it might operate on the in-memory AgentRun object directly (e.g., go through all Steps and scrub sensitive fields). Alternatively, it could be implemented as an OpenTelemetry SpanProcessor that filters/edits span attributes before export. For instance, we could write a custom span processor that on on_end(span) checks for any disallowed attribute keys (like gen_ai.input.messages or gen_ai.prompt) and removes or sanitizes them. OpenLLMetry’s OpenAI instrumentation logs the actual prompt content in span attributes by default , so our span processor could detect those attributes (which correspond to OTel semantic convention fields like gen_ai.input.messages or an OpenAI-specific field) and redact them. This spans-level approach means the data never even goes out in OTLP. We should still consider the local artifacts though: by default, perhaps local console/JSON can show content (since it’s the developer’s machine), but maybe with a big warning. We could decide to apply the same redaction to local outputs too in safe mode, or at least partially (maybe mask emails, etc.). Given the focus on local debugging, likely we do show most content locally (it’s more useful), but when exporting remotely, we mask it by default. The README will make clear that by default prompts are not exported unless enabled – this transparency is important.
-
-Client-Side Tail Sampling
-
-Tail sampling is the idea of deciding whether to keep or drop a trace after observing it (as opposed to head sampling which decides upfront). In distributed tracing systems, tail sampling is usually done by a collector service because it needs to see the whole trace. However, in our scenario, an AgentRun is a self-contained trace within one process, making it feasible to implement tail sampling in the SDK itself.
-
-The motivation is to reduce the volume of telemetry sent to backends (and associated costs) while always retaining the most important traces (errors, outliers, etc.). The strategy for MVP is likely: keep all error traces, keep traces where the user gave negative feedback, and sample (e.g. keep 1 out of N) of successful traces. We can also consider sampling based on other criteria like latency or specific step content, but errors and feedback are the obvious triggers to always keep.
-
-A concrete design: TraceCraft will buffer the entire trace in memory (we have it as the AgentRun object and spans). It will not immediately export spans as they occur; instead, it can hold them until the AgentRun is complete. Then, a sampling decision is made. If any Step has an error (or the run took an exceptionally long time, or some other “interesting” property), we mark this trace for export. If it was a normal successful run and we are above our sampling rate, we might drop it (i.e., do not export to remote). This way, we dramatically cut down on noisy, repetitive traces in high-volume situations. The user can configure the sampling rate (for example, “sample 20% of successful runs”). We might implement a simple random sampling for successes (e.g., use random.random() < 0.2 threshold). We’ll also include logic to ensure certain categories always pass through – essentially policy-based sampling. This aligns with standard practices in tail sampling: e.g., “always sample traces that contain an error”  is a known strategy , and applying different rates or conditions is common in tail-based sampling .
-
-Implementing this might involve two levels of the OpenTelemetry pipeline:
-
- 1. We could set the OpenTelemetry SDK to record all spans (using the AlwaysOn sampler, so nothing is dropped at creation). Then attach a custom SpanProcessor (or use the built-in BatchSpanProcessor but in manual flush mode). We can subclass the SpanProcessor to override on_end(span) – by default, BatchSpanProcessor will batch and send spans out periodically or at end. Instead, we can store spans in a list (grouped by trace/AgentRun). The tricky part is knowing when a trace is complete; since we know our structure, we can detect when the AgentRun root span ends. We might designate the agent’s root span as the signal – when it’s closed, the trace is done. At that moment, our processor can evaluate the trace (it has all spans now) and decide to export or drop.
- 2. Alternatively, we don’t rely on OTel’s exporting at all for remote; we simply, at end-of-run, if decided to keep, then we could manually invoke an OTLP exporter with the spans. However, using OTel’s exporter is easier: OTel doesn’t natively support tail sampling in the SDK, but we can simulate it. For MVP, an easier approach: don’t attach the OTLP exporter initially. When run ends and decision is “keep”, then programmatically use an OTLP exporter to send the spans. The OpenTelemetry Python exporter can be invoked via a export([span_data_list]) call if we construct SpanData objects. We can obtain the spans from the in-memory AgentRun (since we recorded them, or from the OTel SDK if we had them in a processor).
-
-Given complexity, an simpler partial route: use OTel’s head sampling for high-level control, but our use-case really needs tail logic (because we specifically want to catch errors which head sampling can miss if error rate is low). So yes, implementing our own tail sampling in-process is a key feature.
-
-We will document clearly that if using an external collector with tail sampling, they might disable this internal one to avoid double decisions. But for most users who don’t run a collector, this feature “just works” to reduce data volume.
-
-In effect, TraceCraft’s tail sampling gives you the benefits of a sophisticated sampling setup without deploying an OTel Collector. For example, suppose you only care about 10% of successful queries for cost reasons, but 100% of failures. In classic OTel, you’d configure a tail-sampler in the collector with rules (as per OTel docs, tail sampling can sample on error presence or attributes ). Here, you’ll simply set those preferences in TraceCraft config, and the SDK will enforce them.
-
-One additional piece is buffering: tail sampling requires holding spans until decision. For an agent run, the number of spans is not huge (maybe tens or a couple hundred in very complex chains), so memory overhead is fine. We just need to accumulate them either in our Step objects or via OTel. We already have Step objects capturing the info, which is enough to reconstruct spans if needed.
-
-Token and Cost Enrichment
-
-TraceCraft will include a processor to enrich traces with token counts and cost estimates. Many LLM APIs (OpenAI, Anthropic, etc.) return token usage in their response payload. For instance, OpenAI’s completion API provides usage: {prompt_tokens: X, completion_tokens: Y, total_tokens: Z}. If our instrumentation captures those, we will map them into our Step’s data. The OpenTelemetry semantic conventions define standard attributes for token usage: gen_ai.usage.input_tokens and gen_ai.usage.output_tokens . OpenLLMetry indeed populates token counts (and even logs prompts) for OpenAI calls . We will ensure that for any LLM call step, we fill in the token counts either from API data or by estimation.
-
-For providers that don’t return token info (or for arbitrary models), we can estimate using known tokenization libraries. For example, if the user supplies the prompt text and model name, we could use OpenAI’s tiktoken library (if available) or Hugging Face tokenizers to count tokens for that model’s vocabulary. This might require the user to give a hint (e.g., which tokenizer to use if not obvious). But as a baseline, we can implement a simple GPT-3/4 token counter using tiktoken for OpenAI models, since those are common. Similarly, for other models, if they are GPT-2 based, we might approximate with byte-pair encoding from HuggingFace.
-
-Cost computation can be done if the user provides pricing information. We could allow a config like pricing={'gpt-4': {'prompt': 0.03, 'completion': 0.06}, 'gpt-3.5-turbo': {...}, ...} in dollars per 1K tokens. Then the enrichment processor will, after getting token counts, calculate cost = (prompt_tokens/1000 *prompt_rate) + (completion_tokens/1000* completion_rate) and attach that as a field (maybe step.cost in dollars). This is very useful for monitoring spend per call. If no pricing is given, we might omit cost or use some defaults for known models (though pricing can change, safer to require user input for this). The Dynatrace article on OpenLLMetry notes that token usage is basically a proxy for cost tracking  – by tracking tokens per request, you can easily compute costs offline or in dashboards . TraceCraft can do that computation on the fly if configured.
-
-Additionally, if a provider doesn’t return token usage and we can’t easily count, another strategy is to estimate based on input/output text lengths (but that’s crude given tokens ≠ characters). Ideally, we integrate with a proper tokenizer for accuracy.
-
-These enrichments (token count and cost) will be added before exporting. That means in the processing pipeline, once a step finishes (or at run end), we fill in any missing token info. For example, if using an open-source model locally, we might not have a usage count – we can run the prompt through a tokenizer to count tokens. Then we attach step.token_count_input and step.token_count_output. If the user configured pricing for that model, we compute step.cost.
-
-This enrichment ensures that whether the telemetry is viewed locally or sent to a backend, it contains valuable info about usage. Observability platforms can then aggregate this to monitor costs. Or the developer can see in the console that “this query used 100 tokens prompt, 180 tokens output (cost ~$0.002)”.
-
-Routing Telemetry to Any Backend (Export Layer)
-
-TraceCraft’s export layer is designed to be pluggable and flexible so that the same code can send traces to various backends purely via configuration. The primary mechanism is OTLP (OpenTelemetry Protocol) export, since that is vendor-neutral and widely supported. In the MVP, we will implement:
- • OTLP Exporter: Using OpenTelemetry’s OTLP HTTP/gRPC exporter, TraceCraft can send spans to any endpoint that speaks OTLP. This covers a huge range of backends: open source ones like Jaeger (via an OTLP collector), Tempo, SigNoz, or vendor SaaS like Datadog, New Relic, Honeycomb, etc., which accept OTLP. Even LangSmith has recently added support for ingesting OpenTelemetry traces , and Langfuse provides an OTLP ingestion endpoint  . For example, Langfuse’s guide shows setting OTEL_EXPORTER_OTLP_ENDPOINT to their API and providing auth, then OpenLLMetry can send data there  . TraceCraft will allow similar configuration. Likely via environment variables (honoring standard OTel env like OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS) or a config passed to tracecraft.init(). If those are set, we instantiate the OTLP exporter with those values.
-
-Using OTLP means we automatically align with the OpenTelemetry Generative AI semantic conventions. TraceCraft will map its canonical schema to OTel span attributes following that standard. For instance, when we export an LLM call step, we ensure that attributes like gen_ai.model (model name), gen_ai.request.prompt (if not redacted), gen_ai.usage.input_tokens, etc., are set on the span. The result: if sending to an observability backend, these traces will be labeled in a consistent way. OpenLLMetry’s recent contribution to OTel means these semantic field names are recognized and standard  . We want any backend-specific dashboards (like Dynatrace’s or Datadog’s) to automatically pick up things like token counts or model names.
- • Local file exporter (JSONL): In addition to the in-memory JSONL we write for debugging, we can also implement a simple file exporter that writes each span or trace to a file. However, since we are already handling JSONL via our own logic, we might not need to treat it as an “exporter” in the OTel sense. It can be done outside the OTel pipeline (simply writing the AgentRun JSON). Still, conceptually it is an exporter – it exports to file. This will be the default if no remote is configured (so that data isn’t lost if the console scrolls by).
- • Console exporter: We have our custom console printing. We could also use the OTel ConsoleSpanExporter for debugging, but that outputs each span as a line of JSON or text – our rich output is separate. So we likely won’t use the built-in console exporter except perhaps in verbose debug modes.
-
-TraceCraft’s architecture should allow multiple exporters at once. For instance, a user might want to simultaneously send traces to a local JSON file and to a cloud backend. Or during a migration, send to two backends (dual writing). OpenTelemetry allows adding multiple SpanProcessors (each with its own exporter). We can leverage that by, say, always adding our “LocalFileProcessor” and also adding an OTLP processor if configured. This way, when a span ends, it goes to both. We do have to be careful with our tail-sampling logic: if a trace is dropped for remote, we would still want it in local file possibly. That means the sampling decision might only apply to the OTLP exporter, not the file. Implementation-wise, it could be simplest to handle like: always export locally (no sampling applied to local), but wrap the OTLP exporter behind a sampling check.
-
-In the future, we plan optional direct integrations:
- • LangSmith: They have an API for logging runs. We could implement an exporter that directly calls LangSmith’s REST API with our run data (mapping AgentRun to a LangSmith run object). This would avoid needing OTLP and could use their native schema (which likely is similar to ours since we’re both inspired by LangChain’s concepts). Not in MVP, but design will keep it open (perhaps via a plugin interface where new exporters can be registered).
- • Langfuse: Similarly, Langfuse has its own SDK and API. However, since Langfuse also supports OTLP ingestion , users can already integrate via the OTLP path. A future specialized exporter might call Langfuse’s REST if there are features not covered by OTLP (like artifact uploading or evaluations).
- • Other backends (Phoenix, Weights & Biases Weave, etc.): These likely require data transformation. For Phoenix (Arize’s OpenInference format), we might output traces in a format their tool can read. This is beyond MVP, but the modular design (with an exporter plugin system) ensures we can add these without core changes. For now, focusing on OTLP covers the most ground.
-
-Configuration: Users will configure export targets through environment variables (to piggyback on existing OTel env convetions) or through a Python config object. For example:
-
-tracecraft.init(
-    exporters=["otlp", "jsonl"],
-    otlp_endpoint="",
-    otlp_headers={"Authorization": "Bearer abc123"},
-    jsonl_path="./traces.jsonl"
-)
-
-If no exporters specified, we default to local (console + JSONL). If OTEL_EXPORTER_OTLP_ENDPOINT env is set, we assume the user wants OTLP and enable it (with console still on unless turned off).
-
-We will document how to route to common platforms. Thanks to using OTel, it should be as simple as setting the endpoint and credentials. OpenLLMetry’s README highlights many supported destinations achieved via OTel compatibility (Datadog, Azure, New Relic, etc.)  . TraceCraft will similarly be destination-agnostic. We can say “just set your OTLP endpoint to Datadog and you’ll see traces there,” for instance.
-
-One more aspect: dual-writing. If a user wants to send to two OTLP endpoints (say, to two different vendors), we can potentially instantiate two OTLP exporters. The OTel Python API doesn’t directly support multiple endpoints in one exporter, but we can attach two separate SpanProcessor instances each with its own OTLP exporter (pointing to each endpoint). That should effectively duplicate the data to both. We’d provide a config for that scenario (not a primary use-case, but we note it’s possible for migration scenarios).
-
-Framework Adapters and Integration Points
-
-TraceCraft is meant to be usable with no framework at all (just wrap your code), but its success will grow with tight integrations into popular agentic frameworks. We plan to deliver adapters for at least a couple of key frameworks in v0.1:
- • LangChain: LangChain has a rich callback system that powers its tracing (LangSmith) and logging. We can integrate at this level by implementing a CallbackHandler that maps LangChain events to TraceCraft events. For instance, on llm_start (when an LLM call is invoked by LangChain), we would create a new llm Step (or tell the OpenAI instrumentor to start a span) and on llm_end, finalize that Step with the result. On chain_start, we create a workflow or agent Step representing that chain (LangChain calls everything a “chain” even if it’s an agent). Tools invoked by LangChain (via Tool callbacks) would correspond to our tool steps. LangChain’s docs indicate that LangSmith itself just uses these callbacks under the hood , so it’s feasible for us to do similarly without modifying LangChain itself. The end result is that a developer can use LangChain as usual, and just add our callback to get traces, without needing LangSmith. This is attractive for those who want to avoid vendor lock or run things locally.
- • LlamaIndex (GPT Index): LlamaIndex also has callback managers (CallbackManager in their API) to track events like query start, node retrieval, LLM calls, etc. We can make a similar adapter. Possibly, if LlamaIndex doesn’t have a stable callback interface, we might instrument it in other ways (or rely on OpenLLMetry’s LlamaIndex integration if it auto-detects it – OpenLLMetry claims to support it , likely via patching or context managers that LlamaIndex provides).
- • OpenAI/Anthropic raw clients: For cases where the user is directly calling the OpenAI API (not through a framework), our instrumentation (OpenAIInstrumentor) will already cover those calls. But we might also provide a lightweight wrapper client as a convenience. For example, an tracecraft.openai.ChatCompletion class that wraps openai.ChatCompletion but ensures the calls are traced. This might be unnecessary if the monkey-patch via OTel does the job. Alternatively, if users don’t want monkey patching, they could explicitly call through a traced client.
- • Pydantic AI / Instructor: These emerging frameworks (like Pydantic’s experimental AI functions) will be important soon. They likely don’t have built-in tracing yet. We should investigate their APIs and either instrument them or collaborate on callback hooks. Possibly not in v0.1, but on the radar for v0.2.
-
-One guiding principle: zero or minimal code changes for users. The adapters should make integration as seamless as possible. In ideal cases, just calling tracecraft.init() will detect and instrument frameworks automatically. OpenLLMetry attempts this (for example, if it sees LangChain imported, it might wrap the Chain class to auto-add spans). We could attempt something similar: for instance, monkey-patch LangChain’s AgentExecutor to automatically use our callback. But monkey-patching a large library could be fragile. A safer route is documentation: e.g., “if you are using LangChain, add agent = initialize_agent(..., callbacks=[TraceCraftCallback()])” or use the provided integration function.
-
-For toolkits that have their own tracing (like LangChain’s LangSmith or LlamaIndex’s built-in logging), we present TraceCraft as an alternative. Because TraceCraft can co-exist with those (since we don’t require their endpoints), some advanced users might even dual-send (though that’s uncommon).
-
-OpenLLMetry’s documentation suggests that for frameworks like LangChain or Haystack, you actually don’t need to add anything – it auto-annotates traces by detecting their usage . We might explore how that detection works (possibly it checks class names or inherits from certain base classes to name spans appropriately). For MVP, we can achieve similar outcome with an explicit adapter, which is simpler to implement reliably.
-
-Testing these adapters will involve running sample chains and verifying that the trace structure matches expectations (e.g., a LangChain agent with a tool yields an AgentRun with an agent step and inside it a tool step and an llm step, etc., corresponding correctly).
-
-Testing Strategy for Implementation
-
-Given the complexity of this telemetry SDK, we need a thorough testing approach:
- • Unit Tests for Core Functionality: We will write unit tests for the Step/AgentRun model (ensuring that nesting works, JSON serialization works, etc.). Tests for the decorators and context managers will validate that they create the correct Steps and push/pop context properly (including in async functions – e.g., decorate an async def and ensure the trace context flows through await). We’ll simulate simple functions and verify the in-memory trace.
- • OpenTelemetry Integration Tests: Using the OpenTelemetry SDK in testing, we can verify that spans are indeed created and have the expected parent-child relationships and attributes. For example, we can configure a Memory SpanExporter (OpenTelemetry has a test exporter or we can make a dummy one) to collect spans, run a sample agent code through TraceCraft, and then inspect the collected spans. We expect the span hierarchy to mirror the Step hierarchy. We also check for crucial attributes: model name, token counts, etc., are present on spans where applicable. This ensures our mapping to OTel semantic conventions is correct.
- • Redaction Tests: We will create scenarios with known sensitive data in prompts (say a fake credit card number in the prompt) and configure the redactor with a regex to remove it. Then run the trace and check that in the exported span data or output file, the number is replaced with [REDACTED]. Also test that when the flag to keep content is on, the content remains. If we have default redaction of all prompt text for export, verify that indeed no prompt text leaves (maybe by asserting certain attributes are absent or blank in the exported span).
- • Sampling Tests: Simulate multiple runs and configure sampling rate to, say, 50%. We can use a deterministic approach (like seeding the random generator to make results predictable) and ensure roughly half the traces get marked to export. Specifically test that any trace containing an error is never dropped. For example, cause an exception in an agent step and confirm that trace was exported even if random sampling might have dropped successes. This might involve hooking our internal logic to see what decision was made or checking side effects (like exported spans vs not exported). We can also test that the console and local logs still record all runs regardless of sampling (if that’s our intended behavior).
- • Token Count & Cost Tests: For providers that return usage (we can stub an OpenAI API response), test that the token counts from the response end up in the Step and span attributes. For a provider without usage info, test that our tokenizer function is called and produces a reasonable count. If pricing info is provided, test that cost is computed accurately. For example, configure $0.01 per token and see that 100 tokens yields $1.00 in the output cost field.
- • Adapter Tests: Using a dummy LangChain (or the actual LangChain in a test environment) with a simple chain (e.g., an LLMChain that calls a fake LLM), ensure our callback captures the events. We might monkey-patch an LLM to avoid actual API calls (return a fixed output), but still trigger the callbacks. Verify the TraceCraft Steps created correspond to chain and LLM invocation.
- • Performance and Thread Safety: We should test that using TraceCraft in multi-threaded or multi-async-task scenarios does not conflict. ContextVars in Python ensure each async task keeps its own context, but we should test e.g. two concurrent agent runs (if that’s plausible usage) don’t interfere (each should get its own trace). Similarly, ensure the overhead is minimal (but performance tests can be simple – e.g., run 1000 very quick spans and see that it completes reasonably – OpenTelemetry is built to handle high throughput in general).
-
-Given that OpenTelemetry and OpenLLMetry are active projects, we should also test with the latest versions of those to ensure compatibility. The OpenLLMetry version we rely on (likely we will require e.g. traceloop-sdk>=0.50 which as of Dec 2025 is their latest  ) should be pinned to avoid breaking changes.
-
-In addition to automated tests, we will do some manual end-to-end testing: run a simple agent script with TraceCraft and verify that:
- • We get a nice console output (visually inspect).
- • The JSONL file is written and contains expected JSON.
- • The HTML report opens and looks correct.
- • If we configure an OTLP endpoint (maybe a local OpenTelemetry Collector or a tool like Jaeger), verify the trace appears there with correct structure and data (for a real integration test, we could run a local collector container with an in-memory exporter to catch the data).
-
-All these tests will ensure that the package works as intended and that the key promises (local-first ease, correct trace semantics, safe governance, routing flexibility) are actually delivered.
-
-Summary of Differentiators and Best Practices
-
-By researching existing solutions (OpenLLMetry, LangSmith, Langfuse, etc.), we’ve distilled the implementation strategies that allow TraceCraft TAL to stand out:
- • Local-First UX: Unlike OpenLLMetry which is export-first (you typically view traces in a dashboard), TraceCraft prioritizes local debugging. We’ll implement rich console output and offline reports as first-class features, effectively making the developer’s local environment a tracing UI. This addresses the immediate needs of developers (“I replaced print() with TraceCraft today and it just worked”), which existing tools have not focused on by default.
- • Build on OpenTelemetry (don’t reinvent): We are not replacing OpenTelemetry or OpenLLMetry – we’re building on them. This grants us compatibility and avoids reinventing the wheel for instrumentation. The research confirms OpenLLMetry’s approach of extending OTel with GenAI conventions is the right path . We’ll leverage those conventions (model, prompt, token attributes) and ensure our spans conform, so any OTel-compatible backend can understand our traces. In other words, OpenTelemetry is an implementation detail – users don’t need to know it’s there, but they benefit from its power.
- • Agent-Native Schema: Our internal representation is in terms of AgentRuns and Steps (agents, tools, LLM calls, etc.), not low-level spans. This aligns with how developers reason about their code. Existing SDKs like LangSmith also treat runs hierarchically. We enhance this by providing convenient Python APIs (@decorators and context managers) to mark up agent code in those terms (much like OpenLLMetry’s @workflow, @agent, etc. decorators ). This means developers can instrument logic that isn’t automatically captured, in a semantic way. Each step can carry rich structured data (not just a text message), which is crucial for future analysis and governance.
- • Governance Built-In: Data privacy and volume control are not afterthoughts. We design redaction rules into the pipeline – e.g., by default not exporting prompt content (OpenLLMetry allows disabling content for privacy , and we take that further by maybe disabling by default). We also integrate sampling to handle high-volume use cases without forcing the user to deploy infrastructure. Typically, tail sampling is relegated to backend components , but we implement a basic form in the SDK itself, which is novel and practical for many users. This means out of the box, TraceCraft is production-conscious: you can leave it on in production for critical visibility, with confidence it won’t bankrupt you or leak secrets.
- • Routing Flexibility: Switching or adding backends is a config change. Want to try out Langfuse cloud? Just point the OTLP endpoint there (Langfuse’s guide shows how they accept OTLP ). Want to send to your OpenTelemetry Collector and then to Jaeger? Set the endpoint to the collector. Because we stick to standard OTLP and semantic conventions, integration is smooth. And if the user later decides to use a specialized vendor SDK (like LangSmith’s), they haven’t locked their instrumentation to a proprietary API – TraceCraft can dual-publish to both during transition. In essence, we decouple instrumentation from backend – a core principle of OpenTelemetry, carried through in an agent-specific context.
-
-In implementing TraceCraft TAL, we will take care to follow best practices from the libraries researched (OpenLLMetry’s efficient use of context and instrumentation, LangChain’s callback patterns, OpenTelemetry’s spec for GenAI telemetry, etc.). The outcome will be a small, focused Python SDK that developers can add to any LLM app or agent to get immediate tracing and debugging superpowers, while also laying the groundwork for observability at scale by routing to the tools of their choice.
-
-Sources:
- • Traceloop OpenLLMetry (OpenTelemetry-based LLM observability) – GitHub README    
- • OpenLLMetry Documentation – Workflow and Agent Annotations 
- • OpenTelemetry Semantic Conventions for GenAI – Agent and LLM span attributes  
- • OpenLLMetry OpenAI Instrumentation – Prompt logging and privacy toggle  
- • Dynatrace article on OpenLLMetry – Capturing model parameters, token usage, and cost tracking 
- • OpenTelemetry Sampling documentation – Tail sampling use cases (error traces, etc.) 
- • Langfuse OTLP integration guide – Example of configuring OpenLLMetry to send to Langfuse via OTLP