Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
d5b67cd
feat: OTEL tracing across agent runners and LiteLLM proxy
ShuxinLin Apr 23, 2026
e9b3840
refactor: unify trajectory models and LiteLLM helpers across runners
ShuxinLin Apr 23, 2026
133cdf5
refactor: share system prompt and CLI boilerplate across SDK runners
ShuxinLin Apr 23, 2026
298344c
Merge pull request #274 from IBM/refactor/share-prompt-and-cli
ShuxinLin Apr 23, 2026
0012a61
refactor: collapse SDK CLI main() and cache deep-agent chat model
ShuxinLin Apr 23, 2026
aa81db6
Merge pull request #273 from IBM/refactor/unify-agent-models
ShuxinLin Apr 23, 2026
5a9770e
refactor: hoist server_paths to base + use AsyncExitStack
ShuxinLin Apr 24, 2026
4310e6d
feat: reposition OTEL as evaluation record persistence
ShuxinLin Apr 24, 2026
30ff76d
docs: observability.md + otel-collector.yaml
ShuxinLin Apr 24, 2026
abbd2f1
refactor: drop unused observability scaffolding; fix span loss on exit
ShuxinLin Apr 24, 2026
8951960
feat: in-process OTLP-JSON file exporter; drop Docker requirement
ShuxinLin Apr 24, 2026
bf873f8
feat: separate span metadata from trajectory content
ShuxinLin Apr 24, 2026
e216048
feat: restore GenAI usage + turn/tool_call aggregates on root span
ShuxinLin Apr 24, 2026
473183f
feat: timing metrics on spans and trajectories
ShuxinLin Apr 24, 2026
3e59716
fix(claude-agent): drop PreToolUse hook to restore compatibility
ShuxinLin Apr 24, 2026
5a1819d
docs(observability): correct runner coverage for span attrs and traje…
ShuxinLin Apr 24, 2026
7677828
Update observability.md
ShuxinLin Apr 24, 2026
43694d3
docs(observability): clarify plan-execute has no token tracking
ShuxinLin Apr 24, 2026
3937e2f
feat(plan-execute): report token usage on root span
ShuxinLin Apr 24, 2026
0a2c551
docs(observability): reflect token tracking now covers all runners
ShuxinLin Apr 24, 2026
fdcf72b
docs(instructions): add Observability section linking to docs/observa…
ShuxinLin Apr 27, 2026
ef0169b
docs: split MCP server reference into docs/mcp-servers.md
ShuxinLin Apr 27, 2026
fcb40aa
docs(instructions): trim per-agent boxes from Architecture diagram
ShuxinLin Apr 27, 2026
d88e597
docs(instructions): aggregate four agent sections into one
ShuxinLin Apr 27, 2026
4ddea41
refactor(plan-execute): drop --server NAME=PATH CLI flag
ShuxinLin Apr 27, 2026
b922432
docs(instructions): drop Plan-Execute loop diagram
ShuxinLin Apr 27, 2026
6e206a1
docs(instructions): drop Model selection subsection
ShuxinLin Apr 27, 2026
7dbe3d6
docs(instructions): tighten Running Tests section
ShuxinLin Apr 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -201,3 +201,6 @@ mcp/couchdb/sample_data/bulk_docs.json
.env
mcp/servers/tsfm/artifacts/tsfm_models/
src/tmp/

# Observability artifacts (OTLP-JSON traces + per-run trajectory JSON).
traces/
455 changes: 82 additions & 373 deletions INSTRUCTIONS.md

Large diffs are not rendered by default.

110 changes: 110 additions & 0 deletions docs/mcp-servers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# MCP Servers

Six FastMCP servers expose the AssetOpsBench domain logic. Each is a standalone stdio process spawned on-demand by clients (`plan-execute`, `claude-agent`, `openai-agent`, `deep-agent`, Claude Desktop). Backing services and credentials are listed per-server below.

## Contents

- [iot — IoT Sensor Data](#iot--iot-sensor-data)
- [utilities — Utilities](#utilities--utilities)
- [fmsr — Failure Mode and Sensor Relations](#fmsr--failure-mode-and-sensor-relations)
- [wo — Work Order](#wo--work-order)
- [tsfm — Time Series Foundation Model](#tsfm--time-series-foundation-model)
- [vibration — Vibration Diagnostics](#vibration--vibration-diagnostics)

## iot — IoT Sensor Data

**Path:** `src/servers/iot/main.py`
**Requires:** CouchDB (`COUCHDB_URL`, `COUCHDB_USERNAME`, `COUCHDB_PASSWORD`, `IOT_DBNAME`)

| Tool | Arguments | Description |
| --------- | ------------------------------------------ | ----------------------------------------------------------------------- |
| `sites` | — | List all available sites |
| `assets` | `site_name` | List all asset IDs for a site |
| `sensors` | `site_name`, `asset_id` | List sensor names for an asset |
| `history` | `site_name`, `asset_id`, `start`, `final?` | Fetch historical sensor readings for a time range (ISO 8601 timestamps) |

## utilities — Utilities

**Path:** `src/servers/utilities/main.py`
**Requires:** nothing (no external services)

| Tool | Arguments | Description |
| ---------------------- | ----------- | ------------------------------------------------------ |
| `json_reader` | `file_name` | Read and parse a JSON file from disk |
| `current_date_time` | — | Return the current UTC date and time as JSON |
| `current_time_english` | — | Return the current UTC time as a human-readable string |

## fmsr — Failure Mode and Sensor Relations

**Path:** `src/servers/fmsr/main.py`
**Requires:** `WATSONX_APIKEY`, `WATSONX_PROJECT_ID`, `WATSONX_URL` for unknown assets; curated lists for `chiller` and `ahu` work without credentials.
**Failure-mode data:** `src/servers/fmsr/failure_modes.yaml` (edit to add/change asset entries)

| Tool | Arguments | Description |
| --------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `get_failure_modes` | `asset_name` | Return known failure modes for an asset. Uses a curated YAML list for chillers and AHUs; falls back to the LLM for other types. |
| `get_failure_mode_sensor_mapping` | `asset_name`, `failure_modes`, `sensors` | For each (failure mode, sensor) pair, determine relevancy via LLM. Returns bidirectional `fm→sensors` and `sensor→fms` maps plus full per-pair details. |

## wo — Work Order

**Path:** `src/servers/wo/main.py`
**Requires:** CouchDB (`COUCHDB_URL`, `COUCHDB_USERNAME`, `COUCHDB_PASSWORD`, `WO_DBNAME`)
**Data init:** Handled automatically by `docker compose -f src/couchdb/docker-compose.yaml up` (runs `src/couchdb/init_wo.py` inside the CouchDB container on every start — database is dropped and reloaded each time)

| Tool | Arguments | Description |
| ----------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------------------------------- |
| `get_work_orders` | `equipment_id`, `start_date?`, `end_date?` | Retrieve all work orders for an equipment within an optional date range |
| `get_preventive_work_orders` | `equipment_id`, `start_date?`, `end_date?` | Retrieve only preventive (PM) work orders |
| `get_corrective_work_orders` | `equipment_id`, `start_date?`, `end_date?` | Retrieve only corrective (CM) work orders |
| `get_events` | `equipment_id`, `start_date?`, `end_date?` | Retrieve all events (work orders, alerts, anomalies) |
| `get_failure_codes` | — | List all failure codes with categories and descriptions |
| `get_work_order_distribution` | `equipment_id`, `start_date?`, `end_date?` | Count work orders per (primary, secondary) failure code pair, sorted by frequency |
| `predict_next_work_order` | `equipment_id`, `start_date?`, `end_date?` | Predict next work order type via Markov transition matrix built from historical sequence |
| `analyze_alert_to_failure` | `equipment_id`, `rule_id`, `start_date?`, `end_date?` | Probability that an alert rule leads to a work order; average hours to maintenance |

## tsfm — Time Series Foundation Model

**Path:** `src/servers/tsfm/main.py`
**Requires:** `tsfm_public` (IBM Granite TSFM), `transformers`, `torch` for ML tools — imported lazily; static tools work without them.
**Model checkpoints:** resolved relative to `PATH_TO_MODELS_DIR` (default: `src/servers/tsfm/artifacts/output/tuned_models`)

| Tool | Arguments | Description |
| ---------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| `get_ai_tasks` | — | List supported AI task types for time-series analysis |
| `get_tsfm_models` | — | List available pre-trained TinyTimeMixer (TTM) model checkpoints |
| `run_tsfm_forecasting` | `dataset_path`, `timestamp_column`, `target_columns`, `model_checkpoint?`, `forecast_horizon?`, `frequency_sampling?`, ... | Zero-shot TTM inference; returns path to a JSON predictions file |
| `run_tsfm_finetuning` | `dataset_path`, `timestamp_column`, `target_columns`, `model_checkpoint?`, `save_model_dir?`, `n_finetune?`, `n_test?`, ... | Few-shot fine-tune a TTM model; returns saved checkpoint path and metrics file |
| `run_tsad` | `dataset_path`, `tsfm_output_json`, `timestamp_column`, `target_columns`, `task?`, `false_alarm?`, `ad_model_type?`, ... | Conformal anomaly detection on top of a forecasting output JSON; returns CSV with anomaly labels |
| `run_integrated_tsad` | `dataset_path`, `timestamp_column`, `target_columns`, `model_checkpoint?`, `false_alarm?`, `n_calibration?`, ... | End-to-end forecasting + anomaly detection in one call; returns combined CSV |

## vibration — Vibration Diagnostics

**Path:** `src/servers/vibration/main.py`
**Requires:** CouchDB (`COUCHDB_URL`, `VIBRATION_DBNAME` (default `vibration`), `COUCHDB_USERNAME`, `COUCHDB_PASSWORD`); `numpy`, `scipy`
**DSP core:** `src/servers/vibration/dsp/` — adapted from [vibration-analysis-mcp](https://github.com/LGDiMaggio/claude-stwinbox-diagnostics/tree/main/mcp-servers/vibration-analysis-mcp) (Apache-2.0)

| Tool | Arguments | Description |
|---|---|---|
| `get_vibration_data` | `site_name`, `asset_id`, `sensor_name`, `start`, `final?` | Fetch vibration time-series from CouchDB and load into the analysis store. Returns a `data_id`. |
| `list_vibration_sensors` | `site_name`, `asset_id` | List available sensor fields for an asset. |
| `compute_fft_spectrum` | `data_id`, `window?`, `top_n?` | Compute FFT amplitude spectrum (top-N peaks + statistics). |
| `compute_envelope_spectrum` | `data_id`, `band_low_hz?`, `band_high_hz?`, `top_n?` | Compute envelope spectrum for bearing fault detection (Hilbert transform). |
| `assess_vibration_severity` | `rms_velocity_mm_s`, `machine_group?` | Classify vibration severity per ISO 10816 (Zones A–D). |
| `calculate_bearing_frequencies` | `rpm`, `n_balls`, `ball_diameter_mm`, `pitch_diameter_mm`, `contact_angle_deg?`, `bearing_name?` | Compute bearing characteristic frequencies (BPFO, BPFI, BSF, FTF). |
| `list_known_bearings` | — | List all bearings in the built-in database. |
| `diagnose_vibration` | `data_id`, `rpm?`, `bearing_designation?`, `bearing_*?`, `bpfo_hz?`, `bpfi_hz?`, `bsf_hz?`, `ftf_hz?`, `machine_group?`, `machine_description?` | Full automated diagnosis: FFT + shaft features + bearing envelope + ISO 10816 + fault classification + markdown report. |

## Running a server manually

Servers are normally spawned on-demand by an agent client. To launch one directly for testing:

```bash
uv run iot-mcp-server
uv run utilities-mcp-server
uv run fmsr-mcp-server
uv run wo-mcp-server
uv run tsfm-mcp-server
uv run vibration-mcp-server
```

They speak MCP over stdio, so they're idle until a client connects on stdin.
217 changes: 217 additions & 0 deletions docs/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# Observability

Each agent run produces two artifacts, joined by ``run_id``:

1. **Trace** — an OpenTelemetry span with *metadata* and *aggregate
metrics* (runner, model, IDs, latency via span duration, token
totals, turn and tool-call counts). Written as canonical OTLP-JSON
and recognised by every OTEL-aware backend (Jaeger, Tempo, Langfuse,
Grafana Cloud AI, Honeycomb).
2. **Trajectory** — a per-run JSON file with *per-turn content*: turn
text, tool call inputs / outputs, and (for SDK runners) per-turn
token usage. Written directly by the agent runner alongside the
trace.

Spans and trajectories complement each other without duplicating
content: the span holds everything an observability UI needs to
summarise or bill a run, the trajectory holds the raw per-turn data
needed for offline evaluation. Aggregate numbers (totals) live on the
span; per-turn numbers (from which the totals are derived) live on the
trajectory. Nothing is repeated.

## Root span attributes

Metadata + aggregate metrics — always written when tracing is enabled:

"SDK runners" below means claude-agent, openai-agent, deep-agent (which all
expose turn/tool-call bookkeeping); plan-execute's loop is step-shaped and
surfaces different attributes.

| Attribute | Runner coverage | Notes |
| ----------------------------- | ----------------- | -------------------------------------- |
| `agent.runner` | all | `plan-execute` / `claude-agent` / … |
| `gen_ai.system` | all | Provider family (anthropic, openai…) |
| `gen_ai.request.model` | all | Full model ID |
| `gen_ai.usage.input_tokens` | all | Sum across the run |
| `gen_ai.usage.output_tokens` | all | Sum across the run |
| `agent.question.length` | all | Character length of the question |
| `agent.answer.length` | all | Character length of the final answer |
| `agent.duration_ms` | all | Wall-clock of `run()` |
| `agent.run_id` | all | `--run-id` or auto-generated UUID4 |
| `agent.scenario_id` | all | `--scenario-id` (omitted if unset) |
| `agent.turns` | SDK runners | Number of turns |
| `agent.tool_calls` | SDK runners | Total tool calls |
| `agent.llm_time_ms` | plan-execute | Planning + summarisation LLM time |
| `agent.planning_time_ms` | plan-execute | `Planner.generate_plan` wall-clock |
| `agent.summarization_time_ms` | plan-execute | Final summarise-LLM wall-clock |
| `agent.plan.steps` | plan-execute | Number of generated plan steps |

For plan-execute, ``gen_ai.usage.*`` is the run-wide sum across planning,
per-step arg-resolution, and summarisation LLM calls (provided the backend
reports usage — ``LiteLLMBackend`` does; mocks return zero). Turn and
tool-call counts have no clean mapping to the step-shaped loop and are
not surfaced; per-step wall-clock lives on each ``StepResult.duration_ms``
in the trajectory.

Per-tool timing is not captured for the three SDK runners — the
`PreToolUse` hook that claude-agent needed broke compatibility with
some `@anthropic-ai/claude-code` CLI versions, and the openai / deep
SDKs do not expose clean per-tool callback surfaces either. Follow-up
when needed.

Plus automatic child spans from the `HTTPXClientInstrumentor` — one per
outbound HTTP request to the LiteLLM proxy (URL, status, latency). The
root span's own duration = agent wall-clock, so ``agent.duration_ms`` is
redundant for OTEL UIs but convenient for jq on the JSONL file.

## Trajectory file layout

When ``AGENT_TRAJECTORY_DIR`` is set, each runner writes
``{AGENT_TRAJECTORY_DIR}/{run_id}.json``. The `trajectory` field's shape
depends on the runner.

| Field | claude-agent | openai-agent | deep-agent | plan-execute |
| ----------------------------- | ------------ | ------------ | ---------- | ------------ |
| `Trajectory.started_at` | ✓ | ✓ | ✓ | (n/a) |
| `TurnRecord.duration_ms` | ✓ | ✗ | ✗ | (n/a) |
| `ToolCall.duration_ms` | ✗ | ✗ | ✗ | (n/a) |
| `StepResult.duration_ms` | (n/a) | (n/a) | (n/a) | ✓ |

plan-execute's trajectory is a list of ``StepResult`` records instead
of turns, each carrying its own ``duration_ms`` populated by the executor.

## Enabling persistence

Install the optional tracing deps (trajectories need no extra deps):

```bash
uv sync --group otel
```

Each artifact has its own env var; set either, both, or neither:

| Env var | Effect |
| --------------------------------- | --------------------------------------------------- |
| `AGENT_TRAJECTORY_DIR` | Directory for ``{run_id}.json`` trajectory records. |
| `OTEL_TRACES_FILE` | Append OTLP-JSON lines to this path (in-process). |
| `OTEL_EXPORTER_OTLP_ENDPOINT` | Ship spans over HTTP to a live collector endpoint. |

When none are set, runs work normally with zero persistence overhead.

## Recommended: save both traces and trajectories

```bash
AGENT_TRAJECTORY_DIR=./traces/trajectories \
OTEL_TRACES_FILE=./traces/traces.jsonl \
uv run deep-agent --run-id bench-001 --scenario-id 304 \
"Calculate bearing characteristic frequencies for a 6205 bearing at 1800 RPM."
```

Each span batch appends one JSON line to `./traces/traces.jsonl` in
canonical OTLP-JSON format — the same format the OpenTelemetry Collector's
`file` exporter produces, and ingestible by the Collector's
`otlpjsonfile` receiver later if you want to replay into a live backend.

### Query with `jq`

For metadata + aggregate metrics (run_id, runner, model, token totals,
latency) read the trace alone — token totals are on the span:

```bash
jq -c '.resourceSpans[].scopeSpans[].spans[]
| select(.name | startswith("agent.run"))
| {
run_id: (.attributes[] | select(.key == "agent.run_id") | .value.stringValue),
runner: (.attributes[] | select(.key == "agent.runner") | .value.stringValue),
model: (.attributes[] | select(.key == "gen_ai.request.model") | .value.stringValue),
input_tokens: (.attributes[] | select(.key == "gen_ai.usage.input_tokens") | .value.intValue),
output_tokens: (.attributes[] | select(.key == "gen_ai.usage.output_tokens") | .value.intValue),
turns: (.attributes[] | select(.key == "agent.turns") | .value.intValue),
}' traces/traces.jsonl
```

For per-turn content (text, tool call inputs/outputs, per-turn tokens)
read the matching trajectory file:

```bash
jq '.trajectory.turns[] | {index, input_tokens, tool_calls: [.tool_calls[].name]}' \
traces/trajectories/bench-001.json
```

### Rotation

The built-in file exporter appends indefinitely — one line per span batch
is small, but long-running benchmarks can grow. For rotation, pipe the
path through `logrotate`, or split runs across dated files:

```bash
OTEL_TRACES_FILE="./traces/$(date +%F).jsonl" uv run deep-agent "..."
```

## Replaying saved traces into a live backend (optional)

If you later want to visualize persisted traces, point any
OpenTelemetry Collector at the file with its `otlpjsonfile` receiver:

```yaml
receivers:
otlpjsonfile:
include: ["traces/traces.jsonl"]
exporters:
otlp:
endpoint: jaeger:4317
tls: {insecure: true}
service:
pipelines:
traces:
receivers: [otlpjsonfile]
exporters: [otlp]
```

## Live debugging with Jaeger (optional, Docker)

When network access to Docker Hub is available, Jaeger all-in-one is the
quickest way to inspect traces in a UI:

```bash
docker run -d --rm --name jaeger \
-p 16686:16686 -p 4318:4318 \
jaegertracing/all-in-one

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
OTEL_TRACES_FILE=./traces/traces.jsonl \
uv run deep-agent --run-id demo "$query"

open http://localhost:16686 # macOS
```

With both env vars set, spans go to disk *and* to Jaeger simultaneously.
Jaeger all-in-one is in-memory only; the file stays on disk when the
container exits.

## Troubleshooting

**"OTEL SDK not installed; tracing disabled"** — run `uv sync --group otel`.

**No output file on disk** — tracing is lazy; at least one runner has to
complete a `run()` call before the `BatchSpanProcessor` flushes. For small
smoke tests, make sure the CLI exits cleanly (the `atexit` hook flushes
any buffered spans).

**Spans exist but `agent.run_id` is missing** — you called `runner.run()`
programmatically without going through a CLI. Seed it yourself:

```python
from observability import init_tracing, set_run_context
init_tracing("my-harness")
set_run_context(run_id="...", scenario_id="...")
await runner.run(question)
```

**No trajectory file in `AGENT_TRAJECTORY_DIR`** — the runner skips
persistence when no `run_id` is set. Use the CLI (which seeds a UUID4
automatically), or call `set_run_context(run_id=...)` before invoking
the runner programmatically.

**Exporter silently failing** — set `OTEL_LOG_LEVEL=debug` for the SDK's
internal logs.
Loading