IBM · DhavalRepo18 · Apr 27, 2026 · Apr 23, 2026 · Apr 23, 2026 · Apr 23, 2026
diff --git a/.gitignore b/.gitignore
@@ -201,3 +201,6 @@ mcp/couchdb/sample_data/bulk_docs.json
 .env
 mcp/servers/tsfm/artifacts/tsfm_models/
 src/tmp/
+
+# Observability artifacts (OTLP-JSON traces + per-run trajectory JSON).
+traces/
diff --git a/INSTRUCTIONS.md b/INSTRUCTIONS.md
diff --git a/docs/mcp-servers.md b/docs/mcp-servers.md
@@ -0,0 +1,110 @@
+# MCP Servers
+
+Six FastMCP servers expose the AssetOpsBench domain logic. Each is a standalone stdio process spawned on-demand by clients (`plan-execute`, `claude-agent`, `openai-agent`, `deep-agent`, Claude Desktop). Backing services and credentials are listed per-server below.
+
+## Contents
+
+- [iot — IoT Sensor Data](#iot--iot-sensor-data)
+- [utilities — Utilities](#utilities--utilities)
+- [fmsr — Failure Mode and Sensor Relations](#fmsr--failure-mode-and-sensor-relations)
+- [wo — Work Order](#wo--work-order)
+- [tsfm — Time Series Foundation Model](#tsfm--time-series-foundation-model)
+- [vibration — Vibration Diagnostics](#vibration--vibration-diagnostics)
+
+## iot — IoT Sensor Data
+
+**Path:** `src/servers/iot/main.py`
+**Requires:** CouchDB (`COUCHDB_URL`, `COUCHDB_USERNAME`, `COUCHDB_PASSWORD`, `IOT_DBNAME`)
+
+| Tool      | Arguments                                  | Description                                                             |
+| --------- | ------------------------------------------ | ----------------------------------------------------------------------- |
+| `sites`   | —                                          | List all available sites                                                |
+| `assets`  | `site_name`                                | List all asset IDs for a site                                           |
+| `sensors` | `site_name`, `asset_id`                    | List sensor names for an asset                                          |
+| `history` | `site_name`, `asset_id`, `start`, `final?` | Fetch historical sensor readings for a time range (ISO 8601 timestamps) |
+
+## utilities — Utilities
+
+**Path:** `src/servers/utilities/main.py`
+**Requires:** nothing (no external services)
+
+| Tool                   | Arguments   | Description                                            |
+| ---------------------- | ----------- | ------------------------------------------------------ |
+| `json_reader`          | `file_name` | Read and parse a JSON file from disk                   |
+| `current_date_time`    | —           | Return the current UTC date and time as JSON           |
+| `current_time_english` | —           | Return the current UTC time as a human-readable string |
+
+## fmsr — Failure Mode and Sensor Relations
+
+**Path:** `src/servers/fmsr/main.py`
+**Requires:** `WATSONX_APIKEY`, `WATSONX_PROJECT_ID`, `WATSONX_URL` for unknown assets; curated lists for `chiller` and `ahu` work without credentials.
+**Failure-mode data:** `src/servers/fmsr/failure_modes.yaml` (edit to add/change asset entries)
+
+| Tool                              | Arguments                                | Description                                                                                                                                             |
+| --------------------------------- | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_failure_modes`               | `asset_name`                             | Return known failure modes for an asset. Uses a curated YAML list for chillers and AHUs; falls back to the LLM for other types.                         |
+| `get_failure_mode_sensor_mapping` | `asset_name`, `failure_modes`, `sensors` | For each (failure mode, sensor) pair, determine relevancy via LLM. Returns bidirectional `fm→sensors` and `sensor→fms` maps plus full per-pair details. |
+
+## wo — Work Order
+
+**Path:** `src/servers/wo/main.py`
+**Requires:** CouchDB (`COUCHDB_URL`, `COUCHDB_USERNAME`, `COUCHDB_PASSWORD`, `WO_DBNAME`)
+**Data init:** Handled automatically by `docker compose -f src/couchdb/docker-compose.yaml up` (runs `src/couchdb/init_wo.py` inside the CouchDB container on every start — database is dropped and reloaded each time)
+
+| Tool                          | Arguments                                             | Description                                                                              |
+| ----------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------------------------------- |
+| `get_work_orders`             | `equipment_id`, `start_date?`, `end_date?`            | Retrieve all work orders for an equipment within an optional date range                  |
+| `get_preventive_work_orders`  | `equipment_id`, `start_date?`, `end_date?`            | Retrieve only preventive (PM) work orders                                                |
+| `get_corrective_work_orders`  | `equipment_id`, `start_date?`, `end_date?`            | Retrieve only corrective (CM) work orders                                                |
+| `get_events`                  | `equipment_id`, `start_date?`, `end_date?`            | Retrieve all events (work orders, alerts, anomalies)                                     |
+| `get_failure_codes`           | —                                                     | List all failure codes with categories and descriptions                                  |
+| `get_work_order_distribution` | `equipment_id`, `start_date?`, `end_date?`            | Count work orders per (primary, secondary) failure code pair, sorted by frequency        |
+| `predict_next_work_order`     | `equipment_id`, `start_date?`, `end_date?`            | Predict next work order type via Markov transition matrix built from historical sequence |
+| `analyze_alert_to_failure`    | `equipment_id`, `rule_id`, `start_date?`, `end_date?` | Probability that an alert rule leads to a work order; average hours to maintenance       |
+
+## tsfm — Time Series Foundation Model
+
+**Path:** `src/servers/tsfm/main.py`
+**Requires:** `tsfm_public` (IBM Granite TSFM), `transformers`, `torch` for ML tools — imported lazily; static tools work without them.
+**Model checkpoints:** resolved relative to `PATH_TO_MODELS_DIR` (default: `src/servers/tsfm/artifacts/output/tuned_models`)
+
+| Tool                   | Arguments                                                                                                                   | Description                                                                                      |
+| ---------------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
+| `get_ai_tasks`         | —                                                                                                                           | List supported AI task types for time-series analysis                                            |
+| `get_tsfm_models`      | —                                                                                                                           | List available pre-trained TinyTimeMixer (TTM) model checkpoints                                 |
+| `run_tsfm_forecasting` | `dataset_path`, `timestamp_column`, `target_columns`, `model_checkpoint?`, `forecast_horizon?`, `frequency_sampling?`, ...  | Zero-shot TTM inference; returns path to a JSON predictions file                                 |
+| `run_tsfm_finetuning`  | `dataset_path`, `timestamp_column`, `target_columns`, `model_checkpoint?`, `save_model_dir?`, `n_finetune?`, `n_test?`, ... | Few-shot fine-tune a TTM model; returns saved checkpoint path and metrics file                   |
+| `run_tsad`             | `dataset_path`, `tsfm_output_json`, `timestamp_column`, `target_columns`, `task?`, `false_alarm?`, `ad_model_type?`, ...    | Conformal anomaly detection on top of a forecasting output JSON; returns CSV with anomaly labels |
+| `run_integrated_tsad`  | `dataset_path`, `timestamp_column`, `target_columns`, `model_checkpoint?`, `false_alarm?`, `n_calibration?`, ...            | End-to-end forecasting + anomaly detection in one call; returns combined CSV                     |
+
+## vibration — Vibration Diagnostics
+
+**Path:** `src/servers/vibration/main.py`
+**Requires:** CouchDB (`COUCHDB_URL`, `VIBRATION_DBNAME` (default `vibration`), `COUCHDB_USERNAME`, `COUCHDB_PASSWORD`); `numpy`, `scipy`
+**DSP core:** `src/servers/vibration/dsp/` — adapted from [vibration-analysis-mcp](https://github.com/LGDiMaggio/claude-stwinbox-diagnostics/tree/main/mcp-servers/vibration-analysis-mcp) (Apache-2.0)
+
+| Tool | Arguments | Description |
+|---|---|---|
+| `get_vibration_data` | `site_name`, `asset_id`, `sensor_name`, `start`, `final?` | Fetch vibration time-series from CouchDB and load into the analysis store. Returns a `data_id`. |
+| `list_vibration_sensors` | `site_name`, `asset_id` | List available sensor fields for an asset. |
+| `compute_fft_spectrum` | `data_id`, `window?`, `top_n?` | Compute FFT amplitude spectrum (top-N peaks + statistics). |
+| `compute_envelope_spectrum` | `data_id`, `band_low_hz?`, `band_high_hz?`, `top_n?` | Compute envelope spectrum for bearing fault detection (Hilbert transform). |
+| `assess_vibration_severity` | `rms_velocity_mm_s`, `machine_group?` | Classify vibration severity per ISO 10816 (Zones A–D). |
+| `calculate_bearing_frequencies` | `rpm`, `n_balls`, `ball_diameter_mm`, `pitch_diameter_mm`, `contact_angle_deg?`, `bearing_name?` | Compute bearing characteristic frequencies (BPFO, BPFI, BSF, FTF). |
+| `list_known_bearings` | — | List all bearings in the built-in database. |
+| `diagnose_vibration` | `data_id`, `rpm?`, `bearing_designation?`, `bearing_*?`, `bpfo_hz?`, `bpfi_hz?`, `bsf_hz?`, `ftf_hz?`, `machine_group?`, `machine_description?` | Full automated diagnosis: FFT + shaft features + bearing envelope + ISO 10816 + fault classification + markdown report. |
+
+## Running a server manually
+
+Servers are normally spawned on-demand by an agent client. To launch one directly for testing:
+
+```bash
+uv run iot-mcp-server
+uv run utilities-mcp-server
+uv run fmsr-mcp-server
+uv run wo-mcp-server
+uv run tsfm-mcp-server
+uv run vibration-mcp-server
+```
+
+They speak MCP over stdio, so they're idle until a client connects on stdin.
diff --git a/docs/observability.md b/docs/observability.md
@@ -0,0 +1,217 @@
+# Observability
+
+Each agent run produces two artifacts, joined by ``run_id``:
+
+1. **Trace** — an OpenTelemetry span with *metadata* and *aggregate
+   metrics* (runner, model, IDs, latency via span duration, token
+   totals, turn and tool-call counts).  Written as canonical OTLP-JSON
+   and recognised by every OTEL-aware backend (Jaeger, Tempo, Langfuse,
+   Grafana Cloud AI, Honeycomb).
+2. **Trajectory** — a per-run JSON file with *per-turn content*: turn
+   text, tool call inputs / outputs, and (for SDK runners) per-turn
+   token usage.  Written directly by the agent runner alongside the
+   trace.
+
+Spans and trajectories complement each other without duplicating
+content: the span holds everything an observability UI needs to
+summarise or bill a run, the trajectory holds the raw per-turn data
+needed for offline evaluation.  Aggregate numbers (totals) live on the
+span; per-turn numbers (from which the totals are derived) live on the
+trajectory.  Nothing is repeated.
+
+## Root span attributes
+
+Metadata + aggregate metrics — always written when tracing is enabled:
+
+"SDK runners" below means claude-agent, openai-agent, deep-agent (which all
+expose turn/tool-call bookkeeping); plan-execute's loop is step-shaped and
+surfaces different attributes.
+
+| Attribute                     | Runner coverage   | Notes                                  |
+| ----------------------------- | ----------------- | -------------------------------------- |
+| `agent.runner`                | all               | `plan-execute` / `claude-agent` / …    |
+| `gen_ai.system`               | all               | Provider family (anthropic, openai…)   |
+| `gen_ai.request.model`        | all               | Full model ID                          |
+| `gen_ai.usage.input_tokens`   | all               | Sum across the run                     |
+| `gen_ai.usage.output_tokens`  | all               | Sum across the run                     |
+| `agent.question.length`       | all               | Character length of the question       |
+| `agent.answer.length`         | all               | Character length of the final answer   |
+| `agent.duration_ms`           | all               | Wall-clock of `run()`                  |
+| `agent.run_id`                | all               | `--run-id` or auto-generated UUID4     |
+| `agent.scenario_id`           | all               | `--scenario-id` (omitted if unset)     |
+| `agent.turns`                 | SDK runners       | Number of turns                        |
+| `agent.tool_calls`            | SDK runners       | Total tool calls                       |
+| `agent.llm_time_ms`           | plan-execute      | Planning + summarisation LLM time      |
+| `agent.planning_time_ms`      | plan-execute      | `Planner.generate_plan` wall-clock     |
+| `agent.summarization_time_ms` | plan-execute      | Final summarise-LLM wall-clock         |
+| `agent.plan.steps`            | plan-execute      | Number of generated plan steps         |
+
+For plan-execute, ``gen_ai.usage.*`` is the run-wide sum across planning,
+per-step arg-resolution, and summarisation LLM calls (provided the backend
+reports usage — ``LiteLLMBackend`` does; mocks return zero).  Turn and
+tool-call counts have no clean mapping to the step-shaped loop and are
+not surfaced; per-step wall-clock lives on each ``StepResult.duration_ms``
+in the trajectory.
+
+Per-tool timing is not captured for the three SDK runners — the
+`PreToolUse` hook that claude-agent needed broke compatibility with
+some `@anthropic-ai/claude-code` CLI versions, and the openai / deep
+SDKs do not expose clean per-tool callback surfaces either.  Follow-up
+when needed.
+
+Plus automatic child spans from the `HTTPXClientInstrumentor` — one per
+outbound HTTP request to the LiteLLM proxy (URL, status, latency).  The
+root span's own duration = agent wall-clock, so ``agent.duration_ms`` is
+redundant for OTEL UIs but convenient for jq on the JSONL file.
+
+## Trajectory file layout
+
+When ``AGENT_TRAJECTORY_DIR`` is set, each runner writes
+``{AGENT_TRAJECTORY_DIR}/{run_id}.json``.  The `trajectory` field's shape
+depends on the runner.
+
+| Field                         | claude-agent | openai-agent | deep-agent | plan-execute |
+| ----------------------------- | ------------ | ------------ | ---------- | ------------ |
+| `Trajectory.started_at`       | ✓            | ✓            | ✓          | (n/a)        |
+| `TurnRecord.duration_ms`      | ✓            | ✗            | ✗          | (n/a)        |
+| `ToolCall.duration_ms`        | ✗            | ✗            | ✗          | (n/a)        |
+| `StepResult.duration_ms`      | (n/a)        | (n/a)        | (n/a)      | ✓            |
+
+plan-execute's trajectory is a list of ``StepResult`` records instead
+of turns, each carrying its own ``duration_ms`` populated by the executor.
+
+## Enabling persistence
+
+Install the optional tracing deps (trajectories need no extra deps):
+
+```bash
+uv sync --group otel
+```
+
+Each artifact has its own env var; set either, both, or neither:
+
+| Env var                           | Effect                                              |
+| --------------------------------- | --------------------------------------------------- |
+| `AGENT_TRAJECTORY_DIR`            | Directory for ``{run_id}.json`` trajectory records. |
+| `OTEL_TRACES_FILE`                | Append OTLP-JSON lines to this path (in-process).   |
+| `OTEL_EXPORTER_OTLP_ENDPOINT`     | Ship spans over HTTP to a live collector endpoint.  |
+
+When none are set, runs work normally with zero persistence overhead.
+
+## Recommended: save both traces and trajectories
+
+```bash
+AGENT_TRAJECTORY_DIR=./traces/trajectories \
+OTEL_TRACES_FILE=./traces/traces.jsonl \
+  uv run deep-agent --run-id bench-001 --scenario-id 304 \
+  "Calculate bearing characteristic frequencies for a 6205 bearing at 1800 RPM."
+```
+
+Each span batch appends one JSON line to `./traces/traces.jsonl` in
+canonical OTLP-JSON format — the same format the OpenTelemetry Collector's
+`file` exporter produces, and ingestible by the Collector's
+`otlpjsonfile` receiver later if you want to replay into a live backend.
+
+### Query with `jq`
+
+For metadata + aggregate metrics (run_id, runner, model, token totals,
+latency) read the trace alone — token totals are on the span:
+
+```bash
+jq -c '.resourceSpans[].scopeSpans[].spans[]
+       | select(.name | startswith("agent.run"))
+       | {
+           run_id: (.attributes[] | select(.key == "agent.run_id") | .value.stringValue),
+           runner: (.attributes[] | select(.key == "agent.runner") | .value.stringValue),
+           model: (.attributes[] | select(.key == "gen_ai.request.model") | .value.stringValue),
+           input_tokens: (.attributes[] | select(.key == "gen_ai.usage.input_tokens") | .value.intValue),
+           output_tokens: (.attributes[] | select(.key == "gen_ai.usage.output_tokens") | .value.intValue),
+           turns: (.attributes[] | select(.key == "agent.turns") | .value.intValue),
+         }' traces/traces.jsonl
+```
+
+For per-turn content (text, tool call inputs/outputs, per-turn tokens)
+read the matching trajectory file:
+
+```bash
+jq '.trajectory.turns[] | {index, input_tokens, tool_calls: [.tool_calls[].name]}' \
+   traces/trajectories/bench-001.json
+```
+
+### Rotation
+
+The built-in file exporter appends indefinitely — one line per span batch
+is small, but long-running benchmarks can grow.  For rotation, pipe the
+path through `logrotate`, or split runs across dated files:
+
+```bash
+OTEL_TRACES_FILE="./traces/$(date +%F).jsonl" uv run deep-agent "..."
+```
+
+## Replaying saved traces into a live backend (optional)
+
+If you later want to visualize persisted traces, point any
+OpenTelemetry Collector at the file with its `otlpjsonfile` receiver:
+
+```yaml
+receivers:
+  otlpjsonfile:
+    include: ["traces/traces.jsonl"]
+exporters:
+  otlp:
+    endpoint: jaeger:4317
+    tls: {insecure: true}
+service:
+  pipelines:
+    traces:
+      receivers: [otlpjsonfile]
+      exporters: [otlp]
+```
+
+## Live debugging with Jaeger (optional, Docker)
+
+When network access to Docker Hub is available, Jaeger all-in-one is the
+quickest way to inspect traces in a UI:
+
+```bash
+docker run -d --rm --name jaeger \
+  -p 16686:16686 -p 4318:4318 \
+  jaegertracing/all-in-one
+
+OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 \
+OTEL_TRACES_FILE=./traces/traces.jsonl \
+  uv run deep-agent --run-id demo "$query"
+
+open http://localhost:16686   # macOS
+```
+
+With both env vars set, spans go to disk *and* to Jaeger simultaneously.
+Jaeger all-in-one is in-memory only; the file stays on disk when the
+container exits.
+
+## Troubleshooting
+
+**"OTEL SDK not installed; tracing disabled"** — run `uv sync --group otel`.
+
+**No output file on disk** — tracing is lazy; at least one runner has to
+complete a `run()` call before the `BatchSpanProcessor` flushes.  For small
+smoke tests, make sure the CLI exits cleanly (the `atexit` hook flushes
+any buffered spans).
+
+**Spans exist but `agent.run_id` is missing** — you called `runner.run()`
+programmatically without going through a CLI.  Seed it yourself:
+
+```python
+from observability import init_tracing, set_run_context
+init_tracing("my-harness")
+set_run_context(run_id="...", scenario_id="...")
+await runner.run(question)
+```
+
+**No trajectory file in `AGENT_TRAJECTORY_DIR`** — the runner skips
+persistence when no `run_id` is set.  Use the CLI (which seeds a UUID4
+automatically), or call `set_run_context(run_id=...)` before invoking
+the runner programmatically.
+
+**Exporter silently failing** — set `OTEL_LOG_LEVEL=debug` for the SDK's
+internal logs.