Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The

## [Unreleased]

### Changed

- **Docs sweep: stale references and em-dash normalization.** Fixed three definite stale references (`spec_version='0.26.0'` in the Langfuse example output now reads `'0.38.0'`; the dangling `v0.16.1` qualifier dropped from the parallel-branches concept page; `compiled.attach_observer` corrected to `graph.attach_observer` in `non-obvious-shapes.md` for variable-name consistency with the rest of the docs). Swept em dashes out of the user-facing docs (130 instances across 17 files) per the convention set during the patterns expansion. mkdocs strict build clean; no broken intra-docs links.

### Added

- **vLLM production deployment notes.** `docs/model-providers/vllm.md` grows a "Production deployment" section covering the `VLLM_HTTP_TIMEOUT_KEEP_ALIVE` gotcha (vLLM's stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections and surfaces as `ProviderUnavailable`; widen to roughly 300s), a systemd unit skeleton, and the three throughput knobs that interact with OA's shared connection pool (`--max-model-len`, `--max-num-seqs`, `--gpu-memory-utilization`). The existing "Tool calling" section grows a `--tool-call-parser` family table verified against vLLM's docs (Llama 3.x / Llama 4 / Mistral / Hermes / Qwen3 / DeepSeek V3 / GPT-OSS), plus explicit "not supported here" callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma (no vLLM parser).
Expand Down
62 changes: 31 additions & 31 deletions docs/agent/non-obvious-shapes.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/agent/tldr.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
OpenArmature is a workflow framework for LLM pipelines and tool-calling agents typed state, compile-time topology checks, observability, and crash-safe checkpoints baked into a graph engine. The graph layer has no concept of LLMs or tools; the same primitives drive deterministic ETL pipelines and tool-calling agents alike. Nodes return partial updates; the engine merges into a frozen state snapshot. Behavior is defined by [openarmature-spec](https://openarmature.org/capabilities/) and verified by conformance fixtures; this package is the reference Python implementation.
OpenArmature is a workflow framework for LLM pipelines and tool-calling agents: typed state, compile-time topology checks, observability, and crash-safe checkpoints baked into a graph engine. The graph layer has no concept of LLMs or tools; the same primitives drive deterministic ETL pipelines and tool-calling agents alike. Nodes return partial updates; the engine merges into a frozen state snapshot. Behavior is defined by [openarmature-spec](https://openarmature.org/capabilities/) and verified by conformance fixtures; this package is the reference Python implementation.

**What OpenArmature is NOT:** not a chat framework (no built-in messages channel), not an LLM SDK (Provider is the abstraction layer; OpenAIProvider is the canonical impl), not a state-management library (state is per-invocation, not application-wide), not an evaluation framework (deferred to `openarmature-eval`).
4 changes: 2 additions & 2 deletions docs/concepts/checkpointing.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ Field framing worth getting right:
per-instance entry carries an explicit `result_is_error` boolean
that discriminates success contributions (roll forward into
`target_field`) from `collect`-mode error contributions (roll
forward into `errors_field`) — the engine reads the explicit field
forward into `errors_field`). The engine reads the explicit field
on resume rather than inferring routing from the shape of `result`.
Empty tuple when no fan-outs are in flight. See
[Resume semantics](fan-out.md#resume-semantics) on the fan-out
Expand Down Expand Up @@ -222,7 +222,7 @@ deserializes the result into your current state class.

**Canonical source for `schema_version`.** The framework reads
`schema_version` from the state class declared at graph construction
time the class passed to `GraphBuilder(...)`. If you pass a State
time: the class passed to `GraphBuilder(...)`. If you pass a State
subclass instance at runtime whose `schema_version` shadows the
declared class's value, the saved record still carries the declared
class's value. This rule keeps every save site within an invocation
Expand Down
4 changes: 2 additions & 2 deletions docs/concepts/fan-out.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,15 +111,15 @@ into prior state:
declare `append` on `Annotated[list[X], append]`. Each instance's
value is already an `X`; `append` concatenates cleanly.
- Each instance emits a `list[X]` (0..N records per instance) → the
engine lands `list[list[X]]`. Declare `concat_flatten` instead
engine lands `list[list[X]]`. Declare `concat_flatten` instead;
it flattens one level so the parent field stays `list[X]`. Plain
`append` would leave the nesting and fail Pydantic validation.
- Each instance emits a `dict[str, X]` → the engine lands
`list[dict]`. Declare `merge_all`, which folds the mappings into
the parent dict with last-write-wins per key. Plain `merge` can't
consume a `list[dict]`.

`concat_flatten` and `merge_all` are strict they raise
`concat_flatten` and `merge_all` are strict: they raise
`ReducerError` if an update element isn't the expected list/mapping
shape. See [state and reducers](state-and-reducers.md#five-built-in-reducers).

Expand Down
26 changes: 13 additions & 13 deletions docs/concepts/llms.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,42 +99,42 @@ async def startup() -> None:
try:
await provider.ready()
except ProviderAuthentication:
# Bad API key fail fast at boot.
# Bad API key: fail fast at boot.
raise
except ProviderInvalidModel:
# Bound model isn't served by this endpoint same.
# Bound model isn't served by this endpoint: same.
raise
except ProviderUnavailable:
# Endpoint is down or unreachable fail fast too.
# Endpoint is down or unreachable: fail fast too.
raise
```

`OpenAIProvider` ships three probe shapes selected via the
`readiness_probe` constructor kwarg:

- **`"chat_completions"`** (default) issues `POST /v1/chat/completions`
- **`"chat_completions"`** (default): issues `POST /v1/chat/completions`
with a `max_tokens=1` body. Actually exercises the inference wire
path. Strongest signal at the cost of one prompt's worth of tokens
on cloud endpoints.
- **`"models"`** issues `GET /v1/models` and verifies the bound
- **`"models"`**: issues `GET /v1/models` and verifies the bound
model appears in the catalog. Cheaper (no completion billing) but
blind to proxy wire-mismatch cases: some OpenAI-compatible proxies
(Bifrost is the motivating example) serve `/v1/models` correctly
while 405'ing the completions endpoint, so a green catalog probe
doesn't prove `complete()` will work.
- **`"both"`** runs the catalog probe first (cheap fail-fast on
- **`"both"`**: runs the catalog probe first (cheap fail-fast on
model-not-in-catalog with the cleaner `seen_ids` diagnostic), then
the chat probe. Strongest signal at double the round-trip cost.

```python
# Local server (LM Studio, vLLM, llama.cpp) chat probe is free.
# Local server (LM Studio, vLLM, llama.cpp): chat probe is free.
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="qwen2.5-coder",
readiness_probe="chat_completions", # default
)

# Cloud endpoint, cost-sensitive opt back into the catalog-only probe.
# Cloud endpoint, cost-sensitive: opt back into the catalog-only probe.
provider = OpenAIProvider(
base_url="https://api.openai.com",
model="gpt-4o-mini",
Expand Down Expand Up @@ -342,14 +342,14 @@ shape.
By default the model decides whether and which tools to call.
`tool_choice` constrains that decision per call. Four modes:

- `"auto"` the model decides. Equivalent to omitting the parameter
- `"auto"`: the model decides. Equivalent to omitting the parameter
when `tools` is non-empty.
- `"required"` the model MUST call at least one tool. Useful for
- `"required"`: the model MUST call at least one tool. Useful for
routing nodes that branch on tool selection.
- `"none"` the model MUST NOT call tools, even if `tools` is
- `"none"`: the model MUST NOT call tools, even if `tools` is
supplied. Useful for guarded LLM calls or for explicitly disabling
tool-calling without rebuilding a tools-less request.
- `ForceTool(name=...)` the model MUST call the named tool exactly.
- `ForceTool(name=...)`: the model MUST call the named tool exactly.

Pre-send validation catches the three failure modes (`required` with
empty tools, `ForceTool` with empty tools, `ForceTool.name` not in
Expand All @@ -371,7 +371,7 @@ response = await provider.complete(
)
```

Not all providers honor `tool_choice` confirm with your provider's
Not all providers honor `tool_choice`; confirm with your provider's
documentation. The `OpenAIProvider` maps the spec shape onto OpenAI's
wire shape per the §8.1.1 mapping table. Whether the model actually
honored the constraint is observable from the returned
Expand Down
58 changes: 29 additions & 29 deletions docs/concepts/observability.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,16 +244,16 @@ or missing log entries.
### Bounded drain (optional timeout)

`drain()` accepts an optional `timeout` parameter (non-negative
seconds) `await compiled.drain(timeout=5.0)` bounds the wait at five
seconds): `await compiled.drain(timeout=5.0)` bounds the wait at five
seconds. When the deadline fires, in-flight workers are cancelled
cleanly so the compiled graph stays usable for subsequent invocations
partial delivery state from one drain does NOT leak into the next.
cleanly so the compiled graph stays usable for subsequent invocations;
partial delivery state from one drain does NOT leak into the next.

The returned `DrainSummary` carries:

- `timeout_reached: bool` `True` only when the timeout actually
- `timeout_reached: bool`: `True` only when the timeout actually
fired. A drain that finishes before the deadline reports `False`.
- `undelivered_count: int` events dispatched but not fully delivered
- `undelivered_count: int`: events dispatched but not fully delivered
to every subscribed observer before the deadline. Always `0` when
`timeout_reached is False`.

Expand Down Expand Up @@ -305,8 +305,8 @@ the IDs explicitly.
## Caller-supplied invocation metadata

`correlation_id` is one string; if you also need to attach
business-domain identifiers tenant IDs, request IDs, feature
flags, A/B cohort labels pass them as a structured mapping at
business-domain identifiers (tenant IDs, request IDs, feature
flags, A/B cohort labels), pass them as a structured mapping at
`invoke()` time:

```python
Expand All @@ -324,7 +324,7 @@ await compiled.invoke(
Every observability backend picks the entries up:

- **OTel** emits each entry as an `openarmature.user.<key>`
cross-cutting span attribute on every span invocation, node,
cross-cutting span attribute on every span: invocation, node,
subgraph wrapper, fan-out instance, LLM provider, retry attempt.
Backends that consume OTel attributes (Phoenix / Arize, Honeycomb,
Datadog APM, HyperDX, Grafana Tempo, custom collectors) see them
Expand All @@ -345,7 +345,7 @@ Two rules:
`int`, `float`, `bool`) or homogeneous arrays of those types.
`None`, nested objects, and mixed-type arrays are rejected.

Violations raise `ValueError` synchronously no spans emitted, no
Violations raise `ValueError` synchronously: no spans emitted, no
work runs.

### Adding entries mid-invocation
Expand All @@ -371,7 +371,7 @@ node's `started`, any LLM call inside) pick up the new entries.
**Per-async-context scoping.** The metadata mapping lives in a
`ContextVar`, which Python copies on async-task creation. Fan-out
instances and parallel-branches each receive their own copy at
dispatch time an instance that calls `set_invocation_metadata`
dispatch time; an instance that calls `set_invocation_metadata`
does NOT leak its augmentation to sibling instances. This is the
canonical pattern for per-instance identifiers:

Expand Down Expand Up @@ -472,7 +472,7 @@ Cross-vendor attribute names every LLM-aware backend reads
(Langfuse, Phoenix, Honeycomb's LLM lens, OpenInference-aware
tools). Emitted alongside the OA namespace:

- `gen_ai.system` `"openai"` by default; override per provider
- `gen_ai.system`: `"openai"` by default; override per provider
instance to `"vllm"` / `"lm_studio"` / `"llama_cpp"` / etc. when
the OpenAI Chat Completions wire format is hitting a non-OpenAI
endpoint:
Expand All @@ -485,16 +485,16 @@ tools). Emitted alongside the OA namespace:
)
```

- `gen_ai.request.model` / `gen_ai.response.model` the bound
- `gen_ai.request.model` / `gen_ai.response.model`: the bound
model and (when the provider returns one) the more-specific
identifier in the response body.
- `gen_ai.request.temperature` / `max_tokens` / `top_p` / `seed` /
`frequency_penalty` / `presence_penalty` / `stop_sequences`
`frequency_penalty` / `presence_penalty` / `stop_sequences`:
only emitted for fields the caller actually set; absence on
the span means "not supplied," distinct from a zero value.
- `gen_ai.usage.input_tokens` / `output_tokens` token counts.
- `gen_ai.response.finish_reasons` single-element string array.
- `gen_ai.response.id` when the provider returns one.
- `gen_ai.usage.input_tokens` / `output_tokens`: token counts.
- `gen_ai.response.finish_reasons`: single-element string array.
- `gen_ai.response.id`: when the provider returns one.

Disable the GenAI semconv set with `OTelObserver(disable_genai_semconv=True)`
when an external auto-instrumentation library (OpenInference,
Expand All @@ -515,12 +515,12 @@ observer = OTelObserver(

This surfaces three attributes:

- `openarmature.llm.input.messages` JSON-encoded message array
- `openarmature.llm.input.messages`: JSON-encoded message array
(the spec §3 message shape: `{role, content, tool_calls?, …}`).
- `openarmature.llm.output.content` the assistant's response
- `openarmature.llm.output.content`: the assistant's response
content string verbatim. Omitted for tool-call-only responses
with empty content.
- `openarmature.llm.request.extras` JSON-encoded `RuntimeConfig`
- `openarmature.llm.request.extras`: JSON-encoded `RuntimeConfig`
extras bag (provider-specific pass-through fields like
`repetition_penalty` for vLLM, or `top_k` for HuggingFace
endpoints). Omitted when empty.
Expand All @@ -543,7 +543,7 @@ that fits within `cap - len(marker)` bytes followed by the marker:
```

where M is the pre-truncation byte length. The marker is appended
outside any JSON encoding a truncated attribute is *not* parseable
outside any JSON encoding, so a truncated attribute is *not* parseable
JSON, which is the clean signal backend code can use to detect
truncation without a separate flag.

Expand All @@ -563,7 +563,7 @@ provider, *before* the payload reaches the observer:

The `media_type` and `detail` fields are preserved at the image-block
level (per llm-provider §3.1.2); only `source` is replaced. URL-form
images pass through unchanged the URL is a short string and is
images pass through unchanged: the URL is a short string and is
informative for trace readers.

Redaction is **not** gated by `disable_llm_payload` and is **not**
Expand Down Expand Up @@ -626,7 +626,7 @@ observer = OTelObserver(
```

Each enricher receives the live `Span` plus the `NodeEvent` that
triggered the close (or `None` on synthetic close sites subgraph
triggered the close (or `None` on synthetic close sites: subgraph
dispatch, detached root, fan-out instance, invocation span,
shutdown drain). Setting attributes inside this hook works
correctly; doing it from a `SpanProcessor.on_end` callback does
Expand Down Expand Up @@ -668,9 +668,9 @@ full pattern.

`OTelObserver.shutdown()` calls `provider.shutdown()` on the private
`TracerProvider`, which per OTel SDK contract flushes every
registered span processor. Under unusual teardown orderings for
registered span processor. Under unusual teardown orderings (for
example, FastAPI's `TestClient` teardown that closes the event loop
before a `BatchSpanProcessor`'s export thread finishes spans can
before a `BatchSpanProcessor`'s export thread finishes), spans can
appear dropped. Two workarounds:

- Call `observer._provider.force_flush(timeout_millis=...)`
Expand All @@ -682,7 +682,7 @@ appear dropped. Two workarounds:
## Langfuse mapping (opt-in)

A second sibling observer maps the same `NodeEvent` stream onto
Langfuse's native Trace + Observation data model Traces at the
Langfuse's native Trace + Observation data model: Traces at the
top, Span observations for graph nodes, Generation observations for
LLM calls. Use it instead of (or alongside) the OTel observer when
your trace UI is Langfuse and you want first-class Generation
Expand All @@ -699,7 +699,7 @@ observer = LangfuseObserver(client=client)
graph.attach_observer(observer)
```

The `client` is anything matching the `LangfuseClient` Protocol
The `client` is anything matching the `LangfuseClient` Protocol:
the bundled `InMemoryLangfuseClient` (used by the conformance
harness, useful for unit tests), or a real `langfuse.Langfuse()`
instance wrapped in `LangfuseSDKAdapter` for production. Install
Expand Down Expand Up @@ -749,7 +749,7 @@ for a runnable demo.
matching the `LangfuseClient` Protocol's four methods.

A runtime `isinstance(adapter, LangfuseClient)` check ships in
the unit suite if a future v4 patch breaks the Protocol's
the unit suite, so if a future v4 patch breaks the Protocol's
surface, the test fails loudly.

### What Langfuse sees
Expand All @@ -772,7 +772,7 @@ for a runnable demo.

### Payload + truncation

`disable_llm_payload` mirrors the OTel observer's flag defaults
`disable_llm_payload` mirrors the OTel observer's flag and defaults
to `True` for the same privacy reason. Flip to `False` to populate
`generation.input` / `output` / `metadata.request_extras` from the
LLM event payload.
Expand Down Expand Up @@ -804,7 +804,7 @@ the Generation observation links to that entity natively (spec

The two observers are independent §6 event consumers and can be
attached together. They share the `correlation_id` as the
cross-backend join key find a slow Generation in Langfuse, search
cross-backend join key: find a slow Generation in Langfuse, search
for its `correlation_id` in OTel logs, see the surrounding
infrastructure activity.

Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/parallel-branches.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ wrap that branch's whole subgraph invocation as a unit. Retry
middleware on a branch retries the **whole branch**: a fresh
subgraph invocation each time, fresh inner-node execution. The
wrapping retry's attempt counter propagates to events emitted from
inner nodes (per graph-engine §6 v0.16.1), so observer events
inner nodes (per graph-engine §6), so observer events
inside the branch correctly show `attempt_index` ticking across
retries.

Expand Down
Loading