Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a

## [Unreleased]

### Added — Multimodal foundation (#2465 epic, Stage 1)

- **Typed `Content` hierarchy (#2466)** — `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document` in package `agents_engine.content`. Each non-text variant carries a `ContentRef` plus a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). Mime types are closed sealed interfaces with `wireMime: String` accessors — no `String` mime in any public API. Extension property `Content.modality: String` is the audit-stable per-variant name. Stage 1 wires Image + Document end-to-end (the modalities the 0.8 spec → product loop consumes); Audio + Video are modelled now and exercised through provider adapters in Stage 2 (#2470, deferred).
- **`ContentRef` + `BlobStore` (#2467)** — content-addressed reference (`hash: String` SHA-256 hex, `sizeBytes: Long`, `wireMime: String`). `BlobStore` interface with `InMemoryBlobStore` (defensive byte-array copies on put + get) and `FileBlobStore(dir)` (one file per blob, filename = hash, atomic tmp + rename, survives process restart, idempotent put). Hash family matches the manifest hash (#1912) and snapshot filename hash (#2753) — single algorithm across the audit surface. Public top-level `computeContentHash(bytes): String` for byte-level comparison without a store.
- **`ToolResult` (#2469)** — `data class ToolResult(parts: List<Content>)` for tools that return mixed content (a screenshot tool returns text + image; OCR returns extracted text + the source PDF ref). Just another `Any?` the tool executor returns — no `ToolDef` signature change; existing tools that return strings keep working byte-for-byte. AgenticLoop renders multipart returns as text + `[modality: <wireMime>] (<hash-prefix>, <size>B)` placeholders for the LLM tool-result message; provider-specific multipart rendering (vision-capable Claude/OpenAI/Gemini) is sibling #2470 (deferred). JSONL audit exporter gains an `outputParts: List<String>?` column on audit rows — for `ToolResult` returns it emits one entry per part as `<modality>:<hash-prefix>:<sizeBytes>:<wireMime>` (text parts as `text:inline:<charCount>:text/plain`); blob bytes never enter the audit row. Field is null for non-multimodal returns — legacy audit rows unchanged. `EXPECTED_FIELDS` schema-pin updated to include the new column. Composes with snapshot/resume (refs serialise, blobs stay external) and `untrustedOutput` (the text-summary rendering goes through the existing JSON envelope). See [docs/multimodal.md](docs/multimodal.md).

### Added — Eval harness (#2491 epic, feature-complete)

- **`DeterministicModelClient` (#2492)** — `agents_engine.testing.DeterministicModelClient(scripted: List<LlmResponse>)` (or vararg ctor) hands back pre-scripted responses one per `chat` call. No network, byte-deterministic. `requests` records every message list the agent built up; `remaining()` reports unconsumed responses. Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)`. Streaming uses the default `ModelClient.chatStream` wrap. Out of scope for v1: record-from-live HTTP capture (mentioned in the ticket — needs an HTTP-fixture story we'll write when there's demand) and per-token chunk replay.
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
- **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay).
- **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up.
- **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end; Audio + Video modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). See [docs/multimodal.md](docs/multimodal.md).
- **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
- **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
- **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package agents_engine.observability

import agents_engine.content.modality
import agents_engine.core.Agent
import agents_engine.core.AgentRuntimeContext
import agents_engine.core.PipelineEvent
Expand Down Expand Up @@ -214,6 +215,14 @@ class JsonlAuditExporter(
is PipelineEvent.ToolCalled -> typeName(event.result)
else -> null
},
// #2469 — multimodal tool results record one summary per part:
// "<modality>:<hash-prefix>:<size>:<mime>". No bytes; the
// ContentRef + modality is the auditable surface. Null on
// non-ToolResult returns so legacy audit rows are unchanged.
outputParts = when (event) {
is PipelineEvent.ToolCalled -> partsSummary(event.result)
else -> null
},
// #2395 — record blocked tool calls in the audit log via the
// guardrailDecision column. Only the decision *type* is written:
// the free-text reason can embed offending arg values (e.g. a
Expand Down Expand Up @@ -265,6 +274,11 @@ class JsonlAuditExporter(
is AgentEvent.ToolCallFinished -> typeName(event.result)
else -> null
},
outputParts = when (event) {
is AgentEvent.Completed<*> -> partsSummary(event.output)
is AgentEvent.ToolCallFinished -> partsSummary(event.result)
else -> null
},
toolPolicyRisk = null,
usedDeclaredCapability = null,
usage = usage,
Expand All @@ -291,6 +305,7 @@ class JsonlAuditExporter(
usedDeclaredCapability: Boolean?,
usage: TokenUsage?,
guardrailDecision: String? = null,
outputParts: List<String>? = null,
): Map<String, Any?> =
linkedMapOf(
"requestId" to context.requestId,
Expand All @@ -303,6 +318,7 @@ class JsonlAuditExporter(
"timestamp" to timestamp,
"inputType" to inputType,
"outputType" to outputType,
"outputParts" to outputParts,
"budgetState" to null,
"guardrailDecision" to guardrailDecision,
"mcpClientId" to null,
Expand All @@ -317,6 +333,35 @@ class JsonlAuditExporter(
private fun typeName(value: Any?): String? =
value?.javaClass?.name

/**
* #2469 — for [agents_engine.content.ToolResult] return values,
* render one summary string per part: `"<modality>:<hash-prefix>:
* <sizeBytes>:<wireMime>"`. Hash prefix is the first 12 hex chars
* (enough to disambiguate in audit grep, short enough to read).
* Returns `null` when [value] is not a `ToolResult` — keeps legacy
* audit rows byte-identical for non-multimodal returns.
*
* Crucially: no blob bytes enter the audit row. Modality + ref is
* the auditable surface.
*/
private fun partsSummary(value: Any?): List<String>? {
val toolResult = value as? agents_engine.content.ToolResult ?: return null
return toolResult.parts.map { part ->
when (part) {
is agents_engine.content.Content.Text ->
"${part.modality}:inline:${part.text.length}:text/plain"
is agents_engine.content.Content.Image ->
"${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
is agents_engine.content.Content.Audio ->
"${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
is agents_engine.content.Content.Video ->
"${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
is agents_engine.content.Content.Document ->
"${part.modality}:${part.ref.hash.take(12)}:${part.ref.sizeBytes}:${part.mime.wireMime}"
}
}
}

private fun encodeJson(value: Any?): String =
when (value) {
null -> "null"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,47 @@ class JsonlAuditExporterTest {
assertEquals(false, row["usedDeclaredCapability"])
}

@Test
fun `multimodal ToolResult writes outputParts with modality plus hash plus size — no bytes`() {
// #2469 — audit-row support for typed multimodal tool returns.
val store = agents_engine.content.InMemoryBlobStore()
val pngBytes = byteArrayOf(0x89.toByte(), 0x50, 0x4E, 0x47, 1, 2, 3)
val imgRef = store.put(pngBytes, agents_engine.content.ImageMime.Png.wireMime)

val dir = Files.createTempDirectory("agents-jsonl-audit-multimodal")
val auditFile = dir.resolve("audit.jsonl")
val exporter = JsonlAuditExporter(auditFile, clock = fixedClock)
exporter.write(
PipelineEvent.ToolCalled(
agentName = "agent",
timestamp = Instant.EPOCH,
toolName = "screenshot",
arguments = mapOf("url" to "https://example.com"),
result = agents_engine.content.ToolResult(
agents_engine.content.Content.Text("captured"),
agents_engine.content.Content.Image(imgRef, agents_engine.content.ImageMime.Png),
),
runtimeContext = AgentRuntimeContext(requestId = "req-multimodal"),
),
)
exporter.close()

val line = Files.readAllLines(auditFile).single()
// No bytes anywhere — neither the URL arg nor the image bytes
assertFalse(line.contains("example.com"), "argument values must not be serialized: $line")
// No PNG magic — image bytes definitely never enter the audit row
assertFalse(line.contains("0x89"), "image bytes must not be serialized: $line")

// Substring assertions on the rendered JSON (the test parser doesn't
// model arrays, and the column contents are stable enough to grep).
assertTrue("\"outputParts\":[" in line, "outputParts is emitted as an array: $line")
assertTrue("\"text:inline:8:text/plain\"" in line, "text-part shape in array: $line")
assertTrue("\"image:${imgRef.hash.take(12)}:${pngBytes.size}:image/png\"" in line,
"image-part shape modality:hashPrefix:size:mime — $line")
assertTrue("\"agents_engine.content.ToolResult\"" in line,
"outputType still names the wrapper type: $line")
}

@Test
fun `denied tool calls are recorded as ToolDenied rows without leaking the reason text`() {
// #2395 — blocked calls must appear in the audit log. The PII-safe
Expand Down Expand Up @@ -309,6 +350,10 @@ class JsonlAuditExporterTest {
"timestamp",
"inputType",
"outputType",
// #2469 — per-part summary for multimodal ToolResult returns.
// Null on non-multimodal rows; field is always present so
// schema-pinning consumers see a stable column set.
"outputParts",
"budgetState",
"guardrailDecision",
"mcpClientId",
Expand Down
Loading
Loading