Skip to content

Tool Error Recovery

skobeltsyn edited this page Mar 28, 2026 · 3 revisions

Tool Error Recovery

LLMs produce malformed tool calls. Agents.KT lets you fix them -- with code, with another agent, or with both.


The Problem

Large language models are probabilistic text generators. When they produce tool calls, things go wrong in predictable ways:

  • Trailing commas in JSON: {"a": 1, "b": 2,}
  • Markdown fencing around arguments: ```json\n{"a": 1}\n```
  • Wrong types: a number sent as a string "42" instead of 42
  • Missing required fields: the model forgets a parameter
  • Runtime failures: the tool itself throws because of bad input or transient errors

Without error recovery, any of these failures kills the agentic loop. The agent stops, the user gets nothing.


The Agents.KT Answer

Most frameworks handle malformed tool calls with special parser classes, retry middleware, or string-cleaning utilities buried in utility packages.

Agents.KT takes a different approach: the fixer is an agent. The same Agent<IN, OUT> interface you use to build your application is the same interface you use to repair broken tool calls. No new abstraction. No special machinery.

This means repair logic gets the full power of the framework: it can be deterministic (a pure function), LLM-driven (an agent with its own model), or a composition of both.


Error Taxonomy

Tool errors form a sealed hierarchy with four variants:

sealed interface ToolError {
    data class InvalidArgs(
        val rawArgs: String,
        val parseError: String,
        val expectedSchema: JsonSchema
    ) : ToolError

    data class DeserializationError(
        val rawValue: String,
        val targetType: KType,
        val cause: Throwable
    ) : ToolError

    data class ExecutionError(
        val args: ToolArgs,
        val cause: Throwable
    ) : ToolError

    data class EscalationError(
        val source: AgentRef,
        val reason: String,
        val severity: Severity,
        val originalError: ToolError,
        val attempts: Int
    ) : ToolError
}
Error Type When It Fires Typical Cause
InvalidArgs JSON parsing fails Trailing commas, markdown fencing, truncated output
DeserializationError JSON parses but cannot map to expected types "42" instead of 42, missing keys
ExecutionError Tool executor throws Bad input values, transient I/O failures, business logic errors
EscalationError Repair itself fails and escalates up Exhausted retries, unrecoverable state

The sealed hierarchy means when expressions are exhaustive -- the compiler tells you if you miss a case.


The onError DSL

Each tool can declare error handlers using the onError {} block. Inside, three verbs match the three non-escalation error types:

tool("write_file", "Write content to a file") { args ->
    val path = args["path"] as String
    val content = args["content"] as String
    fileSystem.write(path, content)
}
onError {
    invalidArgs { args, error ->
        fix { args.trimMarkdownFencing() }
    }
    deserializationError { raw, error ->
        sanitize { raw.normalizePathSeparators() }
    }
    executionError { e ->
        retry(maxAttempts = 3, backoff = exponential())
    }
}
Verb Error Type Purpose
invalidArgs { } InvalidArgs Fix unparseable JSON
deserializationError { } DeserializationError Fix type mismatches
executionError { } ExecutionError Handle runtime failures

Deterministic Repair

The simplest recovery strategy is a pure function. No LLM, no network call -- just string manipulation.

fix { } -- Repair Invalid Arguments

onError {
    invalidArgs { args, error ->
        fix {
            args
                .trimMarkdownFencing()       // strip ```json ... ```
                .replace(Regex(",\\s*}"), "}") // remove trailing commas
                .replace(Regex(",\\s*]"), "]") // remove trailing commas in arrays
        }
    }
}

The lambda receives the raw argument string and returns a cleaned version. The framework re-parses the cleaned string and retries the tool call.

sanitize { } -- Repair Deserialization Errors

onError {
    deserializationError { raw, error ->
        sanitize {
            raw.normalizePathSeparators()   // backslash to forward slash
        }
    }
}

Same idea: transform the raw value so it deserializes correctly.

retry() -- Retry on Execution Errors

onError {
    executionError { e ->
        retry(maxAttempts = 3, backoff = exponential())
    }
}

This re-runs the tool executor with the same arguments. The backoff parameter controls the delay between attempts. Use this for transient failures like network timeouts or rate limits.


LLM-Driven Repair

When deterministic cleanup is not enough -- the JSON is too mangled, the error is too novel -- you can delegate repair to an agent.

Defining a Repair Agent

A repair agent is a regular Agent<String, String>. It takes the broken input as a string and returns a fixed string:

val jsonFixer = agent<String, String>("json-fixer") {
    prompt = """
        You are a JSON repair tool. You receive malformed JSON and return
        valid JSON. Do not add or remove fields. Only fix syntax errors.
        Return ONLY the fixed JSON, no explanation.
    """.trimIndent()

    model {
        ollama("qwen2.5:7b")
        temperature = 0.0   // deterministic output
    }

    budget { maxTurns = 1 }   // single-shot, no tool loop

    skills {
        skill<String, String>("fix-json", "Repairs broken JSON") {
            implementedBy { input -> input }  // LLM does the work via prompt
        }
    }
}

Using a Repair Agent in onError

tool("create_task", "Create a new task") { args ->
    val title = args["title"] as String
    taskService.create(title)
}
onError {
    invalidArgs { args, error ->
        fix(agent = jsonFixer, retries = 3)
    }
}

The framework sends the broken arguments to jsonFixer, takes the output, re-parses it, and retries the tool call. If the fix fails, it retries up to 3 times before giving up.


Hybrid Strategies

The most robust approach: try deterministic repair first, fall back to the LLM only if it returns null:

onError {
    invalidArgs { args, error ->
        fix {
            // Attempt 1: simple cleanup
            tryJsonCleanup(args)   // returns null if cleanup is insufficient
        } ?: fix(agent = jsonFixer, retries = 3)
            // Attempt 2: LLM-driven repair if deterministic fix returned null
    }
}

This gives you the speed of string manipulation for common cases (trailing commas, fencing) and the intelligence of an LLM for edge cases.

Helper Function Example

fun tryJsonCleanup(raw: String): String? {
    val cleaned = raw
        .trim()
        .removePrefix("```json").removePrefix("```")
        .removeSuffix("```")
        .trim()
        .replace(Regex(",\\s*}"), "}")
        .replace(Regex(",\\s*]"), "]")

    return try {
        // Verify it parses
        JsonParser.parse(cleaned)
        cleaned
    } catch (e: Exception) {
        null   // signal: deterministic fix was not enough
    }
}

Deterministic Agent

A repair agent does not have to use an LLM. You can build a fully deterministic agent using implementedBy:

val regexFixer = agent<String, String>("regex-fixer") {
    skills {
        skill<String, String>("fix", "Fix JSON with regex") {
            implementedBy { input ->
                input
                    .replace(Regex("(?s)```json\\s*(.+?)\\s*```"), "$1")
                    .replace(Regex(",\\s*([}\\]])"), "$1")
                    .replace(Regex("'"), "\"")
            }
        }
    }
}

Zero LLM calls, zero latency, zero cost -- but it conforms to the Agent<String, String> interface, so it plugs into fix(agent = ...) seamlessly. The framework does not care how the agent produces its output.


Tool-Level Defaults

When many tools share the same error handling, define defaults at the tools {} level:

skills {
    skill<String, String>("data-ops", "Data operations") {
        tools("read_file", "write_file", "delete_file")

        defaults {
            onError {
                invalidArgs { args, error ->
                    fix { tryJsonCleanup(args) } ?: fix(agent = jsonFixer, retries = 2)
                }
                executionError { e ->
                    retry(maxAttempts = 3, backoff = exponential())
                }
            }
        }

        tool("read_file", "Read a file") { args ->
            fileSystem.read(args["path"] as String)
        }

        tool("write_file", "Write a file") { args ->
            fileSystem.write(args["path"] as String, args["content"] as String)
        }

        tool("delete_file", "Delete a file") { args ->
            fileSystem.delete(args["path"] as String)
        }
        // Per-tool override: delete_file has stricter handling
        onError("delete_file") {
            executionError { e ->
                // No retry for destructive operations
                escalate()
            }
        }
    }
}

The rule: per-tool onError overrides defaults for that specific tool. All other tools inherit the defaults.


Escalation

When repair fails, the tool has two options:

escalate() -- Soft Failure

executionError { e ->
    escalate()
}

escalate() does not throw. It wraps the error in an EscalationError and walks up the structure {} delegation tree. If the agent is part of a parent agent's structure, the parent can catch the escalation and decide what to do -- retry with a different skill, use a fallback, or escalate further.

throwException() -- Hard Failure

executionError { e ->
    throwException()
}

throwException() throws immediately. The agentic loop stops. Use this for genuinely unrecoverable errors -- file system corruption, invalid credentials, logic bugs you want to surface during development.

Escalation Flow

Tool fails
  |
  v
onError handler runs
  |
  +--> fix/retry succeeds --> tool result returned, loop continues
  |
  +--> fix/retry fails
         |
         +--> escalate() --> EscalationError created
         |       |
         |       v
         |    Parent agent's structure handler (if exists)
         |       |
         |       +--> Parent handles it (fallback, retry, different skill)
         |       |
         |       +--> Parent escalates further (walks up the tree)
         |
         +--> throwException() --> Exception thrown, loop stops

Complete Example

An agent with multiple tools, each with tailored error recovery:

val jsonFixer = agent<String, String>("json-fixer") {
    prompt = "Fix the malformed JSON. Return only valid JSON."
    model { ollama("qwen2.5:7b"); temperature = 0.0 }
    budget { maxTurns = 1 }
    skills {
        skill<String, String>("fix", "Fix JSON") {
            implementedBy { it }
        }
    }
}

val fileAgent = agent<String, String>("file-manager") {
    prompt = "You manage files. Use tools to read, write, and list files."

    model { ollama("qwen2.5:7b") }
    budget { maxTurns = 10 }

    skills {
        skill<String, String>("manage-files", "File management operations") {
            tools("read_file", "write_file", "list_dir")

            // Shared defaults
            defaults {
                onError {
                    invalidArgs { args, error ->
                        fix { tryJsonCleanup(args) } ?: fix(agent = jsonFixer, retries = 2)
                    }
                }
            }

            tool("read_file", "Read file contents by path") { args ->
                val path = args["path"] as String
                File(path).readText()
            }
            onError("read_file") {
                executionError { e ->
                    when (e.cause) {
                        is FileNotFoundException -> escalate()
                        is IOException -> retry(maxAttempts = 3, backoff = exponential())
                        else -> throwException()
                    }
                }
            }

            tool("write_file", "Write content to a file") { args ->
                val path = args["path"] as String
                val content = args["content"] as String
                File(path).writeText(content)
                "Written ${content.length} bytes to $path"
            }
            onError("write_file") {
                deserializationError { raw, error ->
                    sanitize { raw.normalizePathSeparators() }
                }
                executionError { e ->
                    retry(maxAttempts = 2, backoff = exponential())
                }
            }

            tool("list_dir", "List files in a directory") { args ->
                val path = args["path"] as String
                File(path).listFiles()?.map { it.name } ?: emptyList<String>()
            }
            // list_dir inherits defaults -- no per-tool override needed
        }
    }

    onToolUse { name, args, result ->
        println("[$name] args=$args result=$result")
    }
}

// Usage
val result = fileAgent("Read the contents of /tmp/config.json and summarize it")

In this example:

  • All three tools share the invalidArgs default (deterministic cleanup, then LLM fixer).
  • read_file escalates on missing files, retries on I/O errors, and throws on unexpected failures.
  • write_file sanitizes path separators and retries on execution errors.
  • list_dir relies entirely on the shared defaults.

Next Steps

Clone this wiki locally