Skip to content

DanielBlei/agent-forge

Repository files navigation

Agent Forge

Status: prototype / experimental — exploring agentic patterns, not for production use.

Multi-turn LLM agent with streaming tool use, atomic context management, and gRPC RAG integration.

Designed to pair with/explore go-to-rag, a Go RAG engine that exposes retrieval and generation over gRPC. agent-forge calls it as a first-class tool during conversation.

Purpose

Hands-on exploration of how agentic systems work at the implementation level:

  • How multi-turn conversation state fills a context window and must be trimmed without corrupting tool-call sequences
  • How models at different scales decide when to call a tool, which one to pick, and how to interpret the result
  • How streaming, retries, and context errors compose in a real async loop
  • How local models behave in an agentic context — and where smaller ones break down

The loop, context management, and tool dispatch are all explicit and readable. Deliberately built without LangChain or LangGraph — the intent is to understand the mechanics before adopting a framework that abstracts them away.

Benchmarks

A core question this project explores: does tool-use comprehension scale with parameter count? Can a small model decide when a tool is needed, pick the right one, and correctly interpret the result — or does that only emerge at larger scale?

The benchmark is a diagnostic baseline, not a destination. The goal is a feedback loop: identify where models fail, intervene (prompt engineering, fine-tuning, distillation), re-run, and compare. The category breakdown in each result file points at the specific failure modes worth targeting.

11 behavioral evals ask a live model questions about this codebase, each with a ground-truth answer verifiable from the repo. Two pass conditions per question: the model must call a tool (no hallucinating without evidence) and the response must contain the expected string(s).

Run on Ollama, NVIDIA RTX 4070 Super, Fedora Linux 43. See tests/README.md for full methodology.

Model Pass rate Tool call rate chat config grep loop project tools
qwen3:0.6b 9.1% 27.3% 0/2 0/2 0/2 0/2 0/1 1/2
qwen3:4b 72.7% 90.9% 1/2 2/2 2/2 1/2 0/1 2/2
qwen3:8b 90.9% 90.9% 1/2 2/2 2/2 2/2 1/1 2/2
qwen3:14b 100% 100% 2/2 2/2 2/2 2/2 1/1 2/2

The gap between 0.6b and 4b is stark, tool-calling appears to be an emergent capability that requires a minimum parameter threshold. Above ~8b the differences narrow. Pass/fail here is substring matching against known answers, which measures presence but not reasoning quality.

Next steps: LLM-as-judge scoring and human review to evaluate whether answers are correct for the right reasons, not just because they contain the right string.

Quick start

Ollama

ollama pull qwen3:0.6b
make run-ollama

vLLM

make serve   # starts vLLM on :8000
make run

With go-to-rag

Start the go-to-rag gRPC server — see go-to-rag serve docs for setup. Default port is 50051.

Add to your config:

rag:
  endpoint: "localhost:50051"

The agent gains two tools: ask_rag (retrieval + LLM generation) and retrieve_chunks (raw scored chunks).

Configuration

model:
  name: "qwen3:0.6b"
  endpoint: "http://localhost:11434/v1"
  context_limit: 8192

system:
  prompt: "You are a helpful assistant."
  disable_think: false

rag:              # optional — omit if not using go-to-rag
  endpoint: "localhost:50051"

Priority: CLI flags → environment variables → YAML

Env var Effect
AGENT_FORGE_ENDPOINT Override model endpoint
OLLAMA_MODEL Override Ollama model name

See config.example.yaml for the full schema and config.ollama.yaml for an Ollama-ready config.

Design notes

One round of the agent loop:

flowchart TD
    A([User input]) --> B[Trim context if over budget]
    B --> C[Stream model response]
    C --> D{Tool calls?}
    D -->|Yes| E[Execute tools in parallel]
    E --> F[Inject tool results into chat]
    F --> B
    D -->|No| G([Return final response])
    G -->|Next user message| A
Loading

Atomic context trimming — when the context window fills, turns are dropped oldest-first as atomic groups. An assistant message carrying tool calls is always dropped together with its tool results — splitting them leaves orphaned references that cause model errors or hallucinations.

Token counting — uses tiktoken cl100k_base as a fast approximation. Accurate for OpenAI models; undercounts ~10–15% on Qwen/Llama with heavy code content.

Proto regeneration

If the go-to-rag proto changes:

make proto   # fetches from GitHub, regenerates agent_forge/_gen/

Related

  • go-to-rag — Go RAG engine (gRPC + MCP server)

License

Apache 2.0

About

Multi-turn agent orchestration in Python. Explores tool use, KV caching, and vLLM.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors