Agent Forge

Status: prototype / experimental — exploring agentic patterns, not for production use.

Multi-turn LLM agent with streaming tool use, atomic context management, and gRPC RAG integration.

Designed to pair with/explore go-to-rag, a Go RAG engine that exposes retrieval and generation over gRPC. agent-forge calls it as a first-class tool during conversation.

Purpose

Hands-on exploration of how agentic systems work at the implementation level:

How multi-turn conversation state fills a context window and must be trimmed without corrupting tool-call sequences
How models at different scales decide when to call a tool, which one to pick, and how to interpret the result
How streaming, retries, and context errors compose in a real async loop
How local models behave in an agentic context — and where smaller ones break down

The loop, context management, and tool dispatch are all explicit and readable. Deliberately built without LangChain or LangGraph — the intent is to understand the mechanics before adopting a framework that abstracts them away.

Benchmarks

A core question this project explores: does tool-use comprehension scale with parameter count? Can a small model decide when a tool is needed, pick the right one, and correctly interpret the result — or does that only emerge at larger scale?

The benchmark is a diagnostic baseline, not a destination. The goal is a feedback loop: identify where models fail, intervene (prompt engineering, fine-tuning, distillation), re-run, and compare. The category breakdown in each result file points at the specific failure modes worth targeting.

11 behavioral evals ask a live model questions about this codebase, each with a ground-truth answer verifiable from the repo. Two pass conditions per question: the model must call a tool (no hallucinating without evidence) and the response must contain the expected string(s).

Run on Ollama, NVIDIA RTX 4070 Super, Fedora Linux 43. See tests/README.md for full methodology.

Model	Pass rate	Tool call rate	chat	config	grep	loop	project	tools
`qwen3:0.6b`	9.1%	27.3%	0/2	0/2	0/2	0/2	0/1	1/2
`qwen3:4b`	72.7%	90.9%	1/2	2/2	2/2	1/2	0/1	2/2
`qwen3:8b`	90.9%	90.9%	1/2	2/2	2/2	2/2	1/1	2/2
`qwen3:14b`	100%	100%	2/2	2/2	2/2	2/2	1/1	2/2

The gap between 0.6b and 4b is stark, tool-calling appears to be an emergent capability that requires a minimum parameter threshold. Above ~8b the differences narrow. Pass/fail here is substring matching against known answers, which measures presence but not reasoning quality.

Next steps: LLM-as-judge scoring and human review to evaluate whether answers are correct for the right reasons, not just because they contain the right string.

Quick start

Ollama

ollama pull qwen3:0.6b
make run-ollama

vLLM

make serve   # starts vLLM on :8000
make run

With go-to-rag

Start the go-to-rag gRPC server — see go-to-rag serve docs for setup. Default port is 50051.

Add to your config:

rag:
  endpoint: "localhost:50051"

The agent gains two tools: ask_rag (retrieval + LLM generation) and retrieve_chunks (raw scored chunks).

Configuration

model:
  name: "qwen3:0.6b"
  endpoint: "http://localhost:11434/v1"
  context_limit: 8192

system:
  prompt: "You are a helpful assistant."
  disable_think: false

rag:              # optional — omit if not using go-to-rag
  endpoint: "localhost:50051"

Priority: CLI flags → environment variables → YAML

Env var	Effect
`AGENT_FORGE_ENDPOINT`	Override model endpoint
`OLLAMA_MODEL`	Override Ollama model name

See config.example.yaml for the full schema and config.ollama.yaml for an Ollama-ready config.

Design notes

One round of the agent loop:

flowchart TD
    A([User input]) --> B[Trim context if over budget]
    B --> C[Stream model response]
    C --> D{Tool calls?}
    D -->|Yes| E[Execute tools in parallel]
    E --> F[Inject tool results into chat]
    F --> B
    D -->|No| G([Return final response])
    G -->|Next user message| A

Atomic context trimming — when the context window fills, turns are dropped oldest-first as atomic groups. An assistant message carrying tool calls is always dropped together with its tool results — splitting them leaves orphaned references that cause model errors or hallucinations.

Token counting — uses tiktoken cl100k_base as a fast approximation. Accurate for OpenAI models; undercounts ~10–15% on Qwen/Llama with heavy code content.

Proto regeneration

If the go-to-rag proto changes:

make proto   # fetches from GitHub, regenerates agent_forge/_gen/

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
agent_forge		agent_forge
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.example.yaml		config.example.yaml
config.ollama.yaml		config.ollama.yaml
config.schema.yaml		config.schema.yaml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Forge

Purpose

Benchmarks

Quick start

Ollama

vLLM

With go-to-rag

Configuration

Design notes

Proto regeneration

Related

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agent Forge

Purpose

Benchmarks

Quick start

Ollama

vLLM

With go-to-rag

Configuration

Design notes

Proto regeneration

Related

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages