Skip to content

Queue-Bit-1/code-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Agent — Domain-Adaptive Agentic Code Execution

An LLM agent framework that discovers its environment, generates code, executes it, and reviews the results — with pluggable domain adapters that inject all domain-specific knowledge. The core loop is fully general; domains like Minecraft server management are swappable modules.

Built with local models (Ollama) on a single 3090, using DSPy for structured LLM calls.

Status: Working prototype. The Minecraft/Paper/RCON adapter is battle-tested. The framework is ready for new domains.

How It Works

The agent runs a loop for each subtask: DISCOVER → CODE → EXECUTE → REVIEW.

┌──────────────────────────────────────────────────────────────┐
│                      ORCHESTRATOR                            │
│                                                              │
│  Domain detected from task → adapter selected automatically  │
│                                                              │
│  For each subtask:                                           │
│  ┌──────────┐  ┌────────┐  ┌──────────┐  ┌────────────────┐│
│  │ DISCOVER │→ │  CODE  │→ │ EXECUTE  │→ │    REVIEW      ││
│  │          │  │        │  │          │  │                ││
│  │ run real │  │ LLM    │  │ language │  │ hard verdicts: ││
│  │ commands │  │ writes │  │ detect + │  │ PASS or a      ││
│  │ to learn │  │ Python │  │ run it   │  │ specific FAIL  ││
│  │ the env  │  │ code   │  │          │  │                ││
│  └──────────┘  └────────┘  └──────────┘  └────────────────┘│
│                     │                            │           │
│                     │         on failure:         │           │
│                     │  ┌──────────────────────┐  │           │
│                     │  │     FIX PLANNER      │  │           │
│                     │  │ state patches, task  │  │           │
│                     │  │ rewrites, cleanup    │←─┘           │
│                     │  └──────────────────────┘              │
│                     │              │                          │
│                     └──────────────┘                          │
│                       retry with structurally different input │
│                                                              │
│  State flows forward: each subtask's outputs feed the next   │
│  Contracts enforced: subtasks declare required outputs       │
└──────────────────────────────────────────────────────────────┘

Example task: "Find the running Minecraft server and create a plugin that makes stone blocks fall with gravity"

The agent will: discover the Docker container and RCON credentials → plan subtasks (ensure server, write/compile/deploy plugin, restart, verify) → generate Python code that writes Java source into the container, compiles it, packages a JAR → execute and review each step → retry with structural fixes if anything fails.

Key Design Decisions

Decision Why
Domain adapters Core framework has zero domain knowledge. Adding a new domain = one new file implementing the adapter interface.
Contract enforcement Subtasks declare what they must produce (STATE:key=value). Missing outputs = hard FAIL. Prevents the LLM from hallucinating success.
State prepending Discovered values are injected as real Python variable assignments before the LLM's code. Eliminates NameError from guessed values.
Hard verdicts No confidence scores. PASS or a specific failure type (placeholder detected, contract violation, bad output, syntax error).
Fix planning with state mutations Failed attempts don't just append "try again" to the prompt. The FixPlanner deletes bad state keys, rewrites the task from the original (preventing prompt bloat), and runs cleanup commands. The coder sees genuinely different inputs on retry.
Discovery-first Real shell commands run before each subtask. The coder gets verified values like container IDs and passwords, not LLM guesses.

Project Structure

code_agent/
├── pyproject.toml                # Packaging & dependencies
├── config.example.yaml           # Example configuration (Ollama on 3090)
├── run_tests.py                  # Self-contained test suite (126 tests)
│
├── src/agent/
│   ├── cli.py                    # Entry point
│   ├── core/
│   │   ├── types.py              # All data models
│   │   ├── config.py             # YAML/env configuration
│   │   └── llm.py                # DSPy multi-model management + reasoning model support
│   ├── agents/
│   │   ├── discovery.py          # Environment discovery (run commands, parse results)
│   │   ├── planner.py            # Task decomposition with contracts
│   │   └── coder.py              # Code generation with state injection
│   ├── executor/
│   │   └── local.py              # Language detection, auto-import, sandboxed execution
│   ├── reviewer/
│   │   ├── reviewer.py           # Hard verdict system
│   │   ├── analyzer.py           # Error pattern analysis
│   │   └── fix_planner.py        # Structural fix planning (state patches, task rewrites)
│   ├── orchestrator/
│   │   └── workflow.py           # Main agent loop
│   └── domains/
│       ├── base.py               # DomainAdapter ABC — the plugin interface
│       ├── registry.py           # Auto-detection & selection
│       └── minecraft/
│           ├── adapter.py        # Minecraft/Paper/RCON domain knowledge
│           └── bukkit_reference.py
│
├── tests/                        # Deterministic tests (no LLM needed)
│
└── tools/                        # Ollama model management utilities
    ├── models.py
    └── modelfiles/

What I Learned (Design Tradeoffs)

This project started as "let an LLM figure out how to manage a Minecraft server." Here's what I discovered building it.

LLMs need more guardrails than you'd expect. The Minecraft adapter is ~3,300 lines — almost as much as the core framework. That's not a failure of the architecture; it's the reality of working with local models. They write Java when you ask for Python. They use -it flags with docker exec (which hangs forever). They invent container IDs instead of using discovered ones. They claim success without running verification commands. Every guardrail in the adapter exists because I hit that exact failure mode, sometimes dozens of times.

Structural fixes beat prompt fixes. Early versions just appended error messages to the prompt on retry. The LLM would read "don't use placeholder values" and then use placeholder values. The FixPlanner was born from frustration: instead of asking the LLM to change, change what the LLM sees. Delete the bad classpath from state so it can't reuse it. Rewrite the task description. Run cleanup commands to remove the broken plugin before retrying. This made a massive difference in success rates.

Discovery is the most important phase. The single biggest improvement was running real commands to discover the environment before asking the LLM to write code. When the coder gets container_id = "abc123" as a real Python variable instead of having to guess, the failure rate drops dramatically.

Domain knowledge can't be fully delegated to the LLM. I tried. The planner would generate five-step plans for something that needs two steps, or skip critical steps like waiting for the server to restart. The domain_plan() hook exists because for known task patterns, a hardcoded plan works better than an LLM-generated one. This is an honest tradeoff: less "autonomous" but dramatically more reliable.

The adapter interface has 15+ hooks, and that's the right number. I started with 5. Every new hook was added because I hit a case where the generic framework needed domain-specific input at a point where it didn't have one. The number of hooks is roughly proportional to the number of ways things go wrong.

Adding a New Domain

Implement the DomainAdapter interface. At minimum you need detect() and discovery_commands(). Everything else has sensible defaults.

from agent.domains.base import DomainAdapter
from agent.domains import registry

class KubernetesAdapter(DomainAdapter):
    @property
    def name(self):
        return "kubernetes"

    def detect(self, task):
        t = task.lower()
        if any(kw in t for kw in ("kubectl", "k8s", "pod", "deployment")):
            return 0.8
        return 0.0

    def discovery_commands(self, task, subtask, state):
        return [
            DiscoveryCommand(cmd=["kubectl", "cluster-info"], ...),
            DiscoveryCommand(cmd=["kubectl", "get", "pods", "-A"], ...),
        ]

    def planning_context(self, task, state):
        return "Use kubectl apply for declarative config..."

    # Implement more hooks as you discover failure modes

registry.register(KubernetesAdapter())

Usage

# Install
pip install -e .

# Run
code-agent "find the running minecraft server and deop bob"

# With config
code-agent --config config.yaml "your task"

# Verbose
code-agent -v "your task"

Configuration

Copy config.example.yaml to config.yaml. The example configs are tuned for a 3090 (24GB) with Ollama:

  • Config A: Best quality — 32B coder + 32B reasoning planner (models swap in/out of VRAM)
  • Config B: Both resident — 14B coder + 14B planner (no swapping, faster)
  • Config C: Qwen3 coder + QwQ planner

Running Tests

python run_tests.py

126 deterministic tests. No LLM or external services needed — everything is mocked.

Dependencies

  • Python 3.10+
  • DSPy — structured LLM calls
  • Ollama — local model serving
  • Docker — for the Minecraft adapter (the framework itself doesn't require it)

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published