An agentic debugging tool built on Claude's native tool use. You give it a buggy Python file and a plain-English description of the bug; it then autonomously reads the code, searches documentation when it's stuck, writes a fix, runs the test suite, reads the failures, and iterates — looping until the tests pass or it hits a 5-cycle limit. Crucially, Claude decides which tool to call next based on each result; nothing in the code hardcodes a fixed read→fix→test sequence. That open-ended, result-driven control loop — over a real tool surface (code execution, web search, file I/O, pytest) — is what makes this a genuine agent rather than a single prompt, and it's the part worth showing an employer.
┌──────────────────────────────────────┐
buggy file + │ Streamlit UI │
bug description ─────▶ │ upload · live log · diff · results │
└───────────────────┬──────────────────┘
│ background thread
▼
┌─────────────────────── agent loop (agent.py) ───────────────────────┐
│ │
│ ┌─────────────┐ tool_use ┌──────────────────────────────────┐ │
│ │ Claude │ ──────────▶ │ dispatch tool (tools.py) │ │
│ │ (Sonnet 4.6)│ │ • execute_code ┐ │ │
│ │ + adaptive │ │ • web_search │→ sandbox.py │ │
│ │ thinking │ ◀────────── │ • read_write_file│ (Docker or │ │
│ └─────────────┘ tool_result │ • run_tests ┘ subprocess)│ │
│ │ └──────────────────────────────────┘ │
│ │ tests pass? ── no ──▶ iterate (up to 5 test cycles) │
│ ▼ │
│ tests pass ✓ / max iterations reached / stopped by user │
└──────────────────────────────────────────────────────────────────────┘
Each turn, Claude is given the four tool schemas (tool_schemas.py) and decides
what to do. The loop intercepts every tool_use block, executes it, logs it, and
feeds the tool_result back — repeating until Claude stops calling tools or the
tests go green.
| File | Role |
|---|---|
app.py |
Streamlit UI: setup wizard, inputs, live log, diff/result panel |
agent.py |
The Claude tool-use loop + iteration logic |
tools.py |
The four tool implementations + dispatcher |
tool_schemas.py |
JSON schemas passed to Claude's tools parameter |
sandbox.py |
Docker / subprocess execution backend |
Dockerfile |
The coding-agent-sandbox image (pytest preinstalled) |
.env.example |
Required env vars with placeholder values |
requirements.txt |
Pinned dependencies |
- Anthropic — sign in at https://console.anthropic.com/, open API
Keys, and create a key (starts with
sk-ant-). - Tavily (web search) — sign up at https://tavily.com/ and copy your key
from the dashboard (starts with
tvly-). The free tier is plenty for a demo.
python -m venv .venv
# Windows (PowerShell): .venv\Scripts\Activate.ps1
# macOS/Linux: source .venv/bin/activate
pip install -r requirements.txtdocker build -t coding-agent-sandbox .This builds the image the app uses to run code/tests in an isolated, network-disabled container. You can sanity-check it directly:
docker run --rm coding-agent-sandbox python -c "import pytest; print('sandbox OK')"Docker is optional but strongly recommended. Without it, the app falls back to running code in a plain subprocess on your machine (see the security notice).
# With the venv activated:
streamlit run app.py
# Or without activating (Windows):
.venv\Scripts\python.exe -m streamlit run app.pyOn first launch you'll see a setup screen: paste your two keys, and they're
validated with a live API call and saved to a local .env. You won't see that
screen again (use the Reset API keys expander in the sidebar to re-enter
them). Then upload or paste a buggy .py file, describe the bug, optionally
attach a pytest file, and click Run agent.
When the agent finishes, a Download fixed file button appears below the diff so you can save the fixed version — the agent always works on an isolated copy of your file and never modifies the original.
This agent executes code on your computer. When the Docker sandbox is available, code and tests run inside an ephemeral, network-isolated container. When Docker is not available, the app falls back to running code in a plain subprocess directly on your host — with no isolation. The model writes and runs this code autonomously.
- The live log shows every tool call (timestamp, tool, inputs, output) — review them, especially in subprocess-fallback mode (the UI flags it in yellow).
- Do not point this at sensitive code, secrets, or a machine you can't afford to have arbitrary code run on, without reviewing each tool call first.
- File writes are confined to a throwaway per-session workspace, but executed code is only as contained as your backend (Docker = isolated; subprocess = not). Prefer Docker.
Built with the Anthropic Python SDK, Streamlit, and Tavily.