A local multi-agent AI pipeline that plans, writes, executes, and self-corrects Python code — powered entirely by open-source models running on consumer hardware.
NEXUS is a self-correcting code generation system that uses four specialized AI agents working in a loop to turn natural language descriptions into working Python scripts. Unlike ChatGPT or Copilot, NEXUS runs 100% locally — no API keys, no cloud, no subscription fees, no data leaving your machine.
You describe what you want. NEXUS plans the approach, writes the code, executes it in a sandbox, reviews the output, and if something's wrong — fixes it automatically. Up to 3 iterations until the code works or fails gracefully.
You: "Write a script that checks if 153 is an Armstrong number"
NEXUS: Plans → Writes code → Runs it → NameError detected →
Reviews → "num_digits not in scope" → Fixes it →
Runs again → ✅ "153 is an Armstrong number" → Done.
➜ Write a script that generates a random password of 16 characters containing
uppercase, lowercase, digits, and special characters, then prints the password
and its strength rating
🧠 Planner → 5-step execution plan
💻 Developer → 49-line script with generate_password() and calculate_strength()
🔧 Runner → NameError on line 38: 'all_uppercase' not defined
🔍 Reviewer → ❌ REJECTED — "use 'has_uppercase' instead of 'all_uppercase'"
💻 Developer → Fixed variable name
🔧 Runner → Password: "clJjSU23GG05-kS(" | Strength: "Strong"
🔍 Reviewer → ✅ APPROVED
🏁 Result → SUCCESS in 51.4s (2/3 iterations)
┌─────────────┐
│ User Task │
└──────┬──────┘
│
┌──────▼──────┐
│ PLANNER │ Breaks task into 2-10 steps
│ Agent │ (structured JSON plan)
└──────┬──────┘
│
┌──────▼──────┐
┌────►│ DEVELOPER │ Writes complete Python script
│ │ Agent │ (fixes bugs on retry)
│ └──────┬──────┘
│ │
│ ┌──────▼──────┐
│ │ CODE RUNNER │ Executes in sandboxed subprocess
│ │ (Sandbox) │ (10s timeout, captures stdout/stderr)
│ └──────┬──────┘
│ │
│ ┌──────▼──────┐
│ │ REVIEWER │ Validates output against task
│ │ Agent │ (classifies errors, gives feedback)
│ └──────┬──────┘
│ │
│ ┌─────▼─────┐
│ │ Pass? │
│ └─────┬─────┘
│ No │ Yes
│ (max 3) │
└────────────┘ ──────► ✅ Final Output
Each agent is a separate module with its own Pydantic schema for structured output validation. The pipeline is orchestrated by LangGraph as a state machine with conditional routing.
| Component | Technology | Purpose |
|---|---|---|
| LLM | qwen2.5-coder:7b via Ollama |
All agent reasoning (planning, coding, reviewing) |
| Orchestration | LangGraph 0.4.1 | State machine with conditional edges |
| Schema Validation | Pydantic v2 | Structured JSON output from every agent |
| LLM Integration | LangChain + langchain-ollama | Ollama API wrapper with format="json" |
| Code Execution | subprocess (sandboxed) | Isolated Python execution with timeout |
| Terminal UI | Rich | Panels, syntax highlighting, spinners, tables |
| Language | Python 3.10+ | Everything |
Hardware used during development:
| Component | Spec |
|---|---|
| CPU | Intel i3-12100 |
| GPU | NVIDIA RTX 3050 6GB |
| RAM | 8GB DDR4 |
| OS | Windows 11 |
Note: NEXUS also supports
qwen2.5-coder:3bfor faster inference (~10s/call vs ~20s/call) at the cost of accuracy. ChangeMODEL_NAMEinconfig.pyto switch.
nexus/
├── main.py # Rich terminal UI + manual pipeline runner
├── graph.py # LangGraph state machine with conditional routing
├── state.py # AgentState TypedDict (shared pipeline state)
├── config.py # Ollama settings, prompt templates, iteration limits
├── schemas.py # Pydantic v2 models + JSON parser + model validator
├── requirements.txt # Pinned dependencies
├── agents/
│ ├── planner.py # Breaks task into execution steps
│ ├── developer.py # Writes/fixes Python code
│ └── reviewer.py # Validates code + output, classifies errors
├── tools/
│ └── code_runner.py # Sandboxed subprocess execution (10s timeout)
└── .gitignore
- Python 3.10+
- Ollama installed and running (ollama.com)
- ~4.5GB disk space for the model weights
# 1. Clone the repo
git clone https://github.com/Swapnil-bo/nexus.git
cd nexus
# 2. Create virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. Pull the model
ollama pull qwen2.5-coder:7b# Interactive mode
python main.py
# Direct task
python main.py "Write a script that prints all prime numbers up to 50"Takes the user's natural language task and produces a structured execution plan (2-10 steps). The plan guides the Developer but never contains code.
{
"steps": [
"Define a function to convert Fahrenheit to Celsius",
"Define a function to convert Celsius to Kelvin",
"Call both functions with 98.6 as input",
"Print all three temperature values"
]
}Receives the plan (and on retries: previous code, error output, and reviewer feedback) and writes a complete, self-contained Python script. Uses context pruning — only sees the latest code and feedback, never the full error log, keeping the context window small and focused.
Executes the generated code in an isolated subprocess with:
- 10-second timeout — kills runaway scripts (infinite loops, blocking
input()calls) - stdout/stderr capture — both streams returned to the Reviewer
- Windows-safe temp files — manual create/write/close/execute/delete cycle (avoids Windows file locking issues with
NamedTemporaryFile) - Virtual environment awareness — uses
sys.executableinstead ofpythonto ensure the correct interpreter
Compares the execution output against the original task requirements. Returns a structured verdict:
{
"is_valid": false,
"error_type": "logic_error",
"feedback": "The variable 'num_digits' is not defined in the outer scope. Define it before the print statement."
}Error classifications: syntax_error, logic_error, infinite_loop, incomplete, pass
If the Reviewer rejects the code, the Developer gets the feedback and tries again — up to 3 iterations. The loop is managed by LangGraph's conditional edges:
# Conditional routing after Reviewer
if is_valid → finalize → END (success)
if iteration >= max → fail → END (max retries)
else → Developer → Code Runner → Reviewer (retry)Small local models are unreliable JSON producers. NEXUS handles this with multiple layers:
Every Ollama call uses format="json", which constrains the model's token generation to valid JSON syntax.
The parse_json_response() helper strips markdown wrappers, extracts JSON from garbage text, and handles common formatting issues:
# Handles: ```json { ... } ```, leading text { ... } trailing text, etc.Every parsed response is validated against strict Pydantic v2 schemas with field constraints (list lengths, literal types, required fields).
A @model_validator on ReviewerOutput auto-corrects common model quirks:
| Model Returns | Auto-Corrected To |
|---|---|
is_valid: true, error_type: "" |
error_type: "pass" |
is_valid: false, error_type: "FileNotFoundError" |
error_type: "logic_error" |
is_valid: false, error_type: "SyntaxError" |
error_type: "syntax_error" |
error_type: "pass", is_valid: false |
is_valid: true |
is_valid: true, error_type: "logic_error" |
is_valid: false (trusts the error) |
This prevents valid code from being falsely rejected due to schema validation failures.
Comprehensive testing across 14 tasks with both 3b and 7b models:
| Task | Result | Iterations | Time | Notes |
|---|---|---|---|---|
| Sum 1 to 10 | ✅ Pass | 1/3 | 12.3s | Clean first-try |
| Multiplication table | ✅ Pass | 1/3 | 16.0s | Clean first-try |
| Factorial (recursion) | ✅ Pass | 1/3 | ~20s | Clean first-try |
| Find duplicates in list | ✅ Pass | 1/3 | 41.5s | Clean first-try |
| Temperature conversion | ✅ Pass | 1/3 | 29.6s | Float math, multi-function |
| Student dictionaries (highest grade) | ✅ Pass | 1/3 | 29.2s | Nested data structures |
| Multiplication grid (formatted) | ✅ Pass | 1/3 | 21.2s | Formatted print output |
| Armstrong number check | ✅ Pass | 2/3 | 65.5s | Self-corrected scoping bug |
| Password generator (simple) | ✅ Pass | 2/3 | 51.4s | Self-corrected variable name |
| Second largest number | ❌ Fail | 3/3 | 88.9s | Reviewer hallucination |
| String multi-operation | ❌ Fail | 3/3 | 58.7s | Task ambiguity + Reviewer confusion |
| File read (data.txt) | ❌ Fail | 3/3 | 42.9s | Expected: file doesn't exist in sandbox |
| Even/odd (user input) | ❌ Fail | 3/3 | 86.4s | Expected: input() blocks in sandbox |
Pass Rate: 9/13 (69%) — with 2 failures being expected sandbox limitations
| Metric | qwen2.5-coder:3b | qwen2.5-coder:7b |
|---|---|---|
| Pass rate | ~50% | ~69% |
| Avg time per call | 5-10s | 15-25s |
| Total task time | ~30s | ~30-60s |
| Reviewer accuracy | Frequent hallucinations | Mostly accurate |
| VRAM usage | ~2GB | ~4.5GB |
| Self-correction | Occasionally works | Reliably works |
- Reviewer hallucination: The Reviewer sometimes rejects correct code by inventing bugs that don't exist (e.g., claiming a duplicate-handling algorithm doesn't handle duplicates, while the code clearly does). This is inherent to 7B-parameter model reasoning.
- Task ambiguity: Multi-step tasks with ambiguous requirements (e.g., "reverse AND uppercase AND capitalize") can confuse both the Developer and Reviewer about what the correct output should be.
- Multi-line strings: The 3b model sometimes merges lines in f-strings, producing syntax errors it can't debug. The 7b model handles this correctly.
- No interactive input:
input()blocks forever in a headless subprocess. The Developer is instructed to hardcode example values instead. - No filesystem access: Generated scripts run in the system temp directory with no access to project files.
- No network access: Scripts cannot make HTTP requests or access external APIs.
- 10-second timeout: Long-running computations will be killed.
- No error log feedback: The error log is kept for the UI only and never fed back to agents, preventing context window pollution but also preventing the system from learning across iterations.
- Single file output: NEXUS generates single Python scripts only — no multi-file projects.
- No dependency installation: Only Python standard library is available in the sandbox.
| Bug | Root Cause | Fix |
|---|---|---|
| Planner returning 7 steps crashed validation | max_length=6 too strict |
Raised to max_length=10 |
| Correct code rejected 3 times (factorial) | Reviewer returned error_type: "" instead of "pass" |
Added @model_validator to auto-coerce empty → "pass" |
FileNotFoundError crashed Reviewer parsing |
Reviewer invented error_type: "FileNotFoundError" |
Extended validator with fuzzy keyword matching to map unknown types |
input() caused 10s timeout every time |
Sandbox has no stdin | Added CRITICAL: NEVER use input() to Developer prompt |
Developer & Reviewer fought over input() |
Reviewer rejected hardcoded values | Added sandbox awareness to Reviewer prompt |
| Windows file locking errors | NamedTemporaryFile doesn't work on Windows |
Manual create/write/close/execute/delete pattern |
| Wrong model loaded | Used qwen2.5:3b (general) instead of coder variant |
Corrected to qwen2.5-coder:7b |
NEXUS demonstrates several concepts that are directly relevant to production AI engineering:
-
Multi-agent orchestration — Four specialized agents collaborating through a shared state machine, each with distinct responsibilities and schemas.
-
Self-correcting systems — The feedback loop between Reviewer and Developer mirrors real-world CI/CD patterns where failing tests trigger code fixes.
-
Small model engineering — Making a 7B-parameter model (100x smaller than GPT-4) reliably produce structured, correct output through prompt engineering, schema validation, and defensive parsing.
-
Local-first AI — Zero cloud dependency, zero cost, complete data privacy. The entire system runs on a $300 GPU.
-
Production hardening — JSON bulletproofing, Windows compatibility, timeout protection, graceful error handling — the unglamorous work that separates prototypes from real tools.
- Swap to a larger model (e.g.,
qwen2.5-coder:14bwith quantization) for better Reviewer accuracy - Add a test-generation agent that writes unit tests before the Developer writes code
- Multi-file project support with a virtual filesystem in the sandbox
- Web UI with FastAPI + WebSocket for real-time streaming of agent outputs
- Model-agnostic backend — support for any Ollama model, llama.cpp, or remote APIs
- Persistent memory — learn from past successes/failures across sessions
langchain==0.3.25
langchain-core==0.3.61
langchain-ollama==0.2.3
langgraph==0.4.1
pydantic==2.11.3
rich==14.0.0
Built with 🧠 by Swapnil as part of the 100 Days of Vibe Coding challenge.
Powered by Ollama, LangGraph, and stubbornness.