GitHub - Swapnil-bo/NEXUS: A local multi-agent system that plans, writes, executes, and self-corrects Python code — powered by LangGraph and a 7B parameter model.

Neural EXecution & Understanding System

A local multi-agent AI pipeline that plans, writes, executes, and self-corrects Python code — powered entirely by open-source models running on consumer hardware.

What is NEXUS?

NEXUS is a self-correcting code generation system that uses four specialized AI agents working in a loop to turn natural language descriptions into working Python scripts. Unlike ChatGPT or Copilot, NEXUS runs 100% locally — no API keys, no cloud, no subscription fees, no data leaving your machine.

You describe what you want. NEXUS plans the approach, writes the code, executes it in a sandbox, reviews the output, and if something's wrong — fixes it automatically. Up to 3 iterations until the code works or fails gracefully.

You: "Write a script that checks if 153 is an Armstrong number"

NEXUS: Plans → Writes code → Runs it → NameError detected →
       Reviews → "num_digits not in scope" → Fixes it →
       Runs again → ✅ "153 is an Armstrong number" → Done.

Demo

➜ Write a script that generates a random password of 16 characters containing
  uppercase, lowercase, digits, and special characters, then prints the password
  and its strength rating

🧠 Planner    → 5-step execution plan
💻 Developer  → 49-line script with generate_password() and calculate_strength()
🔧 Runner     → NameError on line 38: 'all_uppercase' not defined
🔍 Reviewer   → ❌ REJECTED — "use 'has_uppercase' instead of 'all_uppercase'"
💻 Developer  → Fixed variable name
🔧 Runner     → Password: "clJjSU23GG05-kS(" | Strength: "Strong"
🔍 Reviewer   → ✅ APPROVED
🏁 Result     → SUCCESS in 51.4s (2/3 iterations)

Architecture

                    ┌─────────────┐
                    │  User Task  │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   PLANNER   │  Breaks task into 2-10 steps
                    │   Agent     │  (structured JSON plan)
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
              ┌────►│  DEVELOPER  │  Writes complete Python script
              │     │   Agent     │  (fixes bugs on retry)
              │     └──────┬──────┘
              │            │
              │     ┌──────▼──────┐
              │     │ CODE RUNNER │  Executes in sandboxed subprocess
              │     │  (Sandbox)  │  (10s timeout, captures stdout/stderr)
              │     └──────┬──────┘
              │            │
              │     ┌──────▼──────┐
              │     │  REVIEWER   │  Validates output against task
              │     │   Agent     │  (classifies errors, gives feedback)
              │     └──────┬──────┘
              │            │
              │      ┌─────▼─────┐
              │      │  Pass?    │
              │      └─────┬─────┘
              │        No  │  Yes
              │   (max 3)  │
              └────────────┘  ──────►  ✅ Final Output

Each agent is a separate module with its own Pydantic schema for structured output validation. The pipeline is orchestrated by LangGraph as a state machine with conditional routing.

Tech Stack

Component	Technology	Purpose
LLM	`qwen2.5-coder:7b` via Ollama	All agent reasoning (planning, coding, reviewing)
Orchestration	LangGraph 0.4.1	State machine with conditional edges
Schema Validation	Pydantic v2	Structured JSON output from every agent
LLM Integration	LangChain + langchain-ollama	Ollama API wrapper with `format="json"`
Code Execution	subprocess (sandboxed)	Isolated Python execution with timeout
Terminal UI	Rich	Panels, syntax highlighting, spinners, tables
Language	Python 3.10+	Everything

Hardware used during development:

Component	Spec
CPU	Intel i3-12100
GPU	NVIDIA RTX 3050 6GB
RAM	8GB DDR4
OS	Windows 11

Note: NEXUS also supports qwen2.5-coder:3b for faster inference (~10s/call vs ~20s/call) at the cost of accuracy. Change MODEL_NAME in config.py to switch.

Project Structure

nexus/
├── main.py                  # Rich terminal UI + manual pipeline runner
├── graph.py                 # LangGraph state machine with conditional routing
├── state.py                 # AgentState TypedDict (shared pipeline state)
├── config.py                # Ollama settings, prompt templates, iteration limits
├── schemas.py               # Pydantic v2 models + JSON parser + model validator
├── requirements.txt         # Pinned dependencies
├── agents/
│   ├── planner.py           # Breaks task into execution steps
│   ├── developer.py         # Writes/fixes Python code
│   └── reviewer.py          # Validates code + output, classifies errors
├── tools/
│   └── code_runner.py       # Sandboxed subprocess execution (10s timeout)
└── .gitignore

Quick Start

Prerequisites

Python 3.10+
Ollama installed and running (ollama.com)
~4.5GB disk space for the model weights

Installation

# 1. Clone the repo
git clone https://github.com/Swapnil-bo/nexus.git
cd nexus

# 2. Create virtual environment
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Pull the model
ollama pull qwen2.5-coder:7b

Usage

# Interactive mode
python main.py

# Direct task
python main.py "Write a script that prints all prime numbers up to 50"

How It Works (Deep Dive)

1. Planner Agent

Takes the user's natural language task and produces a structured execution plan (2-10 steps). The plan guides the Developer but never contains code.

{
  "steps": [
    "Define a function to convert Fahrenheit to Celsius",
    "Define a function to convert Celsius to Kelvin",
    "Call both functions with 98.6 as input",
    "Print all three temperature values"
  ]
}

2. Developer Agent

Receives the plan (and on retries: previous code, error output, and reviewer feedback) and writes a complete, self-contained Python script. Uses context pruning — only sees the latest code and feedback, never the full error log, keeping the context window small and focused.

3. Code Runner (Sandbox)

Executes the generated code in an isolated subprocess with:

10-second timeout — kills runaway scripts (infinite loops, blocking input() calls)
stdout/stderr capture — both streams returned to the Reviewer
Windows-safe temp files — manual create/write/close/execute/delete cycle (avoids Windows file locking issues with NamedTemporaryFile)
Virtual environment awareness — uses sys.executable instead of python to ensure the correct interpreter

4. Reviewer Agent

Compares the execution output against the original task requirements. Returns a structured verdict:

{
  "is_valid": false,
  "error_type": "logic_error",
  "feedback": "The variable 'num_digits' is not defined in the outer scope. Define it before the print statement."
}

Error classifications: syntax_error, logic_error, infinite_loop, incomplete, pass

5. Self-Correction Loop

If the Reviewer rejects the code, the Developer gets the feedback and tries again — up to 3 iterations. The loop is managed by LangGraph's conditional edges:

# Conditional routing after Reviewer
if is_valid → finalize → END (success)
if iteration >= max → fail → END (max retries)
else → Developer → Code Runner → Reviewer (retry)

JSON Bulletproofing

Small local models are unreliable JSON producers. NEXUS handles this with multiple layers:

Layer 1: Forced JSON Mode

Every Ollama call uses format="json", which constrains the model's token generation to valid JSON syntax.

Layer 2: Response Cleaning

The parse_json_response() helper strips markdown wrappers, extracts JSON from garbage text, and handles common formatting issues:

# Handles: ```json { ... } ```, leading text { ... } trailing text, etc.

Layer 3: Pydantic Validation

Every parsed response is validated against strict Pydantic v2 schemas with field constraints (list lengths, literal types, required fields).

Layer 4: Model Validator (Smart Coercion)

A @model_validator on ReviewerOutput auto-corrects common model quirks:

Model Returns	Auto-Corrected To
`is_valid: true, error_type: ""`	`error_type: "pass"`
`is_valid: false, error_type: "FileNotFoundError"`	`error_type: "logic_error"`
`is_valid: false, error_type: "SyntaxError"`	`error_type: "syntax_error"`
`error_type: "pass", is_valid: false`	`is_valid: true`
`is_valid: true, error_type: "logic_error"`	`is_valid: false` (trusts the error)

This prevents valid code from being falsely rejected due to schema validation failures.

Stress Test Results

Comprehensive testing across 14 tasks with both 3b and 7b models:

qwen2.5-coder:7b (Primary Model)

Task	Result	Iterations	Time	Notes
Sum 1 to 10	✅ Pass	1/3	12.3s	Clean first-try
Multiplication table	✅ Pass	1/3	16.0s	Clean first-try
Factorial (recursion)	✅ Pass	1/3	~20s	Clean first-try
Find duplicates in list	✅ Pass	1/3	41.5s	Clean first-try
Temperature conversion	✅ Pass	1/3	29.6s	Float math, multi-function
Student dictionaries (highest grade)	✅ Pass	1/3	29.2s	Nested data structures
Multiplication grid (formatted)	✅ Pass	1/3	21.2s	Formatted print output
Armstrong number check	✅ Pass	2/3	65.5s	Self-corrected scoping bug
Password generator (simple)	✅ Pass	2/3	51.4s	Self-corrected variable name
Second largest number	❌ Fail	3/3	88.9s	Reviewer hallucination
String multi-operation	❌ Fail	3/3	58.7s	Task ambiguity + Reviewer confusion
File read (data.txt)	❌ Fail	3/3	42.9s	Expected: file doesn't exist in sandbox
Even/odd (user input)	❌ Fail	3/3	86.4s	Expected: `input()` blocks in sandbox

Pass Rate: 9/13 (69%) — with 2 failures being expected sandbox limitations

3b vs 7b Comparison

Metric	qwen2.5-coder:3b	qwen2.5-coder:7b
Pass rate	~50%	~69%
Avg time per call	5-10s	15-25s
Total task time	~30s	~30-60s
Reviewer accuracy	Frequent hallucinations	Mostly accurate
VRAM usage	~2GB	~4.5GB
Self-correction	Occasionally works	Reliably works

Known Limitations

Model Limitations (7b ceiling)

Reviewer hallucination: The Reviewer sometimes rejects correct code by inventing bugs that don't exist (e.g., claiming a duplicate-handling algorithm doesn't handle duplicates, while the code clearly does). This is inherent to 7B-parameter model reasoning.
Task ambiguity: Multi-step tasks with ambiguous requirements (e.g., "reverse AND uppercase AND capitalize") can confuse both the Developer and Reviewer about what the correct output should be.
Multi-line strings: The 3b model sometimes merges lines in f-strings, producing syntax errors it can't debug. The 7b model handles this correctly.

Sandbox Limitations

No interactive input: input() blocks forever in a headless subprocess. The Developer is instructed to hardcode example values instead.
No filesystem access: Generated scripts run in the system temp directory with no access to project files.
No network access: Scripts cannot make HTTP requests or access external APIs.
10-second timeout: Long-running computations will be killed.

Architecture Limitations

No error log feedback: The error log is kept for the UI only and never fed back to agents, preventing context window pollution but also preventing the system from learning across iterations.
Single file output: NEXUS generates single Python scripts only — no multi-file projects.
No dependency installation: Only Python standard library is available in the sandbox.

Bugs Found & Fixed During Development

Bug	Root Cause	Fix
Planner returning 7 steps crashed validation	`max_length=6` too strict	Raised to `max_length=10`
Correct code rejected 3 times (factorial)	Reviewer returned `error_type: ""` instead of `"pass"`	Added `@model_validator` to auto-coerce empty → `"pass"`
`FileNotFoundError` crashed Reviewer parsing	Reviewer invented `error_type: "FileNotFoundError"`	Extended validator with fuzzy keyword matching to map unknown types
`input()` caused 10s timeout every time	Sandbox has no stdin	Added `CRITICAL: NEVER use input()` to Developer prompt
Developer & Reviewer fought over `input()`	Reviewer rejected hardcoded values	Added sandbox awareness to Reviewer prompt
Windows file locking errors	`NamedTemporaryFile` doesn't work on Windows	Manual create/write/close/execute/delete pattern
Wrong model loaded	Used `qwen2.5:3b` (general) instead of coder variant	Corrected to `qwen2.5-coder:7b`

Why This Project Matters

NEXUS demonstrates several concepts that are directly relevant to production AI engineering:

Multi-agent orchestration — Four specialized agents collaborating through a shared state machine, each with distinct responsibilities and schemas.
Self-correcting systems — The feedback loop between Reviewer and Developer mirrors real-world CI/CD patterns where failing tests trigger code fixes.
Small model engineering — Making a 7B-parameter model (100x smaller than GPT-4) reliably produce structured, correct output through prompt engineering, schema validation, and defensive parsing.
Local-first AI — Zero cloud dependency, zero cost, complete data privacy. The entire system runs on a $300 GPU.
Production hardening — JSON bulletproofing, Windows compatibility, timeout protection, graceful error handling — the unglamorous work that separates prototypes from real tools.

Future Improvements

Swap to a larger model (e.g., qwen2.5-coder:14b with quantization) for better Reviewer accuracy
Add a test-generation agent that writes unit tests before the Developer writes code
Multi-file project support with a virtual filesystem in the sandbox
Web UI with FastAPI + WebSocket for real-time streaming of agent outputs
Model-agnostic backend — support for any Ollama model, llama.cpp, or remote APIs
Persistent memory — learn from past successes/failures across sessions

Dependencies

langchain==0.3.25
langchain-core==0.3.61
langchain-ollama==0.2.3
langgraph==0.4.1
pydantic==2.11.3
rich==14.0.0

Built with 🧠 by Swapnil as part of the 100 Days of Vibe Coding challenge.
Powered by Ollama, LangGraph, and stubbornness.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.venv		.venv
.vscode		.vscode
__pycache__		__pycache__
agents		agents
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
config.py		config.py
graph.py		graph.py
main.py		main.py
requirements.txt		requirements.txt
schemas.py		schemas.py
state.py		state.py
verify_phase1.py		verify_phase1.py
verify_phase2.py		verify_phase2.py
verify_phase3.py		verify_phase3.py
verify_phase4.py		verify_phase4.py
verify_phase5.py		verify_phase5.py

Folders and files

Latest commit

History

Repository files navigation

Neural EXecution & Understanding System

What is NEXUS?

Demo

Architecture

Tech Stack

Project Structure

Quick Start

Prerequisites

Installation

Usage

How It Works (Deep Dive)

1. Planner Agent

2. Developer Agent

3. Code Runner (Sandbox)

4. Reviewer Agent

5. Self-Correction Loop

JSON Bulletproofing

Layer 1: Forced JSON Mode

Layer 2: Response Cleaning

Layer 3: Pydantic Validation

Layer 4: Model Validator (Smart Coercion)

Stress Test Results

qwen2.5-coder:7b (Primary Model)

3b vs 7b Comparison

Known Limitations

Model Limitations (7b ceiling)

Sandbox Limitations

Architecture Limitations

Bugs Found & Fixed During Development

Why This Project Matters

Future Improvements

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages