Skip to content

Swapnil-bo/NEXUS

Repository files navigation

NEXUS

Neural EXecution & Understanding System

A local multi-agent AI pipeline that plans, writes, executes, and self-corrects Python code — powered entirely by open-source models running on consumer hardware.

Python LangGraph Ollama Model Cost Privacy


What is NEXUS?

NEXUS is a self-correcting code generation system that uses four specialized AI agents working in a loop to turn natural language descriptions into working Python scripts. Unlike ChatGPT or Copilot, NEXUS runs 100% locally — no API keys, no cloud, no subscription fees, no data leaving your machine.

You describe what you want. NEXUS plans the approach, writes the code, executes it in a sandbox, reviews the output, and if something's wrong — fixes it automatically. Up to 3 iterations until the code works or fails gracefully.

You: "Write a script that checks if 153 is an Armstrong number"

NEXUS: Plans → Writes code → Runs it → NameError detected →
       Reviews → "num_digits not in scope" → Fixes it →
       Runs again → ✅ "153 is an Armstrong number" → Done.

Demo

➜ Write a script that generates a random password of 16 characters containing
  uppercase, lowercase, digits, and special characters, then prints the password
  and its strength rating

🧠 Planner    → 5-step execution plan
💻 Developer  → 49-line script with generate_password() and calculate_strength()
🔧 Runner     → NameError on line 38: 'all_uppercase' not defined
🔍 Reviewer   → ❌ REJECTED — "use 'has_uppercase' instead of 'all_uppercase'"
💻 Developer  → Fixed variable name
🔧 Runner     → Password: "clJjSU23GG05-kS(" | Strength: "Strong"
🔍 Reviewer   → ✅ APPROVED
🏁 Result     → SUCCESS in 51.4s (2/3 iterations)

Architecture

                    ┌─────────────┐
                    │  User Task  │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │   PLANNER   │  Breaks task into 2-10 steps
                    │   Agent     │  (structured JSON plan)
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
              ┌────►│  DEVELOPER  │  Writes complete Python script
              │     │   Agent     │  (fixes bugs on retry)
              │     └──────┬──────┘
              │            │
              │     ┌──────▼──────┐
              │     │ CODE RUNNER │  Executes in sandboxed subprocess
              │     │  (Sandbox)  │  (10s timeout, captures stdout/stderr)
              │     └──────┬──────┘
              │            │
              │     ┌──────▼──────┐
              │     │  REVIEWER   │  Validates output against task
              │     │   Agent     │  (classifies errors, gives feedback)
              │     └──────┬──────┘
              │            │
              │      ┌─────▼─────┐
              │      │  Pass?    │
              │      └─────┬─────┘
              │        No  │  Yes
              │   (max 3)  │
              └────────────┘  ──────►  ✅ Final Output

Each agent is a separate module with its own Pydantic schema for structured output validation. The pipeline is orchestrated by LangGraph as a state machine with conditional routing.


Tech Stack

Component Technology Purpose
LLM qwen2.5-coder:7b via Ollama All agent reasoning (planning, coding, reviewing)
Orchestration LangGraph 0.4.1 State machine with conditional edges
Schema Validation Pydantic v2 Structured JSON output from every agent
LLM Integration LangChain + langchain-ollama Ollama API wrapper with format="json"
Code Execution subprocess (sandboxed) Isolated Python execution with timeout
Terminal UI Rich Panels, syntax highlighting, spinners, tables
Language Python 3.10+ Everything

Hardware used during development:

Component Spec
CPU Intel i3-12100
GPU NVIDIA RTX 3050 6GB
RAM 8GB DDR4
OS Windows 11

Note: NEXUS also supports qwen2.5-coder:3b for faster inference (~10s/call vs ~20s/call) at the cost of accuracy. Change MODEL_NAME in config.py to switch.


Project Structure

nexus/
├── main.py                  # Rich terminal UI + manual pipeline runner
├── graph.py                 # LangGraph state machine with conditional routing
├── state.py                 # AgentState TypedDict (shared pipeline state)
├── config.py                # Ollama settings, prompt templates, iteration limits
├── schemas.py               # Pydantic v2 models + JSON parser + model validator
├── requirements.txt         # Pinned dependencies
├── agents/
│   ├── planner.py           # Breaks task into execution steps
│   ├── developer.py         # Writes/fixes Python code
│   └── reviewer.py          # Validates code + output, classifies errors
├── tools/
│   └── code_runner.py       # Sandboxed subprocess execution (10s timeout)
└── .gitignore

Quick Start

Prerequisites

  • Python 3.10+
  • Ollama installed and running (ollama.com)
  • ~4.5GB disk space for the model weights

Installation

# 1. Clone the repo
git clone https://github.com/Swapnil-bo/nexus.git
cd nexus

# 2. Create virtual environment
python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS/Linux
source .venv/bin/activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. Pull the model
ollama pull qwen2.5-coder:7b

Usage

# Interactive mode
python main.py

# Direct task
python main.py "Write a script that prints all prime numbers up to 50"

How It Works (Deep Dive)

1. Planner Agent

Takes the user's natural language task and produces a structured execution plan (2-10 steps). The plan guides the Developer but never contains code.

{
  "steps": [
    "Define a function to convert Fahrenheit to Celsius",
    "Define a function to convert Celsius to Kelvin",
    "Call both functions with 98.6 as input",
    "Print all three temperature values"
  ]
}

2. Developer Agent

Receives the plan (and on retries: previous code, error output, and reviewer feedback) and writes a complete, self-contained Python script. Uses context pruning — only sees the latest code and feedback, never the full error log, keeping the context window small and focused.

3. Code Runner (Sandbox)

Executes the generated code in an isolated subprocess with:

  • 10-second timeout — kills runaway scripts (infinite loops, blocking input() calls)
  • stdout/stderr capture — both streams returned to the Reviewer
  • Windows-safe temp files — manual create/write/close/execute/delete cycle (avoids Windows file locking issues with NamedTemporaryFile)
  • Virtual environment awareness — uses sys.executable instead of python to ensure the correct interpreter

4. Reviewer Agent

Compares the execution output against the original task requirements. Returns a structured verdict:

{
  "is_valid": false,
  "error_type": "logic_error",
  "feedback": "The variable 'num_digits' is not defined in the outer scope. Define it before the print statement."
}

Error classifications: syntax_error, logic_error, infinite_loop, incomplete, pass

5. Self-Correction Loop

If the Reviewer rejects the code, the Developer gets the feedback and tries again — up to 3 iterations. The loop is managed by LangGraph's conditional edges:

# Conditional routing after Reviewer
if is_validfinalizeEND (success)
if iteration >= maxfailEND (max retries)
elseDeveloperCode RunnerReviewer (retry)

JSON Bulletproofing

Small local models are unreliable JSON producers. NEXUS handles this with multiple layers:

Layer 1: Forced JSON Mode

Every Ollama call uses format="json", which constrains the model's token generation to valid JSON syntax.

Layer 2: Response Cleaning

The parse_json_response() helper strips markdown wrappers, extracts JSON from garbage text, and handles common formatting issues:

# Handles: ```json { ... } ```, leading text { ... } trailing text, etc.

Layer 3: Pydantic Validation

Every parsed response is validated against strict Pydantic v2 schemas with field constraints (list lengths, literal types, required fields).

Layer 4: Model Validator (Smart Coercion)

A @model_validator on ReviewerOutput auto-corrects common model quirks:

Model Returns Auto-Corrected To
is_valid: true, error_type: "" error_type: "pass"
is_valid: false, error_type: "FileNotFoundError" error_type: "logic_error"
is_valid: false, error_type: "SyntaxError" error_type: "syntax_error"
error_type: "pass", is_valid: false is_valid: true
is_valid: true, error_type: "logic_error" is_valid: false (trusts the error)

This prevents valid code from being falsely rejected due to schema validation failures.


Stress Test Results

Comprehensive testing across 14 tasks with both 3b and 7b models:

qwen2.5-coder:7b (Primary Model)

Task Result Iterations Time Notes
Sum 1 to 10 ✅ Pass 1/3 12.3s Clean first-try
Multiplication table ✅ Pass 1/3 16.0s Clean first-try
Factorial (recursion) ✅ Pass 1/3 ~20s Clean first-try
Find duplicates in list ✅ Pass 1/3 41.5s Clean first-try
Temperature conversion ✅ Pass 1/3 29.6s Float math, multi-function
Student dictionaries (highest grade) ✅ Pass 1/3 29.2s Nested data structures
Multiplication grid (formatted) ✅ Pass 1/3 21.2s Formatted print output
Armstrong number check ✅ Pass 2/3 65.5s Self-corrected scoping bug
Password generator (simple) ✅ Pass 2/3 51.4s Self-corrected variable name
Second largest number ❌ Fail 3/3 88.9s Reviewer hallucination
String multi-operation ❌ Fail 3/3 58.7s Task ambiguity + Reviewer confusion
File read (data.txt) ❌ Fail 3/3 42.9s Expected: file doesn't exist in sandbox
Even/odd (user input) ❌ Fail 3/3 86.4s Expected: input() blocks in sandbox

Pass Rate: 9/13 (69%) — with 2 failures being expected sandbox limitations

3b vs 7b Comparison

Metric qwen2.5-coder:3b qwen2.5-coder:7b
Pass rate ~50% ~69%
Avg time per call 5-10s 15-25s
Total task time ~30s ~30-60s
Reviewer accuracy Frequent hallucinations Mostly accurate
VRAM usage ~2GB ~4.5GB
Self-correction Occasionally works Reliably works

Known Limitations

Model Limitations (7b ceiling)

  • Reviewer hallucination: The Reviewer sometimes rejects correct code by inventing bugs that don't exist (e.g., claiming a duplicate-handling algorithm doesn't handle duplicates, while the code clearly does). This is inherent to 7B-parameter model reasoning.
  • Task ambiguity: Multi-step tasks with ambiguous requirements (e.g., "reverse AND uppercase AND capitalize") can confuse both the Developer and Reviewer about what the correct output should be.
  • Multi-line strings: The 3b model sometimes merges lines in f-strings, producing syntax errors it can't debug. The 7b model handles this correctly.

Sandbox Limitations

  • No interactive input: input() blocks forever in a headless subprocess. The Developer is instructed to hardcode example values instead.
  • No filesystem access: Generated scripts run in the system temp directory with no access to project files.
  • No network access: Scripts cannot make HTTP requests or access external APIs.
  • 10-second timeout: Long-running computations will be killed.

Architecture Limitations

  • No error log feedback: The error log is kept for the UI only and never fed back to agents, preventing context window pollution but also preventing the system from learning across iterations.
  • Single file output: NEXUS generates single Python scripts only — no multi-file projects.
  • No dependency installation: Only Python standard library is available in the sandbox.

Bugs Found & Fixed During Development

Bug Root Cause Fix
Planner returning 7 steps crashed validation max_length=6 too strict Raised to max_length=10
Correct code rejected 3 times (factorial) Reviewer returned error_type: "" instead of "pass" Added @model_validator to auto-coerce empty → "pass"
FileNotFoundError crashed Reviewer parsing Reviewer invented error_type: "FileNotFoundError" Extended validator with fuzzy keyword matching to map unknown types
input() caused 10s timeout every time Sandbox has no stdin Added CRITICAL: NEVER use input() to Developer prompt
Developer & Reviewer fought over input() Reviewer rejected hardcoded values Added sandbox awareness to Reviewer prompt
Windows file locking errors NamedTemporaryFile doesn't work on Windows Manual create/write/close/execute/delete pattern
Wrong model loaded Used qwen2.5:3b (general) instead of coder variant Corrected to qwen2.5-coder:7b

Why This Project Matters

NEXUS demonstrates several concepts that are directly relevant to production AI engineering:

  1. Multi-agent orchestration — Four specialized agents collaborating through a shared state machine, each with distinct responsibilities and schemas.

  2. Self-correcting systems — The feedback loop between Reviewer and Developer mirrors real-world CI/CD patterns where failing tests trigger code fixes.

  3. Small model engineering — Making a 7B-parameter model (100x smaller than GPT-4) reliably produce structured, correct output through prompt engineering, schema validation, and defensive parsing.

  4. Local-first AI — Zero cloud dependency, zero cost, complete data privacy. The entire system runs on a $300 GPU.

  5. Production hardening — JSON bulletproofing, Windows compatibility, timeout protection, graceful error handling — the unglamorous work that separates prototypes from real tools.


Future Improvements

  • Swap to a larger model (e.g., qwen2.5-coder:14b with quantization) for better Reviewer accuracy
  • Add a test-generation agent that writes unit tests before the Developer writes code
  • Multi-file project support with a virtual filesystem in the sandbox
  • Web UI with FastAPI + WebSocket for real-time streaming of agent outputs
  • Model-agnostic backend — support for any Ollama model, llama.cpp, or remote APIs
  • Persistent memory — learn from past successes/failures across sessions

Dependencies

langchain==0.3.25
langchain-core==0.3.61
langchain-ollama==0.2.3
langgraph==0.4.1
pydantic==2.11.3
rich==14.0.0

Built with 🧠 by Swapnil as part of the 100 Days of Vibe Coding challenge.
Powered by Ollama, LangGraph, and stubbornness.

About

A local multi-agent system that plans, writes, executes, and self-corrects Python code — powered by LangGraph and a 7B parameter model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages