CodeMaestro Alpha

Reliable, auditable AI coding execution. CodeMaestro turns developer intent into bounded execution phases where every action is traceable, every write is gated, and “done” is verified instead of claimed.

Version: 0.1.0-alpha CLI command: red-exec Default model: deepseek-chat Language: JavaScript (CommonJS)

codeMaestro is an alpha AI coding tool built around governed execution instead of open-ended agent loops. You give it a task and a bounded file list. The system decomposes the task, generates code in isolated sections, and checks structural invariants before writing. The core idea comes from accounting controls: the model should not be the same actor that proposes, verifies, writes, and declares completion.

Why governed execution?

Most AI coding agents let the same model plan, write, verify, retry, and declare completion. From an accounting controls perspective, that is a control failure. CodeMaestro explores a different model: bounded execution, constrained attention, invariant gates, and auditable writes.

For the full argument, see PHILOSOPHY.md.

Demo: Side-by-side Benchmark

This short demo compares CodeMaestro Alpha against a Claude Code baseline on the same module-split benchmark. Skip to the end for the results.

Watch the demo video

In this run, both systems completed the benchmark successfully:

System	Score	Wall Time	LLM Calls	Tokens	Estimated Cost
Claude baseline	20 / 20	59.0s	1	207,914	~$0.1158
CodeMaestro Alpha	20 / 20	33.4s	2	4,159	~$0.0006

Result: CodeMaestro Alpha completed the same benchmark with approximately 50x fewer tokens and was approximately 193x cheaper in this run.

The goal of this demo is not to claim universal performance across all coding tasks. It shows the intended execution model: bounded context, structured execution, and deterministic benchmark completion with substantially lower token usage.

Benchmark Summary

Metric	Javascript (Polyglot)	Multi-File (Fixtures)
Pass Rate	✅ 100% (49/49)	✅ 100% (44/44)
Total Cost (USD)	$0.0949	$0.0051
Avg. Cost per Task	~$0.0019	~$0.0017
LLM Calls	406	21
Total Tokens	678,155	36,413
Wall Time	66m 02s	2m 52s

Note: Costs are calculated using DeepSeek-Chat at $0.14/M tokens.

Quick Start (Standalone Binary)

Download one file, set your API key, run. (Benchmarks require Node.js 18+ — see Prerequisites.)

1. Pick your binary

Platform	File	Size	Link
Windows x64	`red-exec-win-x64.exe`	~101 MB	https://github.com/CodeMaestro-AI/CodeMaestro/releases/download/alpha-0.1.0/red-exec-win-x64.exe
Linux x64 (Ubuntu, Debian, etc.)	`red-exec-linux-x64`	~109 MB	https://github.com/CodeMaestro-AI/CodeMaestro/releases/download/alpha-0.1.0/red-exec-linux-x64
macOS x64 (Intel, or Apple Silicon via Rosetta 2)	`red-exec-macos-x64`	~113 MB	https://github.com/CodeMaestro-AI/CodeMaestro/releases/download/alpha-0.1.0/red-exec-macos-x64

2. Set your API key

You need at least one API key. Pick your provider:

Provider	Model	Env variable	Status
DeepSeek	`deepseek-chat` (default)	`DEEPSEEK_API_KEY`	Recommended. All benchmarks use this model.
OpenAI	`gpt-4.1-mini`, `gpt-4.1`, `gpt-4o`	`OPENAI_API_KEY`	Supported but not well-tested. Fallback only.

Get a DeepSeek key at https://platform.deepseek.com/. OpenAI keys at https://platform.openai.com/api-keys.

Windows (PowerShell):

# DeepSeek (default model, cheapest)
$env:DEEPSEEK_API_KEY = "your-key-here"

# Or OpenAI
$env:OPENAI_API_KEY = "your-key-here"

To persist across sessions, add it to your system environment variables:

[System.Environment]::SetEnvironmentVariable("DEEPSEEK_API_KEY", "your-key-here", "User")
# or
[System.Environment]::SetEnvironmentVariable("OPENAI_API_KEY", "your-key-here", "User")

Linux / macOS:

# DeepSeek (default model, cheapest)
export DEEPSEEK_API_KEY="your-key-here"

# Or OpenAI
export OPENAI_API_KEY="your-key-here"

To persist, add the line to ~/.bashrc, ~/.zshrc, or ~/.profile.

Which should I use? DeepSeek-chat is the default and the model the pipeline was built and tuned for. All benchmark results in this document use DeepSeek-chat. OpenAI models are supported but not well-tested -- expect lower consistency, higher cost, and occasional prompt-format mismatches. Use --model gpt-4.1-mini only as a fallback if DeepSeek is unavailable.

3. Make it executable (Linux / macOS only)

chmod +x red-exec-linux-x64    # or red-exec-macos-x64

4. Run

Windows:

# With DeepSeek (default)
.\red-exec-win-x64.exe "Add health endpoint" --files src/server.js

# With OpenAI
.\red-exec-win-x64.exe "Add health endpoint" --files src/server.js --model gpt-4.1-mini

Linux / macOS:

# With DeepSeek (default)
./red-exec-linux-x64 "Add health endpoint" --files src/server.js

# With OpenAI
./red-exec-linux-x64 "Add health endpoint" --files src/server.js --model gpt-4.1-mini

That's it. See Running Tasks and CLI Reference below for the full command set.

macOS Gatekeeper / “developer cannot be verified”

macOS may block red-exec on first run with:

“red-exec cannot be opened because the developer cannot be verified”

This is expected for unsigned alpha binaries downloaded directly.

Option A (recommended): allow via System Settings

Try to open red-exec once (so macOS records the block).
Go to System Settings → Privacy & Security.
Scroll down to the Security section.
Click Open Anyway next to the red-exec warning.

Option B (Terminal): remove the quarantine attribute

cd <folder-containing-red-exec>
xattr -dr com.apple.quarantine ./red-exec
chmod +x ./red-exec
./red-exec --version

What You Can Use Alpha For

The alpha is strongest at creating new code from a clear spec and restructuring existing code into new modules. These are the tasks where the pipeline consistently produces production-quality output.

Strong (high confidence, proven at scale)

Task type	Example	Evidence
Create a new module from spec	"Create a validation module that exports `validate`, `sanitize`, `RULES`"	49/49 polyglot, 44/44 multifile -- all CREATE-heavy
Split a large file into smaller modules	"Extract the database layer from app.js into db.js and queries.js"	module-split benchmark: 20/20 checks, 7/7 runs
Migrate code between patterns	"Move inline SQL queries into a separate query builder module"	tool-migration benchmark: 13/13 checks, 7/7 runs
Create a skeleton with wiring	"Create a pipeline with stages, a runner, and an index that exports everything"	pipeline-skeleton benchmark: 11/11 checks, 7/7 runs
Implement algorithmic logic	"Implement bowling scoring, constraint solvers, reactive streams, parsers"	49 diverse exercises at 100% pass rate
Single-file feature additions	"Add a health endpoint to this Express server"	Well-scoped single-file tasks with clear inputs

Moderate (works, but review the output)

Task type	What to watch for
Multi-file tasks where files call each other	Cross-module wiring (imports, method signatures) is correct ~7/8 runs. Mechanical post-processing handles most cases, but novel integration patterns may need a manual fix.
Tasks involving template literals or complex string generation	LLM occasionally produces unescaped backticks inside template literals (~1 in 8 runs). Built-in retry usually recovers.
Modifying existing files	The pipeline can modify existing files (proven on files up to 937 lines), but the MODIFY path is less reliable than CREATE for complex changes. Well-scoped modifications (e.g., add a method, extract a section) work reliably. Broad behavioral changes across many functions may need review or a re-run.
CommonJS projects	DeepSeek-chat sometimes outputs ESM syntax (`export default`) despite explicit CJS instructions (~1 in 8 runs). If this happens, re-run or use `--model gpt-4.1-mini`.

What Alpha Cannot Do Well (Yet)

Be honest with yourself about these -- using the alpha for these tasks will produce poor results or waste time.

Task type	Why it struggles	What to do instead
Cross-cutting changes across 3+ existing files	The pipeline sees only the files you pass to `--files`. It has no codebase-wide understanding of implicit dependencies, call graphs, or side effects.	Break into smaller tasks. Or use an agentic tool that can explore the codebase.
Ambiguous or exploratory tasks	"Make this faster," "improve the error handling," "refactor this to be cleaner" -- the pipeline needs a concrete spec, not a direction. It executes, it does not explore.	Decide what you want first (with Cursor/Claude), then hand the spec to red-exec.
Tasks requiring deep domain context	The pipeline knows what you tell it via `--files` and the task description. It does not read your README, your tests, your architecture docs, or your commit history.	Provide context via skill files (`.md` files with conventions, patterns, constraints).
Non-JavaScript languages	Alpha only supports JavaScript (CommonJS). No TypeScript, Python, or other languages yet.	Wait for post-alpha language support.
Test generation	Not validated. The pipeline generates implementation code, not test code.	Use your existing test workflow.
Large-scale refactoring (100+ line files with many interdependencies)	Works for well-scoped extractions (proven up to 937-line files). Struggles when the refactoring requires understanding implicit contracts across many modules.	Break into phases: extract first (CREATE), then integrate (smaller MODIFYs).

The Golden Rule

The pipeline is a code generator, not a code architect. You decide what to build, which files are involved, and what the output should look like. The pipeline generates the code mechanically, with structural safety guarantees. The better your spec, the better the output.

If you find yourself writing a paragraph-long task description to explain what you want -- that's a sign you should break the task into smaller pieces.

Prerequisites

For running tasks: No prerequisites. The binary is standalone -- no Node.js, no package manager, no dependencies.

For running benchmarks (--benchmark): Node.js 18+ and npm must be installed on the target machine. The benchmark runner shells out to npm install and npx jest to install exercise dependencies and run tests.

Running Tasks

Throughout this document, commands are shown using red-exec as the executable name. Substitute the actual binary for your platform:

Platform	Command
Windows	`.\red-exec-win-x64.exe`
Linux	`./red-exec-linux-x64`
macOS	`./red-exec-macos-x64`

Basic task execution

red-exec "<task description>" --files <comma-separated-paths>

Examples:

# Single file
red-exec "Add health endpoint" --files src/server.js

# Multiple files
red-exec "Split runner into modules" --files lib/runner.js,lib/setup.js,lib/execution.js

# With a different model
red-exec "Refactor into modules" --files lib/runner.js --model gpt-4.1-mini

# Dry run (works on a temp copy, originals untouched)
red-exec "Add logging" --files src/app.js --dry-run

# Disable safety gate
red-exec "Rewrite exports" --files lib/api.js --no-gate

# Generate a run report (.red-exec-report.md)
red-exec "Add caching" --files src/cache.js --report

Running Benchmarks

There are two ways to run benchmarks:

Path 1: Via the binary (quick, red-exec only)

The binary has benchmarks built in. No extra setup needed:

# Multi-file benchmark (3 fixtures, 44 checks)
red-exec --benchmark multifile

# Single fixture
red-exec --benchmark multifile --task module-split

# Polyglot benchmark (49 exercises)
red-exec --benchmark polyglot

# Single exercise
red-exec --benchmark polyglot --exercise triangle

Path 2: Via the benchmark folder (any agent)

The benchmarks/ folder is a standalone, agent-agnostic test suite. Use this to benchmark any coding agent -- not just red-exec. Requires Node.js 18+ (for Jest test validation).

See benchmarks/README.md for full documentation.

cd benchmarks

# Run multi-file benchmark with red-exec
npm run multifile -- --harness ./harnesses/red-exec-direct.js --all

# Run with Claude Code instead
npm run multifile -- --harness ./harnesses/claude-code.js --all

# Run with your own agent
npm run multifile -- --harness ./harnesses/my-agent.js --all

Writing your own harness -- a harness is a single .js file that receives a workspace directory and runs your agent:

// harnesses/my-agent.js
const { execSync } = require('child_process');
const fs = require('fs');

const dir = process.argv[2];
const task = fs.readFileSync(dir + '/TASK.md', 'utf8');
const files = fs.readFileSync(dir + '/FILES.txt', 'utf8').trim().split('\n').join(' ');

execSync(`my-agent ${JSON.stringify(task)} --files ${files}`, { cwd: dir, stdio: 'inherit' });

That's the entire contract. The runner handles setup, scoring, timing, and results.

Expected output

Multifile benchmark — 3 fixture(s)
Timeout: 600s

[module-split] 20/20 checks passed
[pipeline-skeleton] 11/11 checks passed
[tool-migration] 13/13 checks passed

──────────────────────────────────────────────────
Fixtures: 3
Checks:   44/44 passed (100.0%)
Tokens:   20,694t total (12 LLM calls)

Writing Good Task Descriptions

The task description is plain English -- no JSON, no schema, no special format. The pipeline parses it internally. But the quality of your description directly determines the quality of the output.

What works well

Be specific about what to create or change:

red-exec "Create a validation module that exports validate(input), sanitize(input), and RULES object" --files src/validation.js

Name the functions, exports, and patterns you want:

red-exec "Split the database layer from app.js into db.js (connection, pool) and queries.js (getUser, createUser, deleteUser)" --files app.js,db.js,queries.js

Specify the file list carefully. The pipeline only sees files you pass to --files. It cannot discover files on its own.

# Good: all involved files listed
red-exec "Move SQL queries from server.js into queries.js, update server.js imports" --files server.js,queries.js

# Bad: missing the file that needs updating
red-exec "Move SQL queries into queries.js" --files queries.js

What doesn't work

Vague or exploratory instructions:

# Bad: no concrete spec
red-exec "Make this faster" --files src/app.js
red-exec "Improve error handling" --files lib/server.js
red-exec "Clean up this code" --files src/utils.js

Tasks that require understanding code you didn't provide:

# Bad: pipeline can't read files not in --files
red-exec "Update all callers of getUser()" --files src/db.js

Tips

One task, one run. Don't combine unrelated changes.
If you're describing more than 2-3 sentences of instructions, break the task into smaller runs.
For MODIFY tasks on existing files, the files must exist at the paths you specify.
For CREATE tasks, the files will be created at the paths you specify.
Use --dry-run to preview changes without modifying your files.

Diagnostics & Reporting

Every run produces a structured pipeline report covering each stage. Use --report to write a detailed markdown file.

Console report (always printed)

After every task execution, the CLI prints a summary:

=== Run Report ===
Status:   OK
Stages:   L1 ✓  L2 ✓  L3 ✓  Gate ✓
Files:    3 written, 2 deleted, 0 blocked
Tokens:   9,126 (7 calls, 124.0s)

Status values:

OK — all stages passed, all files written
PARTIAL — syntax errors or InvariantGate blocked one or more files
CRASHED — pipeline threw an unrecoverable error

Detailed report (`--report`)

red-exec "Migrate tools" --files src/tools/*.js --report

Writes .red-exec-report.md to the workspace with:

Section	Contents
Stages table	Per-stage status: L1 Decompose, L2 Section, L3 Execute, Gate
L1 Actions	What the pipeline decided to do: CREATE, MODIFY, EXTRACT, DELETE per file
Syntax Errors	Any L3-generated files that failed `node --check`
Invariant Violations	Gate-detected issues (removed exports, out-of-scope writes)
Blocked Files	Files the gate refused to write, with reasons
Written Files	Files successfully written to disk
Suggested Action	Actionable next step when status is not OK
Token Usage	Prompt/completion/total tokens and LLM call count

Interpreting failures

When a run shows PARTIAL status:

Syntax errors — L3 generated code that doesn't parse. Retry usually fixes this (LLM sampling variance). Try --max-retries 2 or a different model (--model gpt-4.1-mini).
Gate blocked files — InvariantGate prevented a destructive write (e.g., removing an exported function). Check .red-exec-report.md for the specific violation. If the write was intentional, use --no-gate or adjust the task spec.
Missing imports — Generated files may reference symbols from other files without importing them. This is the most common cross-module issue. Skill files can guide the LLM on required imports.

Debug export (`--export-debug`)

If a run fails or produces unexpected output, export a debug bundle for analysis:

red-exec "Add caching" --files src/cache.js --export-debug

This produces a .zip file containing the run metadata, full event log, LLM prompts/responses, and generated file contents. You can also export from a previous run using --run-id:

red-exec --export-debug --run-id <run-id>

On pipeline errors, the debug bundle is exported automatically.

Benchmark diagnostics

The multifile benchmark harness prints the pipeline report for each fixture, so you can see per-fixture stage status alongside the check results:

[tool-migration] 13/13 checks passed
=== Run Report ===
Status:   OK
Stages:   L1 ✓  L2 ✓  L3 ✓  Gate ✓
Files:    3 written, 2 deleted, 0 blocked
Tokens:   9,126 (7 calls, 124.0s)

CLI Reference

red-exec v0.1.0-alpha  Structured code generation pipeline

USAGE:
    red-exec "<task>" --files <paths>           Run a task
    red-exec --benchmark polyglot               Run polyglot benchmark (49 exercises)
    red-exec --benchmark multifile              Run multi-file benchmark (3 fixtures)
    red-exec --score --workspace <d> --checks <f>   Score results
    red-exec --compare                          Compare to baselines

OPTIONS:
    --files <paths>          Comma-separated file paths to modify
    --model <name>           LLM model (default: deepseek-chat)
    --no-gate                Disable InvariantGate safety checks
    --dry-run                Run in temp copy, preserve originals
    --report                 Write .red-exec-report.md to workspace
    --harness <path>         Agent harness for benchmarks
    --exercise <name>        Single polyglot exercise
    --task <name>            Single multi-file fixture
    --max-retries <n>        Fresh retries on failure (default: 2)
    --export-debug           Export debug bundle (zip) for the run
    --run-id <id>            Run ID for retroactive --export-debug
    --version                Print version
    --help                   Print this help

ENVIRONMENT:
    DEEPSEEK_API_KEY         Required for deepseek-chat (default model)
    OPENAI_API_KEY           Required for gpt-4.1-mini, gpt-4.1, gpt-5.2
    LLM_MODEL                Override default model

Logs

Run logs are written to .red-exec-logs/ in your current working directory.

Supported Models

Model	Adapter	Env var required
`deepseek-chat` (default)	DeepSeek	`DEEPSEEK_API_KEY`
`deepseek-reasoner`	DeepSeek	`DEEPSEEK_API_KEY`
`gpt-4.1`	OpenAI	`OPENAI_API_KEY`
`gpt-4.1-mini`	OpenAI	`OPENAI_API_KEY`
`gpt-4o`	OpenAI	`OPENAI_API_KEY`
`gpt-4o-mini`	OpenAI	`OPENAI_API_KEY`

Benchmark Results (April 2026)

Multi-file: public benchmark suite (reproducible)

7 consecutive runs with --max-retries 0 (no retries):

Run	Pass Rate	Tokens	LLM Calls
1	44/44 (100%)	22,934t	13
2	44/44 (100%)	21,538t	12
3	44/44 (100%)	22,968t	12
4	44/44 (100%)	19,869t	11
5	44/44 (100%)	23,580t	13
6	44/44 (100%)	21,787t	12
7	44/44 (100%)	20,694t	12

Per-fixture averages: module-split ~3.8K (1 call), pipeline-skeleton ~6.8K (3 calls), tool-migration ~10.3K (8 calls).

Multi-file: red-exec vs Claude Code (public benchmarks)

Task	red-exec (deepseek-chat)	Claude Code (Sonnet)	Ratio
Module split (20 checks)	3.9K	90K	23x cheaper
Pipeline skeleton (11 checks)	6.5K	191K	29x cheaper
Tool migration (13 checks)	10.3K	163K	16x cheaper
Total	21K	444K	21x cheaper

Multi-file: internal production codebase (not reproducible externally)

Task	red-exec (deepseek-chat)	Claude Code (Sonnet)	Ratio
Pipeline split (937-line file, 12 checks)	10.4K	473K	45x cheaper
Tool migration (7 files, 13 checks)	44K	547K	12x cheaper
Pipeline skeleton (11 checks)	16.2K	787K	49x cheaper

The gap widens on production code because larger codebases amplify context accumulation in agentic loops while the structured pipeline stays bounded. Public benchmarks are the verifiable floor.

Polyglot (49 JavaScript exercises)

	red-exec	Cline (agentic loop)
Pass rate	49/49 (100%)	49/49 (100%)
Total tokens	689K	10,287K
Cost ratio	1x	14.9x

Troubleshooting

"DEEPSEEK_API_KEY is not set" -- The environment variable is not visible to the binary. On Windows, make sure you set it in the same PowerShell session (or added it to system env vars and opened a new terminal). On Linux/macOS, make sure you export it (not just DEEPSEEK_API_KEY=... without export).

"Permission denied" (Linux/macOS) -- Run chmod +x ./red-exec-linux-x64 (or the macOS binary) first.

"This app can't run on your PC" (Windows) -- You may be on an ARM64 Windows device. The provided binary is x64 only. Run it under x64 emulation.

macOS Gatekeeper warning -- macOS may block unsigned binaries. To allow it:

xattr -d com.apple.quarantine ./red-exec-macos-x64

Antivirus flags the binary -- pkg-compiled Node.js binaries are sometimes flagged by antivirus software as false positives because they contain a bundled runtime. Add an exception for the binary.

Pipeline returns PARTIAL or CRASHED -- See Interpreting failures above. Most common fix: retry with --max-retries 2 or try a different model (--model gpt-4.1-mini).

File	Audience	Content
`ALPHA-NOTES.md`	Co-architects / testers	What works, what doesn't, design decisions to challenge
`PHILOSOPHY.md`	Technical readers	"Constrained Attention Is What You Need" -- the architectural thesis

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
benchmarks		benchmarks
.gitignore		.gitignore
ALPHA-NOTES.md		ALPHA-NOTES.md
PHILOSOPHY.md		PHILOSOPHY.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

CodeMaestro Alpha

Why governed execution?

Demo: Side-by-side Benchmark

Benchmark Summary

Quick Start (Standalone Binary)

1. Pick your binary

2. Set your API key

3. Make it executable (Linux / macOS only)

4. Run

macOS Gatekeeper / “developer cannot be verified”

Option A (recommended): allow via System Settings

Option B (Terminal): remove the quarantine attribute

What You Can Use Alpha For

Strong (high confidence, proven at scale)

Moderate (works, but review the output)

What Alpha Cannot Do Well (Yet)

The Golden Rule

Prerequisites

Running Tasks

Basic task execution

Running Benchmarks

Path 1: Via the binary (quick, red-exec only)

Path 2: Via the benchmark folder (any agent)

Expected output

Writing Good Task Descriptions

What works well

What doesn't work

Tips

Diagnostics & Reporting

Console report (always printed)

Detailed report (--report)

Interpreting failures

Debug export (--export-debug)

Benchmark diagnostics

CLI Reference

Logs

Supported Models

Benchmark Results (April 2026)

Multi-file: public benchmark suite (reproducible)

Multi-file: red-exec vs Claude Code (public benchmarks)

Multi-file: internal production codebase (not reproducible externally)

Polyglot (49 JavaScript exercises)

Troubleshooting

Other Documents

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Detailed report (`--report`)

Debug export (`--export-debug`)

Packages