Skip to content

AvixoSec/codeaxiom

Repository files navigation

CodeAxiom

https://codeaxiom.avixosec.xyz | contact@avixosec.xyz

CodeAxiom trains coding models through executable verification.

The goal is to train a specialized coding model that writes, edits, tests, and repairs code. The model does not exist yet. This repository is the public plan, verifier demo, and GPU-credit package for the first training sprint.

The current priority is GPU credit outreach. No training run is planned until credits, trial access, or an approved budget is available.

The training direction is simple:

task -> code or patch -> compile -> run tests -> inspect failure -> improve -> verify

The first serious training phase needs B200-class GPU compute. A small self-funded card budget is reserved for Modal access tests only if it helps close larger credits.

Current state

This repository includes:

  • local verified-run demo
  • YAML task format
  • noop backend
  • prepared-patch backend
  • local toy repository
  • result, log, patch, report, plan, and trace artifacts
  • benchmark and compute plans
  • GPU-credit funding brief
  • public site
  • CI for demo tasks

The current runner is local and offline. It is not a trained model and it does not call model APIs.

Why executable verification

Raw code data teaches syntax and patterns.

CodeAxiom is built around feedback from execution:

  • compiler result
  • unit tests
  • patch apply result
  • runtime errors
  • repository-level checks
  • benchmark harnesses
  • failure repair attempts

A generated answer is not enough. A code model should be measured by code that runs.

Target benchmarks

CodeAxiom targets coding benchmarks as evaluation goals, not claimed results.

Benchmark What it tests
LiveCodeBench fresh algorithmic coding tasks
HumanEval+ Python function correctness
MBPP+ small Python programs with stronger tests
MultiPL-E multilingual code generation
Aider Code Editing Benchmark editing existing files
RepoQA long-context repository understanding
SWE-bench Verified real repository bug fixing
SWE-agent style tasks agentic edit and shell loop
SWE-bench Long or SWE-bench Pro long-horizon software engineering

See docs/benchmark-plan.md and eval/benchmark-matrix.md.

Training stack

The planned training stack:

  1. Baseline evaluation of the selected base model.
  2. Core SFT on public and licensed coding data.
  3. Edit SFT for patches and existing-file changes.
  4. Multilingual compiler loop for Python, JavaScript, TypeScript, Java, C++, Go, and Rust.
  5. Long repository training for RepoQA and SWE-style tasks.
  6. Execution feedback training from compiler and test failures.
  7. Patch search with many candidate fixes.
  8. CodeWorldModel for patch pass prediction.
  9. Verifier-guided RL on narrow task families.
  10. One-shot distillation from verified search winners.

See docs/training-roadmap.md.

GPU credit sprint

The first target is to get GPU credits in the first week, ideally within 48 hours.

Fast starter ask:

$5k to $10k GPU credits

Direct hardware ask:

2 to 4 B200 GPUs for 3 to 7 days

Fallback for a first proof checkpoint:

H100, H200, GB200, A100 80GB, or equivalent GPU credits

See docs/compute-plan.md and docs/funding-brief.md.

Card budget

A $30 card budget is not enough for meaningful model training.

Use it only on Modal, and only for:

  • account verification
  • trial activation
  • short GPU access checks
  • a tiny smoke run if it supports a credit application

With the listed prices, $30 buys about:

  • 11.9 hours on A100 80GB
  • 7.6 hours on H100
  • 6.6 hours on H200
  • 4.8 hours on B200

The current plan is to preserve card budget, use it only on Modal if needed, and focus on GPU credits first.

Local verifier demo

Install dependencies:

python -m pip install -r requirements.txt

Run a task that should stay failed:

python agent/run_task.py examples/tasks/toy_noop.yaml

Run a task that should pass after a prepared patch:

python agent/run_task.py examples/tasks/toy_fix.yaml

On Windows:

py -m pip install -r requirements.txt
py .\agent\run_task.py .\examples\tasks\toy_noop.yaml
py .\agent\run_task.py .\examples\tasks\toy_fix.yaml

Expected result:

toy_noop -> failed
toy_fix -> passed

Each run writes:

artifacts/runs/<run_id>.json
artifacts/logs/<run_id>.log
artifacts/patches/<run_id>.patch
artifacts/reports/<run_id>.md
artifacts/plans/<run_id>.md
artifacts/traces/<run_id>.jsonl
artifacts/tmp/<run_id>/

Generated artifacts are ignored by Git.

Data policy

The first training phase uses public and licensed datasets. Private user code is not training data by default.

Benchmark data must stay separate from training data. Every public score needs a contamination note.

See docs/data-policy.md and docs/contamination-policy.md.

Files worth reading

ONE_PAGER.md                               short funding summary
docs/codeaxiom-model-training-plan.md      implementation handoff plan
docs/benchmark-plan.md                     target benchmark plan
docs/compute-plan.md                       GPU credit and $30 smoke-run plan
docs/contamination-policy.md               benchmark leakage rules
docs/funding-brief.md                      GPU credit application brief
docs/training-roadmap.md                   model training stages
docs/architecture.md                       system shape
docs/roadmap.md                            fast execution roadmap
docs/limitations.md                        current limits
eval/benchmark-matrix.md                   benchmark matrix
agent/README.md                            verifier demo notes
examples/tasks/                            demo task files
site/                                      static public site

Scope

CodeAxiom is not a trained model yet. It is a public training plan, verifier demo, and GPU-credit package.

No benchmark score is claimed before evaluation.

Contact

contact@avixosec.xyz

License

MIT

About

Verified coding-agent workbench for real repository tasks: patches, tests, logs, artifacts, and metrics.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors