https://codeaxiom.avixosec.xyz | contact@avixosec.xyz
CodeAxiom trains coding models through executable verification.
The goal is to train a specialized coding model that writes, edits, tests, and repairs code. The model does not exist yet. This repository is the public plan, verifier demo, and GPU-credit package for the first training sprint.
The current priority is GPU credit outreach. No training run is planned until credits, trial access, or an approved budget is available.
The training direction is simple:
task -> code or patch -> compile -> run tests -> inspect failure -> improve -> verify
The first serious training phase needs B200-class GPU compute. A small self-funded card budget is reserved for Modal access tests only if it helps close larger credits.
This repository includes:
- local verified-run demo
- YAML task format
- noop backend
- prepared-patch backend
- local toy repository
- result, log, patch, report, plan, and trace artifacts
- benchmark and compute plans
- GPU-credit funding brief
- public site
- CI for demo tasks
The current runner is local and offline. It is not a trained model and it does not call model APIs.
Raw code data teaches syntax and patterns.
CodeAxiom is built around feedback from execution:
- compiler result
- unit tests
- patch apply result
- runtime errors
- repository-level checks
- benchmark harnesses
- failure repair attempts
A generated answer is not enough. A code model should be measured by code that runs.
CodeAxiom targets coding benchmarks as evaluation goals, not claimed results.
| Benchmark | What it tests |
|---|---|
| LiveCodeBench | fresh algorithmic coding tasks |
| HumanEval+ | Python function correctness |
| MBPP+ | small Python programs with stronger tests |
| MultiPL-E | multilingual code generation |
| Aider Code Editing Benchmark | editing existing files |
| RepoQA | long-context repository understanding |
| SWE-bench Verified | real repository bug fixing |
| SWE-agent style tasks | agentic edit and shell loop |
| SWE-bench Long or SWE-bench Pro | long-horizon software engineering |
See docs/benchmark-plan.md and eval/benchmark-matrix.md.
The planned training stack:
- Baseline evaluation of the selected base model.
- Core SFT on public and licensed coding data.
- Edit SFT for patches and existing-file changes.
- Multilingual compiler loop for Python, JavaScript, TypeScript, Java, C++, Go, and Rust.
- Long repository training for RepoQA and SWE-style tasks.
- Execution feedback training from compiler and test failures.
- Patch search with many candidate fixes.
- CodeWorldModel for patch pass prediction.
- Verifier-guided RL on narrow task families.
- One-shot distillation from verified search winners.
See docs/training-roadmap.md.
The first target is to get GPU credits in the first week, ideally within 48 hours.
Fast starter ask:
$5k to $10k GPU credits
Direct hardware ask:
2 to 4 B200 GPUs for 3 to 7 days
Fallback for a first proof checkpoint:
H100, H200, GB200, A100 80GB, or equivalent GPU credits
See docs/compute-plan.md and docs/funding-brief.md.
A $30 card budget is not enough for meaningful model training.
Use it only on Modal, and only for:
- account verification
- trial activation
- short GPU access checks
- a tiny smoke run if it supports a credit application
With the listed prices, $30 buys about:
- 11.9 hours on A100 80GB
- 7.6 hours on H100
- 6.6 hours on H200
- 4.8 hours on B200
The current plan is to preserve card budget, use it only on Modal if needed, and focus on GPU credits first.
Install dependencies:
python -m pip install -r requirements.txtRun a task that should stay failed:
python agent/run_task.py examples/tasks/toy_noop.yamlRun a task that should pass after a prepared patch:
python agent/run_task.py examples/tasks/toy_fix.yamlOn Windows:
py -m pip install -r requirements.txt
py .\agent\run_task.py .\examples\tasks\toy_noop.yaml
py .\agent\run_task.py .\examples\tasks\toy_fix.yamlExpected result:
toy_noop -> failed
toy_fix -> passed
Each run writes:
artifacts/runs/<run_id>.json
artifacts/logs/<run_id>.log
artifacts/patches/<run_id>.patch
artifacts/reports/<run_id>.md
artifacts/plans/<run_id>.md
artifacts/traces/<run_id>.jsonl
artifacts/tmp/<run_id>/
Generated artifacts are ignored by Git.
The first training phase uses public and licensed datasets. Private user code is not training data by default.
Benchmark data must stay separate from training data. Every public score needs a contamination note.
See docs/data-policy.md and docs/contamination-policy.md.
ONE_PAGER.md short funding summary
docs/codeaxiom-model-training-plan.md implementation handoff plan
docs/benchmark-plan.md target benchmark plan
docs/compute-plan.md GPU credit and $30 smoke-run plan
docs/contamination-policy.md benchmark leakage rules
docs/funding-brief.md GPU credit application brief
docs/training-roadmap.md model training stages
docs/architecture.md system shape
docs/roadmap.md fast execution roadmap
docs/limitations.md current limits
eval/benchmark-matrix.md benchmark matrix
agent/README.md verifier demo notes
examples/tasks/ demo task files
site/ static public site
CodeAxiom is not a trained model yet. It is a public training plan, verifier demo, and GPU-credit package.
No benchmark score is claimed before evaluation.
MIT