CodeLeWM

CodeLeWM is a Python ML research harness for learning latent transition models over code edits.

It is not a code generator. It is a scorer and reranker for candidate patches: given a before-state, an edit instruction, and candidate after-states or diffs, CodeLeWM estimates which candidate best matches the learned transition.

CodeState_before + EditAction -> latent(CodeState_after)

Current Result

The systems path works end to end:

public-safe Python edit datasets;
manifest-backed training on Hugging Face Jobs;
public dataset/model/run artifacts on Hugging Face;
downloaded-artifact verification with checksums and secret scans;
retrieval, action ablation, surprise, latent-probe, latent-matrix, scorer-quality, score, rerank, downstream-pack, downstream-rerank, and LLM-demo reports.

The first scientific result is negative. The tested action-conditioned variants do not beat the no-action baseline on headline retrieval, and the v0.2 representation/downstream gates remain closed. This repository is publishable as infrastructure and negative evidence, not as a claim that CodeLeWM improves coding.

Quickstart

uv sync --group dev --group data --group train
uv run scripts/first-results --overwrite
uv run codelewm secret-scan .artifacts/first-results docs/benchmark/FIRST_RESULTS.md --json

This rebuilds the local smoke artifact set and regenerates docs/benchmark/FIRST_RESULTS.md. It proves the package-native dataset, pack, train, eval, index, scorer-quality, manifest, and secret-scan loop on tiny fixtures. It does not prove model quality.

LLM + World-Model Demo

Run the deterministic fixture demo:

uv sync --group dev --group data --group train --group llm
uv run scripts/llm-world-model-demo

The task loads .env if present, stays in CODELEWM_LLM_DRY_RUN=1 by default, materializes the bugfix-edge-case scenario, generates candidate diffs through the OpenRouter adapter fixture path, writes codelewm.llm_candidate_pack.v1, runs codelewm llm-demo with a trusted package-native torch checkpoint and --require-learned-scorer, verifies manifests, secret-scans publishable outputs, and writes a visual report at .artifacts/llm-world-model-demo/run/demo.html. If the local first-results checkpoint is missing, the script regenerates it before scoring. The default output is a terminal walkthrough of scenario selection, candidate generation, learned world-model inference, artifact gates, and claim status.

Expected success signal:

CodeLeWM LLM + World-Model Demo
mode: fixture dry-run | scorer: codelewm.torch_transition_scorer.v1 | success: true
[ok] 4/6 World-model inference
[ok] 5/6 Artifact gates
html report: .artifacts/llm-world-model-demo/run/demo.html

For non-interactive JSON output, use uv run scripts/llm-world-model-demo --json or CODELEWM_LLM_DEMO_OUTPUT=json.

Select another built-in scenario with --scenario <id> or CODELEWM_LLM_DEMO_SCENARIO=<id>. List available scenarios with uv run scripts/llm-world-model-demo --list-scenarios.

Run the v0.6 execution-rerank tour with a downloaded seed-42 checkpoint:

CODELEWM_LLM_DRY_RUN=0 CODELEWM_LLM_MAX_CANDIDATES=2 \
  uv run scripts/llm-world-model-demo \
  --scenario execution-rerank-mbpp \
  --checkpoint .artifacts/v0_6/runs/codelewm-v0-6-execution-20260530-af1a114-seed-42/checkpoints/last.pt \
  --tour 5 \
  --html .artifacts/v0-6-execution-rerank-tour-live.html

The tour samples live OpenRouter candidates for five public-safe synthetic MBPP-style tasks, labels them only through codelewm.data.sandbox, scores them with the v0.6 execution-substrate checkpoint, writes codelewm.harness.execution_rerank_tour.v1 plus the unchanged codelewm.harness.execution_rerank_view_model.v1, and keeps the claim gate closed below the scaled 100-example downstream benchmark. A committed HTML report and asciicast live in docs/demo/.

v0.6 Publication Landing

The v0.6 public artifact map is:

artifact index: docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-05-31.md;
dataset card: docs/cards/codelewm-v0-6-execution-dataset-2026-05-31.md;
model cards: docs/cards/codelewm-v0-6-execution-model-seed-42-2026-05-31.md and docs/cards/codelewm-v0-6-execution-model-seed-1729-2026-05-31.md;
blog-style announcement draft: docs/blog/2026-05-31-codelewm-v0-6-substrate-pivot.md;
demo: docs/demo/execution_rerank_tour_2026-05-31.html;
arXiv package: docs/papers/ARXIV_SUBMISSION.md.

The arXiv URL is still pending operator upload. Until that URL lands, the safe public framing is partial-positive substrate evidence: execution-pack retrieval and semantic-decoy score diagnostics pass, while broad semantic surprise and coding-agent utility claims remain closed.

Live OpenRouter mode is explicit:

cp .env.example .env
# Fill OPENROUTER_API_KEY locally. Keep .env untracked.
CODELEWM_LLM_DRY_RUN=0 uv run scripts/llm-world-model-demo

Anthropic BYOK Through OpenRouter

CodeLeWM supports OpenRouter BYOK for Anthropic keys without silently switching to a direct Anthropic client.

# .env, kept local
OPENROUTER_API_KEY=<openrouter-api-key>
OPENROUTER_MANAGEMENT_KEY=<openrouter-management-key>
ANTHROPIC_API_KEY=<anthropic-provider-key>
CODELEWM_LLM_DRY_RUN=0
CODELEWM_OPENROUTER_BYOK=1
CODELEWM_OPENROUTER_BYOK_PROVIDER=anthropic
CODELEWM_OPENROUTER_BYOK_KEY_ENV=ANTHROPIC_API_KEY
CODELEWM_OPENROUTER_BYOK_MANAGEMENT_KEY_ENV=OPENROUTER_MANAGEMENT_KEY
CODELEWM_OPENROUTER_BYOK_REQUIRE=1
CODELEWM_OPENROUTER_BYOK_REGISTER=1
CODELEWM_OPENROUTER_BYOK_DRY_RUN=0

CODELEWM_OPENROUTER_BYOK_REGISTER=1 intentionally creates an encrypted Anthropic BYOK credential in the OpenRouter workspace via OpenRouter's BYOK API. Keep CODELEWM_OPENROUTER_BYOK_DRY_RUN=1 to validate the registration contract without sending the provider key. Registration uses the OpenRouter management key named by CODELEWM_OPENROUTER_BYOK_MANAGEMENT_KEY_ENV; normal chat requests still authenticate with OPENROUTER_API_KEY. If the BYOK credential already exists in the OpenRouter dashboard, set CODELEWM_OPENROUTER_BYOK_REGISTER=0 and keep CODELEWM_OPENROUTER_BYOK=1. CodeLeWM records redacted BYOK routing metadata and never writes provider keys to reports.

For Anthropic BYOK, start with CODELEWM_LLM_PROVIDER_OPTIONS_JSON='{"sort":"price"}'. Add zdr: true only when OpenRouter shows a matching Zero Data Retention endpoint for the pinned provider route; otherwise OpenRouter rejects the request before generation.

Dry-run the registration contract without sending secrets:

uv run codelewm openrouter byok-register \
  --provider anthropic \
  --key-env ANTHROPIC_API_KEY \
  --management-key-env OPENROUTER_MANAGEMENT_KEY \
  --name "CodeLeWM Anthropic BYOK" \
  --allowed-model anthropic/claude-4.5-sonnet \
  --dry-run \
  --json

Evidence

Evidence	Result	Report
First local smoke loop	systems smoke only	`docs/benchmark/FIRST_RESULTS.md`
Scaled HF systems run	negative vs no-action	`docs/benchmark/SCALED_HF_RESULTS_2026-05-20.md`
Action-use margin run	negative vs no-action	`docs/benchmark/ACTION_USE_HF_RESULTS_2026-05-20.md`
Margin + retrieval run	improved but still negative	`docs/benchmark/ACTION_USE_RETRIEVAL_HF_RESULTS_2026-05-20.md`
v0.2 action-swap run	negative across action-use, latent-probe, downstream gates	`docs/benchmark/V0_2_ACTION_SWAP_HF_RESULTS_2026-05-20.md`
Public summary	negative/diagnostic boundary	`docs/benchmark/PRELIMINARY_RESULTS_2026-05-21.md`
Public artifact index	HF dataset/model/run paths	`docs/benchmark/PUBLIC_ARTIFACT_INDEX_2026-05-21.md`
Downstream fixture gate	one example, claim-blocked	`docs/benchmark/DOWNSTREAM_RERANKING_BENCHMARK.md`

Public Hugging Face repositories:

abdelstark/codelewm-public-shard
abdelstark/codelewm-transition-model
abdelstark/codelewm-runs

Command Surface

uv run codelewm dataset build --help
uv run codelewm dataset pack --help
uv run codelewm train --help
uv run codelewm eval retrieval --help
uv run codelewm eval latent-probe --help
uv run codelewm eval latent-matrix --help
uv run codelewm eval surprise --help
uv run codelewm eval scorer-quality --help
uv run codelewm eval downstream-pack --help
uv run codelewm eval downstream-rerank --help
uv run codelewm score --help
uv run codelewm rerank --help
uv run codelewm llm-demo --help
uv run codelewm openrouter byok-register --help
uv run codelewm manifest verify --help
uv run codelewm secret-scan --help

Full usage guide: docs/usage/USAGE.md.

Architecture

raw edit sources
  -> source adapters, license gates, split/dedup policy
  -> CodeState_before, EditAction, CodeState_after
  -> packed transition batches
  -> JEPA-style latent transition training
  -> checkpoint + transition index
  -> score/rerank, retrieval, surprise, downstream, and LLM-demo reports

Core packages:

codelewm.data: source loading, filtering, CodeState extraction, packing;
codelewm.model: action encoders, predictor modules, objective helpers;
codelewm.training: manifest-backed CPU smoke and torch training runners;
codelewm.eval: retrieval, surprise, latent probes, downstream claim gates;
codelewm.harness: scorer, reranker, OpenRouter adapter, LLM demo, CLI;
codelewm.observability: artifact manifests, logs, redaction;
codelewm.security: non-execution parsing, license policy, secret scans.

Root train.py, eval.py, and Hydra configs are compatibility artifacts from the original LeWorldModel seed. The package CLI is the CodeLeWM path.

Install

uv sync --group dev
uv sync --group dev --group data
uv sync --group dev --group train
uv sync --group dev --group eval
uv sync --group dev --group llm
uv sync --group dev --group release

The package extras mirror the same boundaries:

uv sync --extra data
uv sync --extra train
uv sync --extra eval
uv sync --extra llm

Validate

uv run pytest tests/ -q
uv run python -m compileall -q -x 'tests/fixtures/codestate/invalid_(before|after)\.py$' codelewm tests
uv lock --check
git diff --check

For release work, also run the package build, dependency audit, provenance, and wheel-install gates described in docs/release/DEPENDENCY_PROVENANCE.md.

Roadmap

The completed v1.1 boundary is a claim-safe diagnostic workflow:

LLM + world-model demo complete through #186-#189;
preliminary publication package complete through #193-#194;
downstream reranking fixture and claim gate complete through #190-#192;
BYOK/demo/readme polish complete through #206.
meaningful scenario selection complete through #226.
visual model observability and TUI roadmap locked through #236.
optional TensorBoard-compatible training export complete through #237.
checkpoint tensor/layer inspection complete through #238.
latent representation matrix diagnostics complete through #239.
run timeline and monitoring reports complete through #240.
optional Textual TUI for demo inspection complete through #241.
shared visual view model for JSON/rich/HTML demo parity complete through #242.
model/latent/tensor diagnostic links in demo reports complete through #243.
diagnostics-driven candidate-contrast experiment plan complete through #244.
public visual observability artifact set complete through #245.
v0.6 execution-rerank LLM showcase complete through #307, with a committed asciicast and static HTML report path.

Open next streams:

scaled downstream reranking benchmark: #209/#210/#211;
next positive-model research hypothesis: #212, with CWM comparison in #178.

Public wording cannot say CodeLeWM improves candidate patch ranking until a scaled downstream gate passes on at least 100 labeled examples.

Live planning:

docs/roadmap/POST_V0_2_SHOWCASE_ROADMAP.md
docs/roadmap/MODEL_OBSERVABILITY_TUI_ROADMAP.md
docs/roadmap/DIAGNOSTICS_DRIVEN_MODEL_EXPERIMENT.md
docs/benchmark/VISUAL_OBSERVABILITY_ARTIFACTS_2026-05-21.md
docs/roadmap/FULL_COMPLETION.md
docs/roadmap/IMPLEMENTATION.md
docs/roadmap/NEXT_GOAL_PROMPT.md

Claim Boundary

You can cite CodeLeWM today as:

a public, reproducible code-edit world-model research harness;
a verified Hugging Face Jobs and artifact-publication pipeline;
a negative result for tested action-use interventions;
a fixture-proven LLM-candidate reranking workflow.

Do not cite it today as:

a model that improves coding;
a model with validated semantic latent dimensions;
a checkpoint that beats no-action on action-conditioned retrieval;
a downstream patch-ranking system with proven usefulness.

Attribution

CodeLeWM starts from the LeWorldModel codebase and keeps its JEPA-style model shape as the implementation seed:

@article{maes_lelidec2026lewm,
  title={LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels},
  author={Maes, Lucas and Le Lidec, Quentin and Scieur, Damien and LeCun, Yann and Balestriero, Randall},
  journal={arXiv preprint},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 340 Commits
.github		.github
assets		assets
codelewm		codelewm
config		config
containers		containers
docs		docs
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SPEC.md		SPEC.md
eval.py		eval.py
jepa.py		jepa.py
module.py		module.py
pyproject.toml		pyproject.toml
train.py		train.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeLeWM

Current Result

Quickstart

LLM + World-Model Demo

v0.6 Publication Landing

Anthropic BYOK Through OpenRouter

Evidence

Command Surface

Architecture

Install

Validate

Roadmap

Claim Boundary

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeLeWM

Current Result

Quickstart

LLM + World-Model Demo

v0.6 Publication Landing

Anthropic BYOK Through OpenRouter

Evidence

Command Surface

Architecture

Install

Validate

Roadmap

Claim Boundary

Attribution

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages