Action-conditioned JEPA world models for genomic edits, built on top of Carbon.
Documentation | Specification | Roadmap | Architecture | Privacy
GenoLeWM is an alpha research codebase. The first public paper/demo
release is published as geno-lewm-v0.1.0-r1: it includes a real
Carbon-backed SNV training run, public model and dataset artifacts,
measured first-release evaluation, an artifact-backed terminal demo, and
final publication evidence with ok=true.
As of June 6, 2026:
| Area | Current state |
|---|---|
| Edit/action representation | Implemented: EditSpec, RelEdit, edit application, synthetic edit samplers, and optional-runtime ActionEncoder |
| Privacy-safe infrastructure | Implemented: typed errors, structured logging, redaction, metrics |
| Artifact provenance | Implemented: content-addressed manifests, input/output commitments, checksum receipt verification |
| CLI surface | Implemented scaffolds plus working geno-lewm-verify, geno-lewm-update, data prep, score, eval, rollout-metrics, and train paths |
| Desktop/runtime scaffolds | Present but not a complete product |
| Carbon encoder integration | Lazy CarbonStateEncoder wrapper and native artifact loading are implemented; the v0.1 terminal demo replayed from public model/data/demo artifacts; broader platform/runtime validation remains v0.2 work |
| Data/training stream | Carbon window sampler, tuple-builder contract, GenoLeWMDataset iterator, source-state cache lookup, local VCF-to-Parquet prep, and the v0.1 public dataset package are in place; larger held-out benchmark coverage and warm-cache throughput validation remain v0.2 work |
| Predictor/training | Base cross-attention Predictor, ARPredictor rollout wrapper, losses, collapse checks, torch trainer core, WSD scheduling, optimizer grouping, Carbon preflight/training launch plumbing, packaged run evidence, and one real Carbon-backed SNV run are published; true attention KV-cache speedups remain open |
| Evaluation | geno-lewm-eval, geno-lewm-carbon-baseline, geno-lewm-eval-all, geno-lewm-rollout, tools.release.v02_benchmark_suite, and bench.inference --release-efficiency cover measured metrics/report contracts plus a checked v0.2 benchmark-suite planning template; broader coding/non-coding, BRCA2/TraitGym Spearman, Carbon-baseline, rollout-fidelity artifact generation, and planning-readiness benchmark results are still v0.2 work |
| Package/model release | Public model, dataset, demo, paper, and publication-evidence artifacts are published; the first PyPI package tag remains open |
The v0.1 measured evaluation is intentionally narrow: chr21 ClinVar,
3,000 variants, AUROC 0.5191596847727398, AP
0.1651739690365932, and balanced accuracy 0.5. Treat those as
first-release evidence and negative findings, not as clinical utility,
deployment readiness, privacy assurance, or broad model-quality claims.
Public v0.1 artifacts:
- Model: https://huggingface.co/abdelstark/geno-lewm
- Dataset: https://huggingface.co/datasets/abdelstark/geno-lewm-data
- Demo release assets: https://github.com/AbdelStark/GenoLeWM/releases/tag/geno-lewm-v0.1.0-r1
- Paper artifact: https://github.com/AbdelStark/GenoLeWM/releases/download/geno-lewm-v0.1.0-r1/paper.md
- Final publication binder: https://huggingface.co/abdelstark/geno-lewm-runs/resolve/main/geno-lewm-coherent-cd2bfcc/publication/publication_evidence_report.json
| If you want to... | Start here |
|---|---|
| Understand what is implemented today | Status and What You Can Run Today |
| Try the stable Python surface | Install and Quickstart |
| Audit the first release and next work | First Experiment Evidence, v0.2 Readiness Work, and Release Evidence Matrix |
| Contribute code | Repository Layout, Development, and Contributing |
| Check safety and data-handling boundaries | Safety, PRIVACY.md, and SECURITY.md |
Current DNA foundation models usually score a variant by comparing two full sequence likelihoods: one for the reference allele and one for the alternate allele. GenoLeWM instead makes the edit itself an action in a latent world model:
s_t = enc(window_ref)
a_t = action(edit)
s_hat_{t+1} = g(s_t, a_t)
loss = distance(s_hat_{t+1}, enc(window_alt)) + representation regularization
The goal is to learn a small action-conditioned predictor on top of a frozen DNA encoder. If this works, the same model can support:
- single-variant effect scoring;
- multi-edit latent rollout;
- planning over edit sequences;
- surprise scores based on prediction residuals;
- local-first inference on personal genome files.
The project deliberately optimizes for a publishable, reproducible ML system: explicit data snapshots, model cards, evaluation reports, calibration artifacts, and terminal demos are first-class deliverables.
reference window
|
v
Carbon encoder (frozen) -------------------> state s_t
|
genomic edit -> action encoder -> action a_t |
v
predictor g(s_t, a_t)
|
v
predicted next state
|
v
surprise / rollout / planning
The intended training target is enc(edited_window). Carbon remains the
heavy frozen state encoder; GenoLeWM trains the action encoder and
predictor. The deployed package keeps heavyweight ML dependencies behind
extras so the pure-Python utilities stay lightweight.
Detailed design:
- ARCHITECTURE.md - narrative architecture walkthrough
- docs/spec/01-architecture.md - module boundaries
- docs/spec/03-data-model.md - dataset and checkpoint layouts
- ROADMAP.md - current execution plan
Python 3.10 or newer is required. The first PyPI release has not been cut yet, so install from source for now:
git clone https://github.com/AbdelStark/GenoLeWM.git
cd GenoLeWM
uv venv
source .venv/bin/activate
uv pip install -e "."For development extras:
git clone https://github.com/AbdelStark/GenoLeWM.git
cd GenoLeWM
uv venv
source .venv/bin/activate
uv pip install -e ".[dev,docs]"Optional extras:
| Extra | Use |
|---|---|
geno-lewm[train] |
PyTorch, Transformers, datasets, training utilities |
geno-lewm[eval] |
VCF/FASTA parsing and evaluation dependencies |
geno-lewm[deploy] |
ONNX export/runtime dependencies |
geno-lewm[docs] |
MkDocs documentation build |
geno-lewm[dev] |
Tests, linting, typing, packaging checks |
geno-lewm[all] |
Train, eval, and deploy extras |
These commands exercise local contracts. They are useful for development and release hardening, but they do not replace the public v0.1 artifact set or prove broader model quality.
| Task | Command | What it proves |
|---|---|---|
| Verify a checksum receipt fixture | geno-lewm-verify examples/data/verify_receipt/receipt.json --manifest examples/data/verify_receipt/manifest.json |
Receipt schema, manifest identity, and output commitment plumbing work locally |
| Inspect the released terminal demo | terminal-demo-transcript.md |
The v0.1 release replayed geno-lewm-score from public model/data/demo artifacts and recorded score/receipt hashes |
| Run fixture training smoke | geno-lewm-train --fixture-smoke --run-dir /tmp/geno-lewm-smoke --steps 50 |
Trainer packaging path can emit deterministic fixture artifacts without optional Carbon weights |
| Validate the first-experiment dataset spec | python -m tools.release.dataset_snapshot --spec-json configs/first_experiment/dataset-snapshot-snv.json --check-spec |
Dataset rebuild metadata, source layout, split coverage, and staged paths are internally consistent without local upstream files |
| Check public API drift | uv run python tools/api/snapshot.py check |
The exported Python surface matches tests/api/public_surface.json |
| Check retired-scope language | uv run python -m tools.lint.check_scope_language |
Public docs/code do not reintroduce unsupported runtime-assurance claims |
| Build docs strictly | uv run mkdocs build --strict |
MkDocs renders the public documentation with strict link/page checks |
from geno_lewm import EditSpec, EditType, RelEdit, apply_edit, apply_edits
edit = EditSpec(chrom="chr17", pos=43_091_983, ref="A", alt="T")
assert edit.edit_type is EditType.SNV
relative = edit.relative_to(window_start_bp=43_091_900, window_end_bp=43_092_100)
print(relative.rel_pos)
window = "ACGT" * 64
edited = apply_edit(window, RelEdit(0, EditType.SNV, "A", "C"))
haplotype = apply_edits(
window,
[
RelEdit(rel_pos=0, edit_type=EditType.SNV, ref_bases="A", alt_bases="T"),
RelEdit(rel_pos=4, edit_type=EditType.SNV, ref_bases="A", alt_bases="C"),
],
)All validation failures use typed GenoLeWMError subclasses with stable
machine-readable codes.
from geno_lewm import get_logger
log = get_logger("inference", run_id="run-42")
log.info("inference.batch.end", n=10, batch_id="b-1", throughput_per_s=87.2)The logging layer is deny-list and allow-list based. It rejects long DNA strings and personal-data fields before events leave the process.
from geno_lewm import DtypeConfig, EditSpec, PoolingConfig, compute_input_commitment
edit = EditSpec(chrom="1", pos=10, ref="A", alt="T")
pool = PoolingConfig(state_layer=12, pool_type="centered_mean", pool_radius=64, normalize=True)
dtype = DtypeConfig(encoder_dtype="bf16", predictor_dtype="bf16")
window = "ACGT" * 64
print(compute_input_commitment(window, edit, pool, dtype))geno-lewm-verify checks receipt schema validity, manifest identity,
optional input commitments, and output commitments:
$ geno-lewm-verify examples/data/verify_receipt/receipt.json \
--manifest examples/data/verify_receipt/manifest.json
reading receipt: examples/data/verify_receipt/receipt.json
schema_version=1.0.0 provenance.kind=checksum_only
reading manifest: examples/data/verify_receipt/manifest.json
model_id ok (sha256:3bcf3c87e5dd99...)
input_commitment: skipped (no input flags supplied)
output_commitment ok (sha256:982aee9fc1786...)
okThis is reproducibility and tamper-detection plumbing. It is not a model-quality or runtime-assurance guarantee.
geno-lewm-score --variant ... --receipt path/to/receipt.json writes
one canonical receipt. geno-lewm-score --vcf ... --receipt path/to/receipts.jsonl writes one canonical receipt per scored ALT as a
JSONL sidecar. Both paths require manifest-verified local scorer
components. The runtime can now attempt local native component loading
when torch, transformers, and safetensors are installed. The v0.1
clean-machine demo replayed the VCF scoring path from the published
model, dataset, and demo artifacts; broader platform coverage still
needs v0.2 validation.
The first paper/demo experiment was intentionally narrow:
| Component | Target |
|---|---|
| Encoder | Frozen Carbon-500M state vectors |
| Edits | SNVs only |
| Data | Versioned Carbon corpus slice plus prepared gnomAD/ClinVar shards and held-out ClinVar coding/non-coding variants |
| Model | Action encoder + predictor head |
| Metrics | rollout cosine similarity, residual distribution, AUROC/AUPRC against ClinVar labels, throughput |
| Release artifacts | dataset package metadata, dataset input check report, dataset card, model package metadata, model card, checkpoint, manifest, source metrics JSON, effective eval config, eval report, efficiency report, terminal demo transcript, terminal demo manifest, runtime preflight report, batch receipt report |
The first conclusions are deliberately conservative: the release proves the artifact chain and records near-chance held-out chr21 ClinVar metrics. It does not establish broad variant-effect quality, speed at the RFC-0004 autoregressive rollout target, clinical utility, privacy assurance, or planning usefulness.
Completed v0.1 release gates
| Gate | Issue | v0.1 evidence |
|---|---|---|
| Dataset snapshot and data card | #163 | Public dataset package and data card are published at https://huggingface.co/datasets/abdelstark/geno-lewm-data |
| First Carbon-backed run | #164 | geno-lewm-coherent-cd2bfcc trained for 20,000 steps / 160,000 samples and published run evidence |
| Paper-ready results report | #165 | Published eval_metrics.json, eval_report.md, and efficiency_report.json record the first-release measured results and negative findings |
| Terminal real-inference showcase | #166 | Public terminal transcript replayed geno-lewm-score over 32 VCF records with score and receipt JSONL hashes |
| First experiment paper package | #167 | Public paper.md binds the dataset, checkpoint, eval, efficiency, terminal demo, conclusions, and negative findings |
| Model checkpoint Hub release | #101 | Public model package, model card, checkpoint files, manifest, checksums, eval report, and demo links are published at https://huggingface.co/abdelstark/geno-lewm |
Use this table to separate reusable local release contracts from the public v0.1 evidence. Green local tooling is necessary, but it is not a substitute for real artifacts in future releases.
| Evidence artifact | Local contract | Paper-release status |
|---|---|---|
| Dataset package | python -m tools.release.dataset_snapshot --spec-json configs/first_experiment/dataset-snapshot-snv.json --check-spec validates the checked rebuild spec; --check-inputs hashes staged upstream files; the same spec with --dataset-dir ... --overwrite writes dataset_input_check_report.json, dataset_snapshot_report.json, dataset_package.json, dataset_manifest.json, data_card.md, split_integrity.json, and SHA256SUMS |
Completed for v0.1 and published with #163; repeat for larger v0.2 benchmark snapshots |
| Training run | geno-lewm-train --carbon-preflight ... and geno-lewm-train --carbon-train --package-release-run ... bind config, dataset, CUDA/VRAM readiness, Carbon model, checkpoint, logs, metrics, and training_run_SHA256SUMS |
Completed for v0.1 with #164; v0.2 needs stronger runs only after data/eval gates improve |
| Evaluation and efficiency | geno-lewm-eval, geno-lewm-carbon-baseline, geno-lewm-eval-all, geno-lewm-rollout, and python -m bench.inference --release-efficiency generate eval_metrics.json, eval_config.effective.yaml, eval_report.md, rollout-fidelity metric rows from measured state JSONL, and efficiency_report.json |
Completed for the narrow v0.1 release with #165; broader measured benchmark and rollout evidence remain open, including real rollout state-row evidence |
| Terminal demo | python tools/demo/terminal_inference.py ... records terminal-demo-transcript.md, terminal_demo_manifest.json, runtime_preflight_report.json, scores.jsonl, receipts.jsonl, and batch_receipt_report.json |
Completed for v0.1 with #166; v0.2 should demonstrate benchmark/planning behavior without clinical claims |
| Paper and publication evidence | python -m tools.release.paper_draft, python -m tools.release.paper_package, python -m tools.release.release_candidate, python -m tools.release.clean_machine_demo, and python -m tools.release.publication_report bind the paper, Hub plan, public links, replay, and final evidence report |
Completed for v0.1 through #167 and #101; final binder is public and has ok=true |
The next work should improve the evidence substrate before expanding demos. Track it in #197:
Use the v0.2 readiness report to bind measured eval, efficiency, and AR rollout speed artifacts before making broader claims:
python -m tools.release.v02_benchmark_suite \
--manifest configs/first_experiment/v0.2_benchmark_suite.template.json \
--output-report .../v0.2_benchmark_suite_report.jsonThe checked manifest is a planning template. Stage a release-local copy,
replace the identity fields and input artifact paths with measured v0.2
artifacts, then run from the package root. The suite runner composes
existing commands for GenoLeWM scoring, Carbon-baseline scoring,
per-benchmark eval, rollout-fidelity metrics, aggregate report
generation, and the all-up readiness report. ClinVar coding and
non-coding rows use binary ClinVar metrics; BRCA2 saturation and
TraitGym Mendelian rows use geno-lewm-eval --metric-mode spearman
with continuous labels. Without --execute, the suite writes a command
plan with ok=false; this is not measured evidence. With --execute,
ok=true only means every planned command completed after the suite
cleared that step's declared output files, then wrote those output files
again. Passed execute-mode steps record output identities with
package-local paths, SHA-256 values, and sizes, but the generated
metrics, efficiency, rollout-speed, and readiness artifacts must still
validate separately. Suite reports bind the manifest with a package-local
path, SHA-256, and size identity rather than a build-machine absolute
path. Run the final release-input readiness command after an executed
suite report exists, passing that report with --suite-report; the first
suite execution cannot consume the report it is still writing. A
second-pass suite manifest can express that final command by setting
readiness.suite_report.
For rollout benchmarks, the manifest can optionally include a
state_generation block. When it names spec_jsonl, cache_dir, and
examples_report_json, the suite first runs
python -m tools.release.rollout_state_examples to resolve
cache-keyed measured source/target/candidate latent states into
tools.release.rollout_state_examples JSONL. It then runs
python -m tools.release.rollout_state_rows before geno-lewm-rollout;
that second generator consumes the measured latent examples and the
manifest-backed action encoder/predictor to produce
geno-lewm-rollout-states JSONL.
python -m tools.release.v02_benchmark_readiness \
--metrics-json .../eval_metrics.json \
--rollout-speed-report .../rollout.ar_speed.json \
--rollout-speed-scope-report .../rollout_speed_scope.json \
--efficiency-report .../efficiency_report.json \
--suite-report .../v0.2_benchmark_suite_report.json \
--output .../v0.2_benchmark_readiness_report.json \
--require-okWith --require-ok, the gate also requires release-shaped input
provenance: package-relative score/label or aggregate metrics inputs,
efficiency input identities, rollout-state generation report artifacts
for rollout metrics, measured VEP values with baseline deltas and
confidence intervals, measured efficiency latency/throughput/memory
values plus efficiency command provenance, an executed passing suite
report with passed-step output identities, and non-fixture release
identity text. The readiness report records input artifact identities
and readiness, efficiency, accepted scope, or nested rollout-speed
command path arguments with public-safe paths plus SHA-256 and size
where applicable, and its release_inputs row records checked metrics
artifact paths, efficiency input identities, and suite output identities.
The direct bench.rollout speed report must also carry a claim boundary
stating that rollout speed is not model-quality, clinical, privacy, or
release-readiness evidence.
The suite report must include the readiness --metrics-json artifact in
passed-step output identities, preventing stale metrics from being paired
with an unrelated suite execution.
It must also preserve negative findings and a claim boundary that keep
measured model-quality claims dependent on downstream artifact
validators.
Absolute CLI paths do not enter the report. It also derives a readiness checklist and
blockers list with issue refs from the same benchmark rows. Metric
conclusions include measured values, baseline deltas, split/track
context, confidence intervals, and evaluated variant-key identities where
available; non-passing conclusions include missing metrics, missing
confidence intervals, baseline gaps, failed targets, or release-input
findings where applicable. The
report is expected to remain ok=false until the broader benchmark suite
passes from measured artifacts and the
#42 rollout speed target
either passes or is explicitly re-scoped through
python -m tools.release.rollout_speed_scope. A scope report must bind
the failing bench.rollout report, GitHub issue refs including #42 and
#197, UTC generated and accepted timestamps, HTTP(S) decision URL,
rationale, replacement target, public-safe input path/SHA-256/size
identity, and public-safe scope and nested rollout command paths. It
requires the source bench.rollout report to preserve its own claim
boundary before generating the accepted re-scope artifact.
must also preserve negative findings and a claim boundary stating that
the failed target remains not passing rollout-speed evidence; readiness
verifies those scope-report identities before recording the AR-speed row
as rescoped while preserving the failed measured speedups, report
identity, accepter, rationale, replacement target, timestamps, decision
URL, and issue refs in scope_decisions. The re-scope metric conclusion
also carries failed-target details plus the accepted decision URL,
rationale, replacement target, and issue refs.
Before building the all-up readiness report, use
geno-lewm-eval-all --require-v02-vep-metrics --require-v02-rollout-metrics
to fail the aggregate metrics refresh when coding/non-coding ClinVar,
BRCA2 saturation, TraitGym Mendelian, or rollout-fidelity rows are
missing required measured metric coverage. The VEP gate requires
Carbon-baseline deltas, confidence intervals, and evaluated variant-key
identities; the rollout gate requires phased-haplotype and synthetic
edit-chain cosine/L2/Recall@k rows. These are only aggregate coverage
gates; efficiency, rollout speed, and release-input provenance still
belong to tools.release.v02_benchmark_readiness.
For rollout-fidelity evidence, geno-lewm-rollout --states-jsonl ... --output-metrics ...
now aggregates measured latent-state rows into eval-compatible
cosine_similarity_mean, l2_distance_mean, and recall_at_k metrics
with source-state baseline deltas and per-K stratification. It does not
generate held-out haplotypes or run Carbon encoding; those measured
state-row artifacts are still v0.2 benchmark inputs. The lower-level
tools.release.rollout_state_examples helper resolves explicit
cache-key specs into measured latent examples, and
tools.release.rollout_state_rows bridges those examples to
rollout-state JSONL. The rows helper only accepts versioned
schema_version=1.0.0 example rows generated by
tools.release.rollout_state_examples. Neither helper runs Carbon
encoding, constructs held-out haplotypes, or turns fixture states into
benchmark evidence.
In release-input mode, geno-lewm-rollout should record
--rollout-state-examples-report and --rollout-state-rows-report so
the readiness report can bind both generation stages.
- audit data issues #49, #50, #51, and #52 against the actual v0.1 pipeline and turn remaining deltas into narrower v0.2 work;
- run broader coding/non-coding held-out benchmarks with measured GenoLeWM-vs-Carbon deltas, exact variant identities, and negative findings;
- implement or explicitly re-scope the RFC-0004 AR rollout KV-cache speed target;
- add regression/benchmark gates for finite loss, collapse health, eval-artifact integrity, and rollout performance;
- unblock planning API/CLI work only after the predictor/eval substrate supports honest claims.
The v0.1 release satisfied this checklist for its narrow first-publication scope. Future releases should satisfy the same contract with stronger data, evaluation, and rollout evidence before making broader claims:
- Dataset snapshot is reproducible from scripts and pinned revisions,
starting from a checked snapshot spec and explicit local upstream
files with
python -m tools.release.dataset_snapshot --spec-json configs/first_experiment/dataset-snapshot-snv.json --check-specfor public spec validation, thenpython -m tools.release.dataset_snapshot --spec-json configs/first_experiment/dataset-snapshot-snv.json --check-inputsto record SHA-256 and byte-size identities for staged upstream inputs, thenpython -m tools.release.dataset_snapshot --spec-json configs/first_experiment/dataset-snapshot-snv.json --dataset-dir ... --overwriteonce the upstream files are staged underconfigs/first_experiment/inputs/. That command stages Carbon source-mix files, builds gnomAD and ClinVar Parquet shards from local VCF/VCF.gz inputs, writesdataset_package.json, runspython -m tools.release.dataset_package --dataset-dir ... --metadata-json ..., and emitsdataset_input_check_report.json,dataset_snapshot_report.json,dataset_manifest.json,data_card.md,split_integrity.json, andSHA256SUMS. The snapshot report records the checked spec hash plus upstream source file hashes without embedding private absolute input paths, binds the input-check report, generated dataset package metadata, manifest, data card, and split-integrity artifacts by path/hash/size, and keeps the nested package file table aligned with the top-level staged file identities, is included inSHA256SUMS, and is validated by the release verifier. The release verifier checks that generated dataset package metadata carriesgenerated_by=tools.release.dataset_packageand that the data card and manifest still matchdataset_package.json; it also rejects invalid or duplicateSHA256SUMSpaths; the split-integrity report covers record counts, file identities, observed label/class balance, Parquet variant-key extraction, train/eval leakage checks, and thetools.release.dataset_integritysource header; leakage evidence fails closed when train/eval comparable keys are missing, and the data card renders the same class-balance summary fromsplit_integrity.json. - Training tuples are built through
geno_lewm.data.build_training_tuplesor streamed throughgeno_lewm.data.GenoLeWMDatasetso source mix, ClinVar fallback, and holdout exclusions are enforced before the trainer sees a batch. - The real trainer core uses
geno_lewm.training.encode_training_batchandgeno_lewm.training.TorchTrainerto turn Carbon-encoded source and target windows plus relative edits into predictor steps with AdamW parameter groups, WSD learning-rate scheduling, gradient clipping, and distinct data/predictor/LoRA seed records. Sources_tstates use the documented window cache when a compatible$GENO_LEWM_CACHE/embeddings/index.sqliteis present; cache misses fall through to live untargeted Carbon encoding, while editeds_{t+1}targets are still encoded on the fly. - Train/eval configs are committed and can be run from a clean machine;
the first-experiment checked configs live under
configs/first_experiment/, and Carbon training preflight validates the effective training config against the closed GenoLeWM schema before launch; fixture smoke training is available viageno-lewm-train --fixture-smoke --run-dir ... --steps 50; real training inputs are preflighted withgeno-lewm-train --carbon-preflight --dataset-dir ... --carbon-model-dir ... --training-config ... --run-dir ...; that preflight now requires the packaged dataset release evidence set:dataset_package.json,dataset_manifest.json,data_card.md,split_integrity.json,dataset_input_check_report.json,dataset_snapshot_report.json, andSHA256SUMS, requires the first-experiment config to resolveruntime.device: cuda, checks CUDA availability plus the default 40 GiB minimum device-memory threshold, and rejects stale input-check evidence before the trainer can launch; the single-process launcher isgeno-lewm-train --carbon-train --dataset-dir ... --carbon-model-dir ... --training-config ... --run-dir ...; the CLI writestraining_config.effective.yaml, preflights that exact effective config, mirrorstraining_preflight_report.jsoninto the run directory, and--package-release-runbuildstraining_run_manifest.json,training_run_card.md, andtraining_run_SHA256SUMSimmediately after a successful Carbon-backed run;--resume-from predictor_checkpoint.ptis available for Carbon runs but only accepts checkpoints whose run id, dataset snapshot, seed split, and config identity match the target run, and the resumed step is recorded in metrics, logs, andtraining_run.json; the paper run still requires a completed clean-machine Carbon-backed execution; completed training evidence is packaged withpython -m tools.release.training_run --run-dir ... --metadata-json .... Release training-run packages include checksum-coveredtraining_preflight_report.json, requiregenerated_by=tools.release.training_run, and release-mode verification requires the preflight report's dataset core-file evidence fordataset_package.json,dataset_input_check_report.json,dataset_snapshot_report.json, andSHA256SUMS. The paper/demo verifier rejects missing, stale, incomplete, or private-path preflight evidence plustraining_run_card.mddrift fromtraining_run_manifest.jsonbefore model publication can pass. - Checkpoint is packaged with
python -m tools.release.model_package --model-dir ... --metadata-json ...before publication; the model-package command writes normalizedmodel_package.json, rendersmodel_card.mdfrom that metadata plusmanifest.json, requiresgenerated_by=tools.release.model_package, requires packagedeval_metrics.jsonplusefficiency_report.json, verifieseval_report.mdis rendered from the metrics source, requires thetools.release.efficiency_reportsource header, cross-checks eval/efficiency release id, dataset snapshot, commit, and model-result identity, requires model metadata to listtraining_preflight_report.json,training_run_manifest.json,training_run_card.md, andtraining_run_SHA256SUMSas release evidence, and includes all generated source artifacts plus model-local eval artifact references fromeval_metrics.jsoninSHA256SUMS. The paper/package verifier re-renders the model card, rejects invalid or duplicate checksum paths, binds training-run dataset snapshot, training config path/hash, and commit identity to the manifest plus eval/efficiency evidence, and rejects stale model metadata before Hub dry-runs or release-candidate reports pass. - Evaluation metrics are first generated from real score/label artifacts
with
geno-lewm-eval --scores-jsonl ... --labels-jsonl ... --efficiency-report ... --output-metrics ...; primary score rows must carrygenerated_by=geno-lewm-score;geno-lewm-evalrecords checkpoint, config, dataset-manifest, effective eval config, efficiency, score, label, and baseline-score artifacts as package-relative paths under--artifact-root(defaulting to the metrics output directory), writeseval_config.effective.yamlbesideeval_metrics.json, and prevents absolute private workstation paths from entering release metrics JSON; accepted metrics payloads must carrygenerated_by=geno-lewm-evalorgenerated_by=geno-lewm-eval-all, so paper reports cannot be rendered from hand-labelled metrics JSON; Carbon zero-shot baseline scores are generated separately withgeno-lewm-carbon-baseline --artifact-root ... --vcf ... --fasta ... --carbon-model-dir ... --output-scores ... --logp-cache-jsonl ...and each baseline row carriesgenerated_by=geno-lewm-carbon-baseline; optional sequence log-likelihood cache rows are scoped to the Carbon model and revision before reuse and must have unique sequence SHA-256 keys within that scope, while--artifact-rootkeeps model, input, output, and cache paths in the generated summary metadata package-relative. Baseline scores are attached with--baseline-scores-jsonl ... --baseline-score-field carbon_zero_shot_score --baseline-name carbon_zero_shot; generated reports that include baseline comparisons are rejected unlessbaseline,baseline_value, anddelta_vs_baselineare supplied together and the metrics payload also records a baseline score artifact; this emits deterministic stratified bootstrap confidence intervals by default and records an omission reason when bootstrap resampling is disabled; multiple metrics artifacts are then aggregated and rendered withgeno-lewm-eval-all --metrics-json ... --output-metrics ... --output-report .... That command refresheseval_config.effective.yamlnext toeval_metrics.json; the eval-report parser requires each accepted metrics payload to record it as a package-relativeeval_configartifact, and generated reports must include the same artifact row. Metrics inputs must also live under the aggregate metrics directory so the report is tied to the committed eval config plus explicit CLI overrides without private absolute paths. Metric conclusions ineval_metrics.jsonmust explicitly reference every measured metric name, split, measured value, and baseline delta when a baseline is present;negative_findingsmust be a non-empty list rendered as## Negative Findings, so generic result summaries cannot be packaged as paper conclusions. Inference efficiency evidence is generated separately withpython -m bench.inference --release-efficiency --model-dir ... --vcf ... --fasta ... --variant ... --window ... --output-json ...so single-variant latency, batched throughput, peak memory, hardware/runtime notes, command, and package-relative or inline input identities are machine-readable release artifacts rather than prose claims. The lower-level report renderer remains available aspython -m tools.release.eval_report --metrics-json ... --output ...and includes baselines, confidence intervals, hardware, wall-clock cost, and known failure modes, but it rejects metrics payloads whose generator is not one of the eval CLIs. The paper/package verifier requires generated report markers, the Summary/Artifacts/Results sections, model and dataset identity lines, checkpoint/config/dataset-manifest plus efficiency-report artifact rows, and baseline score artifacts whenever baseline rows are reported; it resolves eval artifact paths inside the package and validates primary/baseline score JSONLgenerated_bymarkers; model-local eval artifact references must also be listed in modelSHA256SUMS; it also re-renderseval_report.mdfrom the packagedeval_metrics.json, validatesefficiency_report.json, checks that eval and efficiency evidence agree with the manifest release id and training dataset snapshot, and rejects stale Markdown. - Terminal demo runs real model inference, not fixtures.
- Demo transcript is generated by
tools/demo/terminal_inference.pyfrom the actualgeno-lewm-scorecommand and records generated time, exit code, model release/version/id, score/receipt JSONL hashes, row counts, JSONL field names, artifact-input paths, and an explicit claim-boundary sentence; the same run emitsterminal_demo_manifest.jsonto bind the command, model id, input identities, VCF input summary, transcript hash, score/receipt hashes, generated report hashes, and a compactscore_receipt_batchsummary with record count, checked score fields, receipt stream, model id, calibration hash, and runtime identity as machine-readable release evidence. The demo runner clears owned score, receipt, batch-report, and demo-manifest outputs before invoking the score command so stale JSONL rows cannot satisfy a later run. The package verifier rejects stale input identities, stale VCF input summaries, or VCF/FASTA demo inputs that are not shipped inside the demo package, and it requires recorded commands plus artifact labels to resolve to the canonical package files; it also rejects runtime-preflight command drift from the terminal-demo manifest command, stale terminal-demo manifestruntime_preflightsummaries that no longer matchruntime_preflight_report.json, stale transcript claim-boundary or artifact-input markers, stale manifest JSONL field lists, orscore_receipt_batchsummaries that no longer match the packaged score, receipt, and batch-report artifacts. The same run also emitsruntime_preflight_report.jsonto record model/input hashes, native runtime dependency availability, backend probes, and the fail-closed network guard; release verification rejects reports generated with fixture/test manifest allowance enabled. Before writingterminal_demo_manifest.json, the demo runner re-opens that preflight report and rejects stale or mutated evidence whose model id, release id, VCF/FASTA identities, command argv, requested backend, runtime requirement flags, or model artifact checks no longer match the same run. The same run also emitsbatch_receipt_report.jsonso the score rows, receipt rows, model id, calibration hash, runtime identity, and per-row output commitments are checked as one batch artifact. The release-package verifier rejects score/receipt batches whose model id or calibration hash do not match the packaged model manifest. - Paper draft is generated from the release artifacts with
python -m tools.release.paper_draft --model-dir ... --dataset-dir ... --demo-dir ... --output ...so Citation Metadata, Results, Conclusions, Negative Findings, Limitations, and Artifact Availability are grounded in the generated eval report, efficiency report, manifest, dataset package, and demo evidence. Draft generation rejects staleeval_report.mdoutput that no longer matcheseval_metrics.jsonand stale terminal-demo VCF summaries that no longer match the packaged demo VCF, requires a UTCGenerated: ...Ztimestamp, then renders that scored-input summary in Demo Evidence. The draft namesmodel_package.json,dataset_package.json,dataset_input_check_report.json,dataset_snapshot_report.json,eval_metrics.json,eval_config.effective.yaml,eval_report.md,efficiency_report.json, and demo evidence paths, using package-local artifact names rather than build-machine root paths; the package verifier re-renders the draft from the current artifact set and rejects stale Markdown or drafts missing Citation Metadata or Negative Findings. - Release package passes
python -m tools.release.paper_packageacross the model, dataset, demo, and paper artifacts. - Hub publication dry-run passes
python -m tools.release.hub_release --model-dir ... --dataset-dir ... --demo-dir ...before any checkpoint upload; paper candidates require--paper-url. The versionedhub_release_plan.jsonrecords model files fromSHA256SUMSplustraining_run_SHA256SUMS, dataset files plus datasetSHA256SUMS, and demo files from portableterminal_demo_manifest.jsonwith unique GitHub release asset names. When a paper artifact is included, it also records the verified public-safe paper source name/path, SHA-256, and size next to the public paper URL. For a direct GitHub.../releases/download/<tag>/<paper-file>URL whose asset name matches the verified paper file, the plan also emits the exact paper upload command. Private files beside the package are never published by a directory sync. The non-publishing.github/workflows/release-hub-dry-run.ymlworkflow runs the package verifier, Hub dry-run planner, and release candidate report without requiring Hub credentials. - Credentialed publication runs
python -m tools.release.hub_publish --model-dir ... --dataset-dir ... --demo-dir ...through.github/workflows/release-hub-publish.ymlafter the dry-run is clean. The workflow requires the protectedreleaseenvironment,HF_TOKEN, and GitHub release permissions; it syncs the lockeddev,train,eval, anddeployextras so the clean-machine replay has the native runtime stack available; it uploads only the model, dataset, demo, and matching paper files named by the verified Hub plan. Paper publication requires a direct GitHub release download URL whose final asset name matches the verified paper file, because the final release-candidate check hashes the public paper URL bytes. The helper then regeneratesrelease_candidate_report.jsonfrom the public links and fetched public artifact bytes. The protected workflow then runs the clean-machine terminal replay from that ready report with native runtime checks enabled. It passes the releaseHF_TOKENonly to Hugging Face artifact fetches and the GitHub token only to the release asset listing. After the final binder passes, the workflow uploadshub_release_plan.json,release_candidate_report.json,hub_publish_report.json,clean_machine_demo_report.json, andpublication_evidence_report.json, then runspython -m tools.release.publication_assetsto writepublication_evidence_assets.jsonwith the GitHub release target and evidence-asset hashes and upload command. It uploads that manifest plus the clean-machine replay transcript, manifest, score/receipt JSONL streams, runtime preflight report, and batch receipt report to the public demo release tag, and keeps the replay directory as a workflow artifact for debugging. - A generated release-candidate report from
python -m tools.release.release_candidate --model-dir ... --dataset-dir ... --demo-dir ... --paper-path ... --paper-url ... --repo-id ... --dataset-url ... --demo-url ... --commit-sha ... --output ...binds the package verifier, Hub publication plan, public-link reachability checks, commit, model id, dataset snapshot, dataset package metadata, dataset snapshot report, source metrics JSON, effective eval config, generated eval report, efficiency report, manifest-backed checkpoint/config/calibration artifacts, training-run checksums, Hub model/dataset/demo upload inventories, and key artifact hashes using package-role artifact paths rather than private absolute workstation paths. It also emits areadinesschecklist covering package verification, model artifacts, dataset artifacts, terminal-demo evidence, paper artifact, public links, provider-backed public artifact exact file-set, hash, and size checks plus direct paper byte hash/size checks, and upload-plan completeness; readiness rows and blockers carryissue_refspointing to the live release issues that own each failure.ready=truerequires the model, dataset, demo, and paper URLs to be reachable and, for recognized Hugging Face/GitHub targets, requires the remote listings to contain exactly the expected model, dataset, and terminal-demo files, and requires the public paper URL bytes to match the verified paper file hash and size. Fetched public bytes must match the upload-inventory SHA-256 and size values unless the command is explicitly run in offline fixture mode with both--allow-fixture-manifestand--skip-public-link-check; skipping public checks without fixture mode keepsready=false. - Dataset, model, training-run, paper-draft, and terminal-demo command reports use package-local artifact names in their success JSON output; the terminal transcript uses the same portable names for the score command, output artifacts, and input references. These artifacts must not serialize private workstation roots.
- Clean-machine terminal replay from
python -m tools.release.clean_machine_demo --release-candidate-report ... --output-dir ...downloads the published model files, dataset snapshot files, and GitHub release demo assets named by the generated ready release-candidate report. It rejects hand-authored reports, candidates missing generated readiness rows, candidates with non-empty blockers, skipped or failed public link checks, and skipped, missing, incomplete, or failed public artifact checks before any replay download. It also rejects embedded Hub plans whose source headers or model/repo/URL identities do not match, rejects unsafe Hub-plan destinations or malformed expected hashes before network fetches, verifies downloaded SHA-256 values against the Hub plan, re-runstools.release.paper_packageon the downloaded model, dataset, and demo package, rerunsgeno-lewm-scorefrom those downloaded bytes, then rejects replayedterminal_demo_manifest.jsonfiles with invalid source headers, non-passing status, model id mismatch, downloadedmodel/manifest.jsonhash/size mismatch, stale VCF/FASTA input identities, staleruntime_preflightsummaries, stalescore_receipt_batchsummaries, or replay artifact hash/size drift before writing the clean-machine report. The final publication binder also checks the replay manifest's VCF/FASTA input identities against the downloaded demo artifacts and checks the replay manifest's artifact table against the clean-machine replay report for the transcript, scores, receipts, runtime preflight, and batch report. Before scoring, the replay helper checks the downloaded demo VCF/FASTA hashes and sizes against the downloaded demo manifest; after scoring, it rejects replay manifests whose VCF/FASTA identities do not match those downloaded inputs. The replay tool writesclean_machine_demo_report.jsonwith the release-candidate report filename plus hash/size identity, output-directory-relative downloaded artifact identities, package-verification result, replay transcript and manifest identities, and replay score, receipt, runtime-preflight, and batch-report artifact hashes without serializing private absolute workstation paths. OptionalHF_TOKEN,HUGGINGFACE_HUB_TOKEN,GH_TOKEN, orGITHUB_TOKENenvironment values are used only for authenticated fetches and are never serialized into the report. - Final publication evidence from
python -m tools.release.publication_report --plan ... --release-candidate ... --publish-report ... --clean-machine-demo-report ... --output ...writespublication_evidence_report.json, which binds the Hub release plan, release-candidate report, credentialed publish report, and clean-machine replay report by public-safe filename plus hash/size identity, including the clean-machine replay's recorded release-candidate report filename/path, hash, and size identity, the verified paper file source name, URL, hash, and size identity, the full paper-criticalrelease_candidate_artifactstable for model, dataset, eval, demo, and paper identities, public-safe release-candidate readiness rows plus public link and public artifact check summaries, every uploaded release-candidate artifact identity in that table checked against the Hub plan plus the downloaded public artifact, and the replayed terminal-demo manifest's model id, downloadedmanifest.jsonidentity, VCF/FASTA input identities,runtime_preflightsummary, and replayed runtime-preflight model/input identities without private absolute paths. It also rejects a release candidate whose embedded Hub plan differs fromhub_release_plan.json, requires the generated readiness checklist with all expected rows markedok=true, empty candidate blockers, and currentissue_refs, requires generatedpublic_linksandpublic_artifactssections with required checks present and passing for the model, dataset, demo, and paper/public artifact targets, and fails the release gate if the published candidate, final readiness check, exact Hub-plan download set, public source URLs, hashes, or replay artifacts disagree. Itsissuesentries carryissue_refsso final publication failures route back to #163, #164, #165, #166, #167, and #101. The protected publish workflow uploads the resulting evidence JSON files and asset manifest as GitHub release assets, so paper/demo release notes can link durable public evidence rather than a retention-scoped workflow artifact. - README and docs distinguish measured results from targets.
- Privacy statement and safety boundaries are consistent with the demo.
Current gaps are tracked in ROADMAP.md, docs/roadmap/IMPLEMENTATION.md, and GitHub issues.
GenoLeWM/
├── geno_lewm/
│ ├── action/ # edit specs, relative edits, edit application, samplers
│ ├── provenance/ # preferred manifest, hashing, commitment, receipt API
│ ├── cli/ # console entry points
│ ├── deploy/ # runtime/update/export scaffolds
│ ├── encoder/ # Carbon windowing/cache scaffolds
│ ├── evaluation.py # measured metrics and eval report payloads
│ ├── carbon_zero_shot.py # Carbon baseline score artifacts
│ ├── planning/ # latent planning contracts
│ ├── predictor/ # predictor, rollout, and loss contracts
│ ├── surprise/ # surprise scoring/calibration contracts
│ ├── training/ # fixture/Carbon training and preflight helpers
│ ├── errors.py # typed exception hierarchy
│ ├── observability.py # structured logs and event registry
│ └── metrics.py # metrics registry/export
├── bench/ # local benchmark and release-efficiency harnesses
├── configs/ # checked first-experiment training/eval configs
├── tests/ # unit, property, lint, API snapshot, benchmark tests
├── tools/ # API snapshot, lint gates, release tooling
├── docs/ # MkDocs source
├── rfcs/ # design records
├── examples/ # executable notebooks and fixture data
├── desktop/ # reference desktop scaffold
└── pyproject.toml
make install
make hooks
make ciImportant gates:
| Gate | Command |
|---|---|
| Lockfile | uv lock --check |
| Format | ruff format --check . |
| Lint | ruff check . |
| Types | mypy geno_lewm tools |
| Tests | pytest |
| ML smoke | pytest tests/ml -q --tb=long --durations=10 |
| Eval smoke | python -m tools.ci.eval_smoke_gate --work-dir .eval-smoke --summary-json .eval-smoke/eval_smoke_summary.json |
| Public API | python tools/api/snapshot.py check |
| Scope language | python -m tools.lint.check_scope_language |
| Dataset spec | python -m tools.release.dataset_snapshot --spec-json configs/first_experiment/dataset-snapshot-snv.json --check-spec |
| Release docs contract | pytest tests/lint/test_docs_release_blocker_contract.py -q |
| Docs | mkdocs build --strict |
| Package build | python -m build && twine check --strict dist/* && python -m tools.release.check_sdist_assets dist/*.tar.gz |
The public API snapshot is intentional. If you change a public symbol, update the snapshot in the same PR and explain the compatibility impact.
Start with CONTRIBUTING.md. The most useful contributions now are implementation work that moves the project from the v0.1 proof release toward v0.2 benchmark and rollout readiness:
- broader held-out data and benchmark builders with pinned revisions, tuple-builder wiring, holdout enforcement, and small deterministic smoke fixtures;
- trainer/evaluator paths that produce better publishable artifacts without weakening the v0.1 evidence contract;
- AR rollout speed work and benchmark gates for the RFC-0004 target;
- planning API/CLI work backed by measured predictor and eval evidence;
- Documentation that keeps claims aligned with measured behavior.
Personal-genome reproducers are not accepted. Use synthetic data or public benchmark files.
GenoLeWM is a research tool. It is not a diagnostic device, clinical decision-support system, or medical product. Do not use it for embryo selection, reproductive decision-making, or clinical care.
The runtime is designed to be local-first. Variant data should remain on the user's machine unless the user explicitly exports it. See PRIVACY.md and SECURITY.md.
@software{genolewm2026,
title = {{GenoLeWM}: Action-conditioned {JEPA} world models for genomic edits},
author = {{GenoLeWM Authors}},
year = {2026},
url = {https://github.com/AbdelStark/GenoLeWM},
note = {Apache-2.0},
}GenoLeWM builds on the LeWorldModel/LeJEPA idea of action-conditioned latent prediction and on Carbon as the frozen DNA foundation model. The project is independent; any errors in implementation or interpretation are ours.