Snakemake-based execution layer + content-addressed manifests + Dask substrate#82
Conversation
✅ Eval Results
Graders✅ spec_valid (1.00) Full output |
Replace the vendored snakemake-executor-plugin-lightconeflux + srun-flux-start wrap with a Dask-based stack: a new vendored snakemake-executor-plugin- lightconedask that submits each rule's shell command via client.submit, plus lightcone.engine.dask_cluster.cluster_for_run that produces a scheduler address for the run's lifetime. Rationale: - pip-installable everywhere (no `module load flux`, no PMI bootstrap dependency, no system-binary requirement) - One execution code path covers laptop, workstation, and SLURM allocation: LocalCluster locally, srun-launched workers across nodes inside an allocation, or an existing scheduler if DASK_SCHEDULER_ADDRESS is set - Resource translation is first-class: snakemake's threads/mem_mb/gpus map cleanly to Dask's per-task resources; the scheduler bin-packs tasks into per-node workers up to advertised cpu/memory/gpu budgets Trade: Dask doesn't offer Flux's hierarchical sub-allocation scheduling, which we weren't using anyway. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecutor_plugin_dask The plugin has nothing lightcone-specific in it — it's a generic dask.distributed executor for snakemake. Drop the project prefix so the discovery name (`--executor dask`) is clean and the package is reusable outside lightcone-cli if anyone wants. Confirmed no upstream `snakemake-executor-plugin-dask` exists on PyPI or GitHub at rename time. Class renamed: LightconeDaskExecutor → DaskExecutor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Two questions on the cluster execution section, since they seem to pull in different directions:
|
- Default --rerun-triggers now includes `params` so per-universe code_version drift in cfg actually triggers Snakemake reruns; embed code_version literally into shell_command as belt-and-braces. - Bind the SLURM-mode Dask scheduler to SLURMD_NODENAME (or gethostname() fallback) instead of 127.0.0.1, so workers on remote nodes can connect. - Narrow read_manifest to swallow only JSONDecodeError; propagate OSError so permission failures don't masquerade as missing manifests. - Stop closing the Dask client in cancel_jobs — Snakemake calls it for partial cancellations and subsequent submissions need the client. Adds five regression tests covering each fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- LocalCluster now advertises memory and gpus alongside cpus, so rules with mem_mb / gpus_per_task are schedulable on a workstation. - code_version hashes the resolved (content-addressed) image tag, so Containerfile or lockfile edits propagate to lc status. The same resolution is now used by snakefile.generate and status.get_output_status via a memoizing make_image_tag_resolver helper. - verify resolves qualified-id inputs (e.g. sub.foo) through the tree via a shared find_upstream_output helper, so cross-analysis chain drift surfaces as broken_chain instead of being silently skipped. - lc run --force scopes to explicit targets; --forceall is used only when no targets were named. - validate_output is wired into the rule body; output_type flows through cfg so empty / all-NaN / wrong-extension outputs surface as warnings. - Drop unused snakemake-executor-plugin-slurm dependency. - _target_for accepts qualified analysis.output ids and errors clearly on ambiguity; uses MANIFEST_FILENAME and resolve_output_path. - Tighten the recommended permission tier (rm -rf * / rm -fr *). - Collapse _input_path_for onto find_upstream_output; factor _resource_dict shared by _local_cluster and _resources_arg; hoist the rule-body validation block to a constant. 216 tests passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Snakemake redesign removed the target-selection layer (engine/targets.py deleted in c2e04a6), but `default_target: local` was still being written into new ~/.lightcone/config.yaml files by `lc setup`, by the eval sandbox bootstrap, and by the test fixture — and the Claude guides still documented a long-gone `lc target` command, `lc dev`, `--qos`/`--constraint`/etc. SLURM-target overrides, and the "Choosing run options" section that depended on the removed target system. None of it is read by current code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@cailmdailey, from Claude: What I found slurm-jobstep exists and does roughly what the reviewer described — but with significant qualifications:
Honest read: the reviewer's principle is right (don't reimplement what Snakemake does), but the specific recommendation oversells slurm-jobstep's maturity for standalone pilot-job use. The strongest version of the critique would be: try slurm-jobstep first, keep the Dask layer as a fallback only if slurm-jobstep proves unreliable in your testing. That's a defensible middle ground — the redesign would prove out the in-tree path before committing to bespoke code. |
NERSC's home and CFS filesystems are mounted via Cray DVS, which doesn't implement llistxattr. Buildah's copier (used by podman / podman-hpc) calls it on every COPY source and crashes with EPROTO when the project lives on DVS. Stage the Containerfile, dependency files, and parsed COPY/ADD sources into a tempdir before invoking the runtime. compute_image_tag and the staging logic share one iterator so the hashed and staged file sets cannot drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The redesign lost the Containerfile that `lc init` used to write, and the boilerplate `astra.yaml` had drifted into a parallel spec template maintained inside lightcone-cli. `lc init` now calls `astra.cli.init` for the spec scaffold (astra.yaml, universes/baseline.yaml, base .gitignore, src/) and only owns the lightcone-specific layer on top: - patches `container: python:3.12-slim` → `container: Containerfile` - writes Containerfile + requirements.txt - appends lightcone-specific `.gitignore` lines - writes `.lightcone/lightcone.yaml`, `.claude/`, CLAUDE.md, results/ Bumps astra-tools to >=0.2.5 (the release that ships the `container:` field in the boilerplate). Bundles assorted local edits to engine modules, tests, and skill guides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The first lc invocation now writes ~/.lightcone/config.yaml with defaults if missing, replacing the "Run lc setup first" gate. The setup command was non-interactive and only created the same file, so it's removed. Also brings README, lightcone-cli-reference, and CLAUDE.md in line with the current Snakemake/Dask CLI surface (drops stale lc dev, lc target, --qos/--time-limit/--strategy, and --sub-analysis docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary - Realign `lc-new`, `lc-migrate`, `lightcone-cli-reference.md`, `astra-reference.md`, and the project `CLAUDE.md` template with the current ASTRA spec, scaffold, and CLI surface - Drop `lc-build`, `lc-verify` skills and `ui-brand.md` guide; consolidate skill set to `lc-new`, `lc-migrate`, `lc-feedback` - Wire the four project hooks into `.claude/settings.json` on `lc init` (previously bundled but never invoked) - Rewrite `check-lc-run.sh` as pure bash + jq (was importing `lightcone.engine.status` from a Python heredoc, which silently no-op'd in the empty venv `lc init` creates). Add `recipe_command` to `OutputStatus` and surface it in `lc status --json` so the rewritten hook can match agent invocations against recipe scripts via jq. - Fix `session-start.sh` validation preview (`head -5` always landed before the real error block), drop a dead `decision_count` line, and remove the lc-build loop block This PR replaces #97, which had merged #82 history conflicts. This is rebased cleanly on `main` with the same scope. ## Test plan - [x] `uv run pytest` — 287 passed - [x] `uv run ruff check src/ tests/` — clean - [ ] Manual: `lc init` in a fresh dir, confirm hooks land in `.claude/settings.json` and trigger as expected - [ ] Manual: edit a malformed `astra.yaml`, confirm `validate-on-save` surfaces real errors - [ ] Manual: run a recipe script directly, confirm `check-lc-run` warns 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Francois Lanusse <EiffL@users.noreply.github.com>
TL;DR
Snakemake owns DAG resolution, parallelism, retries, and dry-run. Dask owns task dispatch, per-task resource matching, and bin-packing into worker capacity. We own one Python function (
write_manifest), a Snakefile generator, the container build/hash/wrap module, an in-processcluster_for_run()bootstrap, and a vendored Snakemake executor plugin that hands rules to Dask. Net diff vs. main: −9.7k / +4.4k LOC while delivering more ofdesign_review.mdthan the previous engine did.Design principles guiding this iteration
1. Snakemake does what Snakemake does
DAG construction, topological execution, parallelism, retries, dry-run, profiles, locking — all delegated to upstream. We do not write code for any of it.
2. We own only the integrity layer
The single property
design_review.mdcalls non-negotiable — agents cannot fake outputs — becomes a consequence of how the system is built, not a policy enforced by a process boundary:.lightcone-manifest.jsonas an output. Snakemake re-runs any rule whose manifest is missing.data_version = sha256(output_dir),code_version = sha256(recipe ‖ container_image ‖ decisions),input_versionschain to upstream manifests.lc verifyrecomputes hashes and validates the chain — no orchestrator dependency.lc statuswalks manifests offline. Snakemake's.snakemake/metadata/is treated as a cache and never read.3. User-facing surface does not change
lc run,lc status,lc verifykeep their semantics. The Snakefile is implementation detail — generated on everylc runfromastra.yaml, never edited by hand.4. Clean-slate replacement, no dual-engine
No backwards-compatibility flag, no migration command, no shim. The Dagster / Postgres / runner / slurm-info / targets modules and their tests are deleted, not deprecated.
5. Containers are ours, end-to-end
We build images deterministically from
Containerfiles with content-addressed tags (lc-<name>-<sha256>) and pre-wrap each recipe with the explicit runtime call at Snakefile-generation time. Snakemake'scontainer:directive is intentionally unused, so the manifest'scontainer_imagefield is the strong evidence of what actually ran.6. One execution substrate everywhere — Dask
A user inside an
salloc/sbatchwho runslc rungets routed through Dask automatically; same plugin and bootstrap path covers the workstation case and a pre-existing scheduler.lightcone.engine.dask_cluster.cluster_for_runreturns a scheduler address for the run's lifetime:DASK_SCHEDULER_ADDRESSsetSLURM_JOB_IDsetLocalCluster(n_workers=0)bound to the driver's hostname;srun --ntasks=\$NNODES --ntasks-per-node=1 dask worker \$ADDRlaunches one persistent worker per node, each advertising the node's fullcpus/memory/gpusas Dask resources.LocalCluster()sized to the local machine.The scheduler is always in-process — its lifetime equals the run's lifetime. No service to manage, no orphaned schedulers if the driver crashes, no
lc clusterlifecycle commands.Snakemake dispatches each rule via our own executor plugin at
src/snakemake_executor_plugin_dask/:client.submit(_run_shell, cmd, resources={cpus, memory, gpus}, pure=False). Per-rulecpus_per_task/mem_mb/gpus_per_tasktranslate 1:1 to per-task Dask resources, and the scheduler bin-packs tasks into workers up to each worker's advertised budget.Why Dask, not
slurm-jobsteporslurm:--executor slurmis the wrong shape — head-node sbatch submission, with the Snakemake docs warning that running it inside an active SLURM job leads to unpredictable behavior. Our users want pilot-job semantics: one bigsalloc, many tasks dispatched within it.--executor slurm-jobstepis the right shape but its repo description states it is "meant for internal use by snakemake-executor-plugin-slurm" — using it standalone is an off-label code path the maintainers do not commit to keeping working. It also inherits known footguns (SLURM 22.05+'s--cpus-per-tasknon-inheritance, the long-standing one-core dispatch issue) that would land in our user docs.pip installeverywhere, gives us persistent workers within a run (sub-second task dispatch), exposes a live dashboard, and the Dask scheduler's failure modes are well understood. The trade we accept is that resource matching is advisory rather than cgroup-enforced — for minutes-to-hours recipes whosemem_mbis declared inastra.yaml, this is acceptable, and SLURM's per-allocation memory cgroup is the backstop.module load flux(or building flux-core from source) on every host is install friction we don't want outside Perlmutter.If standalone
slurm-jobstepbecomes a maintained, user-facing executor, this is worth revisiting. Until then, owning ~350 LOC of well-scoped Dask plumbing is the better trade.7. No service to manage
The old
lc cluster start/attach/stoplifecycle is gone with no replacement. The Dask scheduler is in-process; SLURM workers are bounded by the user'ssalloc. No Postgres, no scheduler daemon, no daemon-per-project.What's in the diff
Added
redesign.md,design_review.md— spec + requirementssrc/lightcone/engine/snakefile.py— generator (Jinja →.lightcone/Snakefile+ sidecar JSON)src/lightcone/engine/manifest.py—write_manifest(),sha256_dir(),code_versionhelperssrc/lightcone/engine/verify.py— chain integrity checksrc/lightcone/engine/dask_cluster.py—cluster_for_run()context managersrc/snakemake_executor_plugin_dask/— vendored Snakemake executor pluginRemoved
engine/assets.py,engine/runner.py,engine/slurm_info.py,engine/targets.py,engine/io_manager.pyKept (re-purposed)
Test plan
Open questions / not in this PR
🤖 Generated with Claude Code