Skip to content

rfc: execution lifecycle consolidation#3

Open
lewisjared wants to merge 5 commits into
mainfrom
feat/execution-lifecycle
Open

rfc: execution lifecycle consolidation#3
lewisjared wants to merge 5 commits into
mainfrom
feat/execution-lifecycle

Conversation

@lewisjared
Copy link
Copy Markdown
Contributor

Summary

Consolidate the lifecycle of one diagnostic execution
— allocation, dispatch, run, classify, publish, ingest, finalise —
into a single deep module (ExecutionLifecycle) sitting behind one
Transport port.
Surface a declarative ResourceHint on every Diagnostic so providers can
express memory / CPU / wall-clock once,
and capture per-execution Telemetry so future adaptive scheduling lands
as a feature addition rather than a schema migration.

Motivation in one paragraph

Today the lifecycle of one execution is fragmented across ~8 files in 2
packages. _is_system_error lives in climate-ref-core/executor.py;
CondaCommandError handling lives beside it; missing-log handling lives
in result_handling.py; per-task timeout enforcement only exists in
LocalExecutor; CeleryExecutor enforces no per-task timeout at all;
ExecutionGroup.dirty is toggled in three places.
Providers have nowhere to declare resource expectations — ESMValTool
diagnostics that need 16 GB of memory or 8 hours of wall-clock cannot say
so, which blocks any future SLURM/PBS/K8s transport from doing its job.

Reading order

Direct link to the rendered RFC:
text/0000-execution-lifecycle.md

Key sections:

Scope guard

This RFC is not about replacing SLURM, PBS, K8s, or Celery as schedulers.
It is about defining the seam between climate-ref and a scheduler so the
scheduler has enough information to do its job, and so the lifecycle around
the scheduler is one well-tested module instead of eight thin ones.

Process

Following the instructions in this repo's README:
once a PR number is assigned, the RFC file and the RFC PR link inside the
file will be renamed/updated in a follow-up commit on this branch.

Consolidate diagnostic execution lifecycle (allocation, dispatch, run,
classify, publish, ingest, finalise) into one deep module behind a single
Transport port. Surface ResourceHint on Diagnostic so providers can declare
memory/CPU/wall-clock once. Capture per-execution Telemetry to enable future
adaptive scheduling without a schema change.
Mermaid classDiagram is idiomatic for the Protocol + adapters pattern.
LR layout fits PR width; method signatures stay legible without HTML hacks.
Compress prose, drop per-design subsections in Rationale (table tells the
story), tighten Drawbacks/Prior art/Unresolved/Future to bullets. All three
diagrams kept; technical substance preserved.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant