Skip to content

Phase 1: provider abstraction + scalable.yaml manifest foundation#20

Merged
crvernon merged 12 commits into
version/2.0.0from
version/2.0.0-phase1-provider-manifest
May 19, 2026
Merged

Phase 1: provider abstraction + scalable.yaml manifest foundation#20
crvernon merged 12 commits into
version/2.0.0from
version/2.0.0-phase1-provider-manifest

Conversation

@crvernon
Copy link
Copy Markdown
Member

@crvernon crvernon commented May 19, 2026

Phase 1: Provider Abstraction + scalable.yaml Manifest Foundation

Summary

This PR delivers Phase 1 of the v2.0.0 roadmap defined in plans/v2.0.0_phase1_plan.md: a provider-neutral execution seam, a declarative manifest layer, a deterministic dry-run planner, the public ScalableSession API, and the scalable validate / scalable plan --dry-run CLI commands. All legacy v1.1.0 imperative APIs (SlurmCluster, add_container, add_workers, ScalableClient) remain functional. Phase 1 is strictly additive plus one targeted deprecation warning.


Scope (delivered work units)

WU Description Status
WU-1 Branch + package scaffolding + pyproject.toml bump to 2.0.0a1, console script registration, optional extras placeholders
WU-2 Manifest schema dataclasses + YAML parser with ${VAR} / ${VAR:-default} expansion + version check
WU-3 Manifest semantic validator with ValidationReport / ValidationIssue
WU-4 DeploymentProvider protocol, DeploymentSpec, ScalePlan, ResourceRequest, ClusterHandle, registry with entry-point discovery
WU-5 LocalProvider over Dask LocalCluster + unit + integration coverage
WU-6 SlurmProvider translation layer over existing SlurmCluster + mocked unit tests
WU-7 Manifest-to-legacy adapter + ModelConfig deprecation warning gated by adapter context
WU-8 ScalableSession lifecycle API (from_yaml, validate, plan, start, close, context manager, reserved Phase 4 kwargs)
WU-9 Deterministic dry-run planner + SHA-256 compute_manifest_lock()
WU-10 scalable validate and scalable plan --dry-run CLI commands + Phase 1 reserved-namespace stubs
WU-11 Top-level public re-exports updated in scalable/__init__.py
WU-12 Unit + integration coverage rounded out (session, planning, CLI, exports, settings env)
WU-13 Docs: manifest.rst, providers.rst, README v2 quickstart, CHANGELOG.md
WU-14 CI: validate-example-manifests job, macOS matrix, version-branch triggers
WU-15 This PR

Architectural changes

New package layout

scalable/
  manifest/   # schema, parser, validate, adapter, errors
  providers/  # base protocol, registry, local, slurm
  session/    # ScalableSession
  planning/   # deterministic dry-run + manifest_lock
  cli/        # main, cmd_validate, cmd_plan

No existing modules are removed. Existing v1.1.0 imports continue to work unchanged.

Public API surface (additive)

scalable/__init__.py now also re-exports:

  • ScalableSession
  • DeploymentProvider
  • LocalProvider
  • SlurmProvider

Legacy exports (SlurmCluster, JobQueueCluster, ScalableClient, cacheable, SEED, settings, etc.) are preserved.

Schema and provider contracts (frozen for Phase 1)

Session API (Phase 1 minimal form)

ScalableSession.plan(...) is purely deterministic in Phase 1. The objective= and policy= keyword arguments named in the v2 north-star are reserved on the public surface and currently raise NotImplementedError, locking the API shape for the Phase 4 AI planner without committing behavior.

CLI

  • scalable validate <manifest> — exits 0/non-zero, emits a structured JSON report.
  • scalable plan <manifest> --target <name> --dry-run --output plan.json — writes plan.json and manifest.lock and prints the plan to stdout.
  • Reserved verbs (run, diagnose, explain, init-component, compose, report) are registered as namespace stubs that exit 2 with a phase-pointer message, so the UX namespace cannot be hijacked by third-party packages.

manifest_lock

SHA-256 over canonicalized JSON of the post-env-expansion manifest (sorted keys, UTF-8, compact separators). Designed to be stable so Phase 2 telemetry and Phase 4 AI assistants can durably reference manifests.


Deprecations

  • ModelConfig.__init__ emits DeprecationWarning when invoked outside the manifest adapter context. This is the path slated for replacement by the manifest. Behavior is unchanged; the warning only surfaces when the legacy auto-discovery is used directly. Suppression is provided by model_config_adapter_context() for adapter-internal callers.

No other public API is deprecated in this phase.


Configuration surface

  • Bumped version = "2.0.0a1" in pyproject.toml.
  • Registered [project.scripts] scalable = "scalable.cli.main:main".
  • Optional extras placeholders declared (empty in Phase 1):
    • ai
    • cloud
    • kubernetes
  • Added pyyaml >= 6.0 to core dependencies.
  • New env-var-driven settings in scalable/common.py:
    • SCALABLE_MANIFESTSettings.manifest_path (default ./scalable.yaml)
    • SCALABLE_TARGETSettings.target

Documentation


CI updates

.github/workflows/tests.yml:

  • push and pull_request triggers now include version/**.
  • Test matrix expanded to include macOS for Python 3.11 (LocalProvider path).
  • New validate-example-manifests job runs scalable validate and scalable plan --dry-run against the docs examples to lock the documented manifest grammar against drift.
  • lint (ruff + mypy) job retained; no rule changes.

Test coverage

Full unit suite is green locally (156 passed):

pytest -q
........................................................................ [ 46%]
........................................................................ [ 92%]
............                                                             [100%]
156 passed in 1.04s

ruff check scalable tests is also clean.

New test modules added in this phase:

CLI smoke checks performed manually against the new docs examples.


Backward compatibility

  • All v1.1.0 imports keep working; nothing is removed.
  • Existing imperative flow (SlurmCluster(...)add_container(...)add_workers(...)ScalableClient(cluster)) is unchanged and remains tested.
  • ModelConfig Dockerfile auto-discovery now warns but still functions.
  • version = "2.0.0a1" is alpha-tagged so downstreams pinning <2.0.0 are unaffected.

Phase 1 success criteria checklist (from plans/v2.0.0_phase1_plan.md)

  • from scalable import ScalableSession works; ScalableSession.from_yaml(..., target="local") produces a working LocalCluster + ScalableClient capable of tagged submit(...).
  • ScalableSession.from_yaml(..., target="slurm") configures SlurmCluster from manifest with no functional regression vs. v1.1.0 imperative path.
  • Existing v1.1.0 user code continues to work unchanged; only ModelConfig Dockerfile path emits DeprecationWarning.
  • scalable validate exits 0 on valid manifest, non-zero on invalid, with structured error report.
  • scalable plan --dry-run --target <name> writes plan.json + manifest.lock without instantiating a scheduler.
  • All new modules have unit tests; LocalProvider has integration coverage; existing tests remain green.
  • CI configured for macOS + Linux; ruff clean for new modules.
  • CHANGELOG.md entry, README + docs updates, migration note for ModelConfig users.

Groundwork explicitly enabling later phases

Phase 1 artifact Future phase consumer
DeploymentProvider protocol with provider-neutral DeploymentSpec / ScalePlan Phase 3 (Kubernetes/cloud), Phase 4 (AI planner)
Provider registry with entry_points discovery Phase 3 third-party providers
manifest_lock SHA-256 fingerprint Phase 2 telemetry, Phase 4 plan explanations
targets[*].options: dict[str, Any] passthrough + unknown-key warnings Phase 3 cloud overlays, Phase 4 migration assistant
ScalableSession.plan(objective=, policy=) reserved kwargs raising NotImplementedError Phase 4 AI planner, Phase 5 ML resource advisor
CLI subcommand stubs (run, diagnose, explain, init-component, compose, report) Phases 2–5
Optional-dependency extras (ai, cloud, kubernetes) declared empty Phases 3–5
Settings.manifest_path / Settings.target + env vars Phase 2 telemetry config, Phase 3 overlay selection
Manifest-to-legacy adapter as a pure function Phase 4 onboarding assistant

Risk and rollback

  • All changes are additive; rollback is a git revert of the squash/merge commit on version/2.0.0 without affecting develop/master.
  • Slurm provider tests are mocked and require no live cluster.
  • Local provider integration test uses processes=False, n_workers=1 to keep CI fast and macOS-stable.

How to review

  1. Skim plans/v2.0.0_phase1_plan.md for the architectural intent.
  2. Read protocol shapes in scalable/providers/base.py — these are frozen for Phase 1.
  3. Read the schema in scalable/manifest/schema.py and scalable/manifest/parser.py.
  4. Read ScalableSession in scalable/session/session.py — note the auto-target heuristic and reserved kwargs.
  5. Skim CLI behavior in scalable/cli/main.py, scalable/cli/cmd_validate.py, and scalable/cli/cmd_plan.py.
  6. Verify no regression in scalable/__init__.py re-exports.

Phase 1 architecture

flowchart LR
    subgraph User
      U[scalable.yaml]
      CLI[scalable CLI]
      PY[ScalableSession.from_yaml]
    end

    subgraph Manifest_Layer
      P[parser]
      V[validate]
      S[schema v1]
      A[adapter]
      L[manifest_lock]
    end

    subgraph Provider_Layer
      B[DeploymentProvider Protocol]
      R[registry]
      LP[LocalProvider]
      SP[SlurmProvider]
    end

    subgraph Existing_v1_1_0
      JC[JobQueueCluster]
      SC[SlurmCluster]
      CC[ScalableClient]
    end

    U --> P --> V --> S
    P --> L
    CLI --> P
    CLI --> V
    CLI --> A
    PY --> P
    PY --> A
    A --> B
    B --> R
    R --> LP
    R --> SP
    LP --> CC
    SP --> SC --> JC
Loading

Merge target: version/2.0.0
Source branch: version/2.0.0-phase1-provider-manifest
Tracking PR: #20

crvernon added 12 commits May 19, 2026 14:51
Creates the additive Phase 1 package structure off of version/2.0.0:
manifest/, providers/, session/, planning/, cli/. Each new package ships
with a docstring describing its Phase 1 role and its hooks for later
phases (telemetry, AI assistants, Kubernetes/cloud providers, ML
advisor).

scalable/manifest/schema.py defines the frozen v1 schema dataclasses
(ManifestModel, ProjectConfig, TargetConfig, ComponentConfig, TaskConfig)
and SCHEMA_VERSION = 1. The schema is intentionally implemented with
stdlib dataclasses so manifest validation works without the optional
[ai] extra (resolves Phase 1 plan section 9 open question #1).

scalable/manifest/errors.py declares the ManifestError hierarchy used by
the parser, validator, and Phase 4 AI migration assistant.

scalable/cli/main.py is a Phase 1 stub for the [project.scripts] entry
point; the real validate / plan --dry-run wiring lands in WU-10.

pyproject.toml: version bumped to 2.0.0a1, pyyaml pinned explicitly,
empty placeholder extras for ai/cloud/kubernetes registered so
pip install scalable[ai] resolves cleanly from day one, scalable
console script registered, packages.find used so the new sub-packages
are picked up by setuptools.

Verified: existing 73 unit tests pass unchanged; ruff clean on all new
modules. No public API removed or renamed.

Refs plans/v2.0.0_phase1_plan.md WU-1.
@crvernon crvernon merged commit fc0a8e8 into version/2.0.0 May 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant