Skip to content

phase 2 progress towards telemetry and deterministic advising#21

Merged
crvernon merged 1 commit into
version/2.0.0from
version/2.0.0-phase2-telemetry-advising
May 19, 2026
Merged

phase 2 progress towards telemetry and deterministic advising#21
crvernon merged 1 commit into
version/2.0.0from
version/2.0.0-phase2-telemetry-advising

Conversation

@crvernon
Copy link
Copy Markdown
Member

Phase 2: Telemetry and Deterministic Advising

Implements plans/v2.0.0_phase2_plan.md on the dedicated feature branch version/2.0.0-phase2-telemetry-advising off of the long-lived integration branch version/2.0.0.

Summary

Phase 2 introduces durable run telemetry, a manifest-linked run history store, deterministic resource advising, and the implemented scalable report CLI command. All Phase 1 contracts (ScalableSession, DeploymentProvider, manifest validation, dry-run planning, manifest.lock) are preserved and now anchored to deterministic per-run telemetry artifacts.

Scope (as defined in the Phase 2 plan)

  • Run history store
  • Task/resource/failure event schema
  • Artifact metadata
  • Baseline ResourceAdvisor
  • Cache hit/miss reports
  • scalable report

Implementation highlights

Telemetry package

New package scalable/telemetry/ with:

  • events.py: versioned dataclasses for RunMetadata, TaskEvent, ResourceEvent, WorkerEvent, FailureEvent, CacheEvent, ArtifactEvent (schema_version=1).
  • store.py: TelemetryStore persists each run under .scalable/runs/<run-id>/ with canonical JSONL streams, persisted manifest.yaml, plan.json, manifest.lock, run.json, finalization summary, and optional parquet snapshots when an extra is installed.
  • collectors.py: deterministic run loading, summary aggregation, --latest resolution, and text/JSON report rendering.
  • runtime.py: contextvar-based active store and task context plumbing (with cross-process global fallback).

Session/client/provider/caching integration

  • scalable/session/session.py: creates the run telemetry store at session start, finalizes it at close, and exposes record_artifact(...) for runtime artifact metadata recording.
  • scalable/client.py: instruments ScalableClient.submit and ScalableClient.map to record submitted/running/succeeded/failed/cancelled task events through Dask future callbacks; binds task context for downstream cache events.
  • scalable/caching.py: emits cache hit/miss events with timing information through the runtime hook.
  • scalable/providers/local.py and scalable/providers/slurm.py: emit cluster lifecycle and scaling worker events.

Deterministic advising API

  • New scalable/advising/resources.py with ResourceAdvisor.from_history(...) and recommend(...) returning explainable ResourceRecommendation payloads using confidence-indexed quantiles plus safety margins, with sparse-history fallbacks.
  • Top-level exports added in scalable/__init__.py.

CLI: scalable report

Configuration

New telemetry settings in scalable/common.py and tests in tests/unit/test_common_settings.py:

  • runs_dir (SCALABLE_RUNS_DIR, default ./.scalable/runs)
  • telemetry_enabled (SCALABLE_TELEMETRY, default enabled for manifest-driven sessions)
  • telemetry_parquet (SCALABLE_TELEMETRY_PARQUET, optional)

Documentation

Tests

Files changed

Created

  • scalable/telemetry/__init__.py
  • scalable/telemetry/events.py
  • scalable/telemetry/store.py
  • scalable/telemetry/collectors.py
  • scalable/telemetry/runtime.py
  • scalable/advising/__init__.py
  • scalable/advising/resources.py
  • scalable/cli/cmd_report.py
  • docs/telemetry.rst
  • docs/advising.rst
  • tests/unit/test_telemetry_collectors.py
  • tests/unit/test_telemetry_store.py
  • tests/unit/test_cli_report.py
  • tests/unit/test_resource_advisor.py
  • tests/integration/test_session_telemetry_local.py

Modified

  • scalable/__init__.py
  • scalable/common.py
  • scalable/caching.py
  • scalable/client.py
  • scalable/session/session.py
  • scalable/providers/local.py
  • scalable/providers/slurm.py
  • scalable/cli/main.py
  • scalable/cli/__init__.py
  • tests/conftest.py
  • tests/unit/test_common_settings.py
  • tests/unit/test_public_api_exports.py
  • README.md
  • docs/index.rst
  • docs/getting_started.rst
  • CHANGELOG.md

Removed

  • None. Phase 2 is strictly additive.

Validation

  • ruff check scalable tests passes
  • pytest -q passes: 166 tests, 0 failures (Phase 1 regression tests preserved + Phase 2 unit/integration suites added)

Phase 2 success criteria checklist

  • Starting a session creates a unique run directory under .scalable/runs/
  • Each run directory contains manifest.yaml, plan.json, manifest.lock, and run.json
  • Task events captured for queued/running/succeeded/failed/cancelled with timestamps and run linkage
  • Resource events include requested CPU/memory/walltime and provider context
  • Cache hit/miss events persisted with task/function context
  • Failure events include classification, message, and provider context
  • Artifact metadata records persisted via session API
  • ResourceAdvisor.from_history(...) builds deterministic state
  • ResourceAdvisor.recommend(...) supports confidence-indexed quantile recommendations with safety margins
  • scalable report implemented with text and JSON outputs
  • Unit + integration tests for telemetry, report, and advisor
  • Documentation and changelog updates

Cross-phase groundwork enabled

Artifact Consumer
Stable event schemas with schema_version Phase 4 diagnosis, Phase 5 ML
Run store with manifest linkage Phase 4 explain, migration assistant
ResourceAdvisor API Phase 5 learned advisor (drop-in implementation)
Cache event metrics Phase 3 remote cache, Phase 5 cache policy learning
scalable report JSON envelope Phase 4 assistant-readable diagnostics

Out of scope (per Phase 2 plan)

  • No Kubernetes/cloud providers (Phase 3)
  • No LLM-driven planning/diagnosis (Phase 4)
  • No learned predictive models or emulators (Phase 5)
  • No replacement of legacy imperative API paths

Branching and workflow

  • Source branch: version/2.0.0-phase2-telemetry-advising
  • Target branch: version/2.0.0
  • Phase 3 will branch off version/2.0.0 after this PR is merged
  • version/2.0.0 will not be merged to master until all phases land

@crvernon crvernon merged commit 1ae928c into version/2.0.0 May 19, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant