phase 2 progress towards telemetry and deterministic advising#21
Merged
Merged
Conversation
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Phase 2: Telemetry and Deterministic Advising
Implements
plans/v2.0.0_phase2_plan.mdon the dedicated feature branchversion/2.0.0-phase2-telemetry-advisingoff of the long-lived integration branchversion/2.0.0.Summary
Phase 2 introduces durable run telemetry, a manifest-linked run history store, deterministic resource advising, and the implemented
scalable reportCLI command. All Phase 1 contracts (ScalableSession,DeploymentProvider, manifest validation, dry-run planning,manifest.lock) are preserved and now anchored to deterministic per-run telemetry artifacts.Scope (as defined in the Phase 2 plan)
ResourceAdvisorscalable reportImplementation highlights
Telemetry package
New package
scalable/telemetry/with:events.py: versioned dataclasses forRunMetadata,TaskEvent,ResourceEvent,WorkerEvent,FailureEvent,CacheEvent,ArtifactEvent(schema_version=1).store.py:TelemetryStorepersists each run under.scalable/runs/<run-id>/with canonical JSONL streams, persistedmanifest.yaml,plan.json,manifest.lock,run.json, finalization summary, and optional parquet snapshots when an extra is installed.collectors.py: deterministic run loading, summary aggregation,--latestresolution, and text/JSON report rendering.runtime.py: contextvar-based active store and task context plumbing (with cross-process global fallback).Session/client/provider/caching integration
scalable/session/session.py: creates the run telemetry store at session start, finalizes it at close, and exposesrecord_artifact(...)for runtime artifact metadata recording.scalable/client.py: instrumentsScalableClient.submitandScalableClient.mapto record submitted/running/succeeded/failed/cancelled task events through Dask future callbacks; binds task context for downstream cache events.scalable/caching.py: emits cache hit/miss events with timing information through the runtime hook.scalable/providers/local.pyandscalable/providers/slurm.py: emit cluster lifecycle and scaling worker events.Deterministic advising API
scalable/advising/resources.pywithResourceAdvisor.from_history(...)andrecommend(...)returning explainableResourceRecommendationpayloads using confidence-indexed quantiles plus safety margins, with sparse-history fallbacks.scalable/__init__.py.CLI:
scalable reportscalable/cli/cmd_report.pyreplaces the Phase 1 stub.--runs-dir,--run-id,--latest,--format text|json,--output.scalable/cli/main.pyand the package CLI module docs inscalable/cli/__init__.py.Configuration
New telemetry settings in
scalable/common.pyand tests intests/unit/test_common_settings.py:runs_dir(SCALABLE_RUNS_DIR, default./.scalable/runs)telemetry_enabled(SCALABLE_TELEMETRY, default enabled for manifest-driven sessions)telemetry_parquet(SCALABLE_TELEMETRY_PARQUET, optional)Documentation
docs/telemetry.rstdocs/advising.rstdocs/index.rstanddocs/getting_started.rstREADME.mdTests
tests/unit/test_telemetry_collectors.pytests/unit/test_telemetry_store.pytests/unit/test_cli_report.pytests/unit/test_resource_advisor.pytests/integration/test_session_telemetry_local.pytests/conftest.pyand updatedtests/unit/test_common_settings.pyandtests/unit/test_public_api_exports.py.Files changed
Created
scalable/telemetry/__init__.pyscalable/telemetry/events.pyscalable/telemetry/store.pyscalable/telemetry/collectors.pyscalable/telemetry/runtime.pyscalable/advising/__init__.pyscalable/advising/resources.pyscalable/cli/cmd_report.pydocs/telemetry.rstdocs/advising.rsttests/unit/test_telemetry_collectors.pytests/unit/test_telemetry_store.pytests/unit/test_cli_report.pytests/unit/test_resource_advisor.pytests/integration/test_session_telemetry_local.pyModified
scalable/__init__.pyscalable/common.pyscalable/caching.pyscalable/client.pyscalable/session/session.pyscalable/providers/local.pyscalable/providers/slurm.pyscalable/cli/main.pyscalable/cli/__init__.pytests/conftest.pytests/unit/test_common_settings.pytests/unit/test_public_api_exports.pyREADME.mddocs/index.rstdocs/getting_started.rstCHANGELOG.mdRemoved
Validation
ruff check scalable testspassespytest -qpasses: 166 tests, 0 failures (Phase 1 regression tests preserved + Phase 2 unit/integration suites added)Phase 2 success criteria checklist
.scalable/runs/manifest.yaml,plan.json,manifest.lock, andrun.jsonResourceAdvisor.from_history(...)builds deterministic stateResourceAdvisor.recommend(...)supports confidence-indexed quantile recommendations with safety marginsscalable reportimplemented with text and JSON outputsCross-phase groundwork enabled
schema_versionResourceAdvisorAPIscalable reportJSON envelopeOut of scope (per Phase 2 plan)
Branching and workflow
version/2.0.0-phase2-telemetry-advisingversion/2.0.0version/2.0.0after this PR is mergedversion/2.0.0will not be merged tomasteruntil all phases land