Skip to content

Implement provenance-complete reproducible query results with deterministic replay#996

Merged
SkBlaz merged 6 commits intomasterfrom
copilot/implement-provenance-graph-queries
Jan 6, 2026
Merged

Implement provenance-complete reproducible query results with deterministic replay#996
SkBlaz merged 6 commits intomasterfrom
copilot/implement-provenance-graph-queries

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Jan 5, 2026

Adds versioned provenance schema that captures query AST, network snapshots, and randomness state to enable byte-identical result reproduction across executions.

Core Changes

New py3plex.provenance module (~1,150 LOC)

  • schema.py: Versioned provenance schema v1.0 with log/replayable modes
  • capture.py: Canonical network serialization with stable ordering
  • replay.py: Query reconstruction from serialized AST
  • bundle.py: Compressed JSON bundle I/O

QueryResult extensions

  • .provenance property: structured metadata access
  • .is_replayable property: mode detection
  • .replay(strict=bool): deterministic re-execution
  • .export_bundle(path): gzip-compressed archive

DSL v2 builder methods

  • .provenance(mode, capture, seed): explicit configuration
  • .reproducible(enabled, seed): convenience sugar

Auto-capture policy

  • ≤10K nodes/50K edges: inline snapshot
  • Larger: fingerprint only (prevents OOM)

Usage

# Execute with replayable provenance
result = (
    Q.nodes()
     .reproducible(True, seed=42)
     .uq(method="bootstrap", n_samples=100)
     .compute("betweenness_centrality")
     .execute(network)
)

# Deterministic replay (identical CI bounds)
result2 = result.replay()

# Portable bundles
result.export_bundle("analysis.json.gz")
result3 = replay_from_bundle("analysis.json.gz")

Implementation Details

  • AST serialization handles enum reconstruction (Target/ExportTarget)
  • Dual provenance system: legacy ProvenanceBuilder + new ProvenanceSchema coexist
  • Seed capture via numpy.random.SeedSequence for UQ determinism
  • Default remains log mode (backward compatible)

Testing

  • 11 new tests: capture roundtrip, replay determinism, bundle I/O
  • 154 existing DSL tests verified (no regressions)
  • Integration test: full bundle workflow

Documentation

  • AGENTS.md: comprehensive provenance section
  • docfiles/user_guide/dsl.rst: usage patterns, limitations, best practices
  • examples/network_analysis/example_replayable_queries.py: 5 working demos
Original prompt

This section details on the original issue you should resolve

<issue_title>replicability</issue_title>
<issue_description>COPILOT PROMPT (py3plex) — Implement “Provenance-Complete Reproducible Graph Queries” (Replayable Results)
Constraints:

  • DO NOT create any new .md files.
  • Update existing AGENTS.md (root) to reflect the new feature + agent guidance.
  • Update existing .rst docs (in docfiles/) to document the feature (edit existing files only; no new rst).
  • Add tests + examples as needed (prefer adding to existing example files if present; otherwise add new .py example files under examples/, which is allowed).
  • Keep backward compatibility: existing QueryResult + legacy dict export must not break.

Goal:
Make every query result optionally “replayable”: it carries sufficient provenance to deterministically reproduce the result later, including query AST, parameter bindings, network snapshot/delta, randomness state, and execution plan. Provide a clean API: result.provenance, result.is_replayable, result.replay(), result.export_bundle(), and loader utilities.

================================================================================
PHASE 0 — Repo reconnaissance (fast + precise)
TODO:

  1. Locate current provenance implementation:
    • Search for: meta["provenance"], "ast_hash", "dsl_v2_executor", QueryResult.meta
    • Identify where provenance is assembled (DSL executor, graph_ops, pipeline steps).
  2. Find QueryResult class + serialization exports:
    • QueryResult.to_dict(), to_pandas(), to_networkx(), to_arrow()
    • Legacy DSL returns dict; confirm how meta is attached there.
  3. Find network mutation/versioning:
    • Is there a network_version counter? (It’s mentioned as possibly None in docs.)
    • Identify how to fingerprint networks (node/edge counts, layers) today.
  4. Inspect docs targets to update (existing only):
    • docfiles/user_guide/dsl.rst (and any other relevant rst already present: workflows, uncertainty, temporal, provenance mentions).
  5. Inspect AGENTS.md sections about provenance, reproducibility, UQ, determinism.

Deliverable at end of Phase 0:

  • A short internal design note (in code comments / PR description only; no new md) that enumerates touched modules + chosen serialization approach.

================================================================================
PHASE 1 — Define the provenance contract (data model)
Design principle:

  • “Provenance logging” (current) ≠ “Replayable provenance” (new).
  • Add a structured contract with versioning so it’s stable and evolvable.

TODO:

  1. Add a new provenance schema version:
    • provenance["schema_version"] = "1.0" (string)
    • provenance["mode"] in {"log", "replayable"} (log = existing behavior; replayable = new)
  2. Extend provenance payload shape (minimum replay set):
    A) query:
    • engine, ast_hash, ast (serialized), ast_summary, params/bindings
    • execution_plan (ordered stages + key toggles like autocompute, fast_path)
      B) randomness:
    • base_seed
    • derived_seeds (per stage / per sample) OR a seed-sequence descriptor
    • NOTE: must support UQ (n_samples) deterministically
      C) network_capture:
    • capture_method in {"fingerprint_only", "snapshot_graph", "delta_from_dataset", "delta_from_base"}
    • base_reference (optional): dataset id + version OR file path hash OR user-provided label
    • node_table / edge_table (if snapshot_graph)
    • or delta representation (ops log) if available
    • encoding metadata (format, compression)
      D) environment:
    • py3plex_version
    • python_version
    • platform
    • dependency versions for critical libs (networkx, numpy, scipy, etc.) best-effort
  3. Add size guardrails:
    • provenance["size_bytes_estimate"]
    • provenance["warnings"] when capture too large and falls back
    • allow user config: capture="auto|fingerprint|snapshot|delta"

Implementation hint:

  • Put schema in a dedicated module, e.g. py3plex/provenance/schema.py (new .py ok).
  • Keep it dependency-light; use stdlib json + hashlib + platform + sys.

================================================================================
PHASE 2 — Implement network capture (snapshot / delta)
Goal:
Enable deterministic replay by capturing the exact network state used at execution time.

TODO:

  1. Implement canonical network snapshot export (lossless for py3plex):
    • Export nodes with (id, layer, attributes)
    • Export edges with (source, target, source_layer, target_layer, attributes including weight, timestamps if temporal)
    • Preserve directedness + multigraph semantics if applicable
    • Ensure stable ordering (sorted) for hashing and deterministic serialization
  2. Add compact serialization:
    • JSON is easiest; optionally compress with gzip (stdlib) when exporting bundles
    • Provide in-memory dict form stored in provenance when small enough
  3. Add “auto capture” policy:
    • If graph <= threshold (configurable: node_count/edge_count or bytes), store s...

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 4 commits January 5, 2026 20:39
Co-authored-by: SkBlaz <10035780+SkBlaz@users.noreply.github.com>
Co-authored-by: SkBlaz <10035780+SkBlaz@users.noreply.github.com>
Co-authored-by: SkBlaz <10035780+SkBlaz@users.noreply.github.com>
Co-authored-by: SkBlaz <10035780+SkBlaz@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement provenance-complete reproducible graph queries Implement provenance-complete reproducible query results with deterministic replay Jan 5, 2026
Copilot AI requested a review from SkBlaz January 5, 2026 20:54
@SkBlaz SkBlaz marked this pull request as ready for review January 6, 2026 01:29
@SkBlaz SkBlaz merged commit 7113f64 into master Jan 6, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

replicability

2 participants