feat(perf): three-mode memory harness + sprint tracker (S1) by wolfiesch · Pull Request #28 · SynthGL/ExcelBench

wolfiesch · 2026-04-27T07:27:56Z

Summary

Sprint 1 of 7 in the 7-Dimension Extension roadmap. Lands two things:

Honest memory measurement via a new --memory-mode flag with three coexisting modes:
- getrusage (default, cheap, sticky lifetime peak — preserves prior behavior)
- tracemalloc (Python-heap peak — exposes Rust vs Python allocator divergence)
- time (per-iteration /usr/bin/time -l subprocess — OS-honest peak RSS, includes Python startup + adapter import; quarterly deep-dive)
- all (composite — runs every iteration in all three modes for direct comparison)
PerfOpResult gains rss_kb_via_time and python_heap_peak_kb fields. The existing rss_peak_mb is unchanged so historical results stay comparable. The HTML dashboard renders dual "RSS — getrusage / time -l" cells with a tooltip explaining divergence whenever the time-l field is present.
TRACKER.md at the repo root — single source of truth for the 7-sprint roadmap. Each sprint flips its row (Planned → In Progress → Shipped) and appends an acceptance note, so any future session can resume cold. See TRACKER.md for the full table.

DEC-018 in decisions.md documents why three modes coexist (no single mode tells the truth: getrusage is sticky, tracemalloc misses Rust, time -l is slow but OS-honest).

Honesty signal

End-to-end smoke (--memory-mode=all on cell_values, wolfxl + openpyxl, warmup=1 iters=2):

Adapter	Op	rss_peak_mb	rss_kb_via_time	python_heap_peak_kb
wolfxl	read	144.7 MB	144.0 MB	9.7 KB
openpyxl	read	144.7 MB	143.7 MB	158.4 KB (16×)
wolfxl	write	144.7 MB	143.9 MB	1.5 KB
openpyxl	write	144.7 MB	143.9 MB	354.3 KB (227×)

The Python-heap column quantitatively confirms wolfxl pushes allocations into Rust. The time -l RSS column will diverge meaningfully once Sprint 2 lands ≥1M-cell fixtures (subprocess startup currently dominates a 7-cell file, so the two RSS numbers are ~equal here — that's expected).

Test plan

uv run pytest tests/ → 1140 passed, 32 skipped, 6 xfailed
uv run ruff check src/excelbench/perf/ src/excelbench/cli.py src/excelbench/results/html_dashboard.py → clean
uv run mypy src/excelbench/perf/ → no issues
excelbench perf --memory-mode=all --feature cell_values --adapter wolfxl --adapter openpyxl --warmup 1 --iters 2 → all three fields populated, ratio table above
Backwards-compat: excelbench perf without --memory-mode defaults to getrusage (existing behavior preserved)
Cross-platform: /usr/bin/time -l parser tested on macOS BSD time + GNU time output formats. Windows path returns None silently (no /usr/bin/time); CI runs on macOS + Linux only — Windows handling will get its full test in S6 (cold-start).

Out of scope (deferred to later sprints per the plan)

Synthetic ≥1M-cell fixtures that make the OS-level RSS divergence visible (S2)
Wide / tall / sparse / many-sheets file shapes (S3)
High-cost ops like ws.append, iter_rows, modify_one_cell (S4)
Per-feature complexity perf at scale (S5)
Subprocess-isolated cold-start timings (S6)
LibreOffice round-trip fidelity oracle (S7)

🤖 Generated with Claude Code

Sprint 1 of the 7-Dimension Extension. Lands two things: 1. Honest memory measurement via a new --memory-mode flag with three modes: - getrusage (default, cheap, sticky lifetime peak — preserves prior behavior) - tracemalloc (Python-heap peak — exposes Rust vs. Python-allocator divergence) - time (per-iteration /usr/bin/time -l subprocess — OS-honest peak RSS, includes Python startup + adapter import; intended for quarterly deep-dive) - all (composite — runs every iteration in all three for direct comparison) PerfOpResult gains rss_kb_via_time and python_heap_peak_kb fields. The existing rss_peak_mb is unchanged, so historical results stay comparable. The HTML dashboard renders dual "RSS — getrusage / time -l" cells with a tooltip explaining the divergence whenever the time-l field is present. 2. TRACKER.md at the repo root — single source of truth for the 7-sprint roadmap. Each sprint flips its row (Planned → In Progress → Shipped) and appends an acceptance note so any session can resume cold. DEC-018 in decisions.md documents why three modes coexist (no single mode tells the truth: getrusage is sticky, tracemalloc misses Rust, time -l is slow but OS-honest). Verification: uv run pytest tests/ → 1140 passed uv run ruff check / mypy → clean excelbench perf --memory-mode=all → all 3 fields populated; openpyxl uses 16× (read) / 227× (write) more Python heap than wolfxl on the same fixture, confirming the honesty signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector · 2026-04-27T07:28:01Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR introduces a multi-strategy memory measurement harness for excelbench perf (Sprint 1 of the “7-Dimension Extension” roadmap), adds a --memory-mode CLI flag to select measurement strategy, and updates result rendering/docs to carry and explain the new memory fields.

Changes:

Add --memory-mode {getrusage,tracemalloc,time,all} and plumb it through the perf runner, extending PerfOpResult with rss_kb_via_time and python_heap_peak_kb.
Implement a new memory harness (MemoryProbe + /usr/bin/time subprocess wrapper) plus an internal one-iteration subprocess entrypoint.
Update the HTML dashboard to render a dual RSS column (getrusage/time) with explanatory tooltips; add TRACKER.md and DEC-018 documentation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/excelbench/results/html_dashboard.py	Renders dual RSS cells/tooltip when time-based RSS is present.
src/excelbench/perf/runner.py	Plumbs `memory_mode`, records new memory fields, and runs optional `/usr/bin/time` subprocess probes.
src/excelbench/perf/memory.py	New memory measurement module (in-process probe + `/usr/bin/time` wrapper + stderr parser).
src/excelbench/perf/_iter_subprocess.py	New internal subprocess entrypoint to run a single iteration for `/usr/bin/time` wrapping.
src/excelbench/cli.py	Adds and validates the `--memory-mode` CLI option.
decisions.md	Documents rationale/constraints for three coexisting memory modes (DEC-018).
TRACKER.md	Adds the sprint tracker and S1 acceptance notes/template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wolfiesch · 2026-04-27T08:48:08Z

@@ -748,13 +858,13 @@ def _measure_write_workload_iteration(
    wall1 = time.perf_counter_ns()
    cpu1 = time.process_time_ns()

-    rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
-    rss_peak = max(rss_before, rss_after)
+    probe.__exit__(None, None, None)



Fixed in 4a80614 — _measure_write_workload_iteration now uses with MemoryProbe(memory_mode) as probe:. Cleanup is guaranteed even if any adapter call (create_workbook, add_sheet, _run_workload_write, save_workbook) raises.

wolfiesch · 2026-04-27T08:48:09Z

@@ -1279,13 +1390,13 @@ def _measure_write_iteration(
    wall1 = time.perf_counter_ns()
    cpu1 = time.process_time_ns()

-    rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
-    rss_peak = max(rss_before, rss_after)
+    probe.__exit__(None, None, None)



Fixed in 4a80614 — _measure_write_iteration now uses with MemoryProbe(memory_mode) as probe:. Same cleanup guarantee as the workload helpers.

wolfiesch · 2026-04-27T08:48:10Z

+        "--feature",
+        feature,
+    ]
+    _metrics, rss_kb = run_iteration_under_time_l(cli_args, cwd=manifest_path.parent.parent)


Fixed in 4a80614 — dropped cwd=manifest_path.parent.parent. The subprocess takes paths explicitly via --manifest, so the parent's cwd doesn't matter and forcing it could break python -m excelbench.perf._iter_subprocess resolution in some installs.

wolfiesch · 2026-04-27T08:48:11Z

+    cli_args = [
+        sys.executable,
+        "-m",
+        "excelbench.perf._iter_subprocess",
+        "--library",
+        adapter_name,
+        "--kind",
+        kind,
+        "--manifest",
+        str(manifest_path),
+        "--feature",
+        feature,
+    ]
+    _metrics, rss_kb = run_iteration_under_time_l(cli_args, cwd=manifest_path.parent.parent)
+    return rss_kb


Fixed in 4a80614 — added a new _measure_one_sample helper that centralizes the iteration logic. In time-only mode it runs ONLY the subprocess and uses its JSON metrics for wall_ms/cpu_ms (no double-execute). In all mode both the in-process loop and the subprocess run (intentional — the divergence is the deep-dive signal). Verified with a smoke test that --memory-mode=time produces both wall_ms and rss_kb_via_time from a single subprocess invocation per recorded sample.

wolfiesch · 2026-04-27T08:48:13Z

+            "in-process RUSAGE_SELF peak. 'tracemalloc' adds the Python heap "
+            "peak (misses Rust allocations). 'time' spawns each iteration "
+            "under /usr/bin/time -l for an OS-honest peak RSS (slow). "
+            "'all' runs every iteration in all three modes — quarterly deep-dive, "
+            "not the CI hot path."


Fixed in 4a80614 — updated the help text to describe the new behavior. tracemalloc mode now reports getrusage RSS plus the Python heap peak (see linked memory.py change), time mode is described as -l on macOS / -v on Linux and noted as subprocess-only (no double-execute).

wolfiesch · 2026-04-27T08:48:14Z

+    time_l = _resolve_time_l_path()
+    if time_l is None:
+        return None, None
+
+    cmd = [time_l, "-l", *cli_args]
+    try:
+        completed = subprocess.run(
+            cmd,
+            cwd=str(cwd) if cwd else None,
+            capture_output=True,
+            text=True,
+            timeout=timeout_s,
+            check=False,
+        )
+    except subprocess.TimeoutExpired:
+        return None, None
+    except (FileNotFoundError, OSError):
+        return None, None
+
+    metrics: dict[str, float] | None = None
+    if completed.returncode == 0 and completed.stdout.strip():
+        try:
+            parsed = json.loads(completed.stdout.strip().splitlines()[-1])
+            if isinstance(parsed, dict):
+                metrics = {k: float(v) for k, v in parsed.items() if isinstance(v, int | float)}
+        except (json.JSONDecodeError, ValueError):
+            metrics = None
+
+    rss_kb = parse_time_l_stderr(completed.stderr)
+    return metrics, rss_kb
+
+
+def parse_time_l_stderr(text: str) -> float | None:
+    """Extract peak RSS in KB from ``/usr/bin/time -l`` stderr output.
+
+    Cross-platform: macOS reports bytes ("maximum resident set size") while
+    Linux's GNU time -l reports kilobytes ("Maximum resident set size (kbytes)").
+    """


Fixed in 4a80614 — added _time_flag() helper that returns -l on macOS (BSD time) and -v on GNU/Linux time. parse_time_l_stderr already handled both stderr formats ("maximum resident set size" bytes / "Maximum resident set size (kbytes):"); only the invocation needed to be split per platform.

wolfiesch · 2026-04-27T08:48:15Z

+def includes_getrusage(mode: MemoryMode) -> bool:
+    return mode in ("getrusage", "all")
+
+
+def includes_tracemalloc(mode: MemoryMode) -> bool:
+    return mode in ("tracemalloc", "all")
+


Fixed in 4a80614 — chose option (b): includes_getrusage now also returns True for tracemalloc mode. getrusage is one syscall, essentially free, so always populating rss_peak_mb in any in-process mode keeps the help text honest. Verified with a smoke test that --memory-mode=tracemalloc populates both rss_peak_mb (~140 MB) and python_heap_peak_kb (previously rss_peak_mb was None).

wolfiesch · 2026-04-27T08:48:17Z

@@ -691,13 +800,13 @@ def _measure_read_workload_iteration(
    wall1 = time.perf_counter_ns()
    cpu1 = time.process_time_ns()

-    rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
-    rss_peak = max(rss_before, rss_after)
+    probe.__exit__(None, None, None)



Fixed in 4a80614 — _measure_read_workload_iteration now uses with MemoryProbe(memory_mode) as probe: (matching the existing pattern in _measure_read_iteration). Tracemalloc cleanup is now exception-safe across all four iteration helpers.

Fixes seven Copilot review issues on the Sprint 1 memory honesty PR: memory.py: - includes_getrusage now also returns True for tracemalloc mode so --memory-mode=tracemalloc reports RSS alongside the Python heap peak (matches the documented help-text contract). - run_iteration_under_time_l now picks the right verbose flag per platform: -l on macOS (BSD time), -v on Linux (GNU coreutils time). parse_time_l_stderr already handled both stderr formats; only the invocation needed splitting. - Renamed local time_l → time_bin since the flag is no longer constant. runner.py: - Wrapped MemoryProbe usage in `with` blocks for the three remaining measurement helpers (_measure_read_workload_iteration, _measure_write_workload_iteration, _measure_write_iteration). Manual __enter__/__exit__ leaked tracemalloc state on adapter exceptions. - Removed redundant cwd=manifest_path.parent.parent kwarg — the subprocess takes paths explicitly and shouldn't change cwd. - New _measure_one_sample helper centralizes the bench-loop iteration: in `time` mode it now runs ONLY the subprocess (using its JSON metrics for wall_ms/cpu_ms); in `all` mode the in-process loop also runs (the divergence is the deep-dive signal); other modes stay in-process. Eliminates the previous 2× execution per recorded sample. cli.py: - --memory-mode help text updated to describe the new behavior: tracemalloc mode reports getrusage RSS, time mode picks -l/-v per platform, time mode runs only the subprocess. Verification: - uv run pytest tests/ ✓ 1140 passed, 32 skipped, 6 xfailed - uv run ruff check src/ tests/ ✓ - uv run mypy src/excelbench/perf/ src/excelbench/cli.py ✓ - Smoke test --memory-mode=tracemalloc populates both rss_peak_mb and python_heap_peak_kb (previously rss_peak_mb was None). - Smoke test --memory-mode=time populates rss_kb_via_time and uses subprocess wall_ms (no double-execute). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extends the throughput-fixture pipeline with a parametric (dtype × scale) matrix so the dashboard can answer "how does library X handle int vs string vs formula at 1M cells?" — the most actionable question for users picking between libraries. What ships: - 7 new value_type branches in _run_workload_write (float, date, datetime, boolean, formula_simple, formula_cross_sheet, mixed_realistic); string-short/long fold into existing string op via string_length. - generate_data_shape_scenarios() in the throughput generator: 10 dtypes × 3 tiers (1k/10k/100k) by default, +1M tier behind --include-1m. New --shape-only flag for fast-iteration runs. - excelbench perf-shape subcommand with --rows/--types/--regenerate/ --memory-mode (inherits Sprint 1's memory-mode plumbing). - _section_data_shape dashboard heatmap (read + write), per-dtype-column log-scale color so slow columns don't wash out fast ones. - DEC-019 with mixed_realistic 60/30/5/3/2 ratio rationale and fixtures/synthetic_calibration/sample_set.md calibration provenance. Branched off feat/perf-mem-honesty (S1) since #28 hasn't merged; once it does, this PR will retarget master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rs) (#31) * feat(perf): Sprint 2 — data-shape benchmark matrix (10 dtypes × 4 tiers) Extends the throughput-fixture pipeline with a parametric (dtype × scale) matrix so the dashboard can answer "how does library X handle int vs string vs formula at 1M cells?" — the most actionable question for users picking between libraries. What ships: - 7 new value_type branches in _run_workload_write (float, date, datetime, boolean, formula_simple, formula_cross_sheet, mixed_realistic); string-short/long fold into existing string op via string_length. - generate_data_shape_scenarios() in the throughput generator: 10 dtypes × 3 tiers (1k/10k/100k) by default, +1M tier behind --include-1m. New --shape-only flag for fast-iteration runs. - excelbench perf-shape subcommand with --rows/--types/--regenerate/ --memory-mode (inherits Sprint 1's memory-mode plumbing). - _section_data_shape dashboard heatmap (read + write), per-dtype-column log-scale color so slow columns don't wash out fast ones. - DEC-019 with mixed_realistic 60/30/5/3/2 ratio rationale and fixtures/synthetic_calibration/sample_set.md calibration provenance. Branched off feat/perf-mem-honesty (S1) since #28 hasn't merged; once it does, this PR will retarget master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(perf): add Sprint 2 data-shape coverage tests Brings PR #31 above the 65% coverage gate by adding 31 tests for the data-shape benchmark matrix: - Parametrized value_type writes (10 dtypes including formula_simple, formula_cross_sheet, mixed_realistic, sparse cells) - Unsupported value_type surfaces in row.notes (catch-all branch) - _resolve_shape_features (rows-to-tiers, types filtering, error paths) - _shape_fixtures_stale (missing manifest, generator-newer-than-manifest, needs_1m gate, fresh-manifest-no-regen) - _section_data_shape dashboard helper (heatmap rendering, color scaling) - perf_shape CLI (invalid memory_mode, invalid dtype, unknown adapter, happy path writing results.json/matrix.csv/history.jsonl) Local coverage 67.64% (above 65% gate); Linux CI estimated ~65.9%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(perf): address PR #31 Copilot review comments - cli.py: tier cap for 100k from 100,000 → 99,856 (matches actual 316×316 generated fixture, so --rows 99856 doesn't silently drop the tier) - generate_throughput_fixtures.py: align boolean/date/datetime fixture seeds with the runner's start=1 convention (was 0-based, runner was 1-based, causing read fixtures and write payloads to differ for the same dtype) - generate_throughput_fixtures.py: distinguish "scenarios" (xlsx files) from "manifest rows" in the --shape-only output line - README.md: clarify "40 scenarios produce 80 features" (read + write) - html_dashboard.py: switch isinstance(p50, int | float) to tuple form (universally compatible; Copilot incorrectly flagged PEP 604 form) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Wolfgang Schoenberger <221313372+wolfiesch@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 27, 2026 07:27

docs(tracker): backfill PR #28 link in S1 row

1b2e23f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot started reviewing on behalf of wolfiesch April 27, 2026 07:28 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

This was referenced Apr 27, 2026

fix(deps): bump lxml 6.0.2 → 6.1.0 (closes dependabot #8 / XXE high-sev) #29

Closed

feat(perf): Sprint 2 — data-shape benchmark matrix (10 dtypes × 4 tiers) #30

Closed

wolfiesch merged commit 373c896 into master Apr 27, 2026
5 checks passed

wolfiesch deleted the feat/perf-mem-honesty branch April 27, 2026 09:39

wolfiesch mentioned this pull request Apr 27, 2026

feat(perf): Sprint 2 — data-shape benchmark matrix (10 dtypes × 4 tiers) #31

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(perf): three-mode memory harness + sprint tracker (S1)#28

feat(perf): three-mode memory harness + sprint tracker (S1)#28
wolfiesch merged 3 commits into
masterfrom
feat/perf-mem-honesty

wolfiesch commented Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

wolfiesch Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wolfiesch commented Apr 27, 2026

Summary

Honesty signal

Test plan

Out of scope (deferred to later sprints per the plan)

Uh oh!

chatgpt-codex-connector Bot commented Apr 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants