feat(perf): three-mode memory harness + sprint tracker (S1)#28
Conversation
Sprint 1 of the 7-Dimension Extension. Lands two things:
1. Honest memory measurement via a new --memory-mode flag with three modes:
- getrusage (default, cheap, sticky lifetime peak — preserves prior behavior)
- tracemalloc (Python-heap peak — exposes Rust vs. Python-allocator divergence)
- time (per-iteration /usr/bin/time -l subprocess — OS-honest peak RSS,
includes Python startup + adapter import; intended for quarterly deep-dive)
- all (composite — runs every iteration in all three for direct comparison)
PerfOpResult gains rss_kb_via_time and python_heap_peak_kb fields. The
existing rss_peak_mb is unchanged, so historical results stay comparable.
The HTML dashboard renders dual "RSS — getrusage / time -l" cells with a
tooltip explaining the divergence whenever the time-l field is present.
2. TRACKER.md at the repo root — single source of truth for the 7-sprint
roadmap. Each sprint flips its row (Planned → In Progress → Shipped) and
appends an acceptance note so any session can resume cold.
DEC-018 in decisions.md documents why three modes coexist (no single mode
tells the truth: getrusage is sticky, tracemalloc misses Rust, time -l is
slow but OS-honest).
Verification:
uv run pytest tests/ → 1140 passed
uv run ruff check / mypy → clean
excelbench perf --memory-mode=all → all 3 fields populated; openpyxl
uses 16× (read) / 227× (write) more Python heap than wolfxl on the
same fixture, confirming the honesty signal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a multi-strategy memory measurement harness for excelbench perf (Sprint 1 of the “7-Dimension Extension” roadmap), adds a --memory-mode CLI flag to select measurement strategy, and updates result rendering/docs to carry and explain the new memory fields.
Changes:
- Add
--memory-mode {getrusage,tracemalloc,time,all}and plumb it through the perf runner, extendingPerfOpResultwithrss_kb_via_timeandpython_heap_peak_kb. - Implement a new memory harness (
MemoryProbe+/usr/bin/timesubprocess wrapper) plus an internal one-iteration subprocess entrypoint. - Update the HTML dashboard to render a dual RSS column (getrusage/time) with explanatory tooltips; add
TRACKER.mdand DEC-018 documentation.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/excelbench/results/html_dashboard.py | Renders dual RSS cells/tooltip when time-based RSS is present. |
| src/excelbench/perf/runner.py | Plumbs memory_mode, records new memory fields, and runs optional /usr/bin/time subprocess probes. |
| src/excelbench/perf/memory.py | New memory measurement module (in-process probe + /usr/bin/time wrapper + stderr parser). |
| src/excelbench/perf/_iter_subprocess.py | New internal subprocess entrypoint to run a single iteration for /usr/bin/time wrapping. |
| src/excelbench/cli.py | Adds and validates the --memory-mode CLI option. |
| decisions.md | Documents rationale/constraints for three coexisting memory modes (DEC-018). |
| TRACKER.md | Adds the sprint tracker and S1 acceptance notes/template. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -748,13 +858,13 @@ def _measure_write_workload_iteration( | |||
| wall1 = time.perf_counter_ns() | |||
| cpu1 = time.process_time_ns() | |||
|
|
|||
| rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) | |||
| rss_peak = max(rss_before, rss_after) | |||
| probe.__exit__(None, None, None) | |||
|
|
|||
There was a problem hiding this comment.
Fixed in 4a80614 — _measure_write_workload_iteration now uses with MemoryProbe(memory_mode) as probe:. Cleanup is guaranteed even if any adapter call (create_workbook, add_sheet, _run_workload_write, save_workbook) raises.
| @@ -1279,13 +1390,13 @@ def _measure_write_iteration( | |||
| wall1 = time.perf_counter_ns() | |||
| cpu1 = time.process_time_ns() | |||
|
|
|||
| rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) | |||
| rss_peak = max(rss_before, rss_after) | |||
| probe.__exit__(None, None, None) | |||
|
|
|||
There was a problem hiding this comment.
Fixed in 4a80614 — _measure_write_iteration now uses with MemoryProbe(memory_mode) as probe:. Same cleanup guarantee as the workload helpers.
| "--feature", | ||
| feature, | ||
| ] | ||
| _metrics, rss_kb = run_iteration_under_time_l(cli_args, cwd=manifest_path.parent.parent) |
There was a problem hiding this comment.
Fixed in 4a80614 — dropped cwd=manifest_path.parent.parent. The subprocess takes paths explicitly via --manifest, so the parent's cwd doesn't matter and forcing it could break python -m excelbench.perf._iter_subprocess resolution in some installs.
| cli_args = [ | ||
| sys.executable, | ||
| "-m", | ||
| "excelbench.perf._iter_subprocess", | ||
| "--library", | ||
| adapter_name, | ||
| "--kind", | ||
| kind, | ||
| "--manifest", | ||
| str(manifest_path), | ||
| "--feature", | ||
| feature, | ||
| ] | ||
| _metrics, rss_kb = run_iteration_under_time_l(cli_args, cwd=manifest_path.parent.parent) | ||
| return rss_kb |
There was a problem hiding this comment.
Fixed in 4a80614 — added a new _measure_one_sample helper that centralizes the iteration logic. In time-only mode it runs ONLY the subprocess and uses its JSON metrics for wall_ms/cpu_ms (no double-execute). In all mode both the in-process loop and the subprocess run (intentional — the divergence is the deep-dive signal). Verified with a smoke test that --memory-mode=time produces both wall_ms and rss_kb_via_time from a single subprocess invocation per recorded sample.
| "in-process RUSAGE_SELF peak. 'tracemalloc' adds the Python heap " | ||
| "peak (misses Rust allocations). 'time' spawns each iteration " | ||
| "under /usr/bin/time -l for an OS-honest peak RSS (slow). " | ||
| "'all' runs every iteration in all three modes — quarterly deep-dive, " | ||
| "not the CI hot path." |
There was a problem hiding this comment.
Fixed in 4a80614 — updated the help text to describe the new behavior. tracemalloc mode now reports getrusage RSS plus the Python heap peak (see linked memory.py change), time mode is described as -l on macOS / -v on Linux and noted as subprocess-only (no double-execute).
| time_l = _resolve_time_l_path() | ||
| if time_l is None: | ||
| return None, None | ||
|
|
||
| cmd = [time_l, "-l", *cli_args] | ||
| try: | ||
| completed = subprocess.run( | ||
| cmd, | ||
| cwd=str(cwd) if cwd else None, | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=timeout_s, | ||
| check=False, | ||
| ) | ||
| except subprocess.TimeoutExpired: | ||
| return None, None | ||
| except (FileNotFoundError, OSError): | ||
| return None, None | ||
|
|
||
| metrics: dict[str, float] | None = None | ||
| if completed.returncode == 0 and completed.stdout.strip(): | ||
| try: | ||
| parsed = json.loads(completed.stdout.strip().splitlines()[-1]) | ||
| if isinstance(parsed, dict): | ||
| metrics = {k: float(v) for k, v in parsed.items() if isinstance(v, int | float)} | ||
| except (json.JSONDecodeError, ValueError): | ||
| metrics = None | ||
|
|
||
| rss_kb = parse_time_l_stderr(completed.stderr) | ||
| return metrics, rss_kb | ||
|
|
||
|
|
||
| def parse_time_l_stderr(text: str) -> float | None: | ||
| """Extract peak RSS in KB from ``/usr/bin/time -l`` stderr output. | ||
|
|
||
| Cross-platform: macOS reports bytes ("maximum resident set size") while | ||
| Linux's GNU time -l reports kilobytes ("Maximum resident set size (kbytes)"). | ||
| """ |
There was a problem hiding this comment.
Fixed in 4a80614 — added _time_flag() helper that returns -l on macOS (BSD time) and -v on GNU/Linux time. parse_time_l_stderr already handled both stderr formats ("maximum resident set size" bytes / "Maximum resident set size (kbytes):"); only the invocation needed to be split per platform.
| def includes_getrusage(mode: MemoryMode) -> bool: | ||
| return mode in ("getrusage", "all") | ||
|
|
||
|
|
||
| def includes_tracemalloc(mode: MemoryMode) -> bool: | ||
| return mode in ("tracemalloc", "all") | ||
|
|
There was a problem hiding this comment.
Fixed in 4a80614 — chose option (b): includes_getrusage now also returns True for tracemalloc mode. getrusage is one syscall, essentially free, so always populating rss_peak_mb in any in-process mode keeps the help text honest. Verified with a smoke test that --memory-mode=tracemalloc populates both rss_peak_mb (~140 MB) and python_heap_peak_kb (previously rss_peak_mb was None).
| @@ -691,13 +800,13 @@ def _measure_read_workload_iteration( | |||
| wall1 = time.perf_counter_ns() | |||
| cpu1 = time.process_time_ns() | |||
|
|
|||
| rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss) | |||
| rss_peak = max(rss_before, rss_after) | |||
| probe.__exit__(None, None, None) | |||
|
|
|||
There was a problem hiding this comment.
Fixed in 4a80614 — _measure_read_workload_iteration now uses with MemoryProbe(memory_mode) as probe: (matching the existing pattern in _measure_read_iteration). Tracemalloc cleanup is now exception-safe across all four iteration helpers.
Fixes seven Copilot review issues on the Sprint 1 memory honesty PR: memory.py: - includes_getrusage now also returns True for tracemalloc mode so --memory-mode=tracemalloc reports RSS alongside the Python heap peak (matches the documented help-text contract). - run_iteration_under_time_l now picks the right verbose flag per platform: -l on macOS (BSD time), -v on Linux (GNU coreutils time). parse_time_l_stderr already handled both stderr formats; only the invocation needed splitting. - Renamed local time_l → time_bin since the flag is no longer constant. runner.py: - Wrapped MemoryProbe usage in `with` blocks for the three remaining measurement helpers (_measure_read_workload_iteration, _measure_write_workload_iteration, _measure_write_iteration). Manual __enter__/__exit__ leaked tracemalloc state on adapter exceptions. - Removed redundant cwd=manifest_path.parent.parent kwarg — the subprocess takes paths explicitly and shouldn't change cwd. - New _measure_one_sample helper centralizes the bench-loop iteration: in `time` mode it now runs ONLY the subprocess (using its JSON metrics for wall_ms/cpu_ms); in `all` mode the in-process loop also runs (the divergence is the deep-dive signal); other modes stay in-process. Eliminates the previous 2× execution per recorded sample. cli.py: - --memory-mode help text updated to describe the new behavior: tracemalloc mode reports getrusage RSS, time mode picks -l/-v per platform, time mode runs only the subprocess. Verification: - uv run pytest tests/ ✓ 1140 passed, 32 skipped, 6 xfailed - uv run ruff check src/ tests/ ✓ - uv run mypy src/excelbench/perf/ src/excelbench/cli.py ✓ - Smoke test --memory-mode=tracemalloc populates both rss_peak_mb and python_heap_peak_kb (previously rss_peak_mb was None). - Smoke test --memory-mode=time populates rss_kb_via_time and uses subprocess wall_ms (no double-execute). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the throughput-fixture pipeline with a parametric (dtype × scale) matrix so the dashboard can answer "how does library X handle int vs string vs formula at 1M cells?" — the most actionable question for users picking between libraries. What ships: - 7 new value_type branches in _run_workload_write (float, date, datetime, boolean, formula_simple, formula_cross_sheet, mixed_realistic); string-short/long fold into existing string op via string_length. - generate_data_shape_scenarios() in the throughput generator: 10 dtypes × 3 tiers (1k/10k/100k) by default, +1M tier behind --include-1m. New --shape-only flag for fast-iteration runs. - excelbench perf-shape subcommand with --rows/--types/--regenerate/ --memory-mode (inherits Sprint 1's memory-mode plumbing). - _section_data_shape dashboard heatmap (read + write), per-dtype-column log-scale color so slow columns don't wash out fast ones. - DEC-019 with mixed_realistic 60/30/5/3/2 ratio rationale and fixtures/synthetic_calibration/sample_set.md calibration provenance. Branched off feat/perf-mem-honesty (S1) since #28 hasn't merged; once it does, this PR will retarget master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rs) (#31) * feat(perf): Sprint 2 — data-shape benchmark matrix (10 dtypes × 4 tiers) Extends the throughput-fixture pipeline with a parametric (dtype × scale) matrix so the dashboard can answer "how does library X handle int vs string vs formula at 1M cells?" — the most actionable question for users picking between libraries. What ships: - 7 new value_type branches in _run_workload_write (float, date, datetime, boolean, formula_simple, formula_cross_sheet, mixed_realistic); string-short/long fold into existing string op via string_length. - generate_data_shape_scenarios() in the throughput generator: 10 dtypes × 3 tiers (1k/10k/100k) by default, +1M tier behind --include-1m. New --shape-only flag for fast-iteration runs. - excelbench perf-shape subcommand with --rows/--types/--regenerate/ --memory-mode (inherits Sprint 1's memory-mode plumbing). - _section_data_shape dashboard heatmap (read + write), per-dtype-column log-scale color so slow columns don't wash out fast ones. - DEC-019 with mixed_realistic 60/30/5/3/2 ratio rationale and fixtures/synthetic_calibration/sample_set.md calibration provenance. Branched off feat/perf-mem-honesty (S1) since #28 hasn't merged; once it does, this PR will retarget master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(perf): add Sprint 2 data-shape coverage tests Brings PR #31 above the 65% coverage gate by adding 31 tests for the data-shape benchmark matrix: - Parametrized value_type writes (10 dtypes including formula_simple, formula_cross_sheet, mixed_realistic, sparse cells) - Unsupported value_type surfaces in row.notes (catch-all branch) - _resolve_shape_features (rows-to-tiers, types filtering, error paths) - _shape_fixtures_stale (missing manifest, generator-newer-than-manifest, needs_1m gate, fresh-manifest-no-regen) - _section_data_shape dashboard helper (heatmap rendering, color scaling) - perf_shape CLI (invalid memory_mode, invalid dtype, unknown adapter, happy path writing results.json/matrix.csv/history.jsonl) Local coverage 67.64% (above 65% gate); Linux CI estimated ~65.9%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(perf): address PR #31 Copilot review comments - cli.py: tier cap for 100k from 100,000 → 99,856 (matches actual 316×316 generated fixture, so --rows 99856 doesn't silently drop the tier) - generate_throughput_fixtures.py: align boolean/date/datetime fixture seeds with the runner's start=1 convention (was 0-based, runner was 1-based, causing read fixtures and write payloads to differ for the same dtype) - generate_throughput_fixtures.py: distinguish "scenarios" (xlsx files) from "manifest rows" in the --shape-only output line - README.md: clarify "40 scenarios produce 80 features" (read + write) - html_dashboard.py: switch isinstance(p50, int | float) to tuple form (universally compatible; Copilot incorrectly flagged PEP 604 form) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Wolfgang Schoenberger <221313372+wolfiesch@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Sprint 1 of 7 in the 7-Dimension Extension roadmap. Lands two things:
Honest memory measurement via a new
--memory-modeflag with three coexisting modes:getrusage(default, cheap, sticky lifetime peak — preserves prior behavior)tracemalloc(Python-heap peak — exposes Rust vs Python allocator divergence)time(per-iteration/usr/bin/time -lsubprocess — OS-honest peak RSS, includes Python startup + adapter import; quarterly deep-dive)all(composite — runs every iteration in all three modes for direct comparison)PerfOpResultgainsrss_kb_via_timeandpython_heap_peak_kbfields. The existingrss_peak_mbis unchanged so historical results stay comparable. The HTML dashboard renders dual "RSS — getrusage / time -l" cells with a tooltip explaining divergence whenever the time-l field is present.TRACKER.mdat the repo root — single source of truth for the 7-sprint roadmap. Each sprint flips its row (Planned → In Progress → Shipped) and appends an acceptance note, so any future session can resume cold. SeeTRACKER.mdfor the full table.DEC-018indecisions.mddocuments why three modes coexist (no single mode tells the truth: getrusage is sticky, tracemalloc misses Rust, time -l is slow but OS-honest).Honesty signal
End-to-end smoke (
--memory-mode=alloncell_values, wolfxl + openpyxl, warmup=1 iters=2):The Python-heap column quantitatively confirms wolfxl pushes allocations into Rust. The
time -lRSS column will diverge meaningfully once Sprint 2 lands ≥1M-cell fixtures (subprocess startup currently dominates a 7-cell file, so the two RSS numbers are ~equal here — that's expected).Test plan
uv run pytest tests/→ 1140 passed, 32 skipped, 6 xfaileduv run ruff check src/excelbench/perf/ src/excelbench/cli.py src/excelbench/results/html_dashboard.py→ cleanuv run mypy src/excelbench/perf/→ no issuesexcelbench perf --memory-mode=all --feature cell_values --adapter wolfxl --adapter openpyxl --warmup 1 --iters 2→ all three fields populated, ratio table aboveexcelbench perfwithout--memory-modedefaults togetrusage(existing behavior preserved)/usr/bin/time -lparser tested on macOS BSD time + GNU time output formats. Windows path returnsNonesilently (no/usr/bin/time); CI runs on macOS + Linux only — Windows handling will get its full test in S6 (cold-start).Out of scope (deferred to later sprints per the plan)
ws.append,iter_rows,modify_one_cell(S4)🤖 Generated with Claude Code