Skip to content

feat(perf): three-mode memory harness + sprint tracker (S1)#28

Merged
wolfiesch merged 3 commits into
masterfrom
feat/perf-mem-honesty
Apr 27, 2026
Merged

feat(perf): three-mode memory harness + sprint tracker (S1)#28
wolfiesch merged 3 commits into
masterfrom
feat/perf-mem-honesty

Conversation

@wolfiesch
Copy link
Copy Markdown
Collaborator

Summary

Sprint 1 of 7 in the 7-Dimension Extension roadmap. Lands two things:

  1. Honest memory measurement via a new --memory-mode flag with three coexisting modes:

    • getrusage (default, cheap, sticky lifetime peak — preserves prior behavior)
    • tracemalloc (Python-heap peak — exposes Rust vs Python allocator divergence)
    • time (per-iteration /usr/bin/time -l subprocess — OS-honest peak RSS, includes Python startup + adapter import; quarterly deep-dive)
    • all (composite — runs every iteration in all three modes for direct comparison)

    PerfOpResult gains rss_kb_via_time and python_heap_peak_kb fields. The existing rss_peak_mb is unchanged so historical results stay comparable. The HTML dashboard renders dual "RSS — getrusage / time -l" cells with a tooltip explaining divergence whenever the time-l field is present.

  2. TRACKER.md at the repo root — single source of truth for the 7-sprint roadmap. Each sprint flips its row (Planned → In Progress → Shipped) and appends an acceptance note, so any future session can resume cold. See TRACKER.md for the full table.

DEC-018 in decisions.md documents why three modes coexist (no single mode tells the truth: getrusage is sticky, tracemalloc misses Rust, time -l is slow but OS-honest).

Honesty signal

End-to-end smoke (--memory-mode=all on cell_values, wolfxl + openpyxl, warmup=1 iters=2):

Adapter Op rss_peak_mb rss_kb_via_time python_heap_peak_kb
wolfxl read 144.7 MB 144.0 MB 9.7 KB
openpyxl read 144.7 MB 143.7 MB 158.4 KB (16×)
wolfxl write 144.7 MB 143.9 MB 1.5 KB
openpyxl write 144.7 MB 143.9 MB 354.3 KB (227×)

The Python-heap column quantitatively confirms wolfxl pushes allocations into Rust. The time -l RSS column will diverge meaningfully once Sprint 2 lands ≥1M-cell fixtures (subprocess startup currently dominates a 7-cell file, so the two RSS numbers are ~equal here — that's expected).

Test plan

  • uv run pytest tests/ → 1140 passed, 32 skipped, 6 xfailed
  • uv run ruff check src/excelbench/perf/ src/excelbench/cli.py src/excelbench/results/html_dashboard.py → clean
  • uv run mypy src/excelbench/perf/ → no issues
  • excelbench perf --memory-mode=all --feature cell_values --adapter wolfxl --adapter openpyxl --warmup 1 --iters 2 → all three fields populated, ratio table above
  • Backwards-compat: excelbench perf without --memory-mode defaults to getrusage (existing behavior preserved)
  • Cross-platform: /usr/bin/time -l parser tested on macOS BSD time + GNU time output formats. Windows path returns None silently (no /usr/bin/time); CI runs on macOS + Linux only — Windows handling will get its full test in S6 (cold-start).

Out of scope (deferred to later sprints per the plan)

  • Synthetic ≥1M-cell fixtures that make the OS-level RSS divergence visible (S2)
  • Wide / tall / sparse / many-sheets file shapes (S3)
  • High-cost ops like ws.append, iter_rows, modify_one_cell (S4)
  • Per-feature complexity perf at scale (S5)
  • Subprocess-isolated cold-start timings (S6)
  • LibreOffice round-trip fidelity oracle (S7)

🤖 Generated with Claude Code

Sprint 1 of the 7-Dimension Extension. Lands two things:

1. Honest memory measurement via a new --memory-mode flag with three modes:
   - getrusage (default, cheap, sticky lifetime peak — preserves prior behavior)
   - tracemalloc (Python-heap peak — exposes Rust vs. Python-allocator divergence)
   - time (per-iteration /usr/bin/time -l subprocess — OS-honest peak RSS,
     includes Python startup + adapter import; intended for quarterly deep-dive)
   - all (composite — runs every iteration in all three for direct comparison)

   PerfOpResult gains rss_kb_via_time and python_heap_peak_kb fields. The
   existing rss_peak_mb is unchanged, so historical results stay comparable.
   The HTML dashboard renders dual "RSS — getrusage / time -l" cells with a
   tooltip explaining the divergence whenever the time-l field is present.

2. TRACKER.md at the repo root — single source of truth for the 7-sprint
   roadmap. Each sprint flips its row (Planned → In Progress → Shipped) and
   appends an acceptance note so any session can resume cold.

DEC-018 in decisions.md documents why three modes coexist (no single mode
tells the truth: getrusage is sticky, tracemalloc misses Rust, time -l is
slow but OS-honest).

Verification:
  uv run pytest tests/                → 1140 passed
  uv run ruff check / mypy            → clean
  excelbench perf --memory-mode=all   → all 3 fields populated; openpyxl
    uses 16× (read) / 227× (write) more Python heap than wolfxl on the
    same fixture, confirming the honesty signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 07:27
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a multi-strategy memory measurement harness for excelbench perf (Sprint 1 of the “7-Dimension Extension” roadmap), adds a --memory-mode CLI flag to select measurement strategy, and updates result rendering/docs to carry and explain the new memory fields.

Changes:

  • Add --memory-mode {getrusage,tracemalloc,time,all} and plumb it through the perf runner, extending PerfOpResult with rss_kb_via_time and python_heap_peak_kb.
  • Implement a new memory harness (MemoryProbe + /usr/bin/time subprocess wrapper) plus an internal one-iteration subprocess entrypoint.
  • Update the HTML dashboard to render a dual RSS column (getrusage/time) with explanatory tooltips; add TRACKER.md and DEC-018 documentation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/excelbench/results/html_dashboard.py Renders dual RSS cells/tooltip when time-based RSS is present.
src/excelbench/perf/runner.py Plumbs memory_mode, records new memory fields, and runs optional /usr/bin/time subprocess probes.
src/excelbench/perf/memory.py New memory measurement module (in-process probe + /usr/bin/time wrapper + stderr parser).
src/excelbench/perf/_iter_subprocess.py New internal subprocess entrypoint to run a single iteration for /usr/bin/time wrapping.
src/excelbench/cli.py Adds and validates the --memory-mode CLI option.
decisions.md Documents rationale/constraints for three coexisting memory modes (DEC-018).
TRACKER.md Adds the sprint tracker and S1 acceptance notes/template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/excelbench/perf/runner.py Outdated
Comment on lines 825 to 862
@@ -748,13 +858,13 @@ def _measure_write_workload_iteration(
wall1 = time.perf_counter_ns()
cpu1 = time.process_time_ns()

rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
rss_peak = max(rss_before, rss_after)
probe.__exit__(None, None, None)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — _measure_write_workload_iteration now uses with MemoryProbe(memory_mode) as probe:. Cleanup is guaranteed even if any adapter call (create_workbook, add_sheet, _run_workload_write, save_workbook) raises.

Comment thread src/excelbench/perf/runner.py Outdated
Comment on lines 1341 to 1394
@@ -1279,13 +1390,13 @@ def _measure_write_iteration(
wall1 = time.perf_counter_ns()
cpu1 = time.process_time_ns()

rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
rss_peak = max(rss_before, rss_after)
probe.__exit__(None, None, None)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — _measure_write_iteration now uses with MemoryProbe(memory_mode) as probe:. Same cleanup guarantee as the workload helpers.

Comment thread src/excelbench/perf/runner.py Outdated
"--feature",
feature,
]
_metrics, rss_kb = run_iteration_under_time_l(cli_args, cwd=manifest_path.parent.parent)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — dropped cwd=manifest_path.parent.parent. The subprocess takes paths explicitly via --manifest, so the parent's cwd doesn't matter and forcing it could break python -m excelbench.perf._iter_subprocess resolution in some installs.

Comment thread src/excelbench/perf/runner.py Outdated
Comment on lines +1495 to +1509
cli_args = [
sys.executable,
"-m",
"excelbench.perf._iter_subprocess",
"--library",
adapter_name,
"--kind",
kind,
"--manifest",
str(manifest_path),
"--feature",
feature,
]
_metrics, rss_kb = run_iteration_under_time_l(cli_args, cwd=manifest_path.parent.parent)
return rss_kb
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — added a new _measure_one_sample helper that centralizes the iteration logic. In time-only mode it runs ONLY the subprocess and uses its JSON metrics for wall_ms/cpu_ms (no double-execute). In all mode both the in-process loop and the subprocess run (intentional — the divergence is the deep-dive signal). Verified with a smoke test that --memory-mode=time produces both wall_ms and rss_kb_via_time from a single subprocess invocation per recorded sample.

Comment thread src/excelbench/cli.py Outdated
Comment on lines +268 to +272
"in-process RUSAGE_SELF peak. 'tracemalloc' adds the Python heap "
"peak (misses Rust allocations). 'time' spawns each iteration "
"under /usr/bin/time -l for an OS-honest peak RSS (slow). "
"'all' runs every iteration in all three modes — quarterly deep-dive, "
"not the CI hot path."
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — updated the help text to describe the new behavior. tracemalloc mode now reports getrusage RSS plus the Python heap peak (see linked memory.py change), time mode is described as -l on macOS / -v on Linux and noted as subprocess-only (no double-execute).

Comment thread src/excelbench/perf/memory.py Outdated
Comment on lines +133 to +170
time_l = _resolve_time_l_path()
if time_l is None:
return None, None

cmd = [time_l, "-l", *cli_args]
try:
completed = subprocess.run(
cmd,
cwd=str(cwd) if cwd else None,
capture_output=True,
text=True,
timeout=timeout_s,
check=False,
)
except subprocess.TimeoutExpired:
return None, None
except (FileNotFoundError, OSError):
return None, None

metrics: dict[str, float] | None = None
if completed.returncode == 0 and completed.stdout.strip():
try:
parsed = json.loads(completed.stdout.strip().splitlines()[-1])
if isinstance(parsed, dict):
metrics = {k: float(v) for k, v in parsed.items() if isinstance(v, int | float)}
except (json.JSONDecodeError, ValueError):
metrics = None

rss_kb = parse_time_l_stderr(completed.stderr)
return metrics, rss_kb


def parse_time_l_stderr(text: str) -> float | None:
"""Extract peak RSS in KB from ``/usr/bin/time -l`` stderr output.

Cross-platform: macOS reports bytes ("maximum resident set size") while
Linux's GNU time -l reports kilobytes ("Maximum resident set size (kbytes)").
"""
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — added _time_flag() helper that returns -l on macOS (BSD time) and -v on GNU/Linux time. parse_time_l_stderr already handled both stderr formats ("maximum resident set size" bytes / "Maximum resident set size (kbytes):"); only the invocation needed to be split per platform.

Comment on lines +54 to +60
def includes_getrusage(mode: MemoryMode) -> bool:
return mode in ("getrusage", "all")


def includes_tracemalloc(mode: MemoryMode) -> bool:
return mode in ("tracemalloc", "all")

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — chose option (b): includes_getrusage now also returns True for tracemalloc mode. getrusage is one syscall, essentially free, so always populating rss_peak_mb in any in-process mode keeps the help text honest. Verified with a smoke test that --memory-mode=tracemalloc populates both rss_peak_mb (~140 MB) and python_heap_peak_kb (previously rss_peak_mb was None).

Comment thread src/excelbench/perf/runner.py Outdated
Comment on lines 770 to 804
@@ -691,13 +800,13 @@ def _measure_read_workload_iteration(
wall1 = time.perf_counter_ns()
cpu1 = time.process_time_ns()

rss_after = _ru_maxrss_mb(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
rss_peak = max(rss_before, rss_after)
probe.__exit__(None, None, None)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4a80614 — _measure_read_workload_iteration now uses with MemoryProbe(memory_mode) as probe: (matching the existing pattern in _measure_read_iteration). Tracemalloc cleanup is now exception-safe across all four iteration helpers.

Fixes seven Copilot review issues on the Sprint 1 memory honesty PR:

memory.py:
- includes_getrusage now also returns True for tracemalloc mode so
  --memory-mode=tracemalloc reports RSS alongside the Python heap peak
  (matches the documented help-text contract).
- run_iteration_under_time_l now picks the right verbose flag per
  platform: -l on macOS (BSD time), -v on Linux (GNU coreutils time).
  parse_time_l_stderr already handled both stderr formats; only the
  invocation needed splitting.
- Renamed local time_l → time_bin since the flag is no longer constant.

runner.py:
- Wrapped MemoryProbe usage in `with` blocks for the three remaining
  measurement helpers (_measure_read_workload_iteration,
  _measure_write_workload_iteration, _measure_write_iteration). Manual
  __enter__/__exit__ leaked tracemalloc state on adapter exceptions.
- Removed redundant cwd=manifest_path.parent.parent kwarg — the
  subprocess takes paths explicitly and shouldn't change cwd.
- New _measure_one_sample helper centralizes the bench-loop iteration:
  in `time` mode it now runs ONLY the subprocess (using its JSON
  metrics for wall_ms/cpu_ms); in `all` mode the in-process loop also
  runs (the divergence is the deep-dive signal); other modes stay
  in-process. Eliminates the previous 2× execution per recorded sample.

cli.py:
- --memory-mode help text updated to describe the new behavior:
  tracemalloc mode reports getrusage RSS, time mode picks -l/-v per
  platform, time mode runs only the subprocess.

Verification:
- uv run pytest tests/ ✓ 1140 passed, 32 skipped, 6 xfailed
- uv run ruff check src/ tests/ ✓
- uv run mypy src/excelbench/perf/ src/excelbench/cli.py ✓
- Smoke test --memory-mode=tracemalloc populates both rss_peak_mb and
  python_heap_peak_kb (previously rss_peak_mb was None).
- Smoke test --memory-mode=time populates rss_kb_via_time and uses
  subprocess wall_ms (no double-execute).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wolfiesch wolfiesch merged commit 373c896 into master Apr 27, 2026
5 checks passed
@wolfiesch wolfiesch deleted the feat/perf-mem-honesty branch April 27, 2026 09:39
wolfiesch added a commit that referenced this pull request Apr 27, 2026
Extends the throughput-fixture pipeline with a parametric (dtype × scale)
matrix so the dashboard can answer "how does library X handle int vs string
vs formula at 1M cells?" — the most actionable question for users picking
between libraries.

What ships:
- 7 new value_type branches in _run_workload_write (float, date, datetime,
  boolean, formula_simple, formula_cross_sheet, mixed_realistic);
  string-short/long fold into existing string op via string_length.
- generate_data_shape_scenarios() in the throughput generator: 10 dtypes ×
  3 tiers (1k/10k/100k) by default, +1M tier behind --include-1m. New
  --shape-only flag for fast-iteration runs.
- excelbench perf-shape subcommand with --rows/--types/--regenerate/
  --memory-mode (inherits Sprint 1's memory-mode plumbing).
- _section_data_shape dashboard heatmap (read + write), per-dtype-column
  log-scale color so slow columns don't wash out fast ones.
- DEC-019 with mixed_realistic 60/30/5/3/2 ratio rationale and
  fixtures/synthetic_calibration/sample_set.md calibration provenance.

Branched off feat/perf-mem-honesty (S1) since #28 hasn't merged; once it
does, this PR will retarget master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wolfiesch added a commit that referenced this pull request Apr 27, 2026
…rs) (#31)

* feat(perf): Sprint 2 — data-shape benchmark matrix (10 dtypes × 4 tiers)

Extends the throughput-fixture pipeline with a parametric (dtype × scale)
matrix so the dashboard can answer "how does library X handle int vs string
vs formula at 1M cells?" — the most actionable question for users picking
between libraries.

What ships:
- 7 new value_type branches in _run_workload_write (float, date, datetime,
  boolean, formula_simple, formula_cross_sheet, mixed_realistic);
  string-short/long fold into existing string op via string_length.
- generate_data_shape_scenarios() in the throughput generator: 10 dtypes ×
  3 tiers (1k/10k/100k) by default, +1M tier behind --include-1m. New
  --shape-only flag for fast-iteration runs.
- excelbench perf-shape subcommand with --rows/--types/--regenerate/
  --memory-mode (inherits Sprint 1's memory-mode plumbing).
- _section_data_shape dashboard heatmap (read + write), per-dtype-column
  log-scale color so slow columns don't wash out fast ones.
- DEC-019 with mixed_realistic 60/30/5/3/2 ratio rationale and
  fixtures/synthetic_calibration/sample_set.md calibration provenance.

Branched off feat/perf-mem-honesty (S1) since #28 hasn't merged; once it
does, this PR will retarget master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(perf): add Sprint 2 data-shape coverage tests

Brings PR #31 above the 65% coverage gate by adding 31 tests for the
data-shape benchmark matrix:

- Parametrized value_type writes (10 dtypes including formula_simple,
  formula_cross_sheet, mixed_realistic, sparse cells)
- Unsupported value_type surfaces in row.notes (catch-all branch)
- _resolve_shape_features (rows-to-tiers, types filtering, error paths)
- _shape_fixtures_stale (missing manifest, generator-newer-than-manifest,
  needs_1m gate, fresh-manifest-no-regen)
- _section_data_shape dashboard helper (heatmap rendering, color scaling)
- perf_shape CLI (invalid memory_mode, invalid dtype, unknown adapter,
  happy path writing results.json/matrix.csv/history.jsonl)

Local coverage 67.64% (above 65% gate); Linux CI estimated ~65.9%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(perf): address PR #31 Copilot review comments

- cli.py: tier cap for 100k from 100,000 → 99,856 (matches actual 316×316
  generated fixture, so --rows 99856 doesn't silently drop the tier)
- generate_throughput_fixtures.py: align boolean/date/datetime fixture
  seeds with the runner's start=1 convention (was 0-based, runner was
  1-based, causing read fixtures and write payloads to differ for the
  same dtype)
- generate_throughput_fixtures.py: distinguish "scenarios" (xlsx files)
  from "manifest rows" in the --shape-only output line
- README.md: clarify "40 scenarios produce 80 features" (read + write)
- html_dashboard.py: switch isinstance(p50, int | float) to tuple form
  (universally compatible; Copilot incorrectly flagged PEP 604 form)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Wolfgang Schoenberger <221313372+wolfiesch@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants