bench: Add internal performance benchmark suite + CodSpeed CI#771
bench: Add internal performance benchmark suite + CodSpeed CI#771FBumann wants to merge 113 commits into
Conversation
…smoke Refactors the internal benchmark suite around a reusable ModelSpec / REGISTRY pattern so adding a model is one self-registering file with metadata (features, applicable phases, sizes, optional deps). Other tests and scripts can import it via `from benchmarks import REGISTRY`. New model specs cover gaps in the existing coverage: - milp: general (non-binary) integers (capacitated facility location) - qp: continuous quadratic objective (diagonal portfolio) - sos: SOS1 multi-mode generation (Model.add_sos_constraints) - piecewise: piecewise-linear fuel cost (Model.add_piecewise_formulation) - masked: sparse-route transportation using mask= on add_variables SOS and piecewise specs gate their own registration on API availability, so the suite stays runnable on older linopy. New phase tests: - test_solver_handoff.py: parametrizes lp.io.to_highspy/to_gurobipy/ to_mosek/to_xpress across applicable models, skipping per-solver when the solver isn't installed. Uses stable lp.io wrappers (not the new Solver.from_name API) for backward compatibility. - test_netcdf.py: separate to_netcdf / read_netcdf benchmarks. CI: new benchmark-smoke.yml runs the suite under --quick --benchmark-disable on PRs, so refactors that break a model spec get caught early. Timings stay off CI (~35s smoke locally, no regression tracking). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default ``pytest benchmarks/`` run now skips the slowest 1-2 sizes per spec (e.g. knapsack at 1M, basic at 1600, pypsa_scigrid at >50) so a full timing pass completes in ~2 minutes instead of 20-45. ModelSpec grows a ``long_threshold`` mirror of ``quick_threshold``: - ``--quick`` → ``size <= quick_threshold`` (CI smoke) - default → ``size <= long_threshold`` (medium-cost regression) - ``--long`` → no cap (full sweep) Verified locally: - --quick: 227 passed / 230 skipped / 35s - default: 333 passed / 124 skipped / 45s - --long : 457 passed / 0 skipped / 2m13s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop pypsa_scigrid from --quick entirely (quick_threshold=0). PyPSA import + example loading dominates the smoke wall-clock; the model still runs in default and --long modes. - Lower every other spec's quick_threshold to its smallest size, so --quick exercises one size per model across all phases. The default tier (which uses long_threshold) still gives broad regression coverage. Verified locally: - --quick: 85 passed / 372 skipped / 18.5s (was 35s) - default: 333 passed / 124 skipped / 44.8s (unchanged) - --long : 457 passed / 0 skipped / 2m11s (unchanged) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
benchmarks/notebooks/registry_usage.py is the canonical walkthrough for the model registry. Authored in jupytext percent format so it triples as: - runnable Python script (CI executes it on every PR) - notebook in JupyterLab / VSCode with the jupytext extension - readable doc on GitHub (markdown cells render directly) Covers: import, lookup by name, iterate, filter_by feature/phase, parametrize-your-own-pytest pattern, one-off tracemalloc profiling, and the three CLI tiers. CI: benchmark-smoke.yml gains an "Execute registry-usage notebook" step right after the pytest smoke — so doc rot fails the build instead of hiding until someone next opens the file. README: new "Worked walkthrough" subsection points at the notebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace jupytext-style ``registry_usage.py`` with a proper ``registry_usage.ipynb`` — matches the repo convention (examples/*.ipynb, nbsphinx, nbstripout). CI executes it via ``jupyter nbconvert --execute``. - Add ``__repr__`` (one-line summary) and ``_repr_html_`` (attribute table) to ModelSpec. Visible in pytest -v output, in interactive Python, and as rich HTML in Jupyter cells. - Notebook simplified to lean on the new reprs: explicit-attribute prints in sections 2-5 replaced by bare expression evaluations. - README points at the .ipynb and notes the "launch jupyter from repo root" convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``python -m benchmarks <command>`` with typer subcommands:
- list / show / filter — introspect the registry
- smoke — pytest --quick --benchmark-disable (CI)
- run [--long --phase --model --filter --json]
— pytest --benchmark-only with knobs
- notebook — execute the registry-usage notebook
- memory save/compare — replaces the argparse main in memory.py
Modern typer style throughout: Annotated[...] for every parameter,
Literal[...] for the --phase choice, function docstrings for command
help. ``--help`` is auto-generated and is the source of truth — README
and the notebook just point at it instead of duplicating the menu.
CI smoke now calls ``python -m benchmarks smoke`` and
``python -m benchmarks notebook``. memory.py keeps its save/compare
functions but loses the argparse layer. typer added to the [benchmarks]
extra.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups after checking typer's docs: - Pin typer to the latest release (==0.26.2) in the [benchmarks] extra, so the CLI's behaviour is reproducible across dev / CI / contributor machines. - Switch ``smoke`` and ``run`` from the ``extra: list[str]`` argument to the idiomatic ``typer.Context`` + ``context_settings`` pattern (allow_extra_args, ignore_unknown_options). With the old style, any trailing ``--flag`` would be parsed as an unknown option and rejected; with ctx.args, ``python -m benchmarks run --long -- --tb=short -x`` actually works. Other patterns already match typer's recommended style: Annotated[...], Literal for choice params, docstrings for command help, sub-apps via add_typer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layers of pinning for stable measurement: - ``[benchmarks]`` extra in pyproject pins the test infra exactly (pytest, pytest-benchmark, pytest-memray, pypsa, highspy, netcdf4, nbconvert, typer). Loose enough that the sweep workflow can install varying linopy versions on top. - ``benchmarks/requirements.lock`` is the full transitive resolution (numpy, scipy, pandas, xarray, plus everything else). Generated via ``uv pip compile --no-emit-package linopy`` so the lockfile pins the *environment around linopy* without pinning linopy itself — that lets the same lockfile work for both current-tip regression runs and cross-version sweeps. README clarifies that the lockfile gives consistency over time on the same machine, not absolute reproducibility across machines (CPU / cache / memory bandwidth still matter). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks sweep 0.5.0 0.6.0 0.7.0`` builds a fresh venv per version with uv, installs the benchmark infra (lockfile by default, or the [benchmarks] pinned subset with --no-use-lock) plus the target linopy in a single resolution pass, and runs the suite. Snapshots land in ``<output-dir>/linopy-<version>.json``. Useful for bootstrapping a perf history against published linopy releases. The current benchmark code runs against each linopy version (constant measurement layer); the ``_API_AVAILABLE`` gates on sos / piecewise specs make older linopy versions skip those phases gracefully. Verified locally: ``sweep 0.7.0 --quick --no-use-lock`` runs end-to-end in ~2 min (uv installs 57 packages in 200ms; the rest is the benchmark run). Plain releases (0.4.0) and pip specs (git+https://...) both work via the ``_linopy_install_spec`` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README previously duplicated content from three sources: - the notebook (models table, registry-usage code blocks) - ``--help`` (quick-reference command list) - a stale memory.py invocation (since replaced by ``memory save/compare``) After the consolidation each surface has a clear single job: - README: 1-paragraph what, setup (uv sync / lockfile), size-tier table (architectural), pointers to the notebook + ``--help``, metrics blurb. - ``notebooks/registry_usage.ipynb``: the walkthrough — registry import, lookup / iterate / filter, parametrize your own pytest, profiling. - ``python -m benchmarks --help``: command reference, autogenerated by typer from docstrings / Annotated[..., Option(...)] declarations. Drops ~140 lines from the README; nothing actually disappears — it just lives in the one place that owns it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pypsa removed from the [benchmarks] pinned set, from the sweep
``--no-use-lock`` install list, and from the lockfile. The
``test_pypsa_carbon_management.py`` module uses ``pytest.importorskip``
so collection no longer fails without pypsa; ``pypsa_scigrid`` already
had ``requires=("pypsa",)`` so its phase tests skip gracefully.
Install pypsa separately when you want those benchmarks.
Notebook (registry_usage.ipynb) rewritten as a proper operator guide:
- Architecture overview + per-phase measurement table up front.
- Registry walkthrough (lookup / iterate / filter) kept as the spine.
- Reuse patterns (parametrize-your-own-pytest, tracemalloc spot check).
- ``Running`` section now embeds ``--help`` output live via a
``show_help()`` helper that shells out to ``python -m benchmarks ...
--help``. The doc stays in sync with the typer implementation
automatically — change a flag in cli.py, re-run the notebook,
documentation updates.
- New sections cover timing snapshots, memory snapshots, the
cross-version sweep, and lockfile regeneration.
README gains an explicit "pypsa is optional" note in setup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rough
Mirrors ``run``'s filter knobs and applies them to every version's
pytest invocation. Also switches to the ``typer.Context`` +
``context_settings`` pattern so trailing args after ``--`` are
forwarded to pytest verbatim (same shape ``smoke`` / ``run`` use).
python -m benchmarks sweep 0.6.7 --phase build --model basic
python -m benchmarks sweep 0.6.7 -- --tb=short -x
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks compare a.json b.json [-- --columns=...]`` shells out to ``pytest-benchmark compare`` so the whole suite stays under one entry point. Accepts any number of snapshots; first is the baseline. When called with no arguments — or with paths that don't exist — it prints a copy-paste-ready list of snapshots found under ``.benchmarks/`` (including ``.benchmarks/sweep/`` for cross-version runs). If nothing's saved yet, points at the ``run --json`` flow. For memory snapshots use ``memory compare`` instead — different format, different tool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-paste) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ame) pytest-benchmark's own default emits 10 columns side-by-side, which is unreadable for any non-trivial comparison. Wrapper now prepends ``--columns=median,iqr --sort=name`` so the table is two stats wide and the (baseline, candidate) pair of each test sits together alphabetically. Defaults are only applied when the user hasn't already set the flag, so trailing pass-through overrides still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arg split Two fixes for the ``compare`` UX surfaced by the cross-version sweep: - Default to ``--group-by=fullname`` so each test gets its own mini table showing (baseline, candidate) side-by-side with the parenthesized auto-ratio per column. Easy to scan ``(>1.10)`` for regressions in the median column. Combined with the existing ``--columns=median,iqr --sort=name`` defaults, the output goes from 10-columns-wide-on-one-line to a focused two-column per-test view. - Switch ``compare`` away from a positional ``list[Path]`` argument and parse ``ctx.args`` by hand: typer's positional list was greedily grabbing trailing ``--group-by=fullname`` etc. (and the ``--`` separator didn't escape it either). Now arg-splitting is explicit: anything starting with ``-`` is pytest-benchmark pass-through, everything else is a snapshot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tried switching to the canonical typer pattern (``--`` separator for pass-through) but typer's positional ``list[Path]`` + ``allow_extra_args`` still greedily ate the trailing options. There's no clean typer/click idiom for "list-typed positional + pass-through" — workarounds are manual splitting, bounding the positional count, or named flags. Manual splitting is the most pragmatic: snapshots come first, once we see any flag-like token the rest is forwarded to pytest-benchmark. That preserves things like ``--histogram=/tmp/hist/cmp`` (built-in SVG-per-test plotting), ``--csv=out.csv``, ``--group-by=fullname``, and the value-taking flags whose value doesn't start with ``-``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three opinionated interactive HTML views over pytest-benchmark JSONs, auto-picked from snapshot count or set explicitly via ``--view``: - **compare** (2 snapshots) — horizontal bar chart of per-test median delta, sorted by magnitude, green→red colormap. The "did this PR regress anything?" picture in one glance, vs pytest-benchmark's 60-individual-SVGs which are useless for that workflow. - **sweep** (3+ snapshots) — heatmap of median ratio relative to the first snapshot, rows = tests, columns = labels. Pairs with the ``sweep`` subcommand. - **scaling** (1 snapshot) — log-log median vs ``n`` for size-parametrized tests (e.g. ``[basic-n=10..1600]``), faceted by phase. Shows whether linopy's complexity scales as expected. plotly==6.7.0 pinned in [benchmarks]; lockfile regenerated. plotly is lazy-imported inside ``plot`` so the rest of the suite stays usable without it (with a clear error if a user tries ``plot`` and it's missing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er absolutes
- New ``benchmarks/plotting.py`` module owns the three views
(``plot_compare`` / ``plot_sweep`` / ``plot_scaling``) plus a
``RENDERERS`` dispatch dict. cli.py drops ~140 lines and just imports
``PlotView`` + ``RENDERERS``; plotly is still lazy-loaded inside the
view functions so importing the module without plotly works.
- ``compare`` bar chart and ``sweep`` heatmap now use ``text_auto``
so values render inside each bar / cell.
- Hover info upgraded:
- compare hover shows the per-test median of *both* snapshots
(formatted to 4 significant figures) in addition to the delta %.
- sweep hover shows the absolute median (s) alongside the ratio, via
a customdata + hovertemplate plumbed through ``update_traces``.
scaling view already shows the absolute median on hover by virtue of
being a line chart.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ride For microbenchmarks the lowest observed time is closest to the "true" cost — background noise (GC, context switches, thermal throttling) can only slow things down. pytest-benchmark's own ``--sort`` default is ``min`` for the same reason; LLVM's perf guide, JMH, Google Benchmark and Alexandrescu's "Speed is found in the minds of people" all argue similarly. Changes: - ``plot`` defaults to ``--metric min`` (was median). Accepts ``--metric median|mean|max`` to override. The metric drives the bar values, heatmap ratios, scaling-curve y-axis, and the hover labels. - ``plot_compare`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` all take a ``metric: Metric = "min"`` arg. - ``compare`` table defaults to ``--columns=min,iqr --sort=min`` (was median,iqr / name). The auto-ratios next to each ``min`` flag regressions in the same readable form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For a suite where test costs span six orders of magnitude (knapsack microsecond builds vs PyPSA carbon at 2.4 s), sorting by % delta overweights cheap tests — a 100% regression on a 1µs test ranks above a 1% regression on a 2s test, but the absolute impacts are 1µs vs 24ms. Changes: - Default sort is now ``absolute`` (``b - a`` in seconds). Bar values are the time delta with SI-prefix formatting on the x-axis (24 ms, 240 µs, etc.). Big actual-time impacts float to the bottom. - ``--sort relative`` keeps the old percent behaviour. - Both ``delta_abs`` and ``delta_pct`` are surfaced in hover regardless of which one drives the sort, so you can read off whichever lens. - ``plot_sweep`` / ``plot_scaling`` accept a ``sort`` arg for uniform signature but ignore it (no two-snapshot diff there). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compare bar chart forces a choice between sorting by relative % or absolute Δ. Both have blind spots: pure-relative makes microbenchmark noise look catastrophic, pure-absolute hides real algorithmic regressions on fast paths. The two-axis scatter resolves the tension visually. Per test: - x = baseline time (log scale) - y = candidate / baseline ratio - colour = absolute Δ A point is a real regression worth chasing only when it sits in the top-right — slow tests that got slower. Top-left (high ratio, tiny absolute) reads as microbenchmark noise; bottom-right (high absolute, ratio ≈ 1) was already slow and didn't change. A dashed reference line at ``y=1`` makes "no change" trivial to see. The view is auto-picked for nothing (compare wins for 2 snapshots); pass ``--view scatter`` explicitly to get it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two-axis scatter now scales beyond a single baseline-vs-candidate pair. With N>=3 inputs the first is still the baseline (reference); each subsequent snapshot becomes one animation frame. Use the slider / play button to scrub through versions and watch tests drift across releases. Implementation: - First snapshot is the baseline. Skipped from the frame set (would trivially be y=1 everywhere). - Each subsequent snapshot contributes points at (baseline_time, ratio, Δ) per overlapping test. ``animation_frame="version"`` does the per-frame slicing; ``category_orders`` preserves input order in the slider so the timeline reads left→right. - ``range_x`` / ``range_y`` are pinned to the global min/max so the camera doesn't jump between frames. - 2 inputs still produces a static scatter (no animation overhead). Considered ``facet_col`` but it gets cramped past ~4 versions — the slider scales to arbitrary length. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… |Δ| Two small but high-value tweaks for the multi-snapshot scatter: - The baseline snapshot now contributes its own animation frame where every point sits at ratio=1, Δ=0. Gives the animation a "before anything happened" anchor: hit play and watch points drift from the baseline horizon outward. Previously the first frame was the first candidate, which made the visual feel as if it started mid-story. - ``range_color`` is pinned to the 95th-percentile absolute Δ (±p95). One huge outlier no longer drags the colour scale and flattens everyone else to white; outliers saturate at the bound, the rest of the distribution stays readable. Colour-bar label notes ``Δ (s, p95-clipped)`` so the convention is explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "no change" line sits at y=1, but with asymmetric data (e.g. some 2x regression, no symmetric speedup), it landed near the bottom of the visible range and improvements got squeezed near the floor. Now: ``max_dist = max(|1 - y_lo|, |y_hi - 1|)`` and ``range_y = [1 - max_dist, 1 + max_dist]``. Pure min/max coverage (no clipping) but the window is symmetric around 1.0, so regressions above and improvements below are equally readable regardless of the data skew. The colour scale keeps the p95-clipped centred-at-0 treatment from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h warn Three safe fixes from a code review of benchmarks/plotting.py: - Row-height multiplier 14 → 22 in plot_compare and plot_sweep, with the floor bumped from 400 to 500. At 25+ tests the y-axis labels were colliding; now they breathe. - plot_scaling reads ``params.size`` (the cleanly-stored int from parametrize) and only falls back to the id regex if absent. The ``model`` name still needs the regex because pytest-benchmark serializes our ModelSpec as ``UNSERIALIZABLE[ModelSpec(...)]``, so a full params switch isn't possible here — but the size path is now robust to test-id rename. - plot_compare surfaces the mismatch between snapshots: prints a stderr line with the test counts only in A / only in B / common, and embeds the same as a subtitle in the figure. Silent intersection was the worst-case footgun. Skipped (per review note): the default-view swap for 3+ snapshots (sweep → scatter) is a judgement call left for the user. Default output filename change (clobber on each run) also skipped — they want to decide whether per-view filenames are worth the API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view>.html Two coupled changes setting up the notebook-embedding path: - ``plot_compare`` / ``plot_scatter`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` now return a ``plotly.graph_objects.Figure`` instead of writing to disk + returning a count. The CLI does the ``fig.write_html(output)`` step. ``benchmarks.plotting.n_points(fig)`` is exported as a helper so the CLI still emits a "N points → path" status line. This unblocks rendering plots directly in jupyter — call ``plot_compare(...)`` and Jupyter's display hook renders the Figure inline. - Default ``-o`` for ``plot`` is now ``.benchmarks/plots/<view>.html`` (was ``benchmark-plot.html`` in cwd). Matches where snapshots already land (and is gitignored), and the per-view filename means consecutive runs of different views don't clobber each other. Bonus: two ``numpy_array or fallback`` bugs in scatter (``df.abs().max() or 1e-9``) and the new ``n_points`` helper (``trace.x or trace.z``) — both triggered ``ValueError: The truth value of an array with more than one element is ambiguous``. Replaced with explicit ``is None`` checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ospection The ``n_points(fig)`` helper added in the previous commit walked ``fig.data`` traces and called ``len(trace.x)`` to recover the test count. That's backwards — the count is sitting right there in the source DataFrame at render time, no need to reach into the rendered plot. Renderers now return ``tuple[Figure, int]`` directly. ``len(df)`` for compare / sweep / scaling; ``df["test"].nunique()`` for scatter (rows are per-(test, version) so the raw len double-counts). n_points helper dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two refinements to the end-to-end plotting section: - tqdm wraps the subprocess loop that generates the two snapshots. Each ``--quick --phase build`` run takes ~10 s; tqdm makes the ~20 s wait visible. ``tqdm.auto`` auto-picks the notebook widget vs console bar based on context. - Plots are now rendered via ``python -m benchmarks plot --view <name>`` rather than direct ``plot_compare`` / ``plot_scatter`` imports. A small ``cli_plot(view, snapshots)`` helper runs the subprocess, reads the generated HTML, and inlines it via ``IPython.display.HTML``. Demonstrates the actual user-facing CLI path inside the notebook rather than the internal API. Notebook end-to-end runtime: ~37 s (~33 s for the run loop + plotting overhead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both CodSpeed jobs now use CodSpeedHQ/action@v4 with id-token:write and no CODSPEED_TOKEN — nothing to rotate or leak. Matches the canonical config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tracks peak memory + allocations per-PR-to-master — the instrument that reflects the sparsity/memory work, which simulation and walltime miss. Starts on a standard runner; first run settles whether Python memory mode needs the bare-metal macro runner. Non-gating, OIDC auth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CODSPEED_MODULES in conftest is the single source of truth for what the --codspeed instruments measure: build, matrices, lp_write (file IO) and solver_handoff (direct in-memory IO). test_netcdf is deselected under --codspeed (slow + noisy under walltime) but still runs under `benchmarks smoke`. Workflows are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keeps the sparse(0)/dense(100) contrast plus a midpoint, cuts pattern items ~40% across every instrument. Finer characterization via per-spec severities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s from CodSpeed Phase taxonomy now names the directional operation instead of mixed verbs: - lp_write -> to_lp (test_lp_write.py -> test_to_lp.py) - netcdf -> to_netcdf / from_netcdf (one phase split into the two directions; test_netcdf.py functions test_to_netcdf / test_from_netcdf) - solver_handoff -> to_solver (test_solver_handoff.py -> test_to_solver.py; 'to_solver' avoids reading as Solver.solve()) build / matrices stay un-prefixed (lifecycle, not exports); the per-solver to_highspy/... tags were already in this scheme. Also drops test_matrices from CODSPEED_MODULES — matrix-gen is a subset of the direct to_solver handoff (both warm the same label_index cache), so CodSpeed keeps full coverage via build + to_lp + to_solver. Propagated through registry, __init__, memory.py (phase ids + CLI tags), cli/_base.py (PhaseName + test-file map), sos.py, docs. Renames benchmark ids, so CodSpeed re-baselines those benchmarks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cachegrind (simulation) is ~hundreds× native and under-weights sparse/native
work, so the per-PR simulation job now trims to the cheap phases (build, to_lp).
to_solver keeps per-PR coverage via the memory instrument and master coverage
via walltime. Each CodSpeed workflow declares its instrument with
--codspeed-set {simulation,memory,walltime}; conftest maps it to a module
subset via config.getoption (our own option — no pytest-codspeed internals, no
silent mode-sniffing). Definition stays in Python (CODSPEED_SETS).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
walltime/memory both map to the full set (the default), so passing --codspeed-set there was a no-op. Only the cachegrind job opts into the trimmed set; memory/walltime revert to plain pytest --codspeed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The walkthrough is a CLI demo, but its run/memory-save cells swept the whole registry (16 models + patterns) twice each — 'run --quick --phase build' alone was ~23s x2. Scope them to '--filter basic' (~3s each): same workflow demonstrated and still executed in CI, ~50s less work. Drop the filter to sweep everything. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gated full run Separate per-instrument CI jobs overwrote each other: CodSpeed records one run per (commit, environment), so a later upload to the same environment supersedes the earlier one and its benchmarks show as 'skipped' (memory kept vanishing behind the slower simulation upload). Fix: - codspeed-memory.yml: per-PR memory only (solo upload, fast, no overwrite). - codspeed.yml: simulation+memory in ONE run (mode: simulation,memory) on master push + manual dispatch + opt-in PR via the 'trigger:benchmark' label (keeps PR context so CodSpeed comments; label auto-removed after the run). - codspeed-macro.yml: walltime adds the same label-gated PR path. Drops the --codspeed-set per-instrument subset machinery (incompatible with the one-upload-per-environment model); CODSPEED_MODULES stays as the eligibility guard. Runtime is now traded via cadence (per-PR = memory only), not subsets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting (#28) * refactor(benchmarks): lean up — trim docstrings/comments, dedupe plotting Behavior-preserving readability pass: - plotting.py: extract _metric_label (4x) and _diverging_kwargs (3x); cut comments that restate code, keep the plotly-quirk ones; tighten docstrings. - memory/bench/registry/snapshot: trim the 21-34 line module docstrings to essentials (keep the non-obvious why); fix stale lp_write -> to_lp refs. - sweep.py: trim restate comments (orchestration left as-is — no test coverage). No logic/API/CLI change; -84 lines. Verified: mypy clean, smoke 172 passed, _tests 14 passed, walkthrough notebook renders all plots. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(codspeed): run memory before simulation in the combined run mode: memory,simulation — memory (fast, always-on) is measured first, so its baseline lands even if the slow cachegrind pass is interrupted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aseline (#29) Instruction counts under-weight the numpy/scipy/sparse work this fork targets, and the cachegrind pass was ~11 min — by far the slowest. This project doesn't do micro-CPU tuning, so its deterministic-sub-1%-CPU value isn't worth it. Dropping it also removes the whole reason codspeed.yml existed (memory+simulation shared one run to avoid same-environment overwrite). Now: - codspeed-memory.yml: memory on master + every PR + dispatch (always-on). - codspeed-macro.yml: walltime on master + dispatch + labeled PRs (unchanged). - codspeed.yml: deleted. memory (ubuntu) and walltime (macro) are separate environments, so no clash. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(benchmarks): log2 sweep colouring + --clip clamp (shared symmetric p95)
The sweep heatmap coloured by raw ratio on plotly's linear scale, so a 2x and
its mirror 1/2x looked asymmetric. Colour by log2(ratio) instead — folds
symmetric around 1x, with a fold-change colourbar (1/8x...8x).
Add --clip to override the colour clamp (a fold-change >1 for sweep, an absolute
delta for scatter) over a new shared _symmetric_clip(magnitudes, override)
helper that defaults to the symmetric p95 of the data, reused by both views.
numpy promoted to a module-level import.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(benchmarks): log-scale ratio axis in scatter; symmetric p95 colour in compare
Same fix as the sweep view, applied across the others:
- scatter: the ratio y-axis was linear, so a 2x and its mirror 1/2x read
asymmetrically (they even centred it on 1.0 *linearly*). Make it log_y so folds
are symmetric about 1.0; window symmetric in log space; drop non-positive
ratios (a log axis can't show a 0).
- compare: clamp the bar colour with the shared symmetric p95 (consistency; the
bar *length* still shows the full delta).
- scaling already log-scales size; left as is.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(benchmarks): make --clip a uniform fold-change on the ratio dimension
Was inconsistent: a fold for sweep but an absolute Δ for scatter, and compare
ignored it. Now --clip is always a fold-change (>1, default symmetric p95) that
bounds the *ratio* dimension wherever a view has one:
- sweep: the ratio colour (±log2)
- scatter: the ratio y-axis ([1/clip, clip]) — moved off the colour, which
reverts to the auto symmetric-p95 Δ clamp
- compare / scaling: no ratio axis → ignored
Validation is now uniform (fold-change > 1).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(benchmarks): --clip clamps only the colour scale (log or linear per plot)
Colour is the one thing you can't adjust after the plot is drawn (axes zoom).
So --clip targets the colour only, and its unit follows the plot's colour scale:
- sweep (colour = log2 ratio): a fold-change (>1)
- scatter / compare (colour = absolute Δ): a linear Δ bound
- scaling: no diverging colour → ignored
Default stays the symmetric p95. Axes are full-range and zoomable — scatter's
y-axis no longer p95-clips (which hid outlier points). Validation is per-scale
(fold > 1 for sweep; any positive for the linear ones).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(benchmarks): make --sort actually sort compare bars; add --order for inputs
- compare: --sort was misleading — bars were hardcoded alphabetical
(sort_values('test_id')) while --sort only switched the dimension. Now it
sorts by the chosen Δ (delta_abs/delta_pct): biggest regressions on top,
improvements at the bottom. The name/help are finally truthful.
- plot --order {input,version}: default 'input' preserves the order you pass
(the plot never re-sorts); 'version' sorts inputs by parsed linopy-<ver>,
fixing a glob's string order (0.3.10 before 0.3.2) for release-history sweeps.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(benchmarks): add plot --reverse to flip snapshot order
Applied after --order, so e.g. --order version --reverse = newest-first (which
also makes the newest snapshot the sweep baseline / compare 'a').
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* revert(benchmarks): drop plot --order/--reverse
plot serves arbitrary snapshots (bench labels, baseline.json, …), so parsing
linopy-<ver> from filenames is a leaky abstraction — --order version is
meaningless for non-version snapshots. plot already preserves input order, so
callers control the axis by the order they pass. The --sort fix (compare bars
sort by Δ, not alphabetically) stays — that was a real bug.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(benchmarks): --quick as explicit per-spec subset + manual axis selection
Replace the quick_threshold size cap with quick_subset — the first/middle/last
of each spec's sweep — so --quick runs three representative points instead of a
threshold side-effect. This fixes two latent bugs the threshold caused:
- models ran only their smallest size under --quick (most quick_threshold=10),
so the per-PR memory signal was dominated by sub-MiB benchmarks;
- patterns silently skipped severity=100 (quick_threshold=50), so the densest,
peak-memory regime never ran in CI.
Both now follow one rule (first/mid/last); patterns get all of (0, 50, 100).
pypsa_scigrid opts out via quick_sizes=().
Add manual --size / --severity pytest options (repeatable) that override the
tier flags per axis, so CI can pin exact values.
Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100) for finer default/--long/sweep
resolution; --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(benchmarks): sweep --metric {time,memory,both} + compare auto-detect
Make the metric (time vs memory) a flag on the workflow rather than a separate
command tree — the first half of unifying the time/memory CLI split.
- Add a Measure enum (time|memory|both) and `--metric` to `sweep`, dispatching
to the timing or memory engine (or both, sequentially — memray overhead would
skew wall-clock if run concurrently). `both` writes per-metric subdirs so the
two linopy-<version>.json sets don't collide. Default snapshot dirs are now
.benchmarks/{time,memory}/ (timing default moved from .benchmarks/sweep).
- `compare` auto-detects memory snapshots (peak_mib key) and diffs them with a
peak-RSS table; timing still goes through pytest-benchmark. Mixing errors out.
Adds memory.compare_snapshots (path-based; compare() now delegates to it).
The `memory` sub-app still works; Stage 2 retires it and adds `run --metric` +
--long/--size/--severity parity on the memory engine.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(benchmarks): retire memory sub-app — `run --metric` + unified gating
Finish folding the time/memory split into a single command surface. Memory is
now a --metric flag everywhere, not a parallel `memory` sub-app.
- `run --metric {time,memory,both}` replaces `memory save`: prints results,
saves only with --json (label = filename stem) — one rule for both metrics.
Adds --quick/--size/--severity/--repeats so flags match across metrics.
- Delete the `memory` sub-app (cli/memory.py, memory_app); `memory sweep` →
`sweep --metric memory`, `memory compare` → `compare` (auto-detect, Stage 1).
- Unify size/severity selection in registry.skip_reason(), shared by
conftest.maybe_skip and memory.run_phase so the two engines can't drift.
Threads --long/--size/--severity through the memory worker; the id-alignment
test still passes.
- Split memory.save() into measure() (returns the dict) + save() (writes), so
`run --metric memory` can measure-then-maybe-write.
- run_memory_sweep now invokes `run --metric memory --json <abs>` and writes
straight to output_dir (no save-to-default-then-move dance); gains --long.
- Update the walkthrough (CI-executed, re-runs clean) and benchmarks/README.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(benchmarks): fail cleanly on malformed/unreadable snapshot files
compare crashed with a raw JSONDecodeError traceback (and plot printed a
cryptic "Expecting value:" with no filename) when handed an empty or malformed
snapshot. load_snapshot now raises a clear, file-named ValueError on unreadable
JSON or an unrecognized shape; compare routes its metric detection through
load_snapshot too, so both commands report e.g.
/tmp/x.json: not a readable JSON snapshot (Expecting value: line 1 ...)
The mixed memory/timing guard (compare's any/all, plot's _check_same_unit) was
already correct and is unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mall --clip (#32) The fold colourbar placed ticks at integer log2 steps (…, 1/2×, 1×, 2×, …). For a tight range — a small `--clip` like 1.5 (bound = log2 1.5 ≈ 0.585), or a naturally tight p95 default — those integer steps fall *outside* [-bound, bound], so only `1×` rendered. Pick ticks that fit the range and read as round folds: clean powers of two for a wide range, round sub-2× folds (1.5×, 1.25×, 1.1×) for a tight one, with an even-spaced fallback when nothing round fits. `--clip 1.5` now shows 1/1.5× … 1.5× instead of a lone 1×. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thresholds) (#33) Replace the derived quick subset and the long_threshold number with explicit size-tier constants at the top of each model file, so what runs in each tier is visible at a glance: SIZES = (10, 50, 100, 250, 500, 1000, 1600) # --long QUICK_SIZES = (10, 250) # --quick (per-PR) LONG_SIZES = (1000, 1600) # --long only Tiers: --quick → quick_sizes; default → sizes minus long_sizes; --long → all. LONG_SIZES is the sizes above each old long_threshold, so default/--long behave exactly as before. QUICK_SIZES is the lean per-PR pair — the giant-drop that fixes the 14-min CodSpeed memory job, now spelled out rather than derived. Patterns expose the same shape via the QUICK_SEVERITIES (0, 50, 100) constant and long_sizes=() (every severity runs by default). skip_reason / introspect / help text updated; long_threshold removed. Supersedes the derivation-based giant-drop. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@FBumann codspeed has now all required permissions. I would like to pull this in before the release 0.8 so we have a clear starting point for the benchmark in upcoming versions |
There was a problem hiding this comment.
Hey @FBumann,
codspeed is great stuff! Thanks, I would wanna use this for other repos as well. I only wen now through the workflows, and have a couple of commits.
Also note while you guys need to decide that. If I skim the rest, I think it's very verbose and pretty sure this is not necessary
Another point:
Can we remove the comments that Codspeed is creating? Similar to Codcov? In my opinion, it's enough to just get the status checks
| name: Benchmark smoke | ||
|
|
||
| # Builds every spec and fires every phase once under --quick | ||
| # --benchmark-disable: a "did a refactor break a spec?" check, not timing. |
There was a problem hiding this comment.
2 minutes
1 for the actual smoke test, 1 for the notebook
Can be optimized I think.
https://github.com/fluxopt/linopy/actions/runs/27085593476/job/79939326221
| - name: Set up Python 3.12 | ||
| uses: actions/setup-python@v6 | ||
| with: | ||
| python-version: "3.12" |
| # arbitrary PR code — the label is a maintainer-controlled gate, so only apply it | ||
| # to trusted (same-repo) PRs. | ||
| # | ||
| # Requires the repo under a GitHub org (macro runners are org-only) with the |
| # Requires the repo under a GitHub org (macro runners are org-only) with the | ||
| # CodSpeed app connected to the repo (OIDC auth — no token secret needed). | ||
|
|
||
| on: |
There was a problem hiding this comment.
Can we merge both codspeed files?
| pull_request: | ||
| types: [ labeled, synchronize ] | ||
| branches: [ master ] | ||
| workflow_dispatch: |
There was a problem hiding this comment.
I guess push master + workflow_dispatch is enough
| if: >- | ||
| ${{ github.event_name != 'pull_request' || | ||
| contains(github.event.pull_request.labels.*.name, 'trigger:benchmark') }} | ||
| runs-on: codspeed-macro |
There was a problem hiding this comment.
What are costs here. Looks great running memory on github runners and clocktime on codspeed runners, can you point me to more resources? Specially if we wanna use this across multiple repos. How long does this run and how much is for free?
There was a problem hiding this comment.
Walltime (macro) is free up to 600 min/month
Cost after that is 0,013 €/min
But I'm not sure if we can go over the 600 minutes without a paid plan...
Everything else can run on a girhub runner and is free
| "polars==1.35.2", | ||
| "dask==2025.11.0", | ||
| "pytest==9.0.3", | ||
| "pytest-benchmark==5.2.3", |
| # stable while still surfacing upstream perf changes per-PR with | ||
| # eyes-open review. Loose ``[project.dependencies]`` (numpy, scipy, ...) | ||
| # have no version specifier so Dependabot leaves them alone — only the | ||
| # ``==`` pins in ``[benchmarks]`` produce PRs. |
There was a problem hiding this comment.
Pinning like you did in pyproject.toml doesn't give us full pins, because the deps of deps aren't pinned. Also, I'm not sure we need these per dependency PRs just to get codspeed stats on them. The user will use the newest versions anyway, and it's very unlikely that we'd pin deps for the user just because of codspeed.
So I'd just fully remove this and the benchmark pins in pyproject.toml. For PRs, we probably want to just commit uv.lock and use that one in all PRs (for tests and benchmarks), and on master we always resolve from scratch. But all CI runs need to be updated then. I guess this can rather be done in another PR, and here we just ignore dependency pinning
There was a problem hiding this comment.
I decided against a uv lock file to be able to sweep across linopy versions. As this is arguably not needed for ci and regression tracking, I'm fine with removing this and maybe even commuting a lock file instead. But lockfile usage needs to stay optional to make sweeping possible.
Another reason to add those here was to catch indirect improvements in upstream repos (xarray etc), as a dependabot or would also run when trying to bump the deps.
But we could probably simplify this.
|
The PR is very large and will be hard to maintain. I know it does not touch the production code but I would like to have things a bit leaner.
Note AI-generated (Claude) analysis, updated to fold in the CodSpeed CLI capabilities. Posted as input for the plan above; decisions and replies are the maintainers'. On (1) + (2) — what "CI first, CLI later" maps to in the tree. Both CodSpeed jobs run only On (3) —
On (4) — the registry makes "fewer tracked cases" a data edit, not a rewrite. Tracked count = specs × phases × sizes. The builders you want to keep (merge, sparse KVL, rolling, …) are one entry each in Component breakdown — CI baseline vs. ship-later, with disposition
Counts approximate ( |
|
I think If we go all in on Godspeed we could trim a lot of things from this pr. If you rather keep it lean, feel free to drop all the local dev stuff. But note that this will mean that we will probably leverage ci/cod speed more during dev. |
Let me think about it once more, this is my first intuition. but I let me try the local benchmarking first and then we decide together |
|
I tried the local benchmarking to try out different approaches for memory reduction, and having a local memory benchmark felt really valuable to me (pytest doesn't have one). |
|
I think splitting the registry and patterns etc from the utility suite is the right call. Working on it here: fluxopt#34 |
I worked on this a bit on my fork. The results can be seen on this PR on my fork, where i started to do some performance work: fluxopt#25
Its meant to be used during development, as well as in ci.
local dev:
ci (codspeed):
runs a memory benchmark on a suite of models and modeling patterns and compares it to the master baseline. will catch changes/regression of memory usage.
On master or by adding the "trigger:benchmark" label: Runs walltime on a dedicated codspeed runner with great consistency
To activate CodSpeed upstream: a maintainer needs to connect the repo to CodSpeed (install the CodSpeed app on the PyPSA org/repo) — that's the one required step; the workflows already authenticate via OIDC, no token secret. The
trigger:benchmarklabel is created. The walltime job needs CodSpeed macro runners enabled on the org (or dropcodspeed-macro.ymlto keep just the free per-PR memory job).Notebook walkthrough see walkthrough.md
Rough ci times:
Some setup quirks documented in fluxopt#27. Probably irrelevant since now
Note
This description was drafted by AI (Claude); adjust/trim per AGENTS.md. The code is accumulated fork work.
Adds an internal performance-benchmark suite for linopy (build → matrix gen → LP / netCDF export → solver handoff), wired to CodSpeed CI, plus a few correctness fixes surfaced while building it.
Changes proposed in this Pull Request
Benchmark suite (
benchmarks/)A registry-driven harness measuring time and peak memory across linopy's phases.
REGISTRY) — whole models swept over sizen; patterns (PATTERNS) — realistic modelling idioms swept over a 0–100severitydial (how a data shape going dense propagates through build/export).python -m benchmarks):run/sweep(across linopy versions, viauv) with--metric {time,memory,both};compare/plotauto-detect the metric from the snapshot; plussmoke,list/show/filter,notebook.SIZES/QUICK_SIZES/LONG_SIZES):--quickper-PR subset, default =SIZES − LONG_SIZES,--long= all. A singleskip_reasongates pytest and the memray engine identically.log₂(fold)colour scale (symmetric about 1×) with a--clipclamp and round fold-change labels; scatter / compare / scaling views; loaders fail cleanly on malformed snapshots.walkthrough.md, run in CI to prevent doc-rot) + README.CI
CodSpeed instrumentation: memory (heap allocations) on every PR + master push; walltime (macro runner) on master push and
trigger:benchmark-labelled PRs; OIDC auth (no token secret). The simulation/cachegrind instrument is deliberately left out — instruction-count precision is overkill here and burns CI time; memory + walltime cover what matters. We can add it if we dont mid the ci time (only master?)Library fixes
linopy/alignment.py: tolerate xarray builds withoutCoordinateValidationError; read tuple coords entries as xarray's(dim_name, values)rather than a bare sequence.linopy/expressions.py: group by non-dimension coordinate names; faster multi-key groupby path.Checklist
AGENTS.md).benchmarks/walkthrough.md)._tests/).doc/release_notes.rstof the upcoming release is included.