bench: Add internal performance benchmark suite + CodSpeed CI by FBumann · Pull Request #771 · PyPSA/linopy

FBumann · 2026-06-07T20:21:15Z

I worked on this a bit on my fork. The results can be seen on this PR on my fork, where i started to do some performance work: fluxopt#25

Its meant to be used during development, as well as in ci.

local dev:

provides a cli tool to benchmark both walltime and memory usage.
Can be used to comapre the whole package, or focuse on a single method.
Includes tools for quick evaluation

ci (codspeed):

runs a memory benchmark on a suite of models and modeling patterns and compares it to the master baseline. will catch changes/regression of memory usage.
On master or by adding the "trigger:benchmark" label: Runs walltime on a dedicated codspeed runner with great consistency

To activate CodSpeed upstream: a maintainer needs to connect the repo to CodSpeed (install the CodSpeed app on the PyPSA org/repo) — that's the one required step; the workflows already authenticate via OIDC, no token secret. The trigger:benchmark label is created. The walltime job needs CodSpeed macro runners enabled on the org (or drop codspeed-macro.yml to keep just the free per-PR memory job).

Notebook walkthrough see walkthrough.md

Rough ci times:

memory: 3 minutes
walltime: 10 minutes
(instrumentation: 20+ minutes - deliberatly not added, see below)

Some setup quirks documented in fluxopt#27. Probably irrelevant since now

Note

This description was drafted by AI (Claude); adjust/trim per AGENTS.md. The code is accumulated fork work.

Adds an internal performance-benchmark suite for linopy (build → matrix gen → LP / netCDF export → solver handoff), wired to CodSpeed CI, plus a few correctness fixes surfaced while building it.

Changes proposed in this Pull Request

Benchmark suite (`benchmarks/`)

A registry-driven harness measuring time and peak memory across linopy's phases.

Models (REGISTRY) — whole models swept over size n; patterns (PATTERNS) — realistic modelling idioms swept over a 0–100 severity dial (how a data shape going dense propagates through build/export).
One CLI (python -m benchmarks): run / sweep (across linopy versions, via uv) with --metric {time,memory,both}; compare / plot auto-detect the metric from the snapshot; plus smoke, list/show/filter, notebook.
Explicit size tiers per spec (SIZES / QUICK_SIZES / LONG_SIZES): --quick per-PR subset, default = SIZES − LONG_SIZES, --long = all. A single skip_reason gates pytest and the memray engine identically.
Plotting — sweep heatmap on a log₂(fold) colour scale (symmetric about 1×) with a --clip clamp and round fold-change labels; scatter / compare / scaling views; loaders fail cleanly on malformed snapshots.
Executable walkthrough (walkthrough.md, run in CI to prevent doc-rot) + README.

CI

CodSpeed instrumentation: memory (heap allocations) on every PR + master push; walltime (macro runner) on master push and trigger:benchmark-labelled PRs; OIDC auth (no token secret). The simulation/cachegrind instrument is deliberately left out — instruction-count precision is overkill here and burns CI time; memory + walltime cover what matters. We can add it if we dont mid the ci time (only master?)

Library fixes

linopy/alignment.py: tolerate xarray builds without CoordinateValidationError; read tuple coords entries as xarray's (dim_name, values) rather than a bare sequence.
linopy/expressions.py: group by non-dimension coordinate names; faster multi-key groupby path.
Tests for both.

Checklist

AI-generated content is marked (this description; see AGENTS.md).
Code changes are sufficiently documented (docstrings + benchmarks/walkthrough.md).
Unit tests for new features were added (alignment/groupby; benchmark harness _tests/).
A note for doc/release_notes.rst of the upcoming release is included.
I consent to the release of this PR's code under the MIT license.

…smoke Refactors the internal benchmark suite around a reusable ModelSpec / REGISTRY pattern so adding a model is one self-registering file with metadata (features, applicable phases, sizes, optional deps). Other tests and scripts can import it via `from benchmarks import REGISTRY`. New model specs cover gaps in the existing coverage: - milp: general (non-binary) integers (capacitated facility location) - qp: continuous quadratic objective (diagonal portfolio) - sos: SOS1 multi-mode generation (Model.add_sos_constraints) - piecewise: piecewise-linear fuel cost (Model.add_piecewise_formulation) - masked: sparse-route transportation using mask= on add_variables SOS and piecewise specs gate their own registration on API availability, so the suite stays runnable on older linopy. New phase tests: - test_solver_handoff.py: parametrizes lp.io.to_highspy/to_gurobipy/ to_mosek/to_xpress across applicable models, skipping per-solver when the solver isn't installed. Uses stable lp.io wrappers (not the new Solver.from_name API) for backward compatibility. - test_netcdf.py: separate to_netcdf / read_netcdf benchmarks. CI: new benchmark-smoke.yml runs the suite under --quick --benchmark-disable on PRs, so refactors that break a model spec get caught early. Timings stay off CI (~35s smoke locally, no regression tracking). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The default ``pytest benchmarks/`` run now skips the slowest 1-2 sizes per spec (e.g. knapsack at 1M, basic at 1600, pypsa_scigrid at >50) so a full timing pass completes in ~2 minutes instead of 20-45. ModelSpec grows a ``long_threshold`` mirror of ``quick_threshold``: - ``--quick`` → ``size <= quick_threshold`` (CI smoke) - default → ``size <= long_threshold`` (medium-cost regression) - ``--long`` → no cap (full sweep) Verified locally: - --quick: 227 passed / 230 skipped / 35s - default: 333 passed / 124 skipped / 45s - --long : 457 passed / 0 skipped / 2m13s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop pypsa_scigrid from --quick entirely (quick_threshold=0). PyPSA import + example loading dominates the smoke wall-clock; the model still runs in default and --long modes. - Lower every other spec's quick_threshold to its smallest size, so --quick exercises one size per model across all phases. The default tier (which uses long_threshold) still gives broad regression coverage. Verified locally: - --quick: 85 passed / 372 skipped / 18.5s (was 35s) - default: 333 passed / 124 skipped / 44.8s (unchanged) - --long : 457 passed / 0 skipped / 2m11s (unchanged) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

benchmarks/notebooks/registry_usage.py is the canonical walkthrough for the model registry. Authored in jupytext percent format so it triples as: - runnable Python script (CI executes it on every PR) - notebook in JupyterLab / VSCode with the jupytext extension - readable doc on GitHub (markdown cells render directly) Covers: import, lookup by name, iterate, filter_by feature/phase, parametrize-your-own-pytest pattern, one-off tracemalloc profiling, and the three CLI tiers. CI: benchmark-smoke.yml gains an "Execute registry-usage notebook" step right after the pytest smoke — so doc rot fails the build instead of hiding until someone next opens the file. README: new "Worked walkthrough" subsection points at the notebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Replace jupytext-style ``registry_usage.py`` with a proper ``registry_usage.ipynb`` — matches the repo convention (examples/*.ipynb, nbsphinx, nbstripout). CI executes it via ``jupyter nbconvert --execute``. - Add ``__repr__`` (one-line summary) and ``_repr_html_`` (attribute table) to ModelSpec. Visible in pytest -v output, in interactive Python, and as rich HTML in Jupyter cells. - Notebook simplified to lean on the new reprs: explicit-attribute prints in sections 2-5 replaced by bare expression evaluations. - README points at the .ipynb and notes the "launch jupyter from repo root" convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds ``python -m benchmarks <command>`` with typer subcommands: - list / show / filter — introspect the registry - smoke — pytest --quick --benchmark-disable (CI) - run [--long --phase --model --filter --json] — pytest --benchmark-only with knobs - notebook — execute the registry-usage notebook - memory save/compare — replaces the argparse main in memory.py Modern typer style throughout: Annotated[...] for every parameter, Literal[...] for the --phase choice, function docstrings for command help. ``--help`` is auto-generated and is the source of truth — README and the notebook just point at it instead of duplicating the menu. CI smoke now calls ``python -m benchmarks smoke`` and ``python -m benchmarks notebook``. memory.py keeps its save/compare functions but loses the argparse layer. typer added to the [benchmarks] extra. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two cleanups after checking typer's docs: - Pin typer to the latest release (==0.26.2) in the [benchmarks] extra, so the CLI's behaviour is reproducible across dev / CI / contributor machines. - Switch ``smoke`` and ``run`` from the ``extra: list[str]`` argument to the idiomatic ``typer.Context`` + ``context_settings`` pattern (allow_extra_args, ignore_unknown_options). With the old style, any trailing ``--flag`` would be parsed as an unknown option and rejected; with ctx.args, ``python -m benchmarks run --long -- --tb=short -x`` actually works. Other patterns already match typer's recommended style: Annotated[...], Literal for choice params, docstrings for command help, sub-apps via add_typer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two layers of pinning for stable measurement: - ``[benchmarks]`` extra in pyproject pins the test infra exactly (pytest, pytest-benchmark, pytest-memray, pypsa, highspy, netcdf4, nbconvert, typer). Loose enough that the sweep workflow can install varying linopy versions on top. - ``benchmarks/requirements.lock`` is the full transitive resolution (numpy, scipy, pandas, xarray, plus everything else). Generated via ``uv pip compile --no-emit-package linopy`` so the lockfile pins the *environment around linopy* without pinning linopy itself — that lets the same lockfile work for both current-tip regression runs and cross-version sweeps. README clarifies that the lockfile gives consistency over time on the same machine, not absolute reproducibility across machines (CPU / cache / memory bandwidth still matter). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``python -m benchmarks sweep 0.5.0 0.6.0 0.7.0`` builds a fresh venv per version with uv, installs the benchmark infra (lockfile by default, or the [benchmarks] pinned subset with --no-use-lock) plus the target linopy in a single resolution pass, and runs the suite. Snapshots land in ``<output-dir>/linopy-<version>.json``. Useful for bootstrapping a perf history against published linopy releases. The current benchmark code runs against each linopy version (constant measurement layer); the ``_API_AVAILABLE`` gates on sos / piecewise specs make older linopy versions skip those phases gracefully. Verified locally: ``sweep 0.7.0 --quick --no-use-lock`` runs end-to-end in ~2 min (uv installs 57 packages in 200ms; the rest is the benchmark run). Plain releases (0.4.0) and pip specs (git+https://...) both work via the ``_linopy_install_spec`` helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The README previously duplicated content from three sources: - the notebook (models table, registry-usage code blocks) - ``--help`` (quick-reference command list) - a stale memory.py invocation (since replaced by ``memory save/compare``) After the consolidation each surface has a clear single job: - README: 1-paragraph what, setup (uv sync / lockfile), size-tier table (architectural), pointers to the notebook + ``--help``, metrics blurb. - ``notebooks/registry_usage.ipynb``: the walkthrough — registry import, lookup / iterate / filter, parametrize your own pytest, profiling. - ``python -m benchmarks --help``: command reference, autogenerated by typer from docstrings / Annotated[..., Option(...)] declarations. Drops ~140 lines from the README; nothing actually disappears — it just lives in the one place that owns it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pypsa removed from the [benchmarks] pinned set, from the sweep ``--no-use-lock`` install list, and from the lockfile. The ``test_pypsa_carbon_management.py`` module uses ``pytest.importorskip`` so collection no longer fails without pypsa; ``pypsa_scigrid`` already had ``requires=("pypsa",)`` so its phase tests skip gracefully. Install pypsa separately when you want those benchmarks. Notebook (registry_usage.ipynb) rewritten as a proper operator guide: - Architecture overview + per-phase measurement table up front. - Registry walkthrough (lookup / iterate / filter) kept as the spine. - Reuse patterns (parametrize-your-own-pytest, tracemalloc spot check). - ``Running`` section now embeds ``--help`` output live via a ``show_help()`` helper that shells out to ``python -m benchmarks ... --help``. The doc stays in sync with the typer implementation automatically — change a flag in cli.py, re-run the notebook, documentation updates. - New sections cover timing snapshots, memory snapshots, the cross-version sweep, and lockfile regeneration. README gains an explicit "pypsa is optional" note in setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rough Mirrors ``run``'s filter knobs and applies them to every version's pytest invocation. Also switches to the ``typer.Context`` + ``context_settings`` pattern so trailing args after ``--`` are forwarded to pytest verbatim (same shape ``smoke`` / ``run`` use). python -m benchmarks sweep 0.6.7 --phase build --model basic python -m benchmarks sweep 0.6.7 -- --tb=short -x Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

``python -m benchmarks compare a.json b.json [-- --columns=...]`` shells out to ``pytest-benchmark compare`` so the whole suite stays under one entry point. Accepts any number of snapshots; first is the baseline. When called with no arguments — or with paths that don't exist — it prints a copy-paste-ready list of snapshots found under ``.benchmarks/`` (including ``.benchmarks/sweep/`` for cross-version runs). If nothing's saved yet, points at the ``run --json`` flow. For memory snapshots use ``memory compare`` instead — different format, different tool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-paste) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ame) pytest-benchmark's own default emits 10 columns side-by-side, which is unreadable for any non-trivial comparison. Wrapper now prepends ``--columns=median,iqr --sort=name`` so the table is two stats wide and the (baseline, candidate) pair of each test sits together alphabetically. Defaults are only applied when the user hasn't already set the flag, so trailing pass-through overrides still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…arg split Two fixes for the ``compare`` UX surfaced by the cross-version sweep: - Default to ``--group-by=fullname`` so each test gets its own mini table showing (baseline, candidate) side-by-side with the parenthesized auto-ratio per column. Easy to scan ``(>1.10)`` for regressions in the median column. Combined with the existing ``--columns=median,iqr --sort=name`` defaults, the output goes from 10-columns-wide-on-one-line to a focused two-column per-test view. - Switch ``compare`` away from a positional ``list[Path]`` argument and parse ``ctx.args`` by hand: typer's positional list was greedily grabbing trailing ``--group-by=fullname`` etc. (and the ``--`` separator didn't escape it either). Now arg-splitting is explicit: anything starting with ``-`` is pytest-benchmark pass-through, everything else is a snapshot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tried switching to the canonical typer pattern (``--`` separator for pass-through) but typer's positional ``list[Path]`` + ``allow_extra_args`` still greedily ate the trailing options. There's no clean typer/click idiom for "list-typed positional + pass-through" — workarounds are manual splitting, bounding the positional count, or named flags. Manual splitting is the most pragmatic: snapshots come first, once we see any flag-like token the rest is forwarded to pytest-benchmark. That preserves things like ``--histogram=/tmp/hist/cmp`` (built-in SVG-per-test plotting), ``--csv=out.csv``, ``--group-by=fullname``, and the value-taking flags whose value doesn't start with ``-``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three opinionated interactive HTML views over pytest-benchmark JSONs, auto-picked from snapshot count or set explicitly via ``--view``: - **compare** (2 snapshots) — horizontal bar chart of per-test median delta, sorted by magnitude, green→red colormap. The "did this PR regress anything?" picture in one glance, vs pytest-benchmark's 60-individual-SVGs which are useless for that workflow. - **sweep** (3+ snapshots) — heatmap of median ratio relative to the first snapshot, rows = tests, columns = labels. Pairs with the ``sweep`` subcommand. - **scaling** (1 snapshot) — log-log median vs ``n`` for size-parametrized tests (e.g. ``[basic-n=10..1600]``), faceted by phase. Shows whether linopy's complexity scales as expected. plotly==6.7.0 pinned in [benchmarks]; lockfile regenerated. plotly is lazy-imported inside ``plot`` so the rest of the suite stays usable without it (with a clear error if a user tries ``plot`` and it's missing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er absolutes - New ``benchmarks/plotting.py`` module owns the three views (``plot_compare`` / ``plot_sweep`` / ``plot_scaling``) plus a ``RENDERERS`` dispatch dict. cli.py drops ~140 lines and just imports ``PlotView`` + ``RENDERERS``; plotly is still lazy-loaded inside the view functions so importing the module without plotly works. - ``compare`` bar chart and ``sweep`` heatmap now use ``text_auto`` so values render inside each bar / cell. - Hover info upgraded: - compare hover shows the per-test median of *both* snapshots (formatted to 4 significant figures) in addition to the delta %. - sweep hover shows the absolute median (s) alongside the ratio, via a customdata + hovertemplate plumbed through ``update_traces``. scaling view already shows the absolute median on hover by virtue of being a line chart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ride For microbenchmarks the lowest observed time is closest to the "true" cost — background noise (GC, context switches, thermal throttling) can only slow things down. pytest-benchmark's own ``--sort`` default is ``min`` for the same reason; LLVM's perf guide, JMH, Google Benchmark and Alexandrescu's "Speed is found in the minds of people" all argue similarly. Changes: - ``plot`` defaults to ``--metric min`` (was median). Accepts ``--metric median|mean|max`` to override. The metric drives the bar values, heatmap ratios, scaling-curve y-axis, and the hover labels. - ``plot_compare`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` all take a ``metric: Metric = "min"`` arg. - ``compare`` table defaults to ``--columns=min,iqr --sort=min`` (was median,iqr / name). The auto-ratios next to each ``min`` flag regressions in the same readable form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For a suite where test costs span six orders of magnitude (knapsack microsecond builds vs PyPSA carbon at 2.4 s), sorting by % delta overweights cheap tests — a 100% regression on a 1µs test ranks above a 1% regression on a 2s test, but the absolute impacts are 1µs vs 24ms. Changes: - Default sort is now ``absolute`` (``b - a`` in seconds). Bar values are the time delta with SI-prefix formatting on the x-axis (24 ms, 240 µs, etc.). Big actual-time impacts float to the bottom. - ``--sort relative`` keeps the old percent behaviour. - Both ``delta_abs`` and ``delta_pct`` are surfaced in hover regardless of which one drives the sort, so you can read off whichever lens. - ``plot_sweep`` / ``plot_scaling`` accept a ``sort`` arg for uniform signature but ignore it (no two-snapshot diff there). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The compare bar chart forces a choice between sorting by relative % or absolute Δ. Both have blind spots: pure-relative makes microbenchmark noise look catastrophic, pure-absolute hides real algorithmic regressions on fast paths. The two-axis scatter resolves the tension visually. Per test: - x = baseline time (log scale) - y = candidate / baseline ratio - colour = absolute Δ A point is a real regression worth chasing only when it sits in the top-right — slow tests that got slower. Top-left (high ratio, tiny absolute) reads as microbenchmark noise; bottom-right (high absolute, ratio ≈ 1) was already slow and didn't change. A dashed reference line at ``y=1`` makes "no change" trivial to see. The view is auto-picked for nothing (compare wins for 2 snapshots); pass ``--view scatter`` explicitly to get it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two-axis scatter now scales beyond a single baseline-vs-candidate pair. With N>=3 inputs the first is still the baseline (reference); each subsequent snapshot becomes one animation frame. Use the slider / play button to scrub through versions and watch tests drift across releases. Implementation: - First snapshot is the baseline. Skipped from the frame set (would trivially be y=1 everywhere). - Each subsequent snapshot contributes points at (baseline_time, ratio, Δ) per overlapping test. ``animation_frame="version"`` does the per-frame slicing; ``category_orders`` preserves input order in the slider so the timeline reads left→right. - ``range_x`` / ``range_y`` are pinned to the global min/max so the camera doesn't jump between frames. - 2 inputs still produces a static scatter (no animation overhead). Considered ``facet_col`` but it gets cramped past ~4 versions — the slider scales to arbitrary length. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… |Δ| Two small but high-value tweaks for the multi-snapshot scatter: - The baseline snapshot now contributes its own animation frame where every point sits at ratio=1, Δ=0. Gives the animation a "before anything happened" anchor: hit play and watch points drift from the baseline horizon outward. Previously the first frame was the first candidate, which made the visual feel as if it started mid-story. - ``range_color`` is pinned to the 95th-percentile absolute Δ (±p95). One huge outlier no longer drags the colour scale and flattens everyone else to white; outliers saturate at the bound, the rest of the distribution stays readable. Colour-bar label notes ``Δ (s, p95-clipped)`` so the convention is explicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The "no change" line sits at y=1, but with asymmetric data (e.g. some 2x regression, no symmetric speedup), it landed near the bottom of the visible range and improvements got squeezed near the floor. Now: ``max_dist = max(|1 - y_lo|, |y_hi - 1|)`` and ``range_y = [1 - max_dist, 1 + max_dist]``. Pure min/max coverage (no clipping) but the window is symmetric around 1.0, so regressions above and improvements below are equally readable regardless of the data skew. The colour scale keeps the p95-clipped centred-at-0 treatment from the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h warn Three safe fixes from a code review of benchmarks/plotting.py: - Row-height multiplier 14 → 22 in plot_compare and plot_sweep, with the floor bumped from 400 to 500. At 25+ tests the y-axis labels were colliding; now they breathe. - plot_scaling reads ``params.size`` (the cleanly-stored int from parametrize) and only falls back to the id regex if absent. The ``model`` name still needs the regex because pytest-benchmark serializes our ModelSpec as ``UNSERIALIZABLE[ModelSpec(...)]``, so a full params switch isn't possible here — but the size path is now robust to test-id rename. - plot_compare surfaces the mismatch between snapshots: prints a stderr line with the test counts only in A / only in B / common, and embeds the same as a subtitle in the figure. Silent intersection was the worst-case footgun. Skipped (per review note): the default-view swap for 3+ snapshots (sweep → scatter) is a judgement call left for the user. Default output filename change (clobber on each run) also skipped — they want to decide whether per-view filenames are worth the API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…view>.html Two coupled changes setting up the notebook-embedding path: - ``plot_compare`` / ``plot_scatter`` / ``plot_sweep`` / ``plot_scaling`` in ``benchmarks/plotting.py`` now return a ``plotly.graph_objects.Figure`` instead of writing to disk + returning a count. The CLI does the ``fig.write_html(output)`` step. ``benchmarks.plotting.n_points(fig)`` is exported as a helper so the CLI still emits a "N points → path" status line. This unblocks rendering plots directly in jupyter — call ``plot_compare(...)`` and Jupyter's display hook renders the Figure inline. - Default ``-o`` for ``plot`` is now ``.benchmarks/plots/<view>.html`` (was ``benchmark-plot.html`` in cwd). Matches where snapshots already land (and is gitignored), and the per-view filename means consecutive runs of different views don't clobber each other. Bonus: two ``numpy_array or fallback`` bugs in scatter (``df.abs().max() or 1e-9``) and the new ``n_points`` helper (``trace.x or trace.z``) — both triggered ``ValueError: The truth value of an array with more than one element is ambiguous``. Replaced with explicit ``is None`` checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ospection The ``n_points(fig)`` helper added in the previous commit walked ``fig.data`` traces and called ``len(trace.x)`` to recover the test count. That's backwards — the count is sitting right there in the source DataFrame at render time, no need to reach into the rendered plot. Renderers now return ``tuple[Figure, int]`` directly. ``len(df)`` for compare / sweep / scaling; ``df["test"].nunique()`` for scatter (rows are per-(test, version) so the raw len double-counts). n_points helper dropped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two refinements to the end-to-end plotting section: - tqdm wraps the subprocess loop that generates the two snapshots. Each ``--quick --phase build`` run takes ~10 s; tqdm makes the ~20 s wait visible. ``tqdm.auto`` auto-picks the notebook widget vs console bar based on context. - Plots are now rendered via ``python -m benchmarks plot --view <name>`` rather than direct ``plot_compare`` / ``plot_scatter`` imports. A small ``cli_plot(view, snapshots)`` helper runs the subprocess, reads the generated HTML, and inlines it via ``IPython.display.HTML``. Demonstrates the actual user-facing CLI path inside the notebook rather than the internal API. Notebook end-to-end runtime: ~37 s (~33 s for the run loop + plotting overhead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both CodSpeed jobs now use CodSpeedHQ/action@v4 with id-token:write and no CODSPEED_TOKEN — nothing to rotate or leak. Matches the canonical config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Tracks peak memory + allocations per-PR-to-master — the instrument that reflects the sparsity/memory work, which simulation and walltime miss. Starts on a standard runner; first run settles whether Python memory mode needs the bare-metal macro runner. Non-gating, OIDC auth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CODSPEED_MODULES in conftest is the single source of truth for what the --codspeed instruments measure: build, matrices, lp_write (file IO) and solver_handoff (direct in-memory IO). test_netcdf is deselected under --codspeed (slow + noisy under walltime) but still runs under `benchmarks smoke`. Workflows are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Keeps the sparse(0)/dense(100) contrast plus a midpoint, cuts pattern items ~40% across every instrument. Finer characterization via per-spec severities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s from CodSpeed Phase taxonomy now names the directional operation instead of mixed verbs: - lp_write -> to_lp (test_lp_write.py -> test_to_lp.py) - netcdf -> to_netcdf / from_netcdf (one phase split into the two directions; test_netcdf.py functions test_to_netcdf / test_from_netcdf) - solver_handoff -> to_solver (test_solver_handoff.py -> test_to_solver.py; 'to_solver' avoids reading as Solver.solve()) build / matrices stay un-prefixed (lifecycle, not exports); the per-solver to_highspy/... tags were already in this scheme. Also drops test_matrices from CODSPEED_MODULES — matrix-gen is a subset of the direct to_solver handoff (both warm the same label_index cache), so CodSpeed keeps full coverage via build + to_lp + to_solver. Propagated through registry, __init__, memory.py (phase ids + CLI tags), cli/_base.py (PhaseName + test-file map), sos.py, docs. Renames benchmark ids, so CodSpeed re-baselines those benchmarks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cachegrind (simulation) is ~hundreds× native and under-weights sparse/native work, so the per-PR simulation job now trims to the cheap phases (build, to_lp). to_solver keeps per-PR coverage via the memory instrument and master coverage via walltime. Each CodSpeed workflow declares its instrument with --codspeed-set {simulation,memory,walltime}; conftest maps it to a module subset via config.getoption (our own option — no pytest-codspeed internals, no silent mode-sniffing). Definition stays in Python (CODSPEED_SETS). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

walltime/memory both map to the full set (the default), so passing --codspeed-set there was a no-op. Only the cachegrind job opts into the trimmed set; memory/walltime revert to plain pytest --codspeed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The walkthrough is a CLI demo, but its run/memory-save cells swept the whole registry (16 models + patterns) twice each — 'run --quick --phase build' alone was ~23s x2. Scope them to '--filter basic' (~3s each): same workflow demonstrated and still executed in CI, ~50s less work. Drop the filter to sweep everything. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…gated full run Separate per-instrument CI jobs overwrote each other: CodSpeed records one run per (commit, environment), so a later upload to the same environment supersedes the earlier one and its benchmarks show as 'skipped' (memory kept vanishing behind the slower simulation upload). Fix: - codspeed-memory.yml: per-PR memory only (solo upload, fast, no overwrite). - codspeed.yml: simulation+memory in ONE run (mode: simulation,memory) on master push + manual dispatch + opt-in PR via the 'trigger:benchmark' label (keeps PR context so CodSpeed comments; label auto-removed after the run). - codspeed-macro.yml: walltime adds the same label-gated PR path. Drops the --codspeed-set per-instrument subset machinery (incompatible with the one-upload-per-environment model); CODSPEED_MODULES stays as the eligibility guard. Runtime is now traded via cadence (per-PR = memory only), not subsets. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ting (#28) * refactor(benchmarks): lean up — trim docstrings/comments, dedupe plotting Behavior-preserving readability pass: - plotting.py: extract _metric_label (4x) and _diverging_kwargs (3x); cut comments that restate code, keep the plotly-quirk ones; tighten docstrings. - memory/bench/registry/snapshot: trim the 21-34 line module docstrings to essentials (keep the non-obvious why); fix stale lp_write -> to_lp refs. - sweep.py: trim restate comments (orchestration left as-is — no test coverage). No logic/API/CLI change; -84 lines. Verified: mypy clean, smoke 172 passed, _tests 14 passed, walkthrough notebook renders all plots. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(codspeed): run memory before simulation in the combined run mode: memory,simulation — memory (fast, always-on) is measured first, so its baseline lands even if the slow cachegrind pass is interrupted. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aseline (#29) Instruction counts under-weight the numpy/scipy/sparse work this fork targets, and the cachegrind pass was ~11 min — by far the slowest. This project doesn't do micro-CPU tuning, so its deterministic-sub-1%-CPU value isn't worth it. Dropping it also removes the whole reason codspeed.yml existed (memory+simulation shared one run to avoid same-environment overwrite). Now: - codspeed-memory.yml: memory on master + every PR + dispatch (always-on). - codspeed-macro.yml: walltime on master + dispatch + labeled PRs (unchanged). - codspeed.yml: deleted. memory (ubuntu) and walltime (macro) are separate environments, so no clash. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(benchmarks): log2 sweep colouring + --clip clamp (shared symmetric p95) The sweep heatmap coloured by raw ratio on plotly's linear scale, so a 2x and its mirror 1/2x looked asymmetric. Colour by log2(ratio) instead — folds symmetric around 1x, with a fold-change colourbar (1/8x...8x). Add --clip to override the colour clamp (a fold-change >1 for sweep, an absolute delta for scatter) over a new shared _symmetric_clip(magnitudes, override) helper that defaults to the symmetric p95 of the data, reused by both views. numpy promoted to a module-level import. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(benchmarks): log-scale ratio axis in scatter; symmetric p95 colour in compare Same fix as the sweep view, applied across the others: - scatter: the ratio y-axis was linear, so a 2x and its mirror 1/2x read asymmetrically (they even centred it on 1.0 *linearly*). Make it log_y so folds are symmetric about 1.0; window symmetric in log space; drop non-positive ratios (a log axis can't show a 0). - compare: clamp the bar colour with the shared symmetric p95 (consistency; the bar *length* still shows the full delta). - scaling already log-scales size; left as is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(benchmarks): make --clip a uniform fold-change on the ratio dimension Was inconsistent: a fold for sweep but an absolute Δ for scatter, and compare ignored it. Now --clip is always a fold-change (>1, default symmetric p95) that bounds the *ratio* dimension wherever a view has one: - sweep: the ratio colour (±log2) - scatter: the ratio y-axis ([1/clip, clip]) — moved off the colour, which reverts to the auto symmetric-p95 Δ clamp - compare / scaling: no ratio axis → ignored Validation is now uniform (fold-change > 1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(benchmarks): --clip clamps only the colour scale (log or linear per plot) Colour is the one thing you can't adjust after the plot is drawn (axes zoom). So --clip targets the colour only, and its unit follows the plot's colour scale: - sweep (colour = log2 ratio): a fold-change (>1) - scatter / compare (colour = absolute Δ): a linear Δ bound - scaling: no diverging colour → ignored Default stays the symmetric p95. Axes are full-range and zoomable — scatter's y-axis no longer p95-clips (which hid outlier points). Validation is per-scale (fold > 1 for sweep; any positive for the linear ones). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(benchmarks): make --sort actually sort compare bars; add --order for inputs - compare: --sort was misleading — bars were hardcoded alphabetical (sort_values('test_id')) while --sort only switched the dimension. Now it sorts by the chosen Δ (delta_abs/delta_pct): biggest regressions on top, improvements at the bottom. The name/help are finally truthful. - plot --order {input,version}: default 'input' preserves the order you pass (the plot never re-sorts); 'version' sorts inputs by parsed linopy-<ver>, fixing a glob's string order (0.3.10 before 0.3.2) for release-history sweeps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(benchmarks): add plot --reverse to flip snapshot order Applied after --order, so e.g. --order version --reverse = newest-first (which also makes the newest snapshot the sweep baseline / compare 'a'). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * revert(benchmarks): drop plot --order/--reverse plot serves arbitrary snapshots (bench labels, baseline.json, …), so parsing linopy-<ver> from filenames is a leaky abstraction — --order version is meaningless for non-version snapshots. plot already preserves input order, so callers control the axis by the order they pass. The --sort fix (compare bars sort by Δ, not alphabetically) stays — that was a real bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(benchmarks): --quick as explicit per-spec subset + manual axis selection Replace the quick_threshold size cap with quick_subset — the first/middle/last of each spec's sweep — so --quick runs three representative points instead of a threshold side-effect. This fixes two latent bugs the threshold caused: - models ran only their smallest size under --quick (most quick_threshold=10), so the per-PR memory signal was dominated by sub-MiB benchmarks; - patterns silently skipped severity=100 (quick_threshold=50), so the densest, peak-memory regime never ran in CI. Both now follow one rule (first/mid/last); patterns get all of (0, 50, 100). pypsa_scigrid opts out via quick_sizes=(). Add manual --size / --severity pytest options (repeatable) that override the tier flags per axis, so CI can pin exact values. Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100) for finer default/--long/sweep resolution; --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(benchmarks): sweep --metric {time,memory,both} + compare auto-detect Make the metric (time vs memory) a flag on the workflow rather than a separate command tree — the first half of unifying the time/memory CLI split. - Add a Measure enum (time|memory|both) and `--metric` to `sweep`, dispatching to the timing or memory engine (or both, sequentially — memray overhead would skew wall-clock if run concurrently). `both` writes per-metric subdirs so the two linopy-<version>.json sets don't collide. Default snapshot dirs are now .benchmarks/{time,memory}/ (timing default moved from .benchmarks/sweep). - `compare` auto-detects memory snapshots (peak_mib key) and diffs them with a peak-RSS table; timing still goes through pytest-benchmark. Mixing errors out. Adds memory.compare_snapshots (path-based; compare() now delegates to it). The `memory` sub-app still works; Stage 2 retires it and adds `run --metric` + --long/--size/--severity parity on the memory engine. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(benchmarks): retire memory sub-app — `run --metric` + unified gating Finish folding the time/memory split into a single command surface. Memory is now a --metric flag everywhere, not a parallel `memory` sub-app. - `run --metric {time,memory,both}` replaces `memory save`: prints results, saves only with --json (label = filename stem) — one rule for both metrics. Adds --quick/--size/--severity/--repeats so flags match across metrics. - Delete the `memory` sub-app (cli/memory.py, memory_app); `memory sweep` → `sweep --metric memory`, `memory compare` → `compare` (auto-detect, Stage 1). - Unify size/severity selection in registry.skip_reason(), shared by conftest.maybe_skip and memory.run_phase so the two engines can't drift. Threads --long/--size/--severity through the memory worker; the id-alignment test still passes. - Split memory.save() into measure() (returns the dict) + save() (writes), so `run --metric memory` can measure-then-maybe-write. - run_memory_sweep now invokes `run --metric memory --json <abs>` and writes straight to output_dir (no save-to-default-then-move dance); gains --long. - Update the walkthrough (CI-executed, re-runs clean) and benchmarks/README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(benchmarks): fail cleanly on malformed/unreadable snapshot files compare crashed with a raw JSONDecodeError traceback (and plot printed a cryptic "Expecting value:" with no filename) when handed an empty or malformed snapshot. load_snapshot now raises a clear, file-named ValueError on unreadable JSON or an unrecognized shape; compare routes its metric detection through load_snapshot too, so both commands report e.g. /tmp/x.json: not a readable JSON snapshot (Expecting value: line 1 ...) The mixed memory/timing guard (compare's any/all, plot's _check_same_unit) was already correct and is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mall --clip (#32) The fold colourbar placed ticks at integer log2 steps (…, 1/2×, 1×, 2×, …). For a tight range — a small `--clip` like 1.5 (bound = log2 1.5 ≈ 0.585), or a naturally tight p95 default — those integer steps fall *outside* [-bound, bound], so only `1×` rendered. Pick ticks that fit the range and read as round folds: clean powers of two for a wide range, round sub-2× folds (1.5×, 1.25×, 1.1×) for a tight one, with an even-spaced fallback when nothing round fits. `--clip 1.5` now shows 1/1.5× … 1.5× instead of a lone 1×. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…thresholds) (#33) Replace the derived quick subset and the long_threshold number with explicit size-tier constants at the top of each model file, so what runs in each tier is visible at a glance: SIZES = (10, 50, 100, 250, 500, 1000, 1600) # --long QUICK_SIZES = (10, 250) # --quick (per-PR) LONG_SIZES = (1000, 1600) # --long only Tiers: --quick → quick_sizes; default → sizes minus long_sizes; --long → all. LONG_SIZES is the sizes above each old long_threshold, so default/--long behave exactly as before. QUICK_SIZES is the lean per-PR pair — the giant-drop that fixes the 14-min CodSpeed memory job, now spelled out rather than derived. Patterns expose the same shape via the QUICK_SEVERITIES (0, 50, 100) constant and long_sizes=() (every severity runs by default). skip_reason / introspect / help text updated; long_threshold removed. Supersedes the derivation-based giant-drop. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

FabianHofmann · 2026-06-08T11:09:49Z

@FBumann codspeed has now all required permissions. I would like to pull this in before the release 0.8 so we have a clear starting point for the benchmark in upcoming versions

codspeed-hq · 2026-06-08T11:10:31Z

Congrats! CodSpeed is installed 🎉

🆕 138 new benchmarks were detected.

You will start to see performance impacts in the reports once the benchmarks are run from your default branch.

Detected benchmarks

test_to_solver[highs-milp-n=10] (Memory): 98.6 KB
test_to_lp[piecewise-n=1000] (Memory): 162.4 KB
test_to_solver[highs-milp-n=50] (Memory): 802.1 KB
test_to_solver[gurobi-basic-n=250] (Memory): 69.1 MB
test_to_solver[gurobi-rolling-severity=100] (Memory): 203.6 MB
test_to_solver[gurobi-knapsack-n=100] (Memory): 336.3 KB
test_to_solver[gurobi-milp-n=10] (Memory): 336.3 KB
test_to_lp[milp-n=10] (Memory): 4.7 KB
test_to_solver[highs-kvl_cycles-severity=50] (Memory): 245.6 MB
test_to_solver[highs-rolling-severity=50] (Memory): 161.6 MB
test_to_solver[highs-rolling-severity=0] (Memory): 3.2 MB
test_build[merge_balance-severity=0] (Memory): 794.9 KB
test_to_solver[gurobi-sparse_network-n=10] (Memory): 336.3 KB
test_to_solver[highs-knapsack-n=100] (Memory): 95.7 KB
test_build[storage-n=10] (Memory): 478.6 KB
test_to_lp[sparse_network-n=10] (Memory): 29.8 KB
test_to_lp[sparse_network-n=250] (Memory): 34.5 MB
test_to_lp[sos-n=1000] (Memory): 90.7 KB
test_to_lp[merge_balance-severity=0] (Memory): 367.8 KB
test_to_solver[gurobi-masked-n=100] (Memory): 925.2 KB
...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

lkstrp

Hey @FBumann,
codspeed is great stuff! Thanks, I would wanna use this for other repos as well. I only wen now through the workflows, and have a couple of commits.

Also note while you guys need to decide that. If I skim the rest, I think it's very verbose and pretty sure this is not necessary

Another point:
Can we remove the comments that Codspeed is creating? Similar to Codcov? In my opinion, it's enough to just get the status checks

lkstrp · 2026-06-08T11:05:56Z

+name: Benchmark smoke
+
+# Builds every spec and fires every phase once under --quick
+# --benchmark-disable: a "did a refactor break a spec?" check, not timing.


How long does this run?

2 minutes
1 for the actual smoke test, 1 for the notebook

Can be optimized I think.

https://github.com/fluxopt/linopy/actions/runs/27085593476/job/79939326221

lkstrp · 2026-06-08T11:05:58Z

+    - name: Set up Python 3.12
+      uses: actions/setup-python@v6
+      with:
+        python-version: "3.12"


Switch to 3.13?

lkstrp · 2026-06-08T11:06:05Z

+# arbitrary PR code — the label is a maintainer-controlled gate, so only apply it
+# to trusted (same-repo) PRs.
+#
+# Requires the repo under a GitHub org (macro runners are org-only) with the


Please remove the noise

lkstrp · 2026-06-08T11:06:08Z

+# Requires the repo under a GitHub org (macro runners are org-only) with the
+# CodSpeed app connected to the repo (OIDC auth — no token secret needed).
+
+on:


Can we merge both codspeed files?

lkstrp · 2026-06-08T11:06:10Z

+  pull_request:
+    types: [ labeled, synchronize ]
+    branches: [ master ]
+  workflow_dispatch:


I guess push master + workflow_dispatch is enough

lkstrp · 2026-06-08T11:06:11Z

+    if: >-
+      ${{ github.event_name != 'pull_request' ||
+          contains(github.event.pull_request.labels.*.name, 'trigger:benchmark') }}
+    runs-on: codspeed-macro


What are costs here. Looks great running memory on github runners and clocktime on codspeed runners, can you point me to more resources? Specially if we wanna use this across multiple repos. How long does this run and how much is for free?

Walltime (macro) is free up to 600 min/month

Cost after that is 0,013 €/min

But I'm not sure if we can go over the 600 minutes without a paid plan...

Everything else can run on a girhub runner and is free

https://codspeed.io/pricing

lkstrp · 2026-06-08T11:07:04Z

+    "polars==1.35.2",
+    "dask==2025.11.0",
+    "pytest==9.0.3",
+    "pytest-benchmark==5.2.3",


See comment above

lkstrp · 2026-06-08T11:15:25Z

+# stable while still surfacing upstream perf changes per-PR with
+# eyes-open review. Loose ``[project.dependencies]`` (numpy, scipy, ...)
+# have no version specifier so Dependabot leaves them alone — only the
+# ``==`` pins in ``[benchmarks]`` produce PRs.


Pinning like you did in pyproject.toml doesn't give us full pins, because the deps of deps aren't pinned. Also, I'm not sure we need these per dependency PRs just to get codspeed stats on them. The user will use the newest versions anyway, and it's very unlikely that we'd pin deps for the user just because of codspeed.

So I'd just fully remove this and the benchmark pins in pyproject.toml. For PRs, we probably want to just commit uv.lock and use that one in all PRs (for tests and benchmarks), and on master we always resolve from scratch. But all CI runs need to be updated then. I guess this can rather be done in another PR, and here we just ignore dependency pinning

I decided against a uv lock file to be able to sweep across linopy versions. As this is arguably not needed for ci and regression tracking, I'm fine with removing this and maybe even commuting a lock file instead. But lockfile usage needs to stay optional to make sweeping possible.
Another reason to add those here was to catch indirect improvements in upstream repos (xarray etc), as a dependabot or would also run when trying to bump the deps.
But we could probably simplify this.

FabianHofmann · 2026-06-08T11:31:25Z

The PR is very large and will be hard to maintain. I know it does not touch the production code but I would like to have things a bit leaner.

Ship Codspeed CI first
Ship local CLI later
Trim down the code a bit, remove comments, think about resusing pytest built-ins where possible, DRY
Reduce the number of test cases. I like the coverage of the individual functionality (merge, sparse KVL and so on), I just think it is too much to really track. Let's think about how to make that more useable

Note

AI-generated (Claude) analysis, updated to fold in the CodSpeed CLI capabilities. Posted as input for the plan above; decisions and replies are the maintainers'.

On (1) + (2) — what "CI first, CLI later" maps to in the tree. Both CodSpeed jobs run only pytest benchmarks/ --quick --codspeed, so the CI baseline is registry → phases.py → test_*.py + conftest.py (~1,900 lines). The local-only analytics stack — memray engine, plotting, cross-version sweep, snapshot/compare, ad-hoc bench (~55%) — is exactly the "ship later" layer; none of it runs in CI today. Disposition mapped in the table below.

On (3) — pytest-codspeed built-ins replace a chunk of the local stack. The pinned plugin (5.0.3) has native --codspeed-mode {walltime,memory}, so the local dev story can be "run the same command CI runs, locally":

Local memory profiling → pytest benchmarks/ -k <name> --codspeed --codspeed-mode memory. This makes memory.py (511 lines, the memray engine) redundant — no bespoke memory harness to maintain.
Ad-hoc local timing → --codspeed-mode walltime, covering most of what bench.py (346 lines) exists for.
What the CLI/plugin does not provide: a local A/B compare or local plots — those live on the hosted app. So sweep.py / cli/compare.py / plotting.py have no built-in equivalent. But the hosted app already tracks version-over-version on master, so the only genuinely uncovered case is a one-off "linopy X vs Y" study → a dev-scripts/ script, not merged infra. (Caveat: the docs don't confirm a fully offline / no-token local mode — worth a quick check before relying on it.)
DRY quick win: DEFAULT_PHASES is an identical longhand copy of ALL_PHASES (registry.py) and collapses to one line.

On (4) — the registry makes "fewer tracked cases" a data edit, not a rewrite. Tracked count = specs × phases × sizes. The builders you want to keep (merge, sparse KVL, rolling, …) are one entry each in patterns/ / models/ and are cheap to keep. The dials already exist (quick_sizes / long_sizes per spec, --quick / --long). So the question is purely what the dashboard tracks by default: e.g. default CodSpeed to one representative size per spec (keeping every builder runnable on demand) so the dashboard shows ~1 trend per functionality instead of 3–6 — full coverage stays a flag away without flooding the regression view.

Component breakdown — CI baseline vs. ship-later, with disposition

Component	~Lines	In CI baseline?	Disposition under the plan
`registry.py` + `models/` + `patterns/`	~1,400	✅	ship now (core)
`phases.py`	86	✅	ship now
`conftest.py` + `test_*.py`	~380	✅	ship now
`cli/run.py` (smoke) + introspect/notebook	~450	✅ (smoke)	ship now, thin
`memory.py` (memray engine)	511	❌	drop — `--codspeed-mode memory` covers it
`bench.py` (ad-hoc timer)	346	❌	mostly dropped — `--codspeed-mode walltime`
`plotting.py`	533	❌	CLI-later; hosted app covers ongoing
`sweep.py` (cross-version via uv)	424	❌	CLI-later / one-off `dev-scripts/`
`snapshot.py` + `cli/compare.py`	333	❌	CLI-later (feeds plotting/compare)
`walkthrough.md` + notebook CI job	361	smoke only	trim with the layer it documents

Counts approximate (wc -l); CI usage from the three workflow files. CodSpeed CLI facts from https://codspeed.io/docs/cli and the pytest-codspeed reference.

FBumann · 2026-06-08T11:37:45Z

I think If we go all in on Godspeed we could trim a lot of things from this pr.
But I think the local benchmarking cli is very valuable during the next few months.

If you rather keep it lean, feel free to drop all the local dev stuff. But note that this will mean that we will probably leverage ci/cod speed more during dev.
If you don't mind the ci minutes, this is ok for me. We can add it back in later if we change minds

FabianHofmann · 2026-06-08T11:40:09Z

I think If we go all in on Godspeed we could trim a lot of things from this pr. But I think the local benchmarking cli is very valuable during the next few months.

If you rather keep it lean, feel free to drop all the local dev stuff. But note that this will mean that we will probably leverage ci/cod speed more during dev. If you don't mind the ci minutes, this is ok for me. We can add it back in later if we change minds

Let me think about it once more, this is my first intuition. but I let me try the local benchmarking first and then we decide together

FBumann · 2026-06-08T11:49:02Z

I tried the local benchmarking to try out different approaches for memory reduction, and having a local memory benchmark felt really valuable to me (pytest doesn't have one).
As memory is arguably more important than time in some cases, this was the main reason to add it

FBumann · 2026-06-08T17:20:25Z

I think splitting the registry and patterns etc from the utility suite is the right call.
I means that linopy only keeps the thinks it needs for codspeed.
And another package, which can be reused in other projects, holds the logic for sewwping, plotting etc!

Working on it here: fluxopt#34

FBumann and others added 30 commits May 27, 2026 23:04

docs: update benchmark readme

8c908af

benchmarks: compare lists snapshots as relative paths (easier to copy…

83bdeda

…-paste) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FBumann and others added 17 commits June 5, 2026 14:02

ci(codspeed): authenticate via OIDC instead of a token secret

dc8d404

Both CodSpeed jobs now use CodSpeedHQ/action@v4 with id-token:write and no CODSPEED_TOKEN — nothing to rotate or leak. Matches the canonical config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ci(codspeed): set mode=simulation (required input in action v4)

ce449ce

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

perf(benchmarks): thin default severity sweep 5->3 (0,50,100)

ab61e27

Keeps the sparse(0)/dense(100) contrast plus a midpoint, cuts pattern items ~40% across every instrument. Finer characterization via per-spec severities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge branch 'PyPSA:master' into master

a64c0f2

FBumann mentioned this pull request Jun 7, 2026

benchmarks: reusable registry, with models, patterns and phases, CI smoke and Codspeed #735

Closed

FBumann changed the title ~~Add internal performance benchmark suite + CodSpeed CI (and alignment/groupby fixes)~~ bench: Add internal performance benchmark suite + CodSpeed CI Jun 7, 2026

pre-commit

85c103c

FBumann requested a review from FabianHofmann June 7, 2026 20:45

FabianHofmann mentioned this pull request Jun 8, 2026

Add CodSpeed continuous benchmarking #772

Closed

lkstrp reviewed Jun 8, 2026

View reviewed changes

Conversation

FBumann commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

local dev:

ci (codspeed):

Changes proposed in this Pull Request

Benchmark suite (benchmarks/)

CI

Library fixes

Checklist

Uh oh!

FabianHofmann commented Jun 8, 2026

Uh oh!

codspeed-hq Bot commented Jun 8, 2026

Congrats! CodSpeed is installed 🎉

Detected benchmarks

Uh oh!

lkstrp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FabianHofmann commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBumann commented Jun 8, 2026

Uh oh!

FabianHofmann commented Jun 8, 2026

Uh oh!

FBumann commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FBumann commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FBumann commented Jun 7, 2026 •

edited

Loading

Benchmark suite (`benchmarks/`)

lkstrp left a comment •

edited

Loading

FabianHofmann commented Jun 8, 2026 •

edited

Loading

FBumann commented Jun 8, 2026 •

edited

Loading

FBumann commented Jun 8, 2026 •

edited

Loading