Skip to content

bench: Add internal performance benchmark suite + CodSpeed CI#771

Open
FBumann wants to merge 113 commits into
PyPSA:masterfrom
fluxopt:master
Open

bench: Add internal performance benchmark suite + CodSpeed CI#771
FBumann wants to merge 113 commits into
PyPSA:masterfrom
fluxopt:master

Conversation

@FBumann
Copy link
Copy Markdown
Collaborator

@FBumann FBumann commented Jun 7, 2026

I worked on this a bit on my fork. The results can be seen on this PR on my fork, where i started to do some performance work: fluxopt#25

Its meant to be used during development, as well as in ci.

local dev:

  • provides a cli tool to benchmark both walltime and memory usage.
  • Can be used to comapre the whole package, or focuse on a single method.
  • Includes tools for quick evaluation

ci (codspeed):

runs a memory benchmark on a suite of models and modeling patterns and compares it to the master baseline. will catch changes/regression of memory usage.
On master or by adding the "trigger:benchmark" label: Runs walltime on a dedicated codspeed runner with great consistency

To activate CodSpeed upstream: a maintainer needs to connect the repo to CodSpeed (install the CodSpeed app on the PyPSA org/repo) — that's the one required step; the workflows already authenticate via OIDC, no token secret. The trigger:benchmark label is created. The walltime job needs CodSpeed macro runners enabled on the org (or drop codspeed-macro.yml to keep just the free per-PR memory job).

Notebook walkthrough see walkthrough.md

Rough ci times:

  • memory: 3 minutes
  • walltime: 10 minutes
  • (instrumentation: 20+ minutes - deliberatly not added, see below)

Some setup quirks documented in fluxopt#27. Probably irrelevant since now

Note

This description was drafted by AI (Claude); adjust/trim per AGENTS.md. The code is accumulated fork work.

Adds an internal performance-benchmark suite for linopy (build → matrix gen → LP / netCDF export → solver handoff), wired to CodSpeed CI, plus a few correctness fixes surfaced while building it.

Changes proposed in this Pull Request

Benchmark suite (benchmarks/)

A registry-driven harness measuring time and peak memory across linopy's phases.

  • Models (REGISTRY) — whole models swept over size n; patterns (PATTERNS) — realistic modelling idioms swept over a 0–100 severity dial (how a data shape going dense propagates through build/export).
  • One CLI (python -m benchmarks): run / sweep (across linopy versions, via uv) with --metric {time,memory,both}; compare / plot auto-detect the metric from the snapshot; plus smoke, list/show/filter, notebook.
  • Explicit size tiers per spec (SIZES / QUICK_SIZES / LONG_SIZES): --quick per-PR subset, default = SIZES − LONG_SIZES, --long = all. A single skip_reason gates pytest and the memray engine identically.
  • Plotting — sweep heatmap on a log₂(fold) colour scale (symmetric about 1×) with a --clip clamp and round fold-change labels; scatter / compare / scaling views; loaders fail cleanly on malformed snapshots.
  • Executable walkthrough (walkthrough.md, run in CI to prevent doc-rot) + README.

CI

CodSpeed instrumentation: memory (heap allocations) on every PR + master push; walltime (macro runner) on master push and trigger:benchmark-labelled PRs; OIDC auth (no token secret). The simulation/cachegrind instrument is deliberately left out — instruction-count precision is overkill here and burns CI time; memory + walltime cover what matters. We can add it if we dont mid the ci time (only master?)

Library fixes

  • linopy/alignment.py: tolerate xarray builds without CoordinateValidationError; read tuple coords entries as xarray's (dim_name, values) rather than a bare sequence.
  • linopy/expressions.py: group by non-dimension coordinate names; faster multi-key groupby path.
  • Tests for both.

Checklist

  • AI-generated content is marked (this description; see AGENTS.md).
  • Code changes are sufficiently documented (docstrings + benchmarks/walkthrough.md).
  • Unit tests for new features were added (alignment/groupby; benchmark harness _tests/).
  • A note for doc/release_notes.rst of the upcoming release is included.
  • I consent to the release of this PR's code under the MIT license.

FBumann and others added 30 commits May 27, 2026 23:04
…smoke

Refactors the internal benchmark suite around a reusable ModelSpec /
REGISTRY pattern so adding a model is one self-registering file with
metadata (features, applicable phases, sizes, optional deps). Other tests
and scripts can import it via `from benchmarks import REGISTRY`.

New model specs cover gaps in the existing coverage:
- milp: general (non-binary) integers (capacitated facility location)
- qp: continuous quadratic objective (diagonal portfolio)
- sos: SOS1 multi-mode generation (Model.add_sos_constraints)
- piecewise: piecewise-linear fuel cost (Model.add_piecewise_formulation)
- masked: sparse-route transportation using mask= on add_variables

SOS and piecewise specs gate their own registration on API availability,
so the suite stays runnable on older linopy.

New phase tests:
- test_solver_handoff.py: parametrizes lp.io.to_highspy/to_gurobipy/
  to_mosek/to_xpress across applicable models, skipping per-solver when
  the solver isn't installed. Uses stable lp.io wrappers (not the new
  Solver.from_name API) for backward compatibility.
- test_netcdf.py: separate to_netcdf / read_netcdf benchmarks.

CI: new benchmark-smoke.yml runs the suite under --quick
--benchmark-disable on PRs, so refactors that break a model spec get
caught early. Timings stay off CI (~35s smoke locally, no regression
tracking).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The default ``pytest benchmarks/`` run now skips the slowest 1-2 sizes per
spec (e.g. knapsack at 1M, basic at 1600, pypsa_scigrid at >50) so a full
timing pass completes in ~2 minutes instead of 20-45.

ModelSpec grows a ``long_threshold`` mirror of ``quick_threshold``:

- ``--quick``  → ``size <= quick_threshold``  (CI smoke)
- default      → ``size <= long_threshold``   (medium-cost regression)
- ``--long``   → no cap                       (full sweep)

Verified locally:
- --quick: 227 passed / 230 skipped / 35s
- default: 333 passed / 124 skipped / 45s
- --long : 457 passed /   0 skipped / 2m13s

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop pypsa_scigrid from --quick entirely (quick_threshold=0). PyPSA
  import + example loading dominates the smoke wall-clock; the model
  still runs in default and --long modes.
- Lower every other spec's quick_threshold to its smallest size, so
  --quick exercises one size per model across all phases. The default
  tier (which uses long_threshold) still gives broad regression coverage.

Verified locally:
- --quick:  85 passed / 372 skipped / 18.5s   (was 35s)
- default: 333 passed / 124 skipped / 44.8s   (unchanged)
- --long : 457 passed /   0 skipped / 2m11s   (unchanged)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
benchmarks/notebooks/registry_usage.py is the canonical walkthrough for
the model registry. Authored in jupytext percent format so it triples
as:

- runnable Python script (CI executes it on every PR)
- notebook in JupyterLab / VSCode with the jupytext extension
- readable doc on GitHub (markdown cells render directly)

Covers: import, lookup by name, iterate, filter_by feature/phase,
parametrize-your-own-pytest pattern, one-off tracemalloc profiling,
and the three CLI tiers.

CI: benchmark-smoke.yml gains an "Execute registry-usage notebook" step
right after the pytest smoke — so doc rot fails the build instead of
hiding until someone next opens the file.

README: new "Worked walkthrough" subsection points at the notebook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Replace jupytext-style ``registry_usage.py`` with a proper
  ``registry_usage.ipynb`` — matches the repo convention (examples/*.ipynb,
  nbsphinx, nbstripout). CI executes it via ``jupyter nbconvert --execute``.
- Add ``__repr__`` (one-line summary) and ``_repr_html_`` (attribute table)
  to ModelSpec. Visible in pytest -v output, in interactive Python, and as
  rich HTML in Jupyter cells.
- Notebook simplified to lean on the new reprs: explicit-attribute prints
  in sections 2-5 replaced by bare expression evaluations.
- README points at the .ipynb and notes the "launch jupyter from repo root"
  convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``python -m benchmarks <command>`` with typer subcommands:

- list / show / filter — introspect the registry
- smoke               — pytest --quick --benchmark-disable (CI)
- run [--long --phase --model --filter --json]
                      — pytest --benchmark-only with knobs
- notebook            — execute the registry-usage notebook
- memory save/compare — replaces the argparse main in memory.py

Modern typer style throughout: Annotated[...] for every parameter,
Literal[...] for the --phase choice, function docstrings for command
help. ``--help`` is auto-generated and is the source of truth — README
and the notebook just point at it instead of duplicating the menu.

CI smoke now calls ``python -m benchmarks smoke`` and
``python -m benchmarks notebook``. memory.py keeps its save/compare
functions but loses the argparse layer. typer added to the [benchmarks]
extra.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups after checking typer's docs:

- Pin typer to the latest release (==0.26.2) in the [benchmarks] extra,
  so the CLI's behaviour is reproducible across dev / CI / contributor
  machines.

- Switch ``smoke`` and ``run`` from the ``extra: list[str]`` argument
  to the idiomatic ``typer.Context`` + ``context_settings`` pattern
  (allow_extra_args, ignore_unknown_options). With the old style, any
  trailing ``--flag`` would be parsed as an unknown option and rejected;
  with ctx.args, ``python -m benchmarks run --long -- --tb=short -x``
  actually works.

Other patterns already match typer's recommended style: Annotated[...],
Literal for choice params, docstrings for command help, sub-apps via
add_typer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layers of pinning for stable measurement:

- ``[benchmarks]`` extra in pyproject pins the test infra exactly
  (pytest, pytest-benchmark, pytest-memray, pypsa, highspy, netcdf4,
  nbconvert, typer). Loose enough that the sweep workflow can install
  varying linopy versions on top.

- ``benchmarks/requirements.lock`` is the full transitive resolution
  (numpy, scipy, pandas, xarray, plus everything else). Generated via
  ``uv pip compile --no-emit-package linopy`` so the lockfile pins the
  *environment around linopy* without pinning linopy itself — that lets
  the same lockfile work for both current-tip regression runs and
  cross-version sweeps.

README clarifies that the lockfile gives consistency over time on the
same machine, not absolute reproducibility across machines (CPU / cache
/ memory bandwidth still matter).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks sweep 0.5.0 0.6.0 0.7.0`` builds a fresh venv
per version with uv, installs the benchmark infra (lockfile by default,
or the [benchmarks] pinned subset with --no-use-lock) plus the target
linopy in a single resolution pass, and runs the suite. Snapshots land
in ``<output-dir>/linopy-<version>.json``.

Useful for bootstrapping a perf history against published linopy
releases. The current benchmark code runs against each linopy version
(constant measurement layer); the ``_API_AVAILABLE`` gates on sos /
piecewise specs make older linopy versions skip those phases gracefully.

Verified locally: ``sweep 0.7.0 --quick --no-use-lock`` runs end-to-end
in ~2 min (uv installs 57 packages in 200ms; the rest is the benchmark
run). Plain releases (0.4.0) and pip specs (git+https://...) both work
via the ``_linopy_install_spec`` helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README previously duplicated content from three sources:
- the notebook (models table, registry-usage code blocks)
- ``--help`` (quick-reference command list)
- a stale memory.py invocation (since replaced by ``memory save/compare``)

After the consolidation each surface has a clear single job:

- README: 1-paragraph what, setup (uv sync / lockfile), size-tier table
  (architectural), pointers to the notebook + ``--help``, metrics blurb.
- ``notebooks/registry_usage.ipynb``: the walkthrough — registry import,
  lookup / iterate / filter, parametrize your own pytest, profiling.
- ``python -m benchmarks --help``: command reference, autogenerated by
  typer from docstrings / Annotated[..., Option(...)] declarations.

Drops ~140 lines from the README; nothing actually disappears — it just
lives in the one place that owns it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pypsa removed from the [benchmarks] pinned set, from the sweep
``--no-use-lock`` install list, and from the lockfile. The
``test_pypsa_carbon_management.py`` module uses ``pytest.importorskip``
so collection no longer fails without pypsa; ``pypsa_scigrid`` already
had ``requires=("pypsa",)`` so its phase tests skip gracefully.
Install pypsa separately when you want those benchmarks.

Notebook (registry_usage.ipynb) rewritten as a proper operator guide:

- Architecture overview + per-phase measurement table up front.
- Registry walkthrough (lookup / iterate / filter) kept as the spine.
- Reuse patterns (parametrize-your-own-pytest, tracemalloc spot check).
- ``Running`` section now embeds ``--help`` output live via a
  ``show_help()`` helper that shells out to ``python -m benchmarks ...
  --help``. The doc stays in sync with the typer implementation
  automatically — change a flag in cli.py, re-run the notebook,
  documentation updates.
- New sections cover timing snapshots, memory snapshots, the
  cross-version sweep, and lockfile regeneration.

README gains an explicit "pypsa is optional" note in setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rough

Mirrors ``run``'s filter knobs and applies them to every version's
pytest invocation. Also switches to the ``typer.Context`` +
``context_settings`` pattern so trailing args after ``--`` are
forwarded to pytest verbatim (same shape ``smoke`` / ``run`` use).

    python -m benchmarks sweep 0.6.7 --phase build --model basic
    python -m benchmarks sweep 0.6.7 -- --tb=short -x

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``python -m benchmarks compare a.json b.json [-- --columns=...]``
shells out to ``pytest-benchmark compare`` so the whole suite stays
under one entry point. Accepts any number of snapshots; first is the
baseline.

When called with no arguments — or with paths that don't exist — it
prints a copy-paste-ready list of snapshots found under
``.benchmarks/`` (including ``.benchmarks/sweep/`` for cross-version
runs). If nothing's saved yet, points at the ``run --json`` flow.

For memory snapshots use ``memory compare`` instead — different
format, different tool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-paste)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ame)

pytest-benchmark's own default emits 10 columns side-by-side, which is
unreadable for any non-trivial comparison. Wrapper now prepends
``--columns=median,iqr --sort=name`` so the table is two stats wide
and the (baseline, candidate) pair of each test sits together
alphabetically.

Defaults are only applied when the user hasn't already set the flag,
so trailing pass-through overrides still work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arg split

Two fixes for the ``compare`` UX surfaced by the cross-version sweep:

- Default to ``--group-by=fullname`` so each test gets its own mini
  table showing (baseline, candidate) side-by-side with the
  parenthesized auto-ratio per column. Easy to scan ``(>1.10)`` for
  regressions in the median column. Combined with the existing
  ``--columns=median,iqr --sort=name`` defaults, the output goes from
  10-columns-wide-on-one-line to a focused two-column per-test view.

- Switch ``compare`` away from a positional ``list[Path]`` argument and
  parse ``ctx.args`` by hand: typer's positional list was greedily
  grabbing trailing ``--group-by=fullname`` etc. (and the ``--``
  separator didn't escape it either). Now arg-splitting is explicit:
  anything starting with ``-`` is pytest-benchmark pass-through,
  everything else is a snapshot path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tried switching to the canonical typer pattern (``--`` separator for
pass-through) but typer's positional ``list[Path]`` + ``allow_extra_args``
still greedily ate the trailing options. There's no clean typer/click
idiom for "list-typed positional + pass-through" — workarounds are
manual splitting, bounding the positional count, or named flags.

Manual splitting is the most pragmatic: snapshots come first, once we
see any flag-like token the rest is forwarded to pytest-benchmark.
That preserves things like ``--histogram=/tmp/hist/cmp`` (built-in
SVG-per-test plotting), ``--csv=out.csv``, ``--group-by=fullname``,
and the value-taking flags whose value doesn't start with ``-``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three opinionated interactive HTML views over pytest-benchmark JSONs,
auto-picked from snapshot count or set explicitly via ``--view``:

- **compare** (2 snapshots) — horizontal bar chart of per-test median
  delta, sorted by magnitude, green→red colormap. The "did this PR
  regress anything?" picture in one glance, vs pytest-benchmark's
  60-individual-SVGs which are useless for that workflow.
- **sweep** (3+ snapshots) — heatmap of median ratio relative to the
  first snapshot, rows = tests, columns = labels. Pairs with the
  ``sweep`` subcommand.
- **scaling** (1 snapshot) — log-log median vs ``n`` for
  size-parametrized tests (e.g. ``[basic-n=10..1600]``), faceted by
  phase. Shows whether linopy's complexity scales as expected.

plotly==6.7.0 pinned in [benchmarks]; lockfile regenerated. plotly is
lazy-imported inside ``plot`` so the rest of the suite stays usable
without it (with a clear error if a user tries ``plot`` and it's
missing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er absolutes

- New ``benchmarks/plotting.py`` module owns the three views
  (``plot_compare`` / ``plot_sweep`` / ``plot_scaling``) plus a
  ``RENDERERS`` dispatch dict. cli.py drops ~140 lines and just imports
  ``PlotView`` + ``RENDERERS``; plotly is still lazy-loaded inside the
  view functions so importing the module without plotly works.

- ``compare`` bar chart and ``sweep`` heatmap now use ``text_auto``
  so values render inside each bar / cell.

- Hover info upgraded:
  - compare hover shows the per-test median of *both* snapshots
    (formatted to 4 significant figures) in addition to the delta %.
  - sweep hover shows the absolute median (s) alongside the ratio, via
    a customdata + hovertemplate plumbed through ``update_traces``.

scaling view already shows the absolute median on hover by virtue of
being a line chart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ride

For microbenchmarks the lowest observed time is closest to the "true"
cost — background noise (GC, context switches, thermal throttling) can
only slow things down. pytest-benchmark's own ``--sort`` default is
``min`` for the same reason; LLVM's perf guide, JMH, Google Benchmark
and Alexandrescu's "Speed is found in the minds of people" all argue
similarly.

Changes:

- ``plot`` defaults to ``--metric min`` (was median). Accepts
  ``--metric median|mean|max`` to override. The metric drives the bar
  values, heatmap ratios, scaling-curve y-axis, and the hover labels.
- ``plot_compare`` / ``plot_sweep`` / ``plot_scaling`` in
  ``benchmarks/plotting.py`` all take a ``metric: Metric = "min"`` arg.
- ``compare`` table defaults to ``--columns=min,iqr --sort=min`` (was
  median,iqr / name). The auto-ratios next to each ``min`` flag
  regressions in the same readable form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For a suite where test costs span six orders of magnitude (knapsack
microsecond builds vs PyPSA carbon at 2.4 s), sorting by % delta
overweights cheap tests — a 100% regression on a 1µs test ranks above
a 1% regression on a 2s test, but the absolute impacts are 1µs vs
24ms.

Changes:

- Default sort is now ``absolute`` (``b - a`` in seconds). Bar values
  are the time delta with SI-prefix formatting on the x-axis (24 ms,
  240 µs, etc.). Big actual-time impacts float to the bottom.
- ``--sort relative`` keeps the old percent behaviour.
- Both ``delta_abs`` and ``delta_pct`` are surfaced in hover regardless
  of which one drives the sort, so you can read off whichever lens.
- ``plot_sweep`` / ``plot_scaling`` accept a ``sort`` arg for uniform
  signature but ignore it (no two-snapshot diff there).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compare bar chart forces a choice between sorting by relative %
or absolute Δ. Both have blind spots: pure-relative makes microbenchmark
noise look catastrophic, pure-absolute hides real algorithmic regressions
on fast paths.

The two-axis scatter resolves the tension visually. Per test:

- x = baseline time (log scale)
- y = candidate / baseline ratio
- colour = absolute Δ

A point is a real regression worth chasing only when it sits in the
top-right — slow tests that got slower. Top-left (high ratio, tiny
absolute) reads as microbenchmark noise; bottom-right (high absolute,
ratio ≈ 1) was already slow and didn't change. A dashed reference line
at ``y=1`` makes "no change" trivial to see.

The view is auto-picked for nothing (compare wins for 2 snapshots);
pass ``--view scatter`` explicitly to get it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two-axis scatter now scales beyond a single baseline-vs-candidate
pair. With N>=3 inputs the first is still the baseline (reference); each
subsequent snapshot becomes one animation frame. Use the slider / play
button to scrub through versions and watch tests drift across releases.

Implementation:

- First snapshot is the baseline. Skipped from the frame set (would
  trivially be y=1 everywhere).
- Each subsequent snapshot contributes points at (baseline_time,
  ratio, Δ) per overlapping test. ``animation_frame="version"`` does
  the per-frame slicing; ``category_orders`` preserves input order in
  the slider so the timeline reads left→right.
- ``range_x`` / ``range_y`` are pinned to the global min/max so the
  camera doesn't jump between frames.
- 2 inputs still produces a static scatter (no animation overhead).

Considered ``facet_col`` but it gets cramped past ~4 versions — the
slider scales to arbitrary length.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… |Δ|

Two small but high-value tweaks for the multi-snapshot scatter:

- The baseline snapshot now contributes its own animation frame where
  every point sits at ratio=1, Δ=0. Gives the animation a "before
  anything happened" anchor: hit play and watch points drift from the
  baseline horizon outward. Previously the first frame was the first
  candidate, which made the visual feel as if it started mid-story.

- ``range_color`` is pinned to the 95th-percentile absolute Δ
  (±p95). One huge outlier no longer drags the colour scale and
  flattens everyone else to white; outliers saturate at the bound,
  the rest of the distribution stays readable. Colour-bar label notes
  ``Δ (s, p95-clipped)`` so the convention is explicit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "no change" line sits at y=1, but with asymmetric data (e.g. some
2x regression, no symmetric speedup), it landed near the bottom of the
visible range and improvements got squeezed near the floor.

Now: ``max_dist = max(|1 - y_lo|, |y_hi - 1|)`` and ``range_y = [1 -
max_dist, 1 + max_dist]``. Pure min/max coverage (no clipping) but the
window is symmetric around 1.0, so regressions above and improvements
below are equally readable regardless of the data skew.

The colour scale keeps the p95-clipped centred-at-0 treatment from the
previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h warn

Three safe fixes from a code review of benchmarks/plotting.py:

- Row-height multiplier 14 → 22 in plot_compare and plot_sweep, with
  the floor bumped from 400 to 500. At 25+ tests the y-axis labels
  were colliding; now they breathe.
- plot_scaling reads ``params.size`` (the cleanly-stored int from
  parametrize) and only falls back to the id regex if absent. The
  ``model`` name still needs the regex because pytest-benchmark
  serializes our ModelSpec as ``UNSERIALIZABLE[ModelSpec(...)]``, so a
  full params switch isn't possible here — but the size path is now
  robust to test-id rename.
- plot_compare surfaces the mismatch between snapshots: prints a
  stderr line with the test counts only in A / only in B / common,
  and embeds the same as a subtitle in the figure. Silent intersection
  was the worst-case footgun.

Skipped (per review note): the default-view swap for 3+ snapshots
(sweep → scatter) is a judgement call left for the user. Default
output filename change (clobber on each run) also skipped — they want
to decide whether per-view filenames are worth the API change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…view>.html

Two coupled changes setting up the notebook-embedding path:

- ``plot_compare`` / ``plot_scatter`` / ``plot_sweep`` / ``plot_scaling``
  in ``benchmarks/plotting.py`` now return a ``plotly.graph_objects.Figure``
  instead of writing to disk + returning a count. The CLI does the
  ``fig.write_html(output)`` step. ``benchmarks.plotting.n_points(fig)``
  is exported as a helper so the CLI still emits a "N points → path"
  status line.

  This unblocks rendering plots directly in jupyter — call
  ``plot_compare(...)`` and Jupyter's display hook renders the Figure
  inline.

- Default ``-o`` for ``plot`` is now ``.benchmarks/plots/<view>.html``
  (was ``benchmark-plot.html`` in cwd). Matches where snapshots already
  land (and is gitignored), and the per-view filename means consecutive
  runs of different views don't clobber each other.

Bonus: two ``numpy_array or fallback`` bugs in scatter (``df.abs().max()
or 1e-9``) and the new ``n_points`` helper (``trace.x or trace.z``) —
both triggered ``ValueError: The truth value of an array with more
than one element is ambiguous``. Replaced with explicit ``is None``
checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ospection

The ``n_points(fig)`` helper added in the previous commit walked
``fig.data`` traces and called ``len(trace.x)`` to recover the test
count. That's backwards — the count is sitting right there in the
source DataFrame at render time, no need to reach into the rendered
plot.

Renderers now return ``tuple[Figure, int]`` directly. ``len(df)`` for
compare / sweep / scaling; ``df["test"].nunique()`` for scatter
(rows are per-(test, version) so the raw len double-counts).

n_points helper dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two refinements to the end-to-end plotting section:

- tqdm wraps the subprocess loop that generates the two snapshots.
  Each ``--quick --phase build`` run takes ~10 s; tqdm makes the
  ~20 s wait visible. ``tqdm.auto`` auto-picks the notebook widget
  vs console bar based on context.

- Plots are now rendered via ``python -m benchmarks plot --view <name>``
  rather than direct ``plot_compare`` / ``plot_scatter`` imports.
  A small ``cli_plot(view, snapshots)`` helper runs the subprocess,
  reads the generated HTML, and inlines it via ``IPython.display.HTML``.
  Demonstrates the actual user-facing CLI path inside the notebook
  rather than the internal API.

Notebook end-to-end runtime: ~37 s (~33 s for the run loop + plotting
overhead).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FBumann and others added 17 commits June 5, 2026 14:02
Both CodSpeed jobs now use CodSpeedHQ/action@v4 with id-token:write and no
CODSPEED_TOKEN — nothing to rotate or leak. Matches the canonical config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Tracks peak memory + allocations per-PR-to-master — the instrument that
reflects the sparsity/memory work, which simulation and walltime miss.
Starts on a standard runner; first run settles whether Python memory mode
needs the bare-metal macro runner. Non-gating, OIDC auth.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CODSPEED_MODULES in conftest is the single source of truth for what the
--codspeed instruments measure: build, matrices, lp_write (file IO) and
solver_handoff (direct in-memory IO). test_netcdf is deselected under
--codspeed (slow + noisy under walltime) but still runs under `benchmarks
smoke`. Workflows are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keeps the sparse(0)/dense(100) contrast plus a midpoint, cuts pattern items
~40% across every instrument. Finer characterization via per-spec severities.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s from CodSpeed

Phase taxonomy now names the directional operation instead of mixed verbs:
- lp_write       -> to_lp        (test_lp_write.py        -> test_to_lp.py)
- netcdf         -> to_netcdf / from_netcdf  (one phase split into the two
  directions; test_netcdf.py functions test_to_netcdf / test_from_netcdf)
- solver_handoff -> to_solver    (test_solver_handoff.py  -> test_to_solver.py;
  'to_solver' avoids reading as Solver.solve())
build / matrices stay un-prefixed (lifecycle, not exports); the per-solver
to_highspy/... tags were already in this scheme.

Also drops test_matrices from CODSPEED_MODULES — matrix-gen is a subset of the
direct to_solver handoff (both warm the same label_index cache), so CodSpeed
keeps full coverage via build + to_lp + to_solver.

Propagated through registry, __init__, memory.py (phase ids + CLI tags),
cli/_base.py (PhaseName + test-file map), sos.py, docs. Renames benchmark ids,
so CodSpeed re-baselines those benchmarks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cachegrind (simulation) is ~hundreds× native and under-weights sparse/native
work, so the per-PR simulation job now trims to the cheap phases (build, to_lp).
to_solver keeps per-PR coverage via the memory instrument and master coverage
via walltime. Each CodSpeed workflow declares its instrument with
--codspeed-set {simulation,memory,walltime}; conftest maps it to a module
subset via config.getoption (our own option — no pytest-codspeed internals, no
silent mode-sniffing). Definition stays in Python (CODSPEED_SETS).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
walltime/memory both map to the full set (the default), so passing
--codspeed-set there was a no-op. Only the cachegrind job opts into the
trimmed set; memory/walltime revert to plain pytest --codspeed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The walkthrough is a CLI demo, but its run/memory-save cells swept the whole
registry (16 models + patterns) twice each — 'run --quick --phase build' alone
was ~23s x2. Scope them to '--filter basic' (~3s each): same workflow
demonstrated and still executed in CI, ~50s less work. Drop the filter to sweep
everything.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gated full run

Separate per-instrument CI jobs overwrote each other: CodSpeed records one run
per (commit, environment), so a later upload to the same environment supersedes
the earlier one and its benchmarks show as 'skipped' (memory kept vanishing
behind the slower simulation upload). Fix:

- codspeed-memory.yml: per-PR memory only (solo upload, fast, no overwrite).
- codspeed.yml: simulation+memory in ONE run (mode: simulation,memory) on master
  push + manual dispatch + opt-in PR via the 'trigger:benchmark' label (keeps PR
  context so CodSpeed comments; label auto-removed after the run).
- codspeed-macro.yml: walltime adds the same label-gated PR path.

Drops the --codspeed-set per-instrument subset machinery (incompatible with the
one-upload-per-environment model); CODSPEED_MODULES stays as the eligibility
guard. Runtime is now traded via cadence (per-PR = memory only), not subsets.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting (#28)

* refactor(benchmarks): lean up — trim docstrings/comments, dedupe plotting

Behavior-preserving readability pass:
- plotting.py: extract _metric_label (4x) and _diverging_kwargs (3x); cut
  comments that restate code, keep the plotly-quirk ones; tighten docstrings.
- memory/bench/registry/snapshot: trim the 21-34 line module docstrings to
  essentials (keep the non-obvious why); fix stale lp_write -> to_lp refs.
- sweep.py: trim restate comments (orchestration left as-is — no test coverage).

No logic/API/CLI change; -84 lines. Verified: mypy clean, smoke 172 passed,
_tests 14 passed, walkthrough notebook renders all plots.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci(codspeed): run memory before simulation in the combined run

mode: memory,simulation — memory (fast, always-on) is measured first, so its
baseline lands even if the slow cachegrind pass is interrupted.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aseline (#29)

Instruction counts under-weight the numpy/scipy/sparse work this fork targets,
and the cachegrind pass was ~11 min — by far the slowest. This project doesn't
do micro-CPU tuning, so its deterministic-sub-1%-CPU value isn't worth it.

Dropping it also removes the whole reason codspeed.yml existed (memory+simulation
shared one run to avoid same-environment overwrite). Now:
- codspeed-memory.yml: memory on master + every PR + dispatch (always-on).
- codspeed-macro.yml: walltime on master + dispatch + labeled PRs (unchanged).
- codspeed.yml: deleted.
memory (ubuntu) and walltime (macro) are separate environments, so no clash.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(benchmarks): log2 sweep colouring + --clip clamp (shared symmetric p95)

The sweep heatmap coloured by raw ratio on plotly's linear scale, so a 2x and
its mirror 1/2x looked asymmetric. Colour by log2(ratio) instead — folds
symmetric around 1x, with a fold-change colourbar (1/8x...8x).

Add --clip to override the colour clamp (a fold-change >1 for sweep, an absolute
delta for scatter) over a new shared _symmetric_clip(magnitudes, override)
helper that defaults to the symmetric p95 of the data, reused by both views.
numpy promoted to a module-level import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(benchmarks): log-scale ratio axis in scatter; symmetric p95 colour in compare

Same fix as the sweep view, applied across the others:
- scatter: the ratio y-axis was linear, so a 2x and its mirror 1/2x read
  asymmetrically (they even centred it on 1.0 *linearly*). Make it log_y so folds
  are symmetric about 1.0; window symmetric in log space; drop non-positive
  ratios (a log axis can't show a 0).
- compare: clamp the bar colour with the shared symmetric p95 (consistency; the
  bar *length* still shows the full delta).
- scaling already log-scales size; left as is.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(benchmarks): make --clip a uniform fold-change on the ratio dimension

Was inconsistent: a fold for sweep but an absolute Δ for scatter, and compare
ignored it. Now --clip is always a fold-change (>1, default symmetric p95) that
bounds the *ratio* dimension wherever a view has one:
- sweep: the ratio colour (±log2)
- scatter: the ratio y-axis ([1/clip, clip]) — moved off the colour, which
  reverts to the auto symmetric-p95 Δ clamp
- compare / scaling: no ratio axis → ignored

Validation is now uniform (fold-change > 1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(benchmarks): --clip clamps only the colour scale (log or linear per plot)

Colour is the one thing you can't adjust after the plot is drawn (axes zoom).
So --clip targets the colour only, and its unit follows the plot's colour scale:
- sweep (colour = log2 ratio): a fold-change (>1)
- scatter / compare (colour = absolute Δ): a linear Δ bound
- scaling: no diverging colour → ignored
Default stays the symmetric p95. Axes are full-range and zoomable — scatter's
y-axis no longer p95-clips (which hid outlier points). Validation is per-scale
(fold > 1 for sweep; any positive for the linear ones).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(benchmarks): make --sort actually sort compare bars; add --order for inputs

- compare: --sort was misleading — bars were hardcoded alphabetical
  (sort_values('test_id')) while --sort only switched the dimension. Now it
  sorts by the chosen Δ (delta_abs/delta_pct): biggest regressions on top,
  improvements at the bottom. The name/help are finally truthful.
- plot --order {input,version}: default 'input' preserves the order you pass
  (the plot never re-sorts); 'version' sorts inputs by parsed linopy-<ver>,
  fixing a glob's string order (0.3.10 before 0.3.2) for release-history sweeps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(benchmarks): add plot --reverse to flip snapshot order

Applied after --order, so e.g. --order version --reverse = newest-first (which
also makes the newest snapshot the sweep baseline / compare 'a').

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* revert(benchmarks): drop plot --order/--reverse

plot serves arbitrary snapshots (bench labels, baseline.json, …), so parsing
linopy-<ver> from filenames is a leaky abstraction — --order version is
meaningless for non-version snapshots. plot already preserves input order, so
callers control the axis by the order they pass. The --sort fix (compare bars
sort by Δ, not alphabetically) stays — that was a real bug.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(benchmarks): --quick as explicit per-spec subset + manual axis selection

Replace the quick_threshold size cap with quick_subset — the first/middle/last
of each spec's sweep — so --quick runs three representative points instead of a
threshold side-effect. This fixes two latent bugs the threshold caused:

- models ran only their smallest size under --quick (most quick_threshold=10),
  so the per-PR memory signal was dominated by sub-MiB benchmarks;
- patterns silently skipped severity=100 (quick_threshold=50), so the densest,
  peak-memory regime never ran in CI.

Both now follow one rule (first/mid/last); patterns get all of (0, 50, 100).
pypsa_scigrid opts out via quick_sizes=().

Add manual --size / --severity pytest options (repeatable) that override the
tier flags per axis, so CI can pin exact values.

Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100) for finer default/--long/sweep
resolution; --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(benchmarks): sweep --metric {time,memory,both} + compare auto-detect

Make the metric (time vs memory) a flag on the workflow rather than a separate
command tree — the first half of unifying the time/memory CLI split.

- Add a Measure enum (time|memory|both) and `--metric` to `sweep`, dispatching
  to the timing or memory engine (or both, sequentially — memray overhead would
  skew wall-clock if run concurrently). `both` writes per-metric subdirs so the
  two linopy-<version>.json sets don't collide. Default snapshot dirs are now
  .benchmarks/{time,memory}/ (timing default moved from .benchmarks/sweep).
- `compare` auto-detects memory snapshots (peak_mib key) and diffs them with a
  peak-RSS table; timing still goes through pytest-benchmark. Mixing errors out.
  Adds memory.compare_snapshots (path-based; compare() now delegates to it).

The `memory` sub-app still works; Stage 2 retires it and adds `run --metric` +
--long/--size/--severity parity on the memory engine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(benchmarks): retire memory sub-app — `run --metric` + unified gating

Finish folding the time/memory split into a single command surface. Memory is
now a --metric flag everywhere, not a parallel `memory` sub-app.

- `run --metric {time,memory,both}` replaces `memory save`: prints results,
  saves only with --json (label = filename stem) — one rule for both metrics.
  Adds --quick/--size/--severity/--repeats so flags match across metrics.
- Delete the `memory` sub-app (cli/memory.py, memory_app); `memory sweep` →
  `sweep --metric memory`, `memory compare` → `compare` (auto-detect, Stage 1).
- Unify size/severity selection in registry.skip_reason(), shared by
  conftest.maybe_skip and memory.run_phase so the two engines can't drift.
  Threads --long/--size/--severity through the memory worker; the id-alignment
  test still passes.
- Split memory.save() into measure() (returns the dict) + save() (writes), so
  `run --metric memory` can measure-then-maybe-write.
- run_memory_sweep now invokes `run --metric memory --json <abs>` and writes
  straight to output_dir (no save-to-default-then-move dance); gains --long.
- Update the walkthrough (CI-executed, re-runs clean) and benchmarks/README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(benchmarks): fail cleanly on malformed/unreadable snapshot files

compare crashed with a raw JSONDecodeError traceback (and plot printed a
cryptic "Expecting value:" with no filename) when handed an empty or malformed
snapshot. load_snapshot now raises a clear, file-named ValueError on unreadable
JSON or an unrecognized shape; compare routes its metric detection through
load_snapshot too, so both commands report e.g.

  /tmp/x.json: not a readable JSON snapshot (Expecting value: line 1 ...)

The mixed memory/timing guard (compare's any/all, plot's _check_same_unit) was
already correct and is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mall --clip (#32)

The fold colourbar placed ticks at integer log2 steps (…, 1/2×, 1×, 2×, …). For
a tight range — a small `--clip` like 1.5 (bound = log2 1.5 ≈ 0.585), or a
naturally tight p95 default — those integer steps fall *outside* [-bound, bound],
so only `1×` rendered.

Pick ticks that fit the range and read as round folds: clean powers of two for a
wide range, round sub-2× folds (1.5×, 1.25×, 1.1×) for a tight one, with an
even-spaced fallback when nothing round fits. `--clip 1.5` now shows
1/1.5× … 1.5× instead of a lone 1×.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…thresholds) (#33)

Replace the derived quick subset and the long_threshold number with explicit
size-tier constants at the top of each model file, so what runs in each tier is
visible at a glance:

    SIZES       = (10, 50, 100, 250, 500, 1000, 1600)  # --long
    QUICK_SIZES = (10, 250)                            # --quick (per-PR)
    LONG_SIZES  = (1000, 1600)                          # --long only

Tiers: --quick → quick_sizes; default → sizes minus long_sizes; --long → all.
LONG_SIZES is the sizes above each old long_threshold, so default/--long behave
exactly as before. QUICK_SIZES is the lean per-PR pair — the giant-drop that
fixes the 14-min CodSpeed memory job, now spelled out rather than derived.
Patterns expose the same shape via the QUICK_SEVERITIES (0, 50, 100) constant
and long_sizes=() (every severity runs by default). skip_reason / introspect /
help text updated; long_threshold removed.

Supersedes the derivation-based giant-drop.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann changed the title Add internal performance benchmark suite + CodSpeed CI (and alignment/groupby fixes) bench: Add internal performance benchmark suite + CodSpeed CI Jun 7, 2026
@FabianHofmann
Copy link
Copy Markdown
Collaborator

@FBumann codspeed has now all required permissions. I would like to pull this in before the release 0.8 so we have a clear starting point for the benchmark in upcoming versions

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 8, 2026

Congrats! CodSpeed is installed 🎉

🆕 138 new benchmarks were detected.

You will start to see performance impacts in the reports once the benchmarks are run from your default branch.

Detected benchmarks


ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Open in CodSpeed

Copy link
Copy Markdown
Member

@lkstrp lkstrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @FBumann,
codspeed is great stuff! Thanks, I would wanna use this for other repos as well. I only wen now through the workflows, and have a couple of commits.

Also note while you guys need to decide that. If I skim the rest, I think it's very verbose and pretty sure this is not necessary

Another point:
Can we remove the comments that Codspeed is creating? Similar to Codcov? In my opinion, it's enough to just get the status checks

name: Benchmark smoke

# Builds every spec and fires every phase once under --quick
# --benchmark-disable: a "did a refactor break a spec?" check, not timing.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this run?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 minutes
1 for the actual smoke test, 1 for the notebook

Can be optimized I think.

https://github.com/fluxopt/linopy/actions/runs/27085593476/job/79939326221

- name: Set up Python 3.12
uses: actions/setup-python@v6
with:
python-version: "3.12"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to 3.13?

# arbitrary PR code — the label is a maintainer-controlled gate, so only apply it
# to trusted (same-repo) PRs.
#
# Requires the repo under a GitHub org (macro runners are org-only) with the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the noise

# Requires the repo under a GitHub org (macro runners are org-only) with the
# CodSpeed app connected to the repo (OIDC auth — no token secret needed).

on:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge both codspeed files?

pull_request:
types: [ labeled, synchronize ]
branches: [ master ]
workflow_dispatch:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess push master + workflow_dispatch is enough

if: >-
${{ github.event_name != 'pull_request' ||
contains(github.event.pull_request.labels.*.name, 'trigger:benchmark') }}
runs-on: codspeed-macro
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are costs here. Looks great running memory on github runners and clocktime on codspeed runners, can you point me to more resources? Specially if we wanna use this across multiple repos. How long does this run and how much is for free?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walltime (macro) is free up to 600 min/month

Cost after that is 0,013 €/min

But I'm not sure if we can go over the 600 minutes without a paid plan...

Everything else can run on a girhub runner and is free

https://codspeed.io/pricing

Comment thread pyproject.toml
"polars==1.35.2",
"dask==2025.11.0",
"pytest==9.0.3",
"pytest-benchmark==5.2.3",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

Comment thread .github/dependabot.yml
# stable while still surfacing upstream perf changes per-PR with
# eyes-open review. Loose ``[project.dependencies]`` (numpy, scipy, ...)
# have no version specifier so Dependabot leaves them alone — only the
# ``==`` pins in ``[benchmarks]`` produce PRs.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinning like you did in pyproject.toml doesn't give us full pins, because the deps of deps aren't pinned. Also, I'm not sure we need these per dependency PRs just to get codspeed stats on them. The user will use the newest versions anyway, and it's very unlikely that we'd pin deps for the user just because of codspeed.

So I'd just fully remove this and the benchmark pins in pyproject.toml. For PRs, we probably want to just commit uv.lock and use that one in all PRs (for tests and benchmarks), and on master we always resolve from scratch. But all CI runs need to be updated then. I guess this can rather be done in another PR, and here we just ignore dependency pinning

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against a uv lock file to be able to sweep across linopy versions. As this is arguably not needed for ci and regression tracking, I'm fine with removing this and maybe even commuting a lock file instead. But lockfile usage needs to stay optional to make sweeping possible.
Another reason to add those here was to catch indirect improvements in upstream repos (xarray etc), as a dependabot or would also run when trying to bump the deps.
But we could probably simplify this.

@FabianHofmann
Copy link
Copy Markdown
Collaborator

FabianHofmann commented Jun 8, 2026

The PR is very large and will be hard to maintain. I know it does not touch the production code but I would like to have things a bit leaner.

  1. Ship Codspeed CI first
  2. Ship local CLI later
  3. Trim down the code a bit, remove comments, think about resusing pytest built-ins where possible, DRY
  4. Reduce the number of test cases. I like the coverage of the individual functionality (merge, sparse KVL and so on), I just think it is too much to really track. Let's think about how to make that more useable

Note

AI-generated (Claude) analysis, updated to fold in the CodSpeed CLI capabilities. Posted as input for the plan above; decisions and replies are the maintainers'.

On (1) + (2) — what "CI first, CLI later" maps to in the tree. Both CodSpeed jobs run only pytest benchmarks/ --quick --codspeed, so the CI baseline is registry → phases.py → test_*.py + conftest.py (~1,900 lines). The local-only analytics stack — memray engine, plotting, cross-version sweep, snapshot/compare, ad-hoc bench (~55%) — is exactly the "ship later" layer; none of it runs in CI today. Disposition mapped in the table below.

On (3) — pytest-codspeed built-ins replace a chunk of the local stack. The pinned plugin (5.0.3) has native --codspeed-mode {walltime,memory}, so the local dev story can be "run the same command CI runs, locally":

  • Local memory profiling → pytest benchmarks/ -k <name> --codspeed --codspeed-mode memory. This makes memory.py (511 lines, the memray engine) redundant — no bespoke memory harness to maintain.
  • Ad-hoc local timing → --codspeed-mode walltime, covering most of what bench.py (346 lines) exists for.
  • What the CLI/plugin does not provide: a local A/B compare or local plots — those live on the hosted app. So sweep.py / cli/compare.py / plotting.py have no built-in equivalent. But the hosted app already tracks version-over-version on master, so the only genuinely uncovered case is a one-off "linopy X vs Y" study → a dev-scripts/ script, not merged infra. (Caveat: the docs don't confirm a fully offline / no-token local mode — worth a quick check before relying on it.)
  • DRY quick win: DEFAULT_PHASES is an identical longhand copy of ALL_PHASES (registry.py) and collapses to one line.

On (4) — the registry makes "fewer tracked cases" a data edit, not a rewrite. Tracked count = specs × phases × sizes. The builders you want to keep (merge, sparse KVL, rolling, …) are one entry each in patterns/ / models/ and are cheap to keep. The dials already exist (quick_sizes / long_sizes per spec, --quick / --long). So the question is purely what the dashboard tracks by default: e.g. default CodSpeed to one representative size per spec (keeping every builder runnable on demand) so the dashboard shows ~1 trend per functionality instead of 3–6 — full coverage stays a flag away without flooding the regression view.

Component breakdown — CI baseline vs. ship-later, with disposition
Component ~Lines In CI baseline? Disposition under the plan
registry.py + models/ + patterns/ ~1,400 ship now (core)
phases.py 86 ship now
conftest.py + test_*.py ~380 ship now
cli/run.py (smoke) + introspect/notebook ~450 ✅ (smoke) ship now, thin
memory.py (memray engine) 511 drop — --codspeed-mode memory covers it
bench.py (ad-hoc timer) 346 mostly dropped — --codspeed-mode walltime
plotting.py 533 CLI-later; hosted app covers ongoing
sweep.py (cross-version via uv) 424 CLI-later / one-off dev-scripts/
snapshot.py + cli/compare.py 333 CLI-later (feeds plotting/compare)
walkthrough.md + notebook CI job 361 smoke only trim with the layer it documents

Counts approximate (wc -l); CI usage from the three workflow files. CodSpeed CLI facts from https://codspeed.io/docs/cli and the pytest-codspeed reference.

@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 8, 2026

I think If we go all in on Godspeed we could trim a lot of things from this pr.
But I think the local benchmarking cli is very valuable during the next few months.

If you rather keep it lean, feel free to drop all the local dev stuff. But note that this will mean that we will probably leverage ci/cod speed more during dev.
If you don't mind the ci minutes, this is ok for me. We can add it back in later if we change minds

@FabianHofmann
Copy link
Copy Markdown
Collaborator

I think If we go all in on Godspeed we could trim a lot of things from this pr. But I think the local benchmarking cli is very valuable during the next few months.

If you rather keep it lean, feel free to drop all the local dev stuff. But note that this will mean that we will probably leverage ci/cod speed more during dev. If you don't mind the ci minutes, this is ok for me. We can add it back in later if we change minds

Let me think about it once more, this is my first intuition. but I let me try the local benchmarking first and then we decide together

@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 8, 2026

I tried the local benchmarking to try out different approaches for memory reduction, and having a local memory benchmark felt really valuable to me (pytest doesn't have one).
As memory is arguably more important than time in some cases, this was the main reason to add it

@FBumann
Copy link
Copy Markdown
Collaborator Author

FBumann commented Jun 8, 2026

I think splitting the registry and patterns etc from the utility suite is the right call.
I means that linopy only keeps the thinks it needs for codspeed.
And another package, which can be reused in other projects, holds the logic for sewwping, plotting etc!

Working on it here: fluxopt#34

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants