Skip to content

Make multi-key LinearExpression.groupby([names]) fast and flat (align with xarray's grouper API) #753

@FBumann

Description

@FBumann

Note

AI-written analysis (Claude Code, prompted by @FBumann), combining a linopy code trace with a dig through PyPSA's groupby usage. The code-level claims and the prototype output below are verified on the #751 branch; the PyPSA-side file references come from that investigation — please sanity-check before acting.

Multi-key grouping of a LinearExpression currently forces a choice: fast (pandas DataFrame grouper → stacked-MultiIndex output) or separate dims per key (list-of-names → slow xarray fallback). This issue makes groupby([names]) fast while keeping its separate-dims output, without breaking the existing DataFrame contract.

What each layer accepts as a grouper

Grouper plain xarray linopy PyPSA uses it?
dimension name — groupby("snapshot") ✓ (always worked; #751 moved it onto the fast path)
non-dim coordinate name — groupby("carrier") fixed in #751 (was ValueError) + fast
list of names — groupby(["carrier","bus"]) ✓ but slow (xarray fallback) no
DataArraygroupby(da) ✓ (fast) ✓ nodal balance (constraints.py, via c._as_xarray)
Grouper obj — groupby(x=UniqueGrouper()) ✗ unsupported
pandas DataFrame/Series KeyError ✓ (fast, stacked output) ✓ statistics (expressions.py)

Note: the pandas DataFrame/Series grouper is a linopy-only extension — plain xarray rejects it (KeyError). PyPSA is also internally inconsistent: its constraint/nodal-balance code uses the idiomatic DataArray grouper, only the statistics layer uses the pandas one.

The gap: fast XOR separate-dims for multi-key

Spelling Path Speed Output
groupby(["period","season"]) (names) xarray fallback slow (period, season) grid (separate dims)
groupby(df[["period","season"]]) (DataFrame) reindex fast path fast stacked-MultiIndex group dim

Root cause in LinearExpressionGroupby.sum: _resolve_group resolves a single name to its coordinate (fast path), but a multi-name list is returned unchanged, fails the non_fallback_types = (Series, DataFrame, DataArray) check, and drops to the fallback. The only multi-key fast path is the pd.DataFrame branch, which emits a stacked pd.MultiIndex.

Proposed direction (non-breaking)

Key the output shape on how the group was supplied — so the one fast path serves both spellings:

  1. groupby([names]) → fast, separate dims. Resolve a list of coord names to a value frame so it takes the reindex fast path:

    if isinstance(group, (list, tuple)) and all(isinstance(g, str) and g in data.coords for g in group):
        group = data[list(group)].to_dataframe()[list(group)]   # tag this as "came from names"

    Then, for the names spelling only, unstack the stacked result back into separate dims. This reproduces today's fallback output exactly — verified on a sparse crossing (combo (2020,'s') absent):

    FALLBACK  list-of-names       : {'period': 2, 'season': 2, '_term': 2}   (slow)
    FAST      DataFrame (stacked) : {'group': 3, ...}  -> unstack('group') -> {'period': 2, 'season': 2, '_term': 2}   (fast)
              missing (2020,'s') cell -> vars [-1, -1], coeffs [nan, nan]    (filled, == fallback)
    
  2. groupby(df) (user-supplied DataFrame) → unchanged. Keep the stacked-MultiIndex group output. The tested contract (test_linear_expression_groupby_with_dataframe, asserting set(group.values) == set(MultiIndex.from_frame(groups).values)) still holds — no deprecation, no breaking change.

Implementation notes:

  • sum needs to know the frame came from names (a flag from _resolve_group, or handle the list branch directly), so the final unstack applies only to the names spelling.
  • The names path builds a stacked MultiIndex internally and immediately unstacks it — a transient intermediate index, fine.
  • The closing unstack densifies to the full grid — the same shape today's fallback produces, so parity, not a regression.

Optional follow-up (not required by the above)

With a fast separate-dims list grouping available, PyPSA could migrate its statistics sites to the idiomatic, label-safe form for consistency with its constraint code — but it's no longer forced, since the DataFrame grouper keeps working:

gx = grouping.to_xarray()                          # name-indexed, label-aligned (no positional footgun)
da = da.assign_coords(carrier=gx.carrier, bus=gx.bus)
da.groupby(["carrier", "bus"]).sum()               # now fast, separate dims

Relationship to #751

This builds directly on #751 (which fixes #750):

#751 deliberately stops at single-key parity and does not depend on this; nothing here blocks it.

Future directions (sparse representations)

The grid memory cost is fundamental to dense xarray: separate dims are a dense cartesian grid, so a sparse/correlated key crossing materialises mostly-fill cells (measured ~100× vs the DataFrame grouper for a diagonal crossing; see #740). Truly getting separate dims and compact needs a sparse representation. Options, roughly by payoff vs cost:

  1. sparse.COO duck-arrays under xarray — separate dims and observed-only storage. Requires linopy to be sparse-aware end to end (solver export, matmul, merge).
  2. Long-format / polars kernel (Umbrella: long-format / sparse _term kernel (dense-_term memory cluster) #756) — groupby becomes a relational group_by().agg(), inherently sparse; no grid, no MultiIndex. Cleanest long-term; biggest change.
  3. Integer-code segment-sum into a single observed-combo dim with the keys as plain coords — avoids both the grid and the MultiIndex, but is non-idiomatic (no .sel).
  4. Lazy / fused grouping — keep the compact form and only densify if the result is indexed by label.

Until a sparse kernel lands, the two-shape split above (grid via [names], compact MultiIndex via df) plus the blow-up warning is the pragmatic dense-xarray answer. This is also evidence for the #756 kernel decision: groupby's grid blow-up, matmul's zero-padding (#748), and merge's padding (#749) all dissolve together under a sparse/long kernel.

See also

Surfaced while reviewing #751.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions