Make multi-key LinearExpression.groupby([names]) fast and flat (align with xarray's grouper API)

> [!NOTE]
> AI-written analysis (Claude Code, prompted by @FBumann), combining a linopy code trace with a dig through PyPSA's groupby usage. The code-level claims and the prototype output below are verified on the #751 branch; the PyPSA-side file references come from that investigation — please sanity-check before acting.

Multi-key grouping of a `LinearExpression` currently forces a choice: *fast* (pandas `DataFrame` grouper → stacked-MultiIndex output) **or** *separate dims per key* (list-of-names → slow xarray fallback). This issue makes `groupby([names])` **fast** while keeping its separate-dims output, without breaking the existing `DataFrame` contract.

## What each layer accepts as a grouper

| Grouper | plain xarray | linopy | PyPSA uses it? |
|---|---|---|---|
| dimension name — `groupby("snapshot")` | ✓ | ✓ (always worked; #751 moved it onto the fast path) | — |
| non-dim coordinate name — `groupby("carrier")` | ✓ | ✓ **fixed in #751** (was `ValueError`) + fast | — |
| list of names — `groupby(["carrier","bus"])` | ✓ | ✓ but **slow** (xarray fallback) | no |
| `DataArray` — `groupby(da)` | ✓ | ✓ (fast) | ✓ nodal balance (`constraints.py`, via `c._as_xarray`) |
| `Grouper` obj — `groupby(x=UniqueGrouper())` | ✓ | ✗ unsupported | — |
| pandas `DataFrame`/`Series` | ✗ `KeyError` | ✓ (fast, **stacked** output) | ✓ statistics (`expressions.py`) |

Note: the pandas `DataFrame`/`Series` grouper is a **linopy-only extension** — plain xarray rejects it (`KeyError`). PyPSA is also internally inconsistent: its constraint/nodal-balance code uses the idiomatic `DataArray` grouper, only the statistics layer uses the pandas one.

## The gap: fast XOR separate-dims for multi-key

| Spelling | Path | Speed | Output |
|---|---|---|---|
| `groupby(["period","season"])` (names) | xarray fallback | slow | `(period, season)` grid (separate dims) |
| `groupby(df[["period","season"]])` (DataFrame) | reindex fast path | fast | stacked-MultiIndex `group` dim |

Root cause in `LinearExpressionGroupby.sum`: `_resolve_group` resolves a *single* name to its coordinate (fast path), but a multi-name list is returned unchanged, fails the `non_fallback_types = (Series, DataFrame, DataArray)` check, and drops to the fallback. The only multi-key fast path is the `pd.DataFrame` branch, which emits a stacked `pd.MultiIndex`.

## Proposed direction (non-breaking)

Key the output shape on **how the group was supplied** — so the one fast path serves both spellings:

1. **`groupby([names])` → fast, separate dims.** Resolve a list of coord names to a value frame so it takes the reindex fast path:
   ```python
   if isinstance(group, (list, tuple)) and all(isinstance(g, str) and g in data.coords for g in group):
       group = data[list(group)].to_dataframe()[list(group)]   # tag this as "came from names"
   ```
   Then, **for the names spelling only**, `unstack` the stacked result back into separate dims. This reproduces today's fallback output *exactly* — verified on a sparse crossing (combo `(2020,'s')` absent):

   ```
   FALLBACK  list-of-names       : {'period': 2, 'season': 2, '_term': 2}   (slow)
   FAST      DataFrame (stacked) : {'group': 3, ...}  -> unstack('group') -> {'period': 2, 'season': 2, '_term': 2}   (fast)
             missing (2020,'s') cell -> vars [-1, -1], coeffs [nan, nan]    (filled, == fallback)
   ```

2. **`groupby(df)` (user-supplied `DataFrame`) → unchanged.** Keep the stacked-MultiIndex `group` output. The tested contract (`test_linear_expression_groupby_with_dataframe`, asserting `set(group.values) == set(MultiIndex.from_frame(groups).values)`) still holds — **no deprecation, no breaking change**.

Implementation notes:
- `sum` needs to know the frame came from names (a flag from `_resolve_group`, or handle the list branch directly), so the final `unstack` applies only to the names spelling.
- The names path builds a stacked MultiIndex internally and immediately unstacks it — a transient intermediate index, fine.
- The closing `unstack` densifies to the full grid — the *same* shape today's fallback produces, so parity, not a regression.

## Optional follow-up (not required by the above)

With a fast separate-dims list grouping available, PyPSA *could* migrate its statistics sites to the idiomatic, label-safe form for consistency with its constraint code — but it's no longer forced, since the `DataFrame` grouper keeps working:

```python
gx = grouping.to_xarray()                          # name-indexed, label-aligned (no positional footgun)
da = da.assign_coords(carrier=gx.carrier, bus=gx.bus)
da.groupby(["carrier", "bus"]).sum()               # now fast, separate dims
```

## Relationship to #751

This builds directly on #751 (which fixes #750):

- #750/#751 is specifically about grouping by a **non-dimension coordinate name** — `groupby("carrier")` raised `ValueError: carrier already exists as coordinate` before, works after (verified). Grouping by a *dimension* name always worked; #751 only moved it onto the fast path via `_resolve_group` (commit `5a9b2e5`).
- This issue extends that same resolver to the multi-name list case.
- #751's `TestGroupbyByAttachedCoordinate` matrix already pins the separate-dims output for the list spelling, so the target shape is established and tested.

#751 deliberately stops at single-key parity and does not depend on this; nothing here blocks it.

## Future directions (sparse representations)

The grid memory cost is fundamental to *dense* xarray: separate dims are a dense cartesian grid, so a sparse/correlated key crossing materialises mostly-fill cells (measured ~100× vs the DataFrame grouper for a diagonal crossing; see #740). Truly getting *separate dims and compact* needs a sparse representation. Options, roughly by payoff vs cost:

1. **`sparse.COO` duck-arrays under xarray** — separate dims *and* observed-only storage. Requires linopy to be sparse-aware end to end (solver export, `matmul`, `merge`).
2. **Long-format / polars kernel (#756)** — groupby becomes a relational `group_by().agg()`, inherently sparse; no grid, no MultiIndex. Cleanest long-term; biggest change.
3. **Integer-code segment-sum** into a single observed-combo dim with the keys as plain coords — avoids both the grid and the MultiIndex, but is non-idiomatic (no `.sel`).
4. **Lazy / fused grouping** — keep the compact form and only densify if the result is indexed by label.

Until a sparse kernel lands, the two-shape split above (grid via `[names]`, compact MultiIndex via `df`) plus the blow-up warning is the pragmatic dense-xarray answer. This is also evidence for the #756 kernel decision: groupby's grid blow-up, `matmul`'s zero-padding (#748), and `merge`'s padding (#749) all dissolve together under a sparse/long kernel.

## See also

- #730 (open) — `.vars`→`.variables` rename: same xarray-API-unification family.
- #741, #752 (closed) — "PyPSA reaches into linopy internals": same PyPSA/linopy alignment theme.

Surfaced while reviewing #751.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make multi-key LinearExpression.groupby([names]) fast and flat (align with xarray's grouper API) #753

What each layer accepts as a grouper

The gap: fast XOR separate-dims for multi-key

Proposed direction (non-breaking)

Optional follow-up (not required by the above)

Relationship to #751

Future directions (sparse representations)

See also

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Grouper	plain xarray	linopy	PyPSA uses it?
dimension name — `groupby("snapshot")`	✓	✓ (always worked; #751 moved it onto the fast path)	—
non-dim coordinate name — `groupby("carrier")`	✓	✓ fixed in #751 (was `ValueError`) + fast	—
list of names — `groupby(["carrier","bus"])`	✓	✓ but slow (xarray fallback)	no
`DataArray` — `groupby(da)`	✓	✓ (fast)	✓ nodal balance (`constraints.py`, via `c._as_xarray`)
`Grouper` obj — `groupby(x=UniqueGrouper())`	✓	✗ unsupported	—
pandas `DataFrame`/`Series`	✗ `KeyError`	✓ (fast, stacked output)	✓ statistics (`expressions.py`)

Spelling	Path	Speed	Output
`groupby(["period","season"])` (names)	xarray fallback	slow	`(period, season)` grid (separate dims)
`groupby(df[["period","season"]])` (DataFrame)	reindex fast path	fast	stacked-MultiIndex `group` dim

Make multi-key LinearExpression.groupby([names]) fast and flat (align with xarray's grouper API) #753

Description

What each layer accepts as a grouper

The gap: fast XOR separate-dims for multi-key

Proposed direction (non-breaking)

Optional follow-up (not required by the above)

Relationship to #751

Future directions (sparse representations)

See also

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions