You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AI-written analysis (Claude Code, prompted by @FBumann), combining a linopy code trace with a dig through PyPSA's groupby usage. The code-level claims and the prototype output below are verified on the #751 branch; the PyPSA-side file references come from that investigation — please sanity-check before acting.
Multi-key grouping of a LinearExpression currently forces a choice: fast (pandas DataFrame grouper → stacked-MultiIndex output) orseparate dims per key (list-of-names → slow xarray fallback). This issue makes groupby([names])fast while keeping its separate-dims output, without breaking the existing DataFrame contract.
What each layer accepts as a grouper
Grouper
plain xarray
linopy
PyPSA uses it?
dimension name — groupby("snapshot")
✓
✓ (always worked; #751 moved it onto the fast path)
✓ nodal balance (constraints.py, via c._as_xarray)
Grouper obj — groupby(x=UniqueGrouper())
✓
✗ unsupported
—
pandas DataFrame/Series
✗ KeyError
✓ (fast, stacked output)
✓ statistics (expressions.py)
Note: the pandas DataFrame/Series grouper is a linopy-only extension — plain xarray rejects it (KeyError). PyPSA is also internally inconsistent: its constraint/nodal-balance code uses the idiomatic DataArray grouper, only the statistics layer uses the pandas one.
The gap: fast XOR separate-dims for multi-key
Spelling
Path
Speed
Output
groupby(["period","season"]) (names)
xarray fallback
slow
(period, season) grid (separate dims)
groupby(df[["period","season"]]) (DataFrame)
reindex fast path
fast
stacked-MultiIndex group dim
Root cause in LinearExpressionGroupby.sum: _resolve_group resolves a single name to its coordinate (fast path), but a multi-name list is returned unchanged, fails the non_fallback_types = (Series, DataFrame, DataArray) check, and drops to the fallback. The only multi-key fast path is the pd.DataFrame branch, which emits a stacked pd.MultiIndex.
Proposed direction (non-breaking)
Key the output shape on how the group was supplied — so the one fast path serves both spellings:
groupby([names]) → fast, separate dims. Resolve a list of coord names to a value frame so it takes the reindex fast path:
ifisinstance(group, (list, tuple)) andall(isinstance(g, str) andgindata.coordsforgingroup):
group=data[list(group)].to_dataframe()[list(group)] # tag this as "came from names"
Then, for the names spelling only, unstack the stacked result back into separate dims. This reproduces today's fallback output exactly — verified on a sparse crossing (combo (2020,'s') absent):
groupby(df) (user-supplied DataFrame) → unchanged. Keep the stacked-MultiIndex group output. The tested contract (test_linear_expression_groupby_with_dataframe, asserting set(group.values) == set(MultiIndex.from_frame(groups).values)) still holds — no deprecation, no breaking change.
Implementation notes:
sum needs to know the frame came from names (a flag from _resolve_group, or handle the list branch directly), so the final unstack applies only to the names spelling.
The names path builds a stacked MultiIndex internally and immediately unstacks it — a transient intermediate index, fine.
The closing unstack densifies to the full grid — the same shape today's fallback produces, so parity, not a regression.
Optional follow-up (not required by the above)
With a fast separate-dims list grouping available, PyPSA could migrate its statistics sites to the idiomatic, label-safe form for consistency with its constraint code — but it's no longer forced, since the DataFrame grouper keeps working:
gx=grouping.to_xarray() # name-indexed, label-aligned (no positional footgun)da=da.assign_coords(carrier=gx.carrier, bus=gx.bus)
da.groupby(["carrier", "bus"]).sum() # now fast, separate dims
#751 deliberately stops at single-key parity and does not depend on this; nothing here blocks it.
Future directions (sparse representations)
The grid memory cost is fundamental to dense xarray: separate dims are a dense cartesian grid, so a sparse/correlated key crossing materialises mostly-fill cells (measured ~100× vs the DataFrame grouper for a diagonal crossing; see #740). Truly getting separate dims and compact needs a sparse representation. Options, roughly by payoff vs cost:
sparse.COO duck-arrays under xarray — separate dims and observed-only storage. Requires linopy to be sparse-aware end to end (solver export, matmul, merge).
Integer-code segment-sum into a single observed-combo dim with the keys as plain coords — avoids both the grid and the MultiIndex, but is non-idiomatic (no .sel).
Lazy / fused grouping — keep the compact form and only densify if the result is indexed by label.
Until a sparse kernel lands, the two-shape split above (grid via [names], compact MultiIndex via df) plus the blow-up warning is the pragmatic dense-xarray answer. This is also evidence for the #756 kernel decision: groupby's grid blow-up, matmul's zero-padding (#748), and merge's padding (#749) all dissolve together under a sparse/long kernel.
Note
AI-written analysis (Claude Code, prompted by @FBumann), combining a linopy code trace with a dig through PyPSA's groupby usage. The code-level claims and the prototype output below are verified on the #751 branch; the PyPSA-side file references come from that investigation — please sanity-check before acting.
Multi-key grouping of a
LinearExpressioncurrently forces a choice: fast (pandasDataFramegrouper → stacked-MultiIndex output) or separate dims per key (list-of-names → slow xarray fallback). This issue makesgroupby([names])fast while keeping its separate-dims output, without breaking the existingDataFramecontract.What each layer accepts as a grouper
groupby("snapshot")groupby("carrier")ValueError) + fastgroupby(["carrier","bus"])DataArray—groupby(da)constraints.py, viac._as_xarray)Grouperobj —groupby(x=UniqueGrouper())DataFrame/SeriesKeyErrorexpressions.py)Note: the pandas
DataFrame/Seriesgrouper is a linopy-only extension — plain xarray rejects it (KeyError). PyPSA is also internally inconsistent: its constraint/nodal-balance code uses the idiomaticDataArraygrouper, only the statistics layer uses the pandas one.The gap: fast XOR separate-dims for multi-key
groupby(["period","season"])(names)(period, season)grid (separate dims)groupby(df[["period","season"]])(DataFrame)groupdimRoot cause in
LinearExpressionGroupby.sum:_resolve_groupresolves a single name to its coordinate (fast path), but a multi-name list is returned unchanged, fails thenon_fallback_types = (Series, DataFrame, DataArray)check, and drops to the fallback. The only multi-key fast path is thepd.DataFramebranch, which emits a stackedpd.MultiIndex.Proposed direction (non-breaking)
Key the output shape on how the group was supplied — so the one fast path serves both spellings:
groupby([names])→ fast, separate dims. Resolve a list of coord names to a value frame so it takes the reindex fast path:Then, for the names spelling only,
unstackthe stacked result back into separate dims. This reproduces today's fallback output exactly — verified on a sparse crossing (combo(2020,'s')absent):groupby(df)(user-suppliedDataFrame) → unchanged. Keep the stacked-MultiIndexgroupoutput. The tested contract (test_linear_expression_groupby_with_dataframe, assertingset(group.values) == set(MultiIndex.from_frame(groups).values)) still holds — no deprecation, no breaking change.Implementation notes:
sumneeds to know the frame came from names (a flag from_resolve_group, or handle the list branch directly), so the finalunstackapplies only to the names spelling.unstackdensifies to the full grid — the same shape today's fallback produces, so parity, not a regression.Optional follow-up (not required by the above)
With a fast separate-dims list grouping available, PyPSA could migrate its statistics sites to the idiomatic, label-safe form for consistency with its constraint code — but it's no longer forced, since the
DataFramegrouper keeps working:Relationship to #751
This builds directly on #751 (which fixes #750):
groupby("carrier")raisedValueError: carrier already exists as coordinatebefore, works after (verified). Grouping by a dimension name always worked; fix(groupby): group by non-dimension coordinate names; fast multi-key grouping by names (#750, #753) #751 only moved it onto the fast path via_resolve_group(commit5a9b2e5).TestGroupbyByAttachedCoordinatematrix already pins the separate-dims output for the list spelling, so the target shape is established and tested.#751 deliberately stops at single-key parity and does not depend on this; nothing here blocks it.
Future directions (sparse representations)
The grid memory cost is fundamental to dense xarray: separate dims are a dense cartesian grid, so a sparse/correlated key crossing materialises mostly-fill cells (measured ~100× vs the DataFrame grouper for a diagonal crossing; see #740). Truly getting separate dims and compact needs a sparse representation. Options, roughly by payoff vs cost:
sparse.COOduck-arrays under xarray — separate dims and observed-only storage. Requires linopy to be sparse-aware end to end (solver export,matmul,merge)._termkernel (dense-_termmemory cluster) #756) — groupby becomes a relationalgroup_by().agg(), inherently sparse; no grid, no MultiIndex. Cleanest long-term; biggest change..sel).Until a sparse kernel lands, the two-shape split above (grid via
[names], compact MultiIndex viadf) plus the blow-up warning is the pragmatic dense-xarray answer. This is also evidence for the #756 kernel decision: groupby's grid blow-up,matmul's zero-padding (#748), andmerge's padding (#749) all dissolve together under a sparse/long kernel.See also
.vars→.variablesrename: same xarray-API-unification family.has_terms) — PyPSA reaches into.vars/_terminternals #741, PyPSA reaches into linopy internals in several more places (follow-up to #741) #752 (closed) — "PyPSA reaches into linopy internals": same PyPSA/linopy alignment theme.Surfaced while reviewing #751.