Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: constructor from mgr geometry col preservation #3173

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

m-richards
Copy link
Member

Closes #3165 (in the specific case of _geometry_column_name=="geometry" - which is also the case which previously worked prior to the regression

@@ -686,14 +686,16 @@ def func(group):
df.groupby("value2").apply(func, **kwargs)
# selecting the non-group columns -> no need to pass the keyword
if (
compat.PANDAS_GE_21 if geometry_name == "geometry" else compat.PANDAS_GE_20
) and not compat.PANDAS_GE_22:
compat.PANDAS_GE_22
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just swapped the if and else block so I could negate the conditional. Can revert but I find this easier to read as here's the cases where it works, rather than the cases where it's broken.

@@ -1634,7 +1634,17 @@ def _constructor_from_mgr(self, mgr, axes):
return _geodataframe_constructor_with_fallback(
pd.DataFrame._from_mgr(mgr, axes)
)
return GeoDataFrame._from_mgr(mgr, axes)
gdf = GeoDataFrame._from_mgr(mgr, axes)
# _from_mgr doesn't preserve metadata (expect __finalize__ to be called)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we don't end up leaking more of init in here, the nice thing about wrapping everything in _geodataframe_constructor_with_fallback would be that it's straightforward to maintain, even if there's some overhead - though I'm not saying we actually can do that directly because of the sindex caching.

# still need to mimic __init__ behaviour with geometry=None
if (gdf.columns == "geometry").sum() == 1: # only if "geometry" is single col
try:
gdf["geometry"] = _ensure_geometry(gdf["geometry"].values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this _ensure_geometry call?
Yes, to preserve the exact behaviour of before (and so maybe best to do that for 0.x), but longer term, we could just set the _geometry_column_name (in this code path, we already checked that there is at least one column with geometry dtype, we just didn't check it is the one named "geometry")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably uncessary, but I don't know the pandas internals well enough to be sure - I would think if we get a blockmanager it's all or nothing, either all dtypes were conserved or they were all lost (I would expect e.g. apply doesn't come from _from_mgr, but I haven't checked)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tested that using

        if (gdf.columns == "geometry").sum() == 1:
            gdf._geometry_column_name = "geometry"

is enough to fix the example from the issue on pandas 2.1.4. So the try/except may not be needed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's definitely not needed for this example, it's moreso that by implementing _constructor_from_mgr there are pandas code paths in pandas which will use this instead of our _constructor which dispatches to __init__, and this behaviour is part of __init__. I will remove it for now then, there's a possibility this could lead to a weird error after some kind of groupby apply type operation, but we can deal with that if it comes up

@martinfleis martinfleis added this to the 1.0 milestone Apr 28, 2024
@jorisvandenbossche
Copy link
Member

@m-richards do you remember what the status is here?

If I try the original example from the issue, that seems to work fine for geopandas main with both pandas main and pandas 1.5

@m-richards
Copy link
Member Author

m-richards commented May 21, 2024

@m-richards do you remember what the status is here?

If I try the original example from the issue, that seems to work fine for geopandas main with both pandas main and pandas 1.5

@jorisvandenbossche I believe it's only an issue with pandas 2.0 and 2.1 (I can replicate the original issue now with my pandas 2.1.4 env, and that it's fixed on this branch). The test modified indicates that this scenario is still problematic if the geometry column was not called "geometry" - we effectively hide another bug because _constructor_from_mgr will auto attempt to try "geometry", but from memory that's a mess to solve because it would mean depending on "self" in defining _constructor_from_mgr and there's a pandas issue to convert that instance method into a classmethod I think.

Edit: more context, this I think is a bug we introduced by backporting the constructor_from_mgr_sliced etc changes to silence warnings in conjunction with a pandas bug fixed in 2.3

Copy link
Member

@martinfleis martinfleis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

REGR: geometry column not found after groupby & column select
3 participants