Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .to_parquet unable to write empty GeoDataFrame #3137

Open
2 of 3 tasks
kauevestena opened this issue Jan 10, 2024 · 1 comment
Open
2 of 3 tasks

BUG: .to_parquet unable to write empty GeoDataFrame #3137

kauevestena opened this issue Jan 10, 2024 · 1 comment
Labels

Comments

@kauevestena
Copy link

kauevestena commented Jan 10, 2024

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of geopandas.

  • (optional) I have confirmed this bug exists on the main branch of geopandas.


Code Sample, a copy-pastable example

import geopandas as gpd
gpd.GeoDataFrame().to_parquet('test.parquet')

Generated output (short version):

raise ValueError("Writing to Parquet/Feather requires string column names")

ValueError: Writing to Parquet/Feather requires string column names

Problem description

Well, I think that this capacity is nice (or even needed) for generalization, as for example considering GeoJson, when an empty geodataframe can be easily serialized and read back. I know that parquet is column-oriented, and as the output suggests it must have at least one column, so writing like this, works:

import geopandas as gpd
gpd.GeoDataFrame(columns=['geometry']).to_parquet('test.parquet')

print(gpd.read_parquet('test.parquet'))

But with a column sets (such as ['data1','data2'] or ['data'] that doest not includes "geometry, it throws this error at print() call:

ValueError: 'geo' metadata in Parquet/Feather file is missing required key: 'primary_column'

output:

Empty GeoDataFrame
Columns: [geometry]
Index: []

Expected Output

The generation of a file that will be read as Empty GeoDataFrame without the need to specify any column names.

Output of geopandas.show_versions()

SYSTEM INFO

python : 3.11.7 (main, Dec 8 2023, 18:56:58) [GCC 11.4.0]
executable : /home/kaue/opensidewalkmap_beta/.venv/bin/python
machine : Linux-6.2.0-39-generic-x86_64-with-glibc2.35

GEOS, GDAL, PROJ INFO

GEOS : 3.11.2
GEOS lib : None
GDAL : 3.6.4
GDAL data dir: /home/kaue/opensidewalkmap_beta/.venv/lib/python3.11/site-packages/fiona/gdal_data
PROJ : 9.3.0
PROJ data dir: /home/kaue/opensidewalkmap_beta/.venv/lib/python3.11/site-packages/pyproj/proj_dir/share/proj

PYTHON DEPENDENCIES

geopandas : 0.14.2
numpy : 1.26.3
pandas : 2.1.4
pyproj : 3.6.1
shapely : 2.0.2
fiona : 1.9.5
geoalchemy2: None
geopy : None
matplotlib : None
mapclassify: None
pygeos : None
pyogrio : None
psycopg2 : None
pyarrow : 14.0.2
rtree : None
None

@brendan-ward
Copy link
Member

Presumably the main use case of allowing write of an empty Parquet file is to preserve column information and possibly CRS (if set), but perhaps not to write a truly empty DataFrame (no columns, CRS, etc), and as a special case for GeoPandas on top of Pandas, that we preserve info about the geometry column in the GeoDataFrame.

While Pandas allows us to write empty DataFrames to Parquet, we're not always doing so for a few reasons - some of which may be out of scope of GeoPandas:

  • most important: the GeoParquet spec requires certain metadata that we cannot fill out in the case of a truly empty GeoDataFrame: we don't have a primary geometry column (or things that can be obtained from that like geometry type) nor CRS info. It is reasonable for the spec to require these. Which is to say, an empty parquet file may be valid parquet, but it is not valid GeoParquet. (truly empty DataFrame is an edge case)
  • on our end, we check to make sure that columns are only strings, to avoid issues around writing integer column names (a prior bug); we could special case the check to skip this if there are no columns. But this is very much an edge case (writing a DataFrame that includes no columns is not very useful) - so this does not seem like a high priority bug to fix, and still doesn't get around the above.
  • when you create an empty GeoDataFrame that includes a geometry column (gpd.GeoDataFrame(columns=['geometry']) because geometry column name has special meaning here), this allows us to meet the requirements of the GeoParquet spec, and this round-trips fine.
  • any GeoDataFrame that doesn't actually have a geometry column also runs into the first issue above; there isn't a column from which we can derive the required GeoParquet metadata. I'm not quite sure what we want to do in this case, because writing a subset of columns (which may have records present) should fall back to the Pandas implementation and produce a valid non-GeoParquet compatible Parquet file. As per above, it cannot be valid GeoParquet without a geometry column. I don't think we want to raise an error in this case nor force users to convert that subset to a Pandas DataFrame first. Maybe we should raise a warning indicating that the output will not be valid GeoParquet? We also probably shouldn't create the GeoParquet metadata at all in this case, which is obfuscating the above error a bit (it is present but not complete and not valid); we do raise an error if it is not present at all.

So - I think that we already support writing an empty GeoDataFrame that includes a geometry column and is compatible with the GeoParquet spec, and the only real place we could improve is to make it more clear that if you call .to_parquet without a geometry column that it will not produce a valid GeoParquet file, and you will not be able to round-trip this by later calling .read_parquet on it in GeoPandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants