BUG: .to_parquet unable to write empty GeoDataFrame #3137

kauevestena · 2024-01-10T15:05:33Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of geopandas.
(optional) I have confirmed this bug exists on the main branch of geopandas.

Code Sample, a copy-pastable example

import geopandas as gpd
gpd.GeoDataFrame().to_parquet('test.parquet')

Generated output (short version):

raise ValueError("Writing to Parquet/Feather requires string column names")

ValueError: Writing to Parquet/Feather requires string column names

Problem description

Well, I think that this capacity is nice (or even needed) for generalization, as for example considering GeoJson, when an empty geodataframe can be easily serialized and read back. I know that parquet is column-oriented, and as the output suggests it must have at least one column, so writing like this, works:

import geopandas as gpd
gpd.GeoDataFrame(columns=['geometry']).to_parquet('test.parquet')

print(gpd.read_parquet('test.parquet'))

But with a column sets (such as ['data1','data2'] or ['data'] that doest not includes "geometry, it throws this error at print() call:

ValueError: 'geo' metadata in Parquet/Feather file is missing required key: 'primary_column'

output:

Empty GeoDataFrame
Columns: [geometry]
Index: []

Expected Output

The generation of a file that will be read as Empty GeoDataFrame without the need to specify any column names.

Output of `geopandas.show_versions()`

SYSTEM INFO

python : 3.11.7 (main, Dec 8 2023, 18:56:58) [GCC 11.4.0]
executable : /home/kaue/opensidewalkmap_beta/.venv/bin/python
machine : Linux-6.2.0-39-generic-x86_64-with-glibc2.35

GEOS, GDAL, PROJ INFO

GEOS : 3.11.2
GEOS lib : None
GDAL : 3.6.4
GDAL data dir: /home/kaue/opensidewalkmap_beta/.venv/lib/python3.11/site-packages/fiona/gdal_data
PROJ : 9.3.0
PROJ data dir: /home/kaue/opensidewalkmap_beta/.venv/lib/python3.11/site-packages/pyproj/proj_dir/share/proj

PYTHON DEPENDENCIES

geopandas : 0.14.2
numpy : 1.26.3
pandas : 2.1.4
pyproj : 3.6.1
shapely : 2.0.2
fiona : 1.9.5
geoalchemy2: None
geopy : None
matplotlib : None
mapclassify: None
pygeos : None
pyogrio : None
psycopg2 : None
pyarrow : 14.0.2
rtree : None
None

The text was updated successfully, but these errors were encountered:

brendan-ward · 2024-01-10T17:35:42Z

Presumably the main use case of allowing write of an empty Parquet file is to preserve column information and possibly CRS (if set), but perhaps not to write a truly empty DataFrame (no columns, CRS, etc), and as a special case for GeoPandas on top of Pandas, that we preserve info about the geometry column in the GeoDataFrame.

While Pandas allows us to write empty DataFrames to Parquet, we're not always doing so for a few reasons - some of which may be out of scope of GeoPandas:

most important: the GeoParquet spec requires certain metadata that we cannot fill out in the case of a truly empty GeoDataFrame: we don't have a primary geometry column (or things that can be obtained from that like geometry type) nor CRS info. It is reasonable for the spec to require these. Which is to say, an empty parquet file may be valid parquet, but it is not valid GeoParquet. (truly empty DataFrame is an edge case)
on our end, we check to make sure that columns are only strings, to avoid issues around writing integer column names (a prior bug); we could special case the check to skip this if there are no columns. But this is very much an edge case (writing a DataFrame that includes no columns is not very useful) - so this does not seem like a high priority bug to fix, and still doesn't get around the above.
when you create an empty GeoDataFrame that includes a geometry column (gpd.GeoDataFrame(columns=['geometry']) because geometry column name has special meaning here), this allows us to meet the requirements of the GeoParquet spec, and this round-trips fine.
any GeoDataFrame that doesn't actually have a geometry column also runs into the first issue above; there isn't a column from which we can derive the required GeoParquet metadata. I'm not quite sure what we want to do in this case, because writing a subset of columns (which may have records present) should fall back to the Pandas implementation and produce a valid non-GeoParquet compatible Parquet file. As per above, it cannot be valid GeoParquet without a geometry column. I don't think we want to raise an error in this case nor force users to convert that subset to a Pandas DataFrame first. Maybe we should raise a warning indicating that the output will not be valid GeoParquet? We also probably shouldn't create the GeoParquet metadata at all in this case, which is obfuscating the above error a bit (it is present but not complete and not valid); we do raise an error if it is not present at all.

So - I think that we already support writing an empty GeoDataFrame that includes a geometry column and is compatible with the GeoParquet spec, and the only real place we could improve is to make it more clear that if you call .to_parquet without a geometry column that it will not produce a valid GeoParquet file, and you will not be able to round-trip this by later calling .read_parquet on it in GeoPandas.

kauevestena added bug needs triage labels Jan 10, 2024

brendan-ward removed the needs triage label Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: .to_parquet unable to write empty GeoDataFrame #3137

BUG: .to_parquet unable to write empty GeoDataFrame #3137

kauevestena commented Jan 10, 2024 •

edited

SYSTEM INFO

GEOS, GDAL, PROJ INFO

PYTHON DEPENDENCIES

brendan-ward commented Jan 10, 2024

BUG: .to_parquet unable to write empty GeoDataFrame #3137

BUG: .to_parquet unable to write empty GeoDataFrame #3137

Comments

kauevestena commented Jan 10, 2024 • edited

Code Sample, a copy-pastable example

Generated output (short version):

Problem description

Expected Output

Output of geopandas.show_versions()

SYSTEM INFO

GEOS, GDAL, PROJ INFO

PYTHON DEPENDENCIES

brendan-ward commented Jan 10, 2024

kauevestena commented Jan 10, 2024 •

edited

Output of `geopandas.show_versions()`