# Vector data formats

As geospatial data evolved, multiple formats emerged for vector data. The first major one was the shapefile, which remains the most used and supported format, making it the most stable as well. However, this format also has several flaws and limitations, such as requiring at least three files, allowing only one type of geometry per file, and limiting attribute names to 10 characters with a maximum of 255 attributes.

Next came GeoJSON, which is based on the JSON format and adds geospatial information but suffers from poor performance when handling large datasets. It can be useful for small datasets, as it is human-readable. Another human-readable format used to store geospatial vector data is KML. It is based on XML, but also suffers from poor performances. It is mainly used to display geospatial information in Earth navigators such as Google Earth. The GeoPackage format was then introduced, offering improved performance and eliminating the limitations of shapefiles; it is now widely regarded as a modern replacement. Most software and infrastructures are compatible with this format.

For even better performance, FlatGeobuf was developed, based on [FlatBuffers](https://flatbuffers.dev/). Unlike GeoPackage, it allows for advanced and optimized operations like data streaming; however, it is more recent and not yet as widely supported. Finally, optimized for cloud computing, GeoParquet was created, implementing features particularly useful for cloud applications. Like FlatGeobuf, it requires modern infrastructures and software.

This notebook showcases how to use `gepandas` to read and write vector files. This allows us to measure the different reading and writing times, for each format type. `pyarraow` is also required to use parquet files with `geopandas`.

## Data used

As an example, we will be using a land use shapefile, available here: https://www.data.gouv.fr/fr/datasets/carte-des-departements-2-1/#/resources/d823cf85-5f6d-4767-a9fe-25a50a266a04.

In [13]:
import time
from pathlib import Path
import hvplot.pandas
import geopandas as gpd
import pandas as pd

from utils import compare_read_write_times, download_sample_vector_data, get_file_size_in_mb

import warnings
warnings.filterwarnings('ignore')  # ignore warnings when reading invalid polygons

In [14]:
download_dir = "./sample_data/vector/"
land_use_shapefile_path = download_sample_vector_data(download_dir)

start = time.time()
land_use = gpd.read_file(land_use_shapefile_path, on_invalid="ignore")  # ignore invalid polygons
read_time = time.time() - start
print(f"Reading time: {read_time:.3f} seconds\n"
      f"File size: {get_file_size_in_mb(land_use_shapefile_path)} MB")


land_use = land_use.to_crs(epsg=2154)  # convert the CRS to a metric system

land_use["area"] = land_use.geometry.area  # new area column
all_types_area = land_use.groupby("type")["area"].sum()  # sum of area by land use type
# print(all_types_area)  # uncomment to see the results

Reading time: 0.068 seconds
File size: 10.418388 MB


Let's write a function to run this analysis on any file, and another function to write a GeoDataframe to any file (parquet file use different methods for reading and writing).

In [28]:
def analyze_vector_file(filepath):
    start = time.time()
    if Path(filepath).suffix == ".parquet":
        land_use = gpd.read_parquet(filepath)
    else:
        land_use = gpd.read_file(filepath, on_invalid="ignore")
    read_time = time.time() - start
    land_use = land_use.to_crs(epsg=2154)

    start = time.time()
    land_use["area"] = land_use.geometry.area
    compute_time = time.time() - start
    file_size = get_file_size_in_mb(filepath)
    print(f"File type: {Path(filepath).suffix}\n"
          f"File size: {file_size:.1f} MB\n"
          f"Time to open: {read_time:.3f} secondes\n"
          f"Time to compute: {compute_time:.5f} secondes\n")

    return read_time, file_size

def write_gdf_to_file(gdf, output_file):
    gdf = gdf[gdf.geometry.notna()]
    start = time.time()
    if Path(output_file).suffix == ".parquet":
        gdf.to_parquet(output_file)
    elif Path(output_file).suffix == ".kml":
        gdf.to_file(output_file, driver="KML")
    else:
        gdf.to_file(output_file)
    write_time = time.time() - start
    print(f"Writing time of {Path(output_file).suffix}: {write_time:.2f} seconds")

    return write_time

The shapefile is then opened and converted to different formats, measuring the writing time for each one. Each file is then analyzed to check the file size and read time. This information is displayed in histograms (using `holoviews`) in the following cells.

In [29]:
# get files paths
land_use_shape_copy_path = Path(land_use_shapefile_path).with_stem("landuse_copy")  # used to measure the write time of shp
land_use_geojson_path = Path(land_use_shapefile_path).with_suffix(".geojson")  # path to the GeoJSON file
land_use_geopkg_path = Path(land_use_shapefile_path).with_suffix(".gpkg")  # path to the GeoPackage file
land_use_geoparquet_path = Path(land_use_shapefile_path).with_suffix(".parquet")  # path to the fichier GeoParquet file
land_use_fgb_path = Path(land_use_shapefile_path).with_suffix(".fgb")  # path to the FlatGeobuff file
land_use_kml_path = Path(land_use_shapefile_path).with_suffix(".kml")  # path to the kml file


# convert files and retrieve write times
gdf_to_write = gpd.read_file(land_use_shapefile_path, on_invalid="ignore")
shp_write_time = write_gdf_to_file(gdf_to_write, land_use_shape_copy_path)
gpkg_write_time = write_gdf_to_file(gdf_to_write, land_use_geopkg_path)
geojson_write_time = write_gdf_to_file(gdf_to_write, land_use_geojson_path)
geoparquet_write_time = write_gdf_to_file(gdf_to_write, land_use_geoparquet_path)
fgb_write_time = write_gdf_to_file(gdf_to_write, land_use_fgb_path)
kml_write_time = write_gdf_to_file(gdf_to_write, land_use_kml_path)

# analyze files
shp_read_time, shp_file_size = analyze_vector_file(land_use_shapefile_path)
gpkg_read_time, gpkg_file_size = analyze_vector_file(land_use_geopkg_path)
geojson_read_time, geojson_file_size = analyze_vector_file(land_use_geojson_path)
geoparquet_read_time, geoparquet_file_size = analyze_vector_file(land_use_geoparquet_path)
fgb_read_time, fgb_file_size = analyze_vector_file(land_use_fgb_path)
kml_read_time, kml_file_size = analyze_vector_file(land_use_kml_path)

Writing time of .shp: 0.13 seconds
Writing time of .gpkg: 0.12 seconds
Writing time of .geojson: 1.21 seconds
Writing time of .parquet: 0.05 seconds
Writing time of .fgb: 0.10 seconds
Writing time of .kml: 0.84 seconds
File type: .shp
File size: 10.4 MB
Time to open: 0.069 secondes
Time to compute: 0.00367 secondes

File type: .gpkg
File size: 13.3 MB
Time to open: 0.054 secondes
Time to compute: 0.00439 secondes

File type: .geojson
File size: 18.3 MB
Time to open: 0.678 secondes
Time to compute: 0.00353 secondes

File type: .parquet
File size: 9.6 MB
Time to open: 0.058 secondes
Time to compute: 0.00542 secondes

File type: .fgb
File size: 12.2 MB
Time to open: 0.068 secondes
Time to compute: 0.00470 secondes

File type: .kml
File size: 18.4 MB
Time to open: 0.298 secondes
Time to compute: 0.00328 secondes



In [31]:
# Read and write times comparison
read_times = [shp_read_time, gpkg_read_time, geojson_read_time, geoparquet_read_time, fgb_read_time, kml_read_time]
write_times = [shp_write_time, gpkg_write_time, geojson_write_time, geoparquet_write_time, fgb_write_time, kml_write_time]
labels = ["shp", "gpkg", "geojson", "geoparquet", "fgb", "kml"]

compare_read_write_times(read_times, write_times, labels)

In [34]:
# Comparison without geojson and kml
write_times = [shp_write_time, gpkg_write_time, geoparquet_write_time, fgb_write_time]
read_times = [shp_read_time, gpkg_read_time, geoparquet_read_time, fgb_read_time]
labels = ["shp", "gpkg", "geoparquet", "fgb"]
compare_read_write_times(read_times, write_times, labels)

In [35]:
# File size comparison
files_sizes = [shp_file_size, gpkg_file_size, geojson_file_size, geoparquet_file_size, fgb_file_size, kml_file_size]
labels = ["shp", "gpkg", "geojson", "geoparquet", "fgb", "kml"]

# Histogram
data = pd.DataFrame({
    'Formats': labels,
    'File size in MB': files_sizes
})
data.hvplot.bar(
    x='Formats',
    y='File size in MB',
    width=500,
    height=400,
    color='#a6bfe0',
    title='File size depending on format',
    xlabel='Formats',
    ylabel='File size in MB'
)

As stated in the beginning, the performances obtained from using the GeoJSON and KML are worse than for any other tested format. On the other hand, the geoparquet offers greater performances. However, this benchmarking is not exhaustive:

- it was tested on only one dataset: performances are data-dependant, see https://flatgeobuf.org/#performance (from the official flatgeobuf documentation) for another benchmark
- a specific `geopandas` version was used: older versions of pandas were much less performant and/or used different engines, which also lowered performances
- it was run on a local infrastructure: cloud-specific performances were not highlighted


However, performances aren't the only comparison points. As explained in the introduction of this notebook, the shapefile suffers from many flaws. But every format has its advantages and limitations. Here is a recapitulatory table:

| Format     | Advantages                                                             | Drawbacks                                                                                                             |
|------------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| Shapefile  | The most popular and supported format, very stable                     | Requires multiple files, limitations on attribute names and number, files <2GB ...                                    |
| GeoPackage | Faster, widely supported, based on SQLite                              | Bigger files, no streaming possibility                                                                                |
| GeoJSON    | Streaming available, human-readable                                    | Lower performance and larger file sizes                                                                               |
| KML        | Widely supported, easy integration with Google Earth, human-readable   | Lower performance, larger file sizes, not actively maintained (see [libkml github](https://github.com/libkml/libkml)) |
| GeoParquet | Cloud optimized, great performance, lighter files, supports indexation | Less supported, requires modern infrastructure and updated software to operate                                        |
| FlatGeobuf | Good performance and file compression, optimized for spatial requests  | Less supported                                                                                                        |
