# GeoParquet, parquet for geospatial data

In this tutorial, we will learn what are the parquet and geoparquet file format.

## What is apache parquet?

**Apache Parquet** is a `columnar storage file format` widely used in big data analytics (Spark, Hive, Dask, DuckDB). Unlike row-based file format, parquet stored values from the same column sequentially. This makes column scans fast and compresses better.

For example, row based storage will save data like:
```csv
id, name, age
1, Alice, 34
2, Bob, 45
3, Carol, 29
```

And, column based storage will save data like:
```csv
id: [1, 2, 3]
name: [Alice, Bob, Carol]
age: [34, 45, 29]
```

Compare to row based format, `columnar storage file format (parquet)` has the below advantages are:
- Column encoding is possible: the encoding such as `dictionary`, `run-length encoding (RLE)`, `bit-packing` can improve storage and lecture speed.
- Better compression: values are similar in the same column, so it compresses much better than row formats.
- Better data loading performance: `projection pushdown`(reads only required column), `predicate pushdown`(reads only relevant chunks if filter conditions match).
- The Schema and metadata are embedded and customizable.
- Data partition is supported: works well in distributed systems, and efficient for large-scale analytical queries (data warehouses, data lakes).


The disadvantages are:
- Not Ideal for datasets(<1MB): metadata and encoding may cost you more than the actual data
- Not human-readable: Parquet is a binary format. You need tools (e.g. Pandas, DuckDB, etc.) to read and debug.
Complexity
- Write Cost: Writing is more CPU-intensive (encoding + compression). Insert or update a row may require to rewrite the whole parquet file.

> Parquet is a file format designed for large-scale data analytics with the pattern Write Once Read Many(worm). It's not ideal for small, transactional workloads(e.g. need update data every second).

## What is GeoParquet?

**GeoParquet** is an extension of `Apache Parquet` designed for storing `geospatial vector data` (points, lines, polygons) in a columnar, compressed, efficient format.

To be able to store `geospatial vector data`, **GeoParquet** adds the below concepts in the metadata of the parquet:
- which column is geometry column
- how geometries data is encoded(e.g. WKB, GeoArrow, etc.).
- which CRS(Coordinate Reference System) is used.

For example, a typical geoparquet metadata looks like

```json
{
  "version": "1.0.0-beta.1",
  "primary_column": "geometry",
  "columns": {
    "geometry": {
      "encoding": "WKB",
      "geometry_types": ["Polygon", "MultiPolygon"],
      "crs": "EPSG:4326"
    }
  }
}

```

> The parquet file has one geometry column called `geometry`, it use `WKB` as geospatial data encoding. The crs is
> `EPSG:4326`


The advantage of geoparquet:

- Much faster than GeoJSON or Shapefile for analytics.
- Can be partitioned for big data volume. It works well with distributed engines (Spark, Dask).
- Reduce disk usage due to columnar storage & compression.
- Standardized metadata ensures interoperability.

### Supported tools

The below tools can read geoparquet

- GeoPandas (Python)
- Sedona(Python, Java, scala)
- DuckDB(Python, R, SQL)
- Fiona/GDAL
- QGIS/ArcGIS Pro