-
Notifications
You must be signed in to change notification settings - Fork 1
Parquet
Apache Parquet is a columnar, binary file format for tabular data. Unlike csv, it stores the value type of each column in the file itself and compresses the data, which makes it both faster to read/write and smaller on disk.
Parquet (and the related Arrow format) is the recommended format for exchanging tabular data with Python (e.g. with pandas / pyarrow): no separate .csvt is needed to communicate value types and the columnar layout is read efficiently on both sides.
The GeoDMS reads and writes Parquet through the GDAL library, using the same gdal.vect / gdalwrite.vect StorageManagers as for csv.
The following example shows how to read a .parquet file with the gdal.vect StorageManager.
unit<uint32> woningvoorraad
: StorageName = "%LocalDataProjDir%/Python/temp/woningvoorraad_startgebouwopties.parquet"
, StorageType = "gdal.vect"
, StorageReadOnly = "True"
{
}
All attributes from the Parquet file are read. In contrast to csv, the value type of each attribute is taken from the file, so in most cases no conversion functions are needed.
If the extension is not .parquet, the driver can be set explicitly, see GDAL:
parameter<string> GDAL_Driver : ['Parquet'];
When a Parquet file is produced by an external process (for instance a Python script, see below), reading it back creates a new domain unit. This new domain does not automatically match an already configured domain, even when it has the same number of rows. There are several ways to relate the imported data to an existing domain results:
- If the rows correspond one-to-one and in the same order as
results, give the read domain a calculation rule referring to the existing domain:
unit<uint32> startgebouwopties_python := results
, StorageName = "%LocalDataProjDir%/Python/temp/woningvoorraad_startgebouwopties.parquet"
, StorageType = "gdal.vect"
, StorageReadOnly = "True";
- Alternatively, configure the attributes explicitly against the existing domain:
container startgebouwopties_python
: StorageName = "%LocalDataProjDir%/Python/temp/woningvoorraad_startgebouwopties.parquet"
, StorageType = "gdal.vect"
, StorageReadOnly = "True"
{
attribute<Classifications/GebouwOptie> GebouwOptie (results); // read against the existing domain results
}
- Or copy the values one-to-one with union_data (an error is raised on too few or too many rows):
attribute<Classifications/GebouwOptie> GebouwOptie_rel (results) :=
union_data(results, startgebouwopties_python/GebouwOptie);
- If the relation is not one-to-one, relate both domains on an external key with rlookup and read the values with lookup. This is the safest route when selections or row orders may diverge (e.g. for editable stam-tables to which rows can be added):
attribute<startgebouwopties_python> py_rel (results) := rlookup(results/externe_key, startgebouwopties_python/externe_key);
attribute<Classifications/GebouwOptie> GebouwOptie_rel (results) := py_rel -> GebouwOptie;
The following example shows how to write a .parquet file with the gdalwrite.vect StorageManager.
unit<uint32> input_export := src/woningen
, StorageName = "%LocalDataProjDir%/Python/temp/input_startgebouwopties.parquet"
, StorageType = "gdalwrite.vect"
, StorageReadOnly = "False"
{
attribute<uint32> pand_id := src/woningen/pand_id;
attribute<float32> oppervlakte := src/woningen/oppervlakte;
}
Attributes of all value types, except for value types of the point group, are written. The value types are stored in the file, so a reading process gets correctly typed columns without an accompanying .csvt.
A typical pattern is to let the GeoDMS write a Parquet input file, run a Python script on it, and read the Parquet output back. The script can be triggered automatically the moment its output is requested, by using the ExitCode of exec_ec in the StorageName of the output. See exec_ec for a worked end-to-end example.
GeoDMS ©Object Vision BV. Source code distributed under GNU GPL-3. Documentation distributed under CC BY-SA 4.0.