# Writing a Pandas data frame with LAS data to a Parquet file

## Populate data frame

In [4]:
%pip install laspy
import laspy
las = laspy.read("../data/test.las")

import numpy as np
import pandas as pd
df = pd.DataFrame({'X': np.array(las.x), 'Y': np.array(las.y), 'Z': np.array(las.z), 'intensity': las.intensity, 'raw_classification': las.raw_classification, 'gps_time': las.gps_time})
df.head(5)

Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,X,Y,Z,intensity,raw_classification,gps_time
0,555000.0625,4887200.0,120.940003,30,1,467000.4375
1,555000.6875,4887199.5,117.330002,22,1,467000.5
2,555001.3125,4887200.0,115.339996,10,1,467000.5
3,555000.1875,4887197.0,123.910004,31,1,467000.53125
4,555001.9375,4887200.0,111.110001,8,1,467000.53125


Show which verison of Pandas is installed

In [5]:
pd.__version__

'2.1.2'

The built-in parquet export only works if pandas is version 0.21.0 or newer. See the [Pandas doc](https://pandas.pydata.org/pandas-docs/stable/io.html#parquet). At least one parquet engine is required, so [fastparquet](https://fastparquet.readthedocs.io/en/latest/) or [pyarrow](https://arrow.apache.org/docs/python/) needs to be installed. The packages can also be used directly (see below).

In [19]:
%pip install pyarrow 
%pip install fastparquet
df.to_parquet('../data/test.parquet', compression='None')
df.to_parquet('../data/test.parquet_snappy', compression='snappy')
df.to_parquet('../data/test.parquet.gzip', compression='gzip')

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


We can compare the size of the resulting Parquet files to the original LAS files. Note that in this example not all attributes from the LAS file were copied over and that the coordinates are written as double.

In [20]:
import os
print('LAS: {} MB'.format(os.stat('../data/test.las').st_size/(1024*1024)))
print('Parquet: {} MB'.format(os.stat('../data/test.parquet').st_size/(1024*1024)))
print('Parquet snappy: {} MB'.format(os.stat('../data/test.parquet_snappy').st_size/(1024*1024)))
print('Parquet gzip: {} MB'.format(os.stat('../data/test.parquet.gzip').st_size/(1024*1024)))


LAS: 8.686984062194824 MB
Parquet: 2.001495361328125 MB
Parquet snappy: 1.9697446823120117 MB
Parquet gzip: 1.7694463729858398 MB


Alternativeley we can use the Apache Arrow Python module directly to export to Parquet. But the data frame has to be converted to a pyarrow table (copy data?)

In [21]:
import pyarrow as pa
import pyarrow.parquet as pq

In [22]:
table = pa.Table.from_pandas(df)
pq.write_table(table, '../data/test_pyarrow.parquet')

Or one can use fastparquet to achieve the same without conversion directly on the pandas data frame. Here a multi-file output is written as typically used with Hive.

In [23]:
from fastparquet import write
write('../data/test_fastparquet_compressed.parq', df,
      compression='GZIP', file_scheme='hive')