# Writing a Pandas data frame with LAS data to a Parquet file

## Populate data frame

In [16]:
!pip install laspy
from laspy.file import File
inFile = File("../data/test.las", mode = "r")

import pandas as pd
df = pd.DataFrame({'X': inFile.x, 'Y': inFile.y, 'Z': inFile.z, 'intensity': inFile.intensity, 'raw_classification': inFile.raw_classification, 'gps_time': inFile.gps_time})
df.head(5)



Unnamed: 0,X,Y,Z,gps_time,intensity,raw_classification
0,555000.0625,4887200.0,120.940003,467000.4375,30,1
1,555000.6875,4887199.5,117.330002,467000.5,22,1
2,555001.3125,4887200.0,115.339996,467000.5,10,1
3,555000.1875,4887197.0,123.910004,467000.53125,31,1
4,555001.9375,4887200.0,111.110001,467000.53125,8,1


Show which verison of Pandas is installed

In [17]:
pd.__version__

u'0.23.4'

The built-in parquet export only works if pandas is version 0.21.0 or newer. See the [Pandas doc](https://pandas.pydata.org/pandas-docs/stable/io.html#parquet). At least one parquet engine is required, so [fastparquet](https://fastparquet.readthedocs.io/en/latest/) or [pyarrow](https://arrow.apache.org/docs/python/) needs to be installed. the packages can also be used directly (see below).

In [18]:
!pip install pyarrow 
!pip install fastparquet
df.to_parquet('../data/test.parquet', compression='uncompressed')
df.to_parquet('../data/test.parquet.gzip', compression='gzip')

Collecting pyarrow
  Using cached https://files.pythonhosted.org/packages/1d/b6/c4e63f8bdb175d2223df26ddf12e2a0ba3fa347890128b5f128cb3f72589/pyarrow-0.11.0.tar.gz
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
Building wheels for collected packages: pyarrow
  Running setup.py bdist_wheel for pyarrow: started
  Running setup.py bdist_wheel for pyarrow: finished with status 'error'
  Complete output from command c:\programdata\anaconda2\python.exe -u -c "import setuptools, tokenize;__file__='c:\\users\\jan\\appdata\\local\\temp\\pip-install-xm1zsd\\pyarrow\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d c:\users\jan\appdata\local\temp\pip-wheel-e8anqf --python-tag cp27:
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-2.7
  creating build\lib.win-amd64-2.7\pyarrow
  copying

  Failed building wheel for pyarrow
Command "c:\programdata\anaconda2\python.exe -u -c "import setuptools, tokenize;__file__='c:\\users\\jan\\appdata\\local\\temp\\pip-install-xm1zsd\\pyarrow\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record c:\users\jan\appdata\local\temp\pip-record-r9wb6z\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\jan\appdata\local\temp\pip-install-xm1zsd\pyarrow\




We can compre the size of the resulting files to the original LAs. Note that in this example not all attributes from the LAS file were copied oer and that the coordinates ar written as double.

In [19]:
import os
print('LAS: {}'.format(os.stat('../data/test.las').st_size))
print('Parquet: {}'.format(os.stat('../data/test.parquet').st_size))
print('Parquet compressed: {}'.format(os.stat('../data/test.parquet.gzip').st_size))


LAS: 9108963
Parquet: 10718130
Parquet compressed: 2900803


Alternativeley we can use the Apache Arrow Python module directly to export to Parquet. But the data frame has to be converted to a pyarrow table (copy data?)

In [20]:
import pyarrow as pa
import pyarrow.parquet as pq

ImportError: No module named pyarrow

In [21]:
table = pa.Table.from_pandas(df)
pq.write_table(table, '../data/test_pyarrow.parquet')

NameError: name 'pa' is not defined

Or one can use fastparquet to achieve the same without conversion directly on the pandas data frame. Here a multi-file output is written as typically used with Hive.

In [22]:
from fastparquet import write
write('../data/test_fastparquet_compressed.parq', df,
      compression='GZIP', file_scheme='hive')